NER-tagger corpus represents a collection of sentences with manually labelled named entities. The labelling is partial -- only a selected word from each sentence is labelled. As a result, the labelled entity may be only a part of a named entity and the sentence may potentially contain other named entities. We distinguish the following types on named entities: PER: person, LOC: location, ORG: organization, FAC: facility, PRD: product, O: other. For each labelled word the label is determined by the largest named entity containing it. For instance, Eesti in the following sentence: "Eesti Ühispanga Tartu kontor oli inimesi täis" is facility although "Eesti" is location and "Eesti Ühispank" is and organisation.
The corpus has been created using nertagger web tool: https://github.com/estnltk/ner-tagger. Two human annotators have been involved in the annotation process.
The data file contains one sentence per line with the following columns:
name named entity token
start entity start offset in the sentence
end entity end position in the sentence
label assigned label
annotator human annotator id
time number of milliseconds it took annotator to tag a word.