Estonian Reference Corpus analysed with EstNLTK v1.6.b

Eesti keele koondkorpus analüüsitud EstNLTK v1.6.b abil



Estonian Reference Corpus analysed with EstNLTK ver.1.6_b

This resource contains texts from the Estonian Reference Corpus (Eesti keele koondkorpus) that have been converted into JSON format, and linguistically analysed with EstNLTK ver 1.6_b. The corpus contains 705,259 text files in EstNLTK's JSON format.

Source of the corpus

XML files of the Estonian Reference Corpus, which are available from here:


Texts were first converted into EstNLTK JSON format (metadata of the text documents was also preserved), and then automatically processed. Processing involved tokenizing texts into words, sentences and paragraphs, and morphological analysis and disambiguation. Results of the processing were recorded as annotation layers.
There are two layers of morphological annotations:
1) the layer that uses Vabamorf's category system[1],
2) the layer that uses Giellatekno's category system[2].

The processing was done at 2017-12-28, using the latest EstNLTK version available at that time (the version 1.6.0_beta).
Scripts that were used for processing (along with the instructions) are available here:

Loading JSON files with EstNLTK






