Estonian Reference Corpus analysed with EstNLTK v1.6.b

View resource name in all available languages

Eesti keele koondkorpus analüüsitud EstNLTK v1.6.b abil

DOI:

10.15155/1-00-0000-0000-0000-00156L

Estonian Reference Corpus analysed with EstNLTK ver.1.6_b

This resource contains texts from the Estonian Reference Corpus (Eesti keele koondkorpus) that have been converted into JSON format, and linguistically analysed with EstNLTK ver 1.6_b. The corpus contains 705,259 text files in EstNLTK's JSON format.

Source of the corpus

XML files of the Estonian Reference Corpus, which are available from here:
http://www.cl.ut.ee/korpused/segakorpus/

Processing

Texts were first converted into EstNLTK JSON format (metadata of the text documents was also preserved), and then automatically processed. Processing involved tokenizing texts into words, sentences and paragraphs, and morphological analysis and disambiguation. Results of the processing were recorded as annotation layers.
There are two layers of morphological annotations:
1) the layer that uses Vabamorf's category system[1],
2) the layer that uses Giellatekno's category system[2].

The processing was done at 2017-12-28, using the latest EstNLTK version available at that time (the version 1.6.0_beta).
Scripts that were used for processing (along with the instructions) are available here:
https://github.com/estnltk/estnltk/tree/aed554e15e7f9e0f854d7a49bb2e2674e274cabc/estnltk/corpus_processing

Loading JSON files with EstNLTK
See the tutorial:
https://github.com/estnltk/estnltk/blob/aed554e15e7f9e0f854d7a49bb2e2674e274cabc/tutorials/json_exporter_importer.ipynb
(Import from file)


[1] -- Vabamorf's tagset -- Estonian description is available here: https://github.com/Filosoft/vabamorf/blob/master/doc/tagset.html
[2] -- Giellatekno's tagset -- Estonian description is available here: http://www2.keeleveeb.ee/dict/corpus/shared/categories.html

You don’t have the permission to edit this resource.