Web13 corpus analysed with EstNLTK v1.6.b

This resource contains texts from the Web13 Corpus (aka the etTenTen corpus) that have been converted into JSON format, and linguistically analysed with EstNLTK ver 1.6_b. The corpus contains 686,325 text files in EstNLTK's JSON format.

Source of the corpus

Raw texts of the Web13 Corpus, which are available form here:


Texts were first converted into EstNLTK JSON format (metadata of the text documents was also preserved), and then automatically processed. Processing involved tokenizing texts into words, sentences and paragraphs, and morphological analysis and disambiguation. Results of the processing were recorded as annotation layers.
There are two layers of morphological annotations:
1) the layer that uses Vabamorf's category system[1],
2) the layer that uses Giellatekno's category system[2].

The processing was done at 2017-12-22, using the latest EstNLTK version available at that time (the version 1.6.0_beta).
Scripts that were used for processing (along with the instructions) are available here:

Loading JSON files with EstNLTK
See the tutorial:
(Import from file)

[1] -- Vabamorf's tagset -- Estonian description is available here: https://github.com/Filosoft/vabamorf/blob/master/doc/tagset.html
[2] -- Giellatekno's tagset -- Estonian description is available here: http://www2.keeleveeb.ee/dict/corpus/shared/categories.html

