Web13 corpus analysed with EstNLTK v1.6.b
View resource name in all available languages
Veebikorpus13 korpus analüüsitud EstNLTK v1.6.b abil
Web13 Corpus analysed with EstNLTK ver.1.6_b
This resource contains texts from the Web13 Corpus (aka the etTenTen corpus) that have been converted into JSON format, and linguistically analysed with EstNLTK ver 1.6_b. The corpus contains 686,325 text files in EstNLTK's JSON format.
Source of the corpus
Raw texts of the Web13 Corpus, which are available form here:
https://metashare.ut.ee/repository/browse/ettenten-korpus-toortekst/b564ca760de111e6a6e4005056b4002419cacec839ad4b7a93c3f7c45a97c55f
Processing
Texts were first converted into EstNLTK JSON format (metadata of the text documents was also preserved), and then automatically processed. Processing involved tokenizing texts into words, sentences and paragraphs, and morphological analysis and disambiguation. Results of the processing were recorded as annotation layers.
There are two layers of morphological annotations:
1) the layer that uses Vabamorf's category system[1],
2) the layer that uses Giellatekno's category system[2].
The processing was done at 2017-12-22, using the latest EstNLTK version available at that time (the version 1.6.0_beta).
Scripts that were used for processing (along with the instructions) are available here:
https://github.com/estnltk/estnltk/tree/aed554e15e7f9e0f854d7a49bb2e2674e274cabc/estnltk/corpus_processing
Loading JSON files with EstNLTK
See the tutorial:
https://github.com/estnltk/estnltk/blob/aed554e15e7f9e0f854d7a49bb2e2674e274cabc/tutorials/json_exporter_importer.ipynb
(Import from file)
[1] -- Vabamorf's tagset -- Estonian description is available here: https://github.com/Filosoft/vabamorf/blob/master/doc/tagset.html
[2] -- Giellatekno's tagset -- Estonian description is available here: http://www2.keeleveeb.ee/dict/corpus/shared/categories.html
People who looked at this resource also viewed the following: