Web13 corpus analysed with EstNLTK v1.6.b

View resource name in all available languages

Veebikorpus13 korpus analüüsitud EstNLTK v1.6.b abil

Web13 Corpus analysed with EstNLTK ver.1.6_b

This resource contains texts from the Web13 Corpus (aka the etTenTen corpus) that have been converted into JSON format, and linguistically analysed with EstNLTK ver 1.6_b. The corpus contains 686,325 text files in EstNLTK's JSON format.

Source of the corpus

Raw texts of the Web13 Corpus, which are available form here:
https://metashare.ut.ee/repository/browse/ettenten-korpus-toortekst/b564ca760de111e6a6e4005056b4002419cacec839ad4b7a93c3f7c45a97c55f

Processing

Texts were first converted into EstNLTK JSON format (metadata of the text documents was also preserved), and then automatically processed. Processing involved tokenizing texts into words, sentences and paragraphs, and morphological analysis and disambiguation. Results of the processing were recorded as annotation layers.
There are two layers of morphological annotations:
1) the layer that uses Vabamorf's category system[1],
2) the layer that uses Giellatekno's category system[2].

The processing was done at 2017-12-22, using the latest EstNLTK version available at that time (the version 1.6.0_beta).
Scripts that were used for processing (along with the instructions) are available here:
https://github.com/estnltk/estnltk/tree/aed554e15e7f9e0f854d7a49bb2e2674e274cabc/estnltk/corpus_processing

Loading JSON files with EstNLTK
See the tutorial:
https://github.com/estnltk/estnltk/blob/aed554e15e7f9e0f854d7a49bb2e2674e274cabc/tutorials/json_exporter_importer.ipynb
(Import from file)


[1] -- Vabamorf's tagset -- Estonian description is available here: https://github.com/Filosoft/vabamorf/blob/master/doc/tagset.html
[2] -- Giellatekno's tagset -- Estonian description is available here: http://www2.keeleveeb.ee/dict/corpus/shared/categories.html

You don’t have the permission to edit this resource.