Corpus of Estonian Web sentences 2020 – META-SHARE

Last view: 2026-04-20

96 Last view: 2026-04-20

Corpus of Estonian Web sentences 2020

View resource name in all available languages

Eesti veebilausete korpus 2020

Corpus consists of sentences extracted from the Estonian National Corpus 2019 and Estonian RSS Feed Corpus 2020 by using GDEX (Good Dictionary Examples, Kilgarriff et al., 2008) – a tool for detecting good dictionary examples; and examples from the Estonian Collocations Dictionary 2019 (Kallas et al., 2015) (for more information, see Koppel, 2020). Corpus does not include full documents. The size of the corpus is 331,491,665 tokens, 280,961,465 words and 27,987,754 sentences.

References:
Kilgarriff, Adam, Milos Husák, Katy McAdam, Michael Rundell, Pavel Rychlý 2008. GDEX: Automatically finding good dictionary examples in a corpus. – Elisenda Bernal, Janet DeCesaris (Eds), Proceedings of the 13th EURALEX International Congress. Barcelona: Institut Universitari de Linguistica Aplicada, Universitat Pompeu Fabra, 425–432.
Kallas, Jelena; Kilgarriff, Adam; Koppel, Kristina; Kudritski, Elgar; Langemets, Margit; Michelfeit, Jan; Tuulik, Maria; Viks, Ülle (2015). Automatic generation of the Estonian Collocations Dictionary database. Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of the eLex 2015 conference, 11-13 August 2015, Herstmonceux Castle, United Kingdom.. Ljubljana/Brighton: Trojina, Institute for Applied Slovene Studies/Lexical Computing Ltd, 1−20.
Koppel, Kristina (2020). Näitelausete korpuspõhine automaattuvastus eesti keele õppesõnastikele. (Doktoritöö, Tartu Ülikool). Tartu: Tartu Ülikooli Kirjastus.

View resource description in all available languages

Korpus sisaldab "Eesti keele ühendkorpusest 2019" ja "Eesti uudisvoogude korpusest 2020" heade näitelausete tuvastamise tööriista GDEX ehk Good Dictionary Examples (Kilgarriff jt 2008; eesti mooduli kohta Koppel 2020) abil välja valitud lauseid ja "Eesti keele naabersõnade sõnastiku 2019" näitelauseid (loe lähemalt Koppel 2020). Korpus ei sisalda terviktekste. Korpuse suurus on 331 491 665 sõnet, 280 961 465 sõna ja 27 987 754 lauset.

Kirjandusviited:
Kilgarriff, Adam, Milos Husák, Katy McAdam, Michael Rundell, Pavel Rychlý 2008. GDEX: Automatically finding good dictionary examples in a corpus. – Elisenda Bernal, Janet DeCesaris (Eds), Proceedings of the 13th EURALEX International Congress. Barcelona: Institut Universitari de Linguistica Aplicada, Universitat Pompeu Fabra, 425–432.
Kallas, Jelena; Koppel, Kristina; Tuulik, Maria (2015). Korpusleksikograafia uued võimalused eesti keele kollokatsioonisõnastiku näitel. Eesti Rakenduslingvistika Ühingu aastaraamat, 11, 75−94.
Koppel, Kristina (2020). Näitelausete korpuspõhine automaattuvastus eesti keele õppesõnastikele. (Doktoritöö, Tartu Ülikool). Tartu: Tartu Ülikooli Kirjastus.

You don’t have the permission to edit this resource.

DistributionDOI

10.15155/3-00-0000-0000-0000-085B4L

Availability

Available - Unrestricted Use

Licence

CC - BY

IPR Holder

Eesti Keele Instituut, Institute of the Estonian Language

Contact Person

Kristina Koppel

text

Monolingual text corpusLanguages

Estonian

Linguality

Linguality type: Monolingual

Size

27,987,754 Sentences

280,961,465 Words

Metadata

Created: 07/01/2020

Last Updated: 03/30/2022

Version

Version: 2020

People who looked at this resource also viewed the following: