Corpus of Estonian Web Sentences 2021 – META-SHARE

Last view: 2026-03-23

109 Last view: 2026-03-23

Corpus of Estonian Web Sentences 2021

View resource name in all available languages

Eesti keele veebilausete korpus 2021

Corpus consists of sentences extracted from the Estonian National Corpus 2020 (for more information, see Koppel, Kallas, 2022) by using GDEX (Good Dictionary Examples, Kilgarriff et al., 2008) – a tool for detecting good dictionary examples; and examples from the Estonian Collocations Dictionary 2019 (Kallas et al., 2015) (for more information, see Koppel, 2020). Corpus does not include full documents. The size of the corpus is 558,647,923 tokens, 473,455,876 words and 47,011,383 sentences.

References:
Kilgarriff, Adam, Milos Husák, Katy McAdam, Michael Rundell, Pavel Rychlý 2008. GDEX: Automatically finding good dictionary examples in a corpus. – Elisenda Bernal, Janet DeCesaris (Eds), Proceedings of the 13th EURALEX International Congress. Barcelona: Institut Universitari de Linguistica Aplicada, Universitat Pompeu Fabra, 425–432.
Kallas, Jelena; Kilgarriff, Adam; Koppel, Kristina; Kudritski, Elgar; Langemets, Margit; Michelfeit, Jan; Tuulik, Maria; Viks, Ülle (2015). Automatic generation of the Estonian Collocations Dictionary database. Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of the eLex 2015 conference, 11-13 August 2015, Herstmonceux Castle, United Kingdom.. Ljubljana/Brighton: Trojina, Institute for Applied Slovene Studies/Lexical Computing Ltd, 1−20.
Koppel, Kristina (2020). Näitelausete korpuspõhine automaattuvastus eesti keele õppesõnastikele. (Doktoritöö, Tartu Ülikool). Tartu: Tartu Ülikooli Kirjastus.
Koppel, Kristina; Kallas, Jelena (2022). Eesti keele ühendkorpuste sari 2013–2021: mahukaim eestikeelsete digitekstide kogu. Eesti Rakenduslingvistika Ühingu aastaraamat 18, [to be published].

View resource description in all available languages

Korpus sisaldab "Eesti keele ühendkorpusest 2021" (loe lähemalt Koppel ja Kallas 2022) heade näitelausete tuvastamise tööriista GDEX ehk Good Dictionary Examples (Kilgarriff jt 2008; eesti mooduli kohta Koppel 2020) abil välja valitud lauseid ja "Eesti keele naabersõnade sõnastiku 2019" näitelauseid (loe lähemalt Koppel 2020). Korpus ei sisalda terviktekste. Korpuse suurus on 558 647 923 sõnet, 473 455 876 sõna ja 47 011 383 lauset.

Viidatud kirjandus:

Kilgarriff, Adam, Milos Husák, Katy McAdam, Michael Rundell, Pavel Rychlý 2008. GDEX: Automatically finding good dictionary examples in a corpus. – Elisenda Bernal, Janet DeCesaris (Eds), Proceedings of the 13th EURALEX International Congress. Barcelona: Institut Universitari de Linguistica Aplicada, Universitat Pompeu Fabra, 425–432.
Kallas, Jelena; Kilgarriff, Adam; Koppel, Kristina; Kudritski, Elgar; Langemets, Margit; Michelfeit, Jan; Tuulik, Maria; Viks, Ülle (2015). Automatic generation of the Estonian Collocations Dictionary database. Electronic lexicography in the 21st century: linking lexical data in the digital age. Proceedings of the eLex 2015 conference, 11-13 August 2015, Herstmonceux Castle, United Kingdom.. Ljubljana/Brighton: Trojina, Institute for Applied Slovene Studies/Lexical Computing Ltd, 1−20.
Koppel, Kristina (2020). Näitelausete korpuspõhine automaattuvastus eesti keele õppesõnastikele. (Doktoritöö, Tartu Ülikool). Tartu: Tartu Ülikooli Kirjastus.
Koppel, Kristina; Kallas, Jelena (2022). Eesti keele ühendkorpuste sari 2013–2021: mahukaim eestikeelsete digitekstide kogu. Eesti Rakenduslingvistika Ühingu aastaraamat 18, [ilmumas].

You don’t have the permission to edit this resource.

DistributionDOI

10.15155/3-00-0000-0000-0000-08EB2L

Availability

Available - Restricted Use

Licence

CC - BY

Contact Person

Kristina Koppel

text

Monolingual text corpusLanguages

Estonian

Linguality

Linguality type: Monolingual

Size

47,011,383 Sentences

473,455,876 Words

Metadata

Created: 05/23/2022

Last Updated: 05/23/2022

Version

Version: 2021

People who looked at this resource also viewed the following: