Estonian Web Treebank with manually annotated sentence and token boundaries

View resource name in all available languages

Käsitsi lausestatud ja sõnestatud Eesti veebipuudepank

Veebipuud lausedweb trees sentences

Texts of the Estonian Web Treebank (Muischnek et al., 2019), manually annotated with both orthographic and syntactic sentence boundaries. The tokenization is also manually checked and corrected. The sentence boundary annotation process is described by Sirts and Peekman (2020), the tokenization verification process is described in Kairit Peekman's (2020) bachelor's thesis.

When using this data, please cite Sirts and Peekman (2020) article.

Muischnek, K., Müürisep, K., & Särg, D. D. (2019). CG Roots of UD Treebank of Estonian Web Language. In Proceedings of the NoDaLiDa 2019 Workshop on Constraint Grammar-Methods, Tools and Applications, 30 September 2019, Turku, Finland (No. 168, pp. 23-26). Linköping University Electronic Press.
Peekman, K. (2020). Automaatse lausestamise ja sõnestamise hindamine uue meedia keele korpusel [Evaluation of Automatic Sentence and Word Tokenization on the Corpus of New Media Language] (Bachelor's thesis). University of Tartu. Retrieved from https://comserv.cs.ut.ee/ati_thesis/datasheet.php?id=69690&year=2020.
Sirts, K., & Peekman, K. (2020). Evaluating Sentence Segmentation and Word Tokenization Systems on Estonian Web Texts. In Volume 328: Human Language Technologies – The Baltic Perspective, Frontiers in Artificial Intelligence and Applications, pages 174-181.

View resource description in all available languages

Eesti veebipuudepanga tekstid (Muischnek et al., 2019), mis on annoteeritud käsitsi nii ortograafiliste kui süntaktiliste lausepiiridega, samuti on kontrollitud ja parandatud sõnestust. Lausete annoteerimisprotsessi kirjeldavad Sirts ja Peekman (2020), sõnestuse kontrolli kirjeldab Kairit Peekmani (2020) bakalaureusetöö.

Andmete kasutamisel palume viidata Sirts ja Peekman (2020) artiklile.

Muischnek, K., Müürisep, K., & Särg, D. D. (2019). CG Roots of UD Treebank of Estonian Web Language. In Proceedings of the NoDaLiDa 2019 Workshop on Constraint Grammar-Methods, Tools and Applications, 30 September 2019, Turku, Finland (No. 168, pp. 23-26). Linköping University Electronic Press.
Peekman, K. (2020). Automaatse lausestamise ja sõnestamise hindamine uue meedia keele korpusel (bakalaureusetöö). Tartu Ülikool. Kättesaadav https://comserv.cs.ut.ee/ati_thesis/datasheet.php?id=69690&year=2020.
Sirts, K., & Peekman, K. (2020). Evaluating Sentence Segmentation and Word Tokenization Systems on Estonian Web Texts. In Volume 328: Human Language Technologies – The Baltic Perspective, Frontiers in Artificial Intelligence and Applications, pages 174-181.

You don’t have the permission to edit this resource.