EKI error-annotated Estonian L2 learner corpus (version 2) – META-SHARE

Last view: 2026-04-10

25 Last view: 2026-04-10

Last update: 2025-10-30

3 Last update: 2025-10-30

EKI error-annotated Estonian L2 learner corpus (version 2)

View resource name in all available languages

EKI veamärgendatud E2 õppijakorpus (versioon 2)

ID:

https://doi.org/10.15155/27bh-ny83

The materials for the error annotated corpus are based on the Estonian learner corpus EMMA, containing an Estonian learner assessment test (7th grade, 504 texts), basic school final exam (9th grade, 501 texts) and state exam data (12th grade, 998 texts) from the Education and Youth Board. The value of the corpus is enhanced by a manually created error annotation layer, which allows for a more in-depth study and analysis of the language use of learners of Estonian as a second language. The ERRANT-M2 error categories have been used as the basis for marking errors. The corpus contains 2003 texts. The goal is to continuously expand the corpus with new incoming materials.

Annotation layers: The corpus includes a manually added error annotation layer by annotators. In addition, the corpus is automatically annotated morphologically (lemma, part of speech, grammatical categories for each word), surface-syntactically (syntactic functions), and dependency-syntactically. In the dependency-syntactic approach, there is a dependency relationship between two words – one word is subordinate and the other is the head, and the relationship is named according to the syntactic function.

View resource description in all available languages

Veamärgendatud korpuse materjalid põhinevad EMMA õppijakeelekorpusel, sisaldades andmeid Haridus- ja Noorteameti tasemetöödest (7. klass, 504 teksti), põhikooli lõpueksamitest (9. klass, 501 teksti) ja riigieksamitest (12. klass, 998 teksti). Korpusmaterjali on väärindatud lisades käsitsimärgendamisel veamärgenduskihi, mis võimaldab analüüsida eesti keel teise keelena õppijate keelekasutust ja tüüpilisi veakohti. Kasutatud on ERRANT-M2 veamärgendusskeemi. Korpusese kogumaht on 2003 teksti. Eesmärk on korpust järjepidevalt täiendada ja selle mahtu suurendada uute materjalidega.

Märgenduskihtidena sisaldab korpus käsitsimärgendusel põhinevat veamärgenduskihti ja automaatmärgendusel põhinevat grammatikakihti. Grammatilise märgenduse jaoks on kasutatud UDpipe parserit.

You don’t have the permission to edit this resource.

DistributionDOI

10.15155/5hwt-3n69

Availability

Available - Restricted Use

Licence

CLARIN ACA

Contact Person

Kristjan Suluste

text

Monolingual text corpusLanguages

Estonian

Linguality

Linguality type: Monolingual

Size

22 518 Sentences

2,003 Texts

Resource Creation

Resource Creator

Kristjan Suluste

Metadata

Created: 10/20/2025

Last Updated: 10/30/2025

Version

Version: 2

ValidationValidated

People who looked at this resource also viewed the following: