Phonetic Corpus of Estonian Spontaneous Speech v.1.0.6

Eesti keele spontaanse kõne foneetiline korpus v.1.0.6



The corpus consists of high quality audio recordings of spontaneous Estonian the segmentation on different levels. The main body of the corpus contains dialogues, but includes also a sub-corpus of lecture monologues and a sub-corpus of spontaneous discussions with three participants. The speakers are from different age groups and various dialectological background.

Most of the recordings are made in a recording studio, some also on fieldwork. The audio signal of each speaker is recorded in a separate channel. The distance between the speakers is about 1.5-2 meters to minimize the effect of overlaps. Recordings are saved in PCM wav-format. Annotation is saved in Praat TextGrid format in utf-8 text files.

The current version of the corpus is approximately 127 hours of recordings from 195 speakers. Manual word and phoneme level annotation is available for 100 hours of recordings (770 000 words). For 18 h of dialogues and 15 h of trialogues also video recordings (mp4) is available. The subset of trialogues includes breathing signal recorded with belt pletysmograph.

Segmentation and annotation is done with the Praat program ( Recordings are segmented manually on different levels. Following tiers are used:
-Words (in orthographic spelling),
-Phonemes (SAMPA adjusted for Estonian),
-Syllables (short – long, open – closed),
-Prosodic feet (stress pattern, quantity),
-Intonation phrases or inter-pausal units;
-Voice quality (creaky voice);
-Morphological information (automatically annotated using Estmorf/Vabamorf)

