Gleiche Seite in deutsch
This page was last updated 2021-05-01
Please note that selected corpora of this catalogue and other corpora not listed here may be downloaded for free by academic users from the CLARIN Repository (partly marked with a (*) in the following).
Presently the following corpora are available on CD-R/DVD-R/Harddisc/online.
Note that a subset of these corpora is also online accessible for members of academic institutions
and for licensees of BAS resources in the
BAS CLARIN Repository
(tagged with (*) in the following list).
The following speech corpiora are exclusively accessible via the
BAS CLARIN Repository;
a commercial usage is in some cases possible (inquiries via bas@bas.uni-muenchen.de):Speech Corpora
(If not stated otherwise, the language of the corpora is German.)
Entire Catalog
CH-Jugendsprache,MOCHA,NM-MoCap-Corpus,NSC,Sprecherinnen,VERIF1DE,VMEmo,WaSeP
Some audio files of the available corpora.
10 speakers - 10000 utterances - dictation - orthography
100 speakers - 10000 utterances - dictation - orthography
201 speakers - 21681 utterances - read speech - orthography, canonical
transcription, automatic segmentation
16 speakers - 3200 utterances - read speech - orthography, canonical
transcription, automatic segmentation, prosodic labeling
88 speakers - 1 story - read speech - orthography, canonical transcription
8 speakers - 8 repetitions of 100 utterances - field recordings with real
background noise - noise annotated - orthography, canonical pronunciation, noises
70 speakers (67 non-native, 3 native German speakers) - 100 phonetically
balanced sentences, numbers from 1 to 100, 1 story, 1 dialogue, 1
re-telling of a German story - transliteration, orthography, canonical transcription
106 speakers - 11100 utterances - read speech - orthography
22 speakers - robot control - 10810 utterances - read speech -
phoneme and word segmentation
2 professional speakers - laryngographic signal - prosodical labelling - 4 CD-ROMs
94 dialogues of a German speaking cab dispatcher and an English speaking
client - recorded via real phone connections (fixed and GSM) - orthography, canonical pronunciation, translation
3909 recordings of spontaneous speech (monologues) via public phone lines -
SpeechDat transcription
17293 recordings of 4366 speakers answering 4 questions via public phone (land)lines -
SpeechDat transcription
recordings of read and spontaneous speech by adolescents age 13-20 -
SpeechDat annotation
15600 recording of commands addressed to a web pad - British English and Frensh - 49 speakers - office environment - SpeechDat annotation
7746 recordings of street names, ZIP codes, city names and phone numbers - 1957 speakers - all environments - SpeechDat annotation
11036 logatom recordings covering all German diphones - 4 professional speakers - studio, 2 mics, laryngo - manual segmentation, BAS Partitur Format
6732 sentence recordings covering all German diphones in different prosodic contexts - 4 professional speakers - studio, 2 mics, laryngo - manual phonetic segmentation and prosodic annotation, BAS Partitur Format
Special edition containing all audio channel recording of the SmartKom corpora - 224 speakers - 448 sessions - Scenarios: Public, Home, Mobil
10966 human-machine queries using a smart phone - 156 speakers - natural environment, 2 mics (collar, Bluetooth), UMTS + high quality channel - transliteration (Verbmobil compatible), BAS Partitur Format
2315 human-machine queries on running motorcycle - 36 speakers - natural environment, 2 mics (Bluetooth helmet, neck micro), UMTS + high quality channel - transliteration (Verbmobil compatible), BAS Partitur Format
2218 human-machine queries in a human-human-machine situation using a smart phone, video face-capture of asking person - 99 speakers - natural environment, 2 mics (collar, Bluetooth), UMTS + high quality channel - transliteration (Verbmobil compatible), manual turn segmentation, BAS Partitur Format
1019 sessions of up to 138 items each (read, spontaneous) of adolescent speakers of age 12-20 - 1019 speakers - natural environment (school), 2 microphones (headset, desktop), demoscopic distribution within Germany - transliteration according SpeechDat standard, manual segmentation start/end utterance, BAS Partitur Format files, MAUS segmentation
Recordings of intoxicated and sober speakers of age 22-75 - 150 speakers (estimate for the final corpus) - automotive environment, 2 microphones (headset, mouse micro) - transliteration according extended SpeechDat standard, manual segmentation start/end utterance, BAS Partitur Format files, MAUS segmentation
Corpus for speaker verification via telephone - 150 speakers - 20 recording sessions per speaker - transliteration SpeechDat standard, SpeechDat Database Format
Paralinguistic speaker classification via phone lines - 945 speakers - 1-7 recording sessions (calls) per speaker - transliteration according to SpeechDat standard, SpeechDat database format
Italian Maptask Recordings from CLIPS - 30 Speakers - 2 Recording Sessions per Speaker - Transliteration, Segmentation according CLIPS Standard - BPF, TextGrid, Emu
Lombard Dialogue speech - 24 Speakers - 12 Recording Sessions per Speaker Pair - Segmentation speech - non-speech - BPF
Recordings of Calabrese (Italy) - 68 Speakers - 331 Recording Sessions - Orth.-Phon. Transcription - TextGrid
Historic recordings of Saxonian German spoken in Romania in 1970 - 1805 Speakers - 2264 Recording Sessions - Orth. and phon. transcriptions - TextGrid
Speech data of dissertation Feiser (2016) German - 20 (10 brother pairs) - 7240 recording sessions - orth. and autom. phon. transcription - TextGrid, emuDB
The TED corpus is currently distributed by ELDA. Therefore BAS will only disseminate further copies of the corpus, if this first edition is run out.
For further questions or orders please contact
In a second step the signals are analysed in more detail. An automatic segmentation in phonemes and words is carried out (MAUS), deviations from the canonical word form are detected and other features extracted. All results from further analysis are stored in the BAS Partitur Format (BPF).
In a sub-project of the German BITS project (TP8) all available
BAS corpora have been re-validated against
public guidelines. The results of this
re-validation will be published on the BITS webserver.
Within the CLAIN initiative these guidelines for validation must be followed before
publication within the BAS CLARIN repository.
Florian Schiel