Gleiche Seite in deutsch
This page was last updated 2009-05-14
Speech Corpora
(If not stated otherwise, the language of the corpora is German.)
Entire Catalog
Presently the following corpora are available on CDROM:
Some audio files of the available corpora.
10 speakers - 10000 utterances - dictation - orthography
100 speakers - 10000 utterances - dictation - orthography
201 speakers - 21681 utterances - read speech - orthography, canonical
transcription, automatic segmentation
16 speakers - 3200 utterances - read speech - orthography, canonical
transcription, automatic segmentation, prosodic labeling
88 speakers - 1 story - read speech - orthography, canonical transcription
8 speakers - 8 repetitions of 100 utterances - field recordings with real
background noise - noise annotated - orthography, canonical pronunciation, noises
70 speakers (67 non-native, 3 native German speakers) - 100 phonetically
balanced sentences, numbers from 1 to 100, 1 story, 1 dialogue, 1
re-telling of a German story - transliteration, orthography, canonical transcription
106 speakers - 11100 utterances - read speech - orthography
22 speakers - robot control - 10810 utterances - read speech -
phoneme and word segmentation
2 professional speakers - laryngographic signal - prosodical labelling - 4 CD-ROMs
94 dialogues of a German speaking cab dispatcher and an English speaking
client - recorded via real phone connections (fixed and GSM) - orthography, canonical pronunciation, translation
3909 recordings of spontaneous speech (monologues) via public phone lines -
SpeechDat transcription
recordings of read and spontaneous speech by adolescents age 13-20 -
SpeechDat annotation
15600 recording of commands addressed to a web pad - British English and Frensh - 49 speakers - office environment - SpeechDat annotation
7746 recordings of street names, ZIP codes, city names and phone numbers - 1957 speakers - all environments - SpeechDat annotation
11036 logatom recordings covering all German diphones - 4 professional speakers - studio, 2 mics, laryngo - manual segmentation, BAS Partitur Format
6732 sentence recordings covering all German diphones in different prosodic contexts - 4 professional speakers - studio, 2 mics, laryngo - manual phonetic segmentation and prosodic annotation, BAS Partitur Format
Special edition containing all audio channel recording of the SmartKom corpora - 224 speakers - 448 sessions - Scenarios: Public, Home, Mobil
10966 human-machine queries using a smart phone - 156 speakers - natural environment, 2 mics (collar, Bluetooth), UMTS + high quality channel - transliteration (Verbmobil compatible), BAS Partitur Format
2315 human-machine queries on running motorcycle - 36 speakers - natural environment, 2 mics (Bluetooth helmet, neck micro), UMTS + high quality channel - transliteration (Verbmobil compatible), BAS Partitur Format
2218 human-machine queries in a human-human-machine situation using a smart phone, video face-capture of asking person - 99 speakers - natural environment, 2 mics (collar, Bluetooth), UMTS + high quality channel - transliteration (Verbmobil compatible), manual turn segmentation, BAS Partitur Format
864 sessions of up to 138 items each (read, spontaneous) of adolescent speakers of age 12-20 - 864 speakers - natural environment (school), 2 microphones (headset, desktop), demoscopic distribution within Germany - transliteration according SpeechDat standard, manual segmentation start/end utterance, BAS Partitur Format files, MAUS segmentation
Recordings of intoxicated and sober speakers of age 22-75 - 150 speakers (estimate for the final corpus) - automotive environment, 2 microphones (headset, mouse micro) - transliteration according extended SpeechDat standard, manual segmentation start/end utterance, BAS Partitur Format files, MAUS segmentation
The TED corpus is currently distributed by ELDA. Therefore BAS will only disseminate further copies of the corpus, if this first edition is run out.
For further questions or orders please contact
In a second step the signals are analysed in more detail. An automatic segmentation in phonemes and words is carried out, deviations from the canonical word form are detected and other features extracted. These algorithms are in the beta phase now. First results are available for the corpora VM 1-5,7,12, PD1, PD2, SI100, SI1000. All results from further analysis are stored in the BAS Partitur Format (BPF).
In a sub-project of the German BITS project (TP8) all currently available
BAS corpora are re-validated against
public guidelines. The results of this
re-validation will be published on the BITS webserver.
File Formats and Software
Most of the disseminated speech corpora of BAS contain signal files
in NIST SPHERE formats. Some corpora contain SAM or PhonDat
formats.
A description of the formats used in BAS corpora can be found
here.
Of course all formats are described in detail in the accompanying
corpus documentation on CDROM (you can access most of
these on-line by looking up the WWW page of the corpus).
Last but not least on each BAS CDROM you will find a small collection of
software and ANSI C functions for the access to the signal files.
Furthermore you will find tools to transform the PhonDat format into
NIST/SPHERE format, SAM format or raw soundfiles.
Florian Schiel