Speech Corpora

(If not stated otherwise, the language of the corpora is German.)

Entire Catalog

Presently the following corpora are available on CD-R/DVD-R/Harddisc. Note that a subset of these corpora is also online accessible for university members and licensees of BAS resources in the
BAS CLARIN Repository. Some audio files of the available corpora.

The TED corpus is currently distributed by ELDA. Therefore BAS will only disseminate further copies of the corpus, if this first edition is run out.

Corpora for commercial usage

All speech corpora of BAS are available for commercial usage. Under commercial usage we subsumize any developments of speech technology on the basis of the BAS data and the commercial exploitation of products that were developed on the basis of the BAS data. Commercial usage does not include the direct exploitation of the data, that is no BAS data may be distributed to third parties under any circumstances. Some BAS corpora require a special lincense fee for commercial usage; see the corpus pages for details.

Corpora of read speech

The following corpora contain read speech, some of them recorded as a dictation task:

Corpora of spontaneous speech

The following corpora contain spontaneous recorded speech:

Corpora of accentuated/dialectal/alcoholized speech speech

The following corpora contain speech with classified (foreign) accent / dialect:

Corpora with telephone speech

The following BAS corpora contain speech recorded via public telephone lines (traditionell and cellular, GSM):

Corpora with high quality speech

'High quality speech' denotes recordings done with at least 16kHz sampling frequncy and in a controlled environment (studio). The following BAS corpora contain high quality speech:

Planned Corpora

The following corpora are presently in preparation for edition by BAS and will be available in the near future:

Processing and Evaluation

Before distribution the BAS corpora are evaluated for certain formal properties (BAS Short Test). These properties include: After the pass of this formal evaluation, the corpora are stored as 'master volumes' in our archive. They are linked to a central documentation and software server. If there is an order, the volumes are copied to CDROM and distributed (press on demand).

In a second step the signals are analysed in more detail. An automatic segmentation in phonemes and words is carried out (MAUS), deviations from the canonical word form are detected and other features extracted. All results from further analysis are stored in the BAS Partitur Format (BPF).

In a sub-project of the German BITS project (TP8) all currently available BAS corpora are re-validated against public guidelines. The results of this re-validation will be published on the BITS webserver.
Within the CLAIN initiative these guidelines for validation must be followed before publication within the BAS CLARIN repository.

File Formats and Software

Most of the disseminated speech corpora of BAS contain signal files in NIST SPHERE formats. Some corpora contain SAM or PhonDat formats.
A description of the formats used in BAS corpora can be found
Of course all formats are described in detail in the accompanying corpus documentation on CDROM (you can access most of these on-line by looking up the WWW page of the corpus).
Last but not least on each BAS CDROM you will find a small collection of software and ANSI C functions for the access to the signal files. Furthermore you will find tools to transform the PhonDat format into NIST/SPHERE format, SAM format or raw soundfiles.


The following section gives some of the most common uses of BAS speech corpora.

Automatic speech recognition

To initialize statistically based applications for speech recognition phonetically labelled and segmented corpora are needed.
The following corpora may be used for this purpose:
For embedded training without segmentation (after bootstrapping):

Human - machine interaction (HMI)

Speech synthesis

For PSOLA synthesis all corpora with segmental information may be used: (in brackets corpora with automatically segmented speech).

Speaker recognition, verification, adaptation, paralinguistic classification

PD1 and SI100 have a variety of speakers of both sex and different age. SI1000 contains quite an amount of data of fewer speakers (1000 news paper articles).

Empiric phonetic investigations

All BAS corpora with segmentations done manually. Since these are naturally very few data, it may be wise to use automatically segmented data, too (in brackets):
Prosodic investigations
Foreign accents / speaker characteristics
Dialectal variation

