Gleiche Seite in deutsch
This page was last updated 2015-08-21
Please note that selected corpora may be downloaded for free by academic users from the CLARIN Repository (partly marked with a (*) in the following).
(If not stated otherwise, the language of the corpora is German.)
Presently the following corpora are available on CD-R/DVD-R/Harddisc.
Note that a subset of these corpora is also online accessible for university members
and licensees of BAS resources in the
BAS CLARIN Repository
(tagged with (*) in the following list).
The TED corpus is currently distributed by ELDA. Therefore BAS will only disseminate further copies of the corpus, if this first edition is run out.
For further questions or orders please contact
In a second step the signals are analysed in more detail. An automatic segmentation in phonemes and words is carried out (MAUS), deviations from the canonical word form are detected and other features extracted. All results from further analysis are stored in the BAS Partitur Format (BPF).
In a sub-project of the German BITS project (TP8) all currently available
BAS corpora are re-validated against
public guidelines. The results of this
re-validation will be published on the BITS webserver.
Within the CLAIN initiative these guidelines for validation must be followed before publication within the BAS CLARIN repository.
File Formats and Software
Most of the disseminated speech corpora of BAS contain signal files
in NIST SPHERE formats. Some corpora contain SAM or PhonDat
A description of the formats used in BAS corpora can be found here.
Of course all formats are described in detail in the accompanying corpus documentation on CDROM (you can access most of these on-line by looking up the WWW page of the corpus).
Last but not least on each BAS CDROM you will find a small collection of software and ANSI C functions for the access to the signal files. Furthermore you will find tools to transform the PhonDat format into NIST/SPHERE format, SAM format or raw soundfiles.