Next: Starting Document Up: Speech Corpus Production Previous: Check List Pronunciation Dictionary Contents

Documentation

The documentation of a speech corpus summarizes all relevant information regarding the production and the intended usage of the corpus. It does not contain meta data or any kind of symbolic data directly related to the speech signals (annotations). The documentation consists of descriptive text (preferably in English), figures and optionally pictures.

However, the distinction between documentation and meta data in the above definition is often a fuzzy one: in many speech corpora data that are essentially meta data can be found in the corpus documentation and for a simple reason: In most cases these `meta data' are constant for the entire speech corpus and therefore not listed in every speaker profile or recording protocol. Furthermore, some authors define meta data in a much broader way than it is usually done in practice: For instance they also include parameters that describe the speech corpus (usually given in the corpus specifications) such as number of speakers, number of recorded items, technical specifications etc. To include all these data into the meta data set makes sense but only if there are standardized ways to access these data. Since these techniques are emerging just now, in this cookbook we follow the traditional way and include these data under the label `documentation'.

We have mentioned documentation two times earlier in this cookbook. In the chapter Corpus Specification we listed it as a possible item to be specified beforehand, mainly in larger projects with many producing partners (see section , p. ). In the chapter Collection we gave a few hints about what and how to log relevant information during the collection process (see section , p. ). In this chapter we will merely give an overview of what we deem to be essential parts of any speech corpus documentation. As usual this will not be an extensive listing because we cannot foresee the special needs of future speech corpus productions.

In summary the corpus documentation consists of the following parts:

Introduction ( = executive summary)
Copyrights and disclaimers
Version number and date
List of documentation files
Corpus description
- Numbers (speakers, recording etc.)
- Structure
- Contents
- Terminology (file naming)
- Technical specifications of signal files
- Other parts of the corpus: dictionary, translations etc.
Recruitment
- Speaker profiles
- Recruitment technique
- Legal aspects
Recording
- Setup
- Script
- Technique
- Log file
Post-processing
Annotation
- Contents
- Procedure
- File formats
Meta data
- Contents
- File formats
Spoken texts, prompt files
Original corpus specification
Validation reports
Publications, internal reports
Comments
Corpus history
Known errors

In the following sections we will document each of the above listed items.

Subsections

Next: Starting Document Up: Speech Corpus Production Previous: Check List Pronunciation Dictionary Contents

BITS Projekt-Account 2004-06-01