The documentation of a speech corpus summarizes all relevant information regarding the production and the intended usage of the corpus. It does not contain meta data or any kind of symbolic data directly related to the speech signals (annotations). The documentation consists of descriptive text (preferably in English), figures and optionally pictures.
However, the distinction between documentation and meta data in the above definition is often a fuzzy one: in many speech corpora data that are essentially meta data can be found in the corpus documentation and for a simple reason: In most cases these `meta data' are constant for the entire speech corpus and therefore not listed in every speaker profile or recording protocol. Furthermore, some authors define meta data in a much broader way than it is usually done in practice: For instance they also include parameters that describe the speech corpus (usually given in the corpus specifications) such as number of speakers, number of recorded items, technical specifications etc. To include all these data into the meta data set makes sense but only if there are standardized ways to access these data. Since these techniques are emerging just now, in this cookbook we follow the traditional way and include these data under the label `documentation'.
We have mentioned documentation two times earlier in this cookbook. In
the chapter Corpus Specification we listed it as a possible item to be
specified beforehand, mainly in larger projects with many producing
partners (see section , p.
). In the chapter Collection we
gave a few hints about what and how to log relevant information
during the collection
process (see section
, p.
).
In this chapter we will merely give an overview of what we deem
to be essential parts of any speech corpus documentation. As usual this
will not be an extensive listing because we cannot foresee the special
needs of future speech corpus productions.
In summary the corpus documentation consists of the following parts: