Unfortunately, many speech corpora in the past were produced under a different paradigm: in most cases the goal was to produce data to be used in a given application, as quickly and as inexpensively as possible. No emphasis was placed on meta data; it was - if at all - considered to be a part of the documentation. Consequently meta data is often not parsable, not structured and incomplete. A corpus thus quickly becomes virtually useless in terms of re-usability, simply because after a short while there is no longer anybody around who knows the exact properties of the corpus and the circumstances under which it was created.
Until recently no standard existed for the representation of meta data in a formal way. The ISLE Metadata Initiative (IMDI)3.1project has started to define schemata and principles for represention of meta data. The aim is to use meta data browsers to search online for relevant data in a distributed catalogue of speech and language resources3.2. Since there is the hope that in still ongoing projects (e.g. the planned EU project INTERA) more and more speech resources will be added to the IMDI standard and can be browsed over the Internet, it is probably a good idea to include carefully designed meta data files in a speech corpus.3.33.4
In general, the term meta data refers to many types of information about the more general category language resources from which speech corpora are only a sub-category. However, in the context of speech corpora meta data can be restricted to three main types: recording protocols, speaker profiles and diverse comments. What this means in detail will be outlined in the following sections.