next up previous contents
Next: Recording protocol Up: Meta Data Previous: Meta Data   Contents

Importance of Meta Data

New speech corpora are constantly being produced and becoming accessible to scientists and developers, and the diversity of speech corpora is growing quickly. As a consequence, it becomes more and more difficult for the user to decide which corpus is optimal for his/her work. It is usually not possible to access the speech data itself to check this out, because speech data constitute an expensive resource. But it should always be possible to access the meta data, since this is a formal description of the underlying speech corpus and in itself it is of little commercial value. Therefore, meta data play an important role in the planing phase and for the acquisition of speech corpora.

Unfortunately, many speech corpora in the past were produced under a different paradigm: in most cases the goal was to produce data to be used in a given application, as quickly and as inexpensively as possible. No emphasis was placed on meta data; it was - if at all - considered to be a part of the documentation. Consequently meta data is often not parsable, not structured and incomplete. A corpus thus quickly becomes virtually useless in terms of re-usability, simply because after a short while there is no longer anybody around who knows the exact properties of the corpus and the circumstances under which it was created.

Until recently no standard existed for the representation of meta data in a formal way. The ISLE Metadata Initiative (IMDI)3.1project has started to define schemata and principles for represention of meta data. The aim is to use meta data browsers to search online for relevant data in a distributed catalogue of speech and language resources3.2. Since there is the hope that in still ongoing projects (e.g. the planned EU project INTERA) more and more speech resources will be added to the IMDI standard and can be browsed over the Internet, it is probably a good idea to include carefully designed meta data files in a speech corpus.3.33.4

In general, the term meta data refers to many types of information about the more general category language resources from which speech corpora are only a sub-category. However, in the context of speech corpora meta data can be restricted to three main types: recording protocols, speaker profiles and diverse comments. What this means in detail will be outlined in the following sections.


next up previous contents
Next: Recording protocol Up: Meta Data Previous: Meta Data   Contents
BITS Projekt-Account 2004-06-01