Next: Pre-Validation Up: Collection Previous: Collection Contents

Ongoing Documentation, Logging

Ongoing Documentation or Logging is of paramount importance to ensure the later usability of the speech corpus. All processes of the data collection must be documented in such a way that the user of the speech corpus understands all aspects that might be of importance for the later usage of the data.

There are basically two ways to do the logging during the speech data collection: on paper or online.

Logging on paper is easy and can be performed everywhere without computer hardware. However, in most cases the written data must be transferred into machine readable form later which means additional costs. It is much better to perform online logging, either by using a customized editor or into a database system via a Web server.

Practically all modern database systems allow the access and input of data via a Web interface. The advantage of this method is that different data from different processes can be easily linked together. For instance you might use the same database system for the scheduling of your recording sessions and to input the required meta data about recordings and speakers. Care has to be taken that the basic rules of data protection are observed. See also section (p. ).

The following list gives the obligatory data to be logged (marked with one *) and other possible data of interest logged during the collection phase:

Recording Protocol *
These data are the basis for the meta data files about each individual recording session or recording procedure that have to be included in the final speech corpus. Follow your specifications about your recording protocol (section (p. )) or refer to section (p. ) for a basic discussion of meta data.
Speaker Protocol *
These data are the basis for the meta data files about each individual speaker participating in your speech corpus production. Follow your specifications about your speaker profiles (section (p. ) or refer to section (p. ) for a basic discussion of meta data.
Both - recording and speaker protocol - should contain codes and free text comments as discussed in section (p. ).
Comments of Speakers
Questionnaires
Statistical Data
For instance, how many recorded words in unsupervised recordings, S/N ratios, other technical conditions, covered dialects or other required specifications (languages, locations, sex, age groups etc.)

If you are working on a large data collection with many staff members or project partners at different locations, you might also think of an automated Web information system, where interested parties can monitor the progress of the collection and react to certain developments^6.1.

Next: Pre-Validation Up: Collection Previous: Collection Contents

BITS Projekt-Account 2004-06-01