Next: Validation Reports Up: Validation Previous: Final Validation Contents

What to validate

Basically every item included in the specifications of the speech corpus may be subject to validations. What will be validated and what will be regarded as an error has to be included in the contract or the corpus specifications. Typically, the following parts of the speech corpus are validated:

Documentation: consistency, completeness, structure
Meta data: completeness, parsability, contents (samples)
Signal data: completeness, technical quality, acoustical quality (samples), contents (samples)
Annotation data: completeness, parsability, contents (samples)
Media: readability (on different computer platforms)
Dictionary: completeness, quality (samples)

Then there is always the distinction between qualitative and quantitative validation results. A quantitative result might be for instance that more than 5% of the signal data are clipped. Validations of this type will usually be carried out automatically (sometime referred to as `formal' validation). If the documentation contains a description of the signal file format that is unclear or inconsistent, this would result in a qualitative validation result. Validations of this type require manual work, and are often carried out only on a randomly selected sample from the corpus.

Finally, there has to be an agreement on what is treated as an error and what is deemed to be within the tolerance measures. For instance, if the specification demands a 50/50% gender distribution throughout the corpus, there also has to be given a tolerance percentage +/- X%^11.5

To get a better idea about what parts of a speech corpus are validated with what procedures or measures please refer to [11] or to the SpeechDat example in part .

Next: Validation Reports Up: Validation Previous: Final Validation Contents

BITS Projekt-Account 2004-06-01