Next: Validation Reports
Up: Validation
Previous: Final Validation
Contents
What to validate
Basically every item included in the specifications of the
speech corpus may be subject to validations. What will be validated and
what will be regarded as an error has to be included in the contract
or the corpus specifications. Typically, the following parts of the
speech corpus are validated:
- Documentation: consistency, completeness, structure
- Meta data: completeness, parsability, contents (samples)
- Signal data: completeness, technical quality, acoustical quality
(samples), contents (samples)
- Annotation data: completeness, parsability, contents
(samples)
- Media: readability (on different computer platforms)
- Dictionary: completeness, quality (samples)
Then there is always the distinction between qualitative and quantitative
validation results.
A quantitative result might be for instance that more than 5% of the
signal data are clipped. Validations of this type will usually be
carried out automatically (sometime referred to as `formal' validation).
If the documentation contains a
description of the signal file format that is unclear or inconsistent,
this would result in a qualitative validation result. Validations of
this type require manual work, and are often carried out only on a
randomly selected sample from the corpus.
Finally, there has to be an agreement on what is treated as an error and
what is deemed to be within the tolerance measures. For instance, if the
specification demands a 50/50% gender distribution throughout the
corpus, there also has to be given a tolerance percentage +/-
X%11.5
To get a better idea about what parts of a speech corpus are validated
with what procedures or measures please refer to
[11] or to the SpeechDat example in part
.
Next: Validation Reports
Up: Validation
Previous: Final Validation
Contents
BITS Projekt-Account
2004-06-01