Reference and Validation
Identify the Reference (p. )
Check the Reference for completeness (p. )
Define the Validation Check List (p. )
- What is validated formally and to what extent?
- What is valdated manually (percentage) and to what extent?
- What is expected in the validation report?
- Time schedule?
Set up a Validation Contract (p. )
Documentation 8.1
Check the provided documentation files for the following items:
Administrative Information
Contact for requests regarding the corpus
Number and type of media
Content of each medium
Copyright statement and intellectual property rights (IPR)
Validation date(s)8.2
Validation person(s)/institution(s)8.2
Technical Information
Layout of media: file system type and directory structure
File nomenclatura: explanation of codes used; no 'white spaces' in file names
Formats of signal and annotation files: if non-standard formats are used, a full description is required and tools to convert this format into a standard format
Coding: PCM linear, Mu-Law or A-Law; if other codings must be used, they must be fully described
Compression: only widely supported compressions (e.g. zip, gzip) should be used
Sampling rate: rates others than 8000, 11025, 16000, 22050, 32000, 44100 and 48000 should be reported
Valid bits per sample: others than 8, 16 and 24 should be reported
Used bytes per sample8.3
Multiplexed signals: exact de-multiplexing algorithm; tools
Database Contents
Clearly stated purpose of the recordings
Speech type(s): multi-party conversations, human-human dialogues, human-machine dialogues, read sentences, connected and/or isolated digits, isolated words etc.
Instruction to speakers (full copy)8.2
Linguistic Contents of Prompted Speech
Specification of the individual text items
Specification for the prompt sheet design or
Specification of the design of the speech prompts
Example prompt sheet or
Example sound file from the speech prompting8.2
Linguistic Contents of Non-Prompted Speech
Multi-party: Number of speakers, topics, discussed, type of setting (formal/informal)
Human-human dialogues: type of dialogue (problem solving, information seeking, chat etc.), relation between speakers, topic(s) discussed, type of setting, scenarios
Human-machine dialogues: domain(s), topic(s), dialogue strategy followed by the machine (system driven, mixed initiative), type of system (test, operational service, Wizard-of-Oz8.2)
Speaker Information
Speaker recruitment strategies
Number of speakers
Distribution of speakers over sex, age, dialect regions
Description/definition of dialect regions
Recording platform and recording conditions
Recording platform... plus for telephone recordings
Position and type of microphone(s)
Company name and type idPosition of speaker(s) (distance to microphone)
Electret, dynamic, condenser
Directional properties
Mounting
Bandwidth (if other than zero to half of sampling rate)
Number of channels and channel separation
Acoustical environment8.2
Recording hardware, telephone link (analog, digital)
Network from where the call originated
Type of handset
... plus for recording in the automobile environment
Recording hardware8.2Annotation (for each of the contained annotations)
Type of vehicle
Average speed of vehicle
Status of windows (open/closed)
Type of pavement
Audio equipment playing during the recording
Unambiguous spelling standard used in annotationsLexicon
Labeling symbols
List of non-standard spellings (dialectal variation, names etc.)
Distinction of homographs which are not homophones
Character set used in annotations
Any other language dependent information (such as abbreviations etc.)
Annotation manual, guidelines, instructions
Description of quality assurance procedures
Selection of annotators
Training of annotators
Annotation tools used
FormatStatistical Information
Text-to-phoneme procedure
Explanation or reference to the phoneme set
Phonological or higher order phenomena accounted for in the phonemic transcriptions
Frequency of sub-word units: phonemes (diphones, triphones, syllables, ...)Others
Word frequency table
Any other essential language-dependent information or convention
Indication of how many files were double-checked by the producer together with percentage of detected errors
Automatic Validation
Media (p. )
Completeness (p. )
File naming (p. )
Readability, Empty Files (p. )
Signal Format (p. )
Annotation, Meta data, Lexicon Format (p. )
Parse Annotation for Content (p. )
Character codes (p. )
Cross Checks on Meta Data (p. )
Cross Checks on Summary Listings (p. )
Tools, Software (p. )
Manual Validation
Organize the validator group and training (p. )
Define the validation contents (p. )
Select the data sample (p. )
Decide on the validation method (p. )
- One-pass Check
- Multiple-pass Check
- One-pass Re-annotation
- Multi-pass Re-annotation
Select and test the validation tools (p. )
Organize Logistics (p. )
- Checking method: create copies with special markers
- Recruit only native speakers of same expertise level
- Define error schemes
- Database, server / client architecture
Validation Report
Executive summary, overall result (one sentence)
List of all checks, results, methodology (error listings in appendix)
List of the used tools and programs
Manual validation techniques, selection, statistics
Other relevant observations
Comments