Authors | Karl Weilhammer |
Affiliation | BAS Bayerisches Archiv für Sprachsignale Institut für Phonetik Universität München |
Postal address | Schellingstr. 3 D 80799 München |
schiel@phonetik.uni-muenchen.de bas@phonetik.uni-muenchen.de |
|
Telephone | +49-89-2180-2758 |
Fax | +49-89-2800362 |
Corpus Version | 1.0 |
Date | 22.12.2003 |
Status | pre-final |
Comment | Validation of pre-final corpus |
Validation Guidelines | Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook |
The speech corpus "Regional Variants of German - Junior" (RVG-J, Bavaria) is in good order, however there are some details that should be improved, in order to get a sound data base.
The RVG-J Corpus (Regional Variants of German - Junior) was recorded in 2001 at the Institute of Phonetics and Speech Communication at the University of Munich, Germany.
The corpus contains both read and non-scripted German utterances. It comprises the original RVG prompts (telephone numbers, sentences, commands, digits, etc.) plus spellings, date and time expressions, and free form responses to questions, e.g. "What are you wearing?", "How did you get here?", etc.
The speakers were adolescents between 13 and 20 years of age, recruited in public schools in Munich and the suburbs. More than 95% of the speakers have German as their mother language, and almost all of them attended school in Bavaria; 89 of them were male and 93 female. Speakers younger than 18 years were required to provide a waiver signed by their parents stating that they were allowed to participate in the recordings. The corpus can be used for the training of speech recognizers or analyses of adolescent speech.
This document summarizes the results of an in-house validation of the speech corpus "Regional Variants of German - Junior".
The General Documentation directory contains the following files. They can be found under: doc/...
HANDBOOK.PDF: | Explanation of transliteration procedure. |
ISO8859-1.PDF: | Character set |
PROMPTS: | Directory with files containing the prompt texts. |
SAMPA.TXT: | Transcription symbols used in RVG-J. |
SAMPSTAT.TXT: | Basic signal statistics for every recording file. |
SUMMARY.TXT: | Automatically generated SpeechDat conform summary of recordings. |
Administrative Information:
Validating person: n. a.
Date of validation: n. a..
Contact for requests regarding the corpus: ok.
Number and type of media: ok.
Content of each medium: acceptable
Copyright statement and intellectual property rights (IPR): No Copyright-file in the directory, as explained in the documentation, but a Copyright statement is in the header of the README-file. Already repaired, ok.
Technical information:
Layout of media: Information about file system
type and directory structure:
Incomplete, already repaired, ok.
File nomenclature: Explanation of used codes
(no white space in file names!):
ok.
Formats of signals and annotation files: If non
standard formats are used it is common to give a full description or to
convert into a standard format:
ok.
Coding: (PCM linear, Mu-Law or A-LAW, if others, then
fully described)
ok.
Compression: Just widely supported compressions like
zip or gzip should be used.
n. a.
Sampling rate: (other than 8000, 11025, 16000, 22050, 32000,44100, 48000 should be reported): 22050, ok.
Valid bits per sample: (others than 8, 16, 24, should be reported): 16, ok.
Used bytes per sample: 2, ok.
Multiplexed signals: (exact de-multiplexing
algorithm; tools)
Stereo-WAV, not ideal, better two separate files, acceptable.
Database contents:
Clearly stated purpose of the recordings:
Only implicitly given in the title "Regional Variants of German".
Already repaired, ok.
Speech type(s): (multi-party conversations, human-human dialogues, read sentences, connected and/or isolated digits, isolated words etc.) ok.
Instruction to speakers in full copy: ok.
Linguistic contents of prompted speech:
Specifications of the individual text items: Only prompts are given, exact (German) wording of questions is missing (B1-B3,C1-C6,D1-D3,L1-L9,O1-O9,T1-T3,X1-X3,Y1), acceptable.
Specification for the prompt sheet design or specification of the design of the speech prompts: Computer prompts, ok.
Example prompt sheet or example sound file from the speech prompting: n. a.
Linguistic contents of non-prompted speech:
Multi-party:(number of speakers, topics discussed, type of setting - formal/informal) n. a.
Human-human dialogues: (type of dialogues, e.g. problem solving, information seeking, chat etc., relation between speakers, topic(s) discussed, type of setting, scenarios) n. a.
Human-machine dialogues: (domain(s), topic(s), dialogues strategy followed by the machine, e.g. system driven, mixed initiative, type of system, e.g. test, operational service, Wizard-of-Oz) n. a.
Speaker information:
Speaker recruitment strategies: ok.
Number of speakers: ok.
Distribution of speakers over sex, age, dialect regions:
SEX: almost uniform distribution
AGE: no uniform distribution 13-15 high frequent (~50); 11,12,16-19
much lower (<14)
ACC: great majority is from BY ( What are BE, BW, BY, NN and NW
abbreviations for?)
Description/definition of dialect regions: No description found, I guess Bundesländer. Not ok.
Recording platform and recording conditions:
Recording platform: ok.
Position and type of microphones:
- Company name and type id: Beyerdynamic MCE10, Beyerdynamic NEM192, ok.
- Electret, dynamic, condenser: not given, not ok.
- Directional properties: not given, not ok.
- Mounting: collar of the speaker and headset, ok.
Position of speakers: (distance to microphone) Implicitly given, ok.
Bandwidth: (if other than zero to half of sampling rate) Not given, not ok.
Number of channels and channel separation: 2 channels, stereo WAV-files, ok.
Acoustical environment: ok.
Annotation (Orthographic transcription):
Unambiguous spelling standard used in annotations: ok.
Labeling symbols: n. a.
List of non-standard spellings (dialectal variation, names etc.): Dialectal variants are not given. Words that have been altered by reading errors are marked with a "*". These words do not occur in the lexicon LEXICON.TBL. It is not clear if the lexicon is based on the prompts or on the transcription. Nevertheless, ok.
Distinction of homographs which are no homophones: Not given.
Character set used in annotations: Table with character set in file ISO88591.PDF. No mapping from symbol to number in this diagram. ok.
Any other language dependent information as abbreviations etc: Explanations in HANDBOOK.PDF, ok.
Annotation manual, guidelines, instructions: Format Error in HANDBOOK.PDF, top frame of table on page 6 has been printed on page 5.
Description of quality assurance procedures: Either there were no quality assurance procedures or not given. acceptable.
Selection of annotators: Not given, not ok!
Training of annotators: Not given, not ok!
Annotation tools used: ok.
Lexicon:
Only orthographic but no canonic forms could be found in the lexicon.
Format: ok.
Text-to-phoneme procedure: ok.
Explanation or reference to the phoneme set: ok.
Phonological or higher order phenomena accounted in the phonemic transcriptions: ok.
Statistical information:
Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): There is the file SAMPSTAT.TXT, which contains basic signal statistics for every recording file, but I find it very difficult to understand, what kind of statistics is described in this file. Other word-sub unit statistics could not be found, not ok.
Word frequency table: ok.
Others:
Any other essential language-dependent information or convention: n. a.
Indication of how many files were double-checked by the producer together with percentage of detected errors: Not given, not ok.
Status of documentation: acceptable.
The following list contains all validation steps with the methodology and results.
Completeness of signal, annotation and meta data files: ok
Correctness of file names: ok
Empty files: none
Status of signal, annotation and meta data files: ok
- Annotation Files: The signal byte order should be "lohi" but in
the SAM files is "HILO"- corrected
Signal Format: ok
Meta
information in signal files, meta files and annotation files: ok
Cross Checks of summary listings:
ok
Manual Validation was not carried out.
I would think that prompts and annotations contain somehow similar information. It would therefore be consequent to have the PROMPT/ directory on the same directory level as the ANNOT/ directory.
The channels of the recording files should be split into two separate files. The WAV-Header should be replaced by a Nist-Header.
The lexicon should be completed (canonical pronunciation is missing).
The format error on page 5/6 in the HANDBOOK.PDF file should be corrected.
The prompt directory should be moved out of the DOC-directory to the root level.
It might be a good idea to put the general README file into the DOC-directory and provide a short README in on the root level explaining the file structure briefly.
The abbreviations BE, BW, BY, NN and NW should be specified in the documentation.
The corpus is ok, although some details should be fixed before publication.