Authors | Florian Schiel, Tania Ellbogen, Karl Weilhammer |
Affiliation | BAS Bayerisches Archiv für Sprachsignale Institut für Phonetik Universität München |
Postal address | Schellingstr. 3 D 80799 München |
bas@phonetik.uni-muenchen.de |
|
Telephone | +49-89-2180-2758 |
Fax | +49-89-2800362 |
Corpus Version | 1.0 |
Date | 18.06.2003 |
Status | final |
Comment | |
Validation Guidelines | Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook |
The speech corpus Strange Corpus 1 Accents (SC1) has been validated against general principles of good practice. The validation covered completeness and formal checks.
This document summarizes the results of an in-house validation of the speech corpus SC1 (Version 1.0), which was released in 1995. The corpus was recorded in the years 1979 and 1991 at the University of Munich in Germany. The language of the corpus is German. 72 of the 88 speakers were not born in Germany and were educated in various countries. The aim was automatic accent detection, test of robustness against different accents in automatic speech recognition and scientific investigation of accents in German. The corpus contains read speech of 88 different speakers. Each speaker read the German text "Nordwind und Sonne".
The General Documentation directory contains the following documentation files for the SC1 corpus, which can be found under: doc/...
README: | documentation, |
phondat.doc: | PhonDat Data Format - Description, |
sc1_ort.txt: | orthography of the corpus, |
sc1_spk.txt: | speaker information, |
seg_conv.txt: | conventions for transcription and segmentation |
Administrative Information:
Validating person: n. a.
Date of validation: n. a..
Contact for requests regarding the corpus: ok.
Number and type of media: 1 volume on CDROM, total size 169 MB. ok.
Content of each medium: 88 different speakers of the same German text ('Nordwind und Sonne'). 16 of these speakers are native Germans (L1, corpus T); the other 72 were born and educated in other countries (L2, corpus C). acceptable.
Copyright statement and intellectual property rights (IPR): ok.
Technical information:
Layout of media:
Information about file system type and directory structure:
The volumes are stored on CDROM with High Sierra File System
(HSFS) or ISO 9660 format, which can be used on all common
platforms. ok.
File nomenclature:
Explanation of used codes (no white space in file names!):
Two different name format for sub-corpus T and C. ok.
<file id>.<ext>
file id: | see file sc1_spk.txt for reference |
ext: | 16 = Phondat 2 signal file |
s1 = Segmental information |
The length of the prefix is always equal to or less than 8. ok.
Formats of signals and annotation files:
If non standard formats are used it is common to give a full description or to convert into a standard format:
The format of the signal files is PhonDat 2 as described in the doc file
phondat.doc in this directory.
The string 'orthography' contains the German orthographic
representation of the utterance. German 'Umlaute' are coded
in 7 bit ASCII (see doc file phondat.doc for
details). not acceptable
Coding:
(PCM linear, Mu-Law or A-LAW, if others, then fully described)
PCM linear, as given in phondat.doc. acceptable
Compression:
Just widely supported compressions like zip or gzip should be used.
n. a.
Sampling rate: 48 kHz. The speech data were digitally filtered to 8 kHz cut-off frequency and down-sampled to 16 kHz. ok.
Valid bits per sample:16 bit. ok.
Used bytes per sample: 2. ok.
Multiplexed signals: (exact de-multiplexing algorithm; tools)
n. a.
Database contents:
Clearly stated purpose of the recordings:
- automatic accent detection
- test of robustness against different accents in automatic speech recognition
- scientific investigation of accents in German
The sub-corpus of 16 German speakers may be used as a reference or training
corpus for technical evaluations. ok.
Speech type(s): Read German text ('Nordwind und Sonne'). ok.
Instruction to speakers in full copy: Just a verbal instruction "read carefully but fluently". Incomplete and erroneous. not ok.
Linguistic contents of prompted speech:
Specifications of the individual text items: The text that is given in sc1_ort.txt, was obviously not the text that was read by the subjects. not ok.
Specification for the prompt sheet design or specification of the design of the speech prompts: not ok.
Example prompt sheet or example sound file from the speech prompting: not ok.
Linguistic contents of non-prompted speech:
Multi-party: n. a.
Human-human dialogues: n. a.
Human-machine dialogues: n. a.
Speaker information:
Speaker recruitment strategies: no information. Not ok.
Number of speakers: 88 different speakers. Acceptable.
Distribution of speakers over sex, age, dialect regions: For all speakers it is documented whether they are female or male. The age is not given. For the 16 German native speakers, no dialect is given. The other 72 were born and educated in other countries, which determines their accent. acceptable.
Description/definition of dialect regions: No information given for the 16 German native speakers. Since the purpose of this corpus is to sample different accents, not ok.
Recording platform and recording conditions:
Recording platform:Corpus T: Digitally to DAT machine. Corpus C: Analog to tape machine Telefunken M15, later digitized to DAT tape. ok.
Position and type of microphones:
Corpus T: Sennheiser Microphone MKH 20 P48. Corpus C: Neumann U-87. ok.
No further specification is given. acceptable.
Position of speakers: distance to microphone not given. Not ok.
Bandwidth: (if other than zero to half of sampling rate) Not given. Microphone types are given, therefore acceptable.
Number of channels and channel separation: 1 channel. ok.
Acoustical environment: Studio quality. ok.
Annotation orthography (level I):
The level I annotation is not really an annotation, it is rather the prompt text that had to be read by the speakers. In the case of sub-corpus T this is identical to what was uttered by the speakers, because the recording was repeated until the text was read without errors. For sub-corpus C the level I annotation represents only the prompt text and no orthographic transcription exists.
Unambiguous spelling standard used in annotations: Standard orthography. acceptable.
Labeling symbols: characters, German 'Umlaute' are coded in 7 bit ASCII (see phondat.doc) or latex conventions. ok.
List of non-standard spellings (dialectal variation, names etc.): not given.
Distinction of homographs which are no homophones: not given.
Character set used in annotations: ASCII, with German 'Umlaute' coded in 7 bit ASCII. acceptable.
Any other language dependent information as abbreviations etc: not given.
Annotation manual, guidelines, instructions: see files README and seg_conv.txt. ok.
Description of quality assurance procedures: Only for sub-corpus T implicitly given by three layer annotation procedure. Problematic for sub-corpus C. not ok.
Selection of annotators: not ok.
Training of annotators: not ok.
Annotation tools used: not ok.
Annotation canonical pronunciation (level 2):
Also the level II annotation is not really an annotation. It is a mapping of orthographic forms (of level I) on canonical pronunciations. It therefore has the same problems as the level I annotation. It is ok for sub-corpus T, but useless and even confusing for sub-corpus C.
Unambiguous spelling standard used in annotations: Manual conversion from orthography to canonical pronunciation. ok.
Labeling symbols: German SAMPA. ok.
List of non-standard spellings (dialectal variation, names etc.): n. a.
Distinction of homographs which are no homophones: n. a.
Character set used in annotations: ASCII. ok.
Any other language dependent information as abbreviations etc: n. a.
Annotation manual, guidelines, instructions: Mapping procedure, no guidelines given. In this case acceptable.
Description of quality assurance procedures: Only for sub-corpus T implicitly given by three layer annotation procedure. Problematic for sub-corpus C. not ok.
Selection of annotators: not ok.
Training of annotators: not ok.
Annotation tools used: not ok.
Annotation manual transcription (level 3):
Manual transcriptions are only given for sub-corpus T.
Unambiguous spelling standard used in annotations: Manual labeling. ok.
Labeling symbols: German SAMPA. ok.
List of non-standard spellings (dialectal variation, names etc.): n. a.
Distinction of homographs which are no homophones: n. a.
Character set used in annotations: ASCII. ok.
Any other language dependent information as abbreviations etc: n. a.
Annotation manual, guidelines, instructions: See file seg_conv.txt. Annotation guidelines are only in German. Since the corpus is in German, ok.
Description of quality assurance procedures: Implicitly given by three layer annotation procedure. No information given about quality assurance within this layer. not ok.
Selection of annotators: Experienced students of phonetics. acceptable.
Training of annotators: No information. not ok.
Annotation tools used: No information. not ok.
Lexicon:
For such a short text a separate lexicon is not essential. Therefore the level II annotation was considered as a lexicon.
Format: ASCII. ok.
Text-to-phoneme procedure: Manually. ok
Explanation or reference to the phoneme set: Extended German SAM-PA. ok.
Phonological or higher order phenomena accounted in the phonemic transcriptions: n. a.
Statistical information:
Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): Only reasonable for manual segmented phones (level 3), but not found. not ok.
Word frequency table: Not given, since all speakers read the same story, acceptable.
Others:
Any other essential language-dependent information or convention: not given.
Indication of how many files were double-checked by the producer together with percentage of detected errors: not given. not ok.
Status of documentation: not ok.
The following list contains all validation steps with the methodology and results.
Completeness of signal, annotation and meta data files:
Correctness of file names: ok
Empty files: none
Status of signal, annotation and meta data files: ok.
Rough inspection of some sound files by ear.
It is not mentioned in the documentation, that sub-corpus C is part of the phondat 2 corpus. It is neither clear, what was read by the speakers, nor what was annotated in the level 1 annotation:
"The orthography of the spoken text can be found in the file sc1_ort.txt (ASCII, with German 'Umlaute' coded in 7 bit ASCII).
Note that some of the speakers have not read the title of the story. Please refer to the signal file headers to get the exact orthography of what was spoken.
Also note that some of the speakers has repeated single words or phrases and/or have deleted words during reading. These are NOT marked in the orthographic or canonic strings of the headers."
This sentence is not necessary in the Documentation:
"This corpus will be extended to more countries and more material in 1996."
Parts of the documentation are not clear, and even contradictory, we would therefore suggest to reformulate some passages to get a clear and consistent README file. The corpus consists of two parts, that are somehow similar but not entirely. This fact should be considered in the documentation. Obviously the speakers had different prompt texts, we suggest to add the other versions of the prompt text to the file sc1_ort.txt and give at least some hints in the README file, which text was used in which recording.
Obviously it would be a great improvement of the corpus to provide at least correct orthographic transcripts for sub-corpus C. Phoneme segmentations for sub-corpus C would improve the corpus tremendously. For users of the corpus that can not read German, it would be interesting to have the segmentation manual in English.
Since the phondat header format is not easily accessible on a windows system or for unexperienced users, we suggest to convert the headers to the NIST format, which is easily readable with a text editor.
With an update of the documentation and after the conversion of the phondat files into files with NIST headers the corpus is in acceptable conditions. Part T is very valuable. Since all recordings were repeated until they were flawless, the promt text can be regarded as an orthographic transcription, on which the phone segmentations, that exist for each audio file are based.
Without proper orthographic transcriptions Part C is only suited for users that are mainly interested in the audio signals or only need to know roughly, what was read.