Authors |
Florian Schiel, Angela Baumann |
Affiliation |
BAS Bayerisches Archiv für Sprachsignale Institut für Phonetik Universität München |
Postal address |
Schellingstr. 3 D 80799 München |
E-Mail |
schiel@phonetik.uni-muenchen.de bas@phonetik.uni-muenchen.de |
Telephone |
+49-89-2180-2758 |
Fax |
+49-89-2800362 |
Corpus Version |
3.0 |
Date |
03/07/2003 |
Status |
final |
Comment |
The following validation results show a
lack of essential information in documentation and annotation. |
Validation Guidelines |
Florian Schiel: The Validation of Speech
Corpora, Bastard Verlag München, 2003,
www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook |
The speech corpus of SI100 has been validated against general principles of good practise. The validation covered completeness, formal checks and manual checks of the selected subsamples. Missing data reduces the corpus in its usability and there could occur problems in using the corpus for other applications.
This document summarises the results of an in-house validation of the speech corpus SI100 made in the year 2003 within the project 'BITS' by the Institute of Phonetics of the Ludwig-Maximilians-University Munich. The recordings took place at the same institute in 1995 and were accurately controlled by a supervisor in quiet studio environment. The language of the corpus is German. The corpus contains read speech of 101 different speakers (50 female, 50 male, 1 unknown). Each speaker has read approx. 100 sentences from either the SZ sub-corpus or the CeBit sub-corpus. The sub-corpus SZ contains 544 sentences from newspaper articles ("Sueddeutsche Zeitung"). The sub-corpus CeBit contains 483 sentences from newspaper articles about the CeBit 1995. Each sub-corpus is divided into 5 parts of approx. 100 utterances each. Every speaker read only one part of one sub-corpus (with some exceptions), thus resulting in a total of approx. 10.100 recorded utterances (31,5 h of speech).
The General Documentation directory contains the following documentation files for the SI100 corpus which can be found under: doc/
README: | general documentation |
SI100_id.lis: | list of speakers for the total corpus |
SI100_#.lis: | list of speaker ids for the single volume (# = 1,...,7) |
SI100_ce.txt: | texts of sub-corpus CeBit |
SI100_sz.txt: | texts of sub-corpus SZ |
SI100_wo.txt: | list of spoken words |
SI100.lex: | pronunciation lexicon |
partitur/: | BAS Partitur files |
pardoc/: | BAS Partitur Files Docu |
The following required contents of the documentation have been checked:
Administrative Information:
Validating person: Henk van den Heuvel, ok.
Date of validation: 10. Sept. 2002, ok.
Contact for requests regarding the corpus: ok.
Number and type of media: 7 volumes on CDROMs. ok.
Content of each medium: ca. 10.100 recorded utterances, total size 3.62 GB. ok.
Copyright statement and intellectual property rights (IPR): ok.
Technical information:
Layout of media: information about file system type and directory structure:
CDROMs with High Sierra File System (HSFS) or ISO 9660
format. ok.
Directory structure not explained: repairable.
File nomenclature: explanation of used codes (no white space in file names!):
<speaker_id><sentence # from sub-corpus>[c].nis. ok.
Formats of signals and annotation files: If non
standard formats are used, it's common to give a full
description or convert into standard format:
The signal file format is NIST SPHERE. ok.
The partitur files can be found under the subdirectory
doc/partitur/. This is a little confusing. We would
recommend to have a partitur/ directory and a data/ directory
(for the signal files) on the same level as the doc/
directory. ok.
Coding: PCM. Just given in the .nis-files itself. Better in README: repairable
Compression: only widely supported compressions like zip or gzip should be used: n. a.
Sampling rate: 16 kHz. The speech data were digitally filtered to 8 kHz cutoff frequency and down-sampled to 16 kHz. ok.
Valid bits per sample: 16 bit. ok.
Used bytes per sample: 2. ok.
Multiplexed signals: n.a.
Database contents:
Clearly stated purpose of the recordings: Read articles (SZ, CeBit). Clear purpose not given. Not ok.
Speech type(s): read sentences. Prompting method not given. Not ok.
Instruction to speakers in full copy: Just a verbal instruction: 'read as carefully as possible and to read all punctuations as in a dictation task'. ok.
Linguistic contents of prompted speech:
Specifications of the individual text items: The file SI100_ce.txt contains text prompts of the sub-corpus CeBit and the file SI100_sz.txt contains the prompts from newspaper articles of the 'Süddeutsche Zeitung'. Each utterance is numbered according the file naming convention. ok.
Specification for the prompt sheet design or specification of the design of the speech prompts: Not given. Not ok.
Example prompt sheet or example sound file from the speech prompting: not given. Not ok.
Speaker information:
Speaker recruitment strategies: No information. Not ok.
Number of speakers: 101 speakers (50f/50m/1unknown). ok.
Distribution of speakers over sex, age, dialect regions: Indirectly given in SI100_id.lis. ok.
Description/definition of dialect regions: (see SI100_id.lis). 'Place of education' given. ok.
Recording platform and recording conditions:
Recording platform: not given. Not ok.
Position and type of microphones: Sennheiser Headset HMD 410. ok.
Position of speakers: distance to mouth approx. 3-5 cm. ok.
Bandwidth of microphones: digitally filtered to 8 kHz. ok.
Number of channels and channel separation: n.a.
Acoustical environment: 'dry' acoustics ('quiet office). ok.
Annotation, orthography:
The recordings were repeated, until each prompt text was read correctly. Therefore the prompt texts can be regarded as orthographic annotations, although annotations in the sense of transcriptions have not been prepared.
Unambiguous spelling standard used in annotations: German orthography. ok
Labelling symbols: n. a.
List of non-standard spellings: Punctuation. ok.
Character set used in annotations: The prompt texts are given in two character sets: Umlaut coding in LaTex and ISO 8859-1 German character set. A little confusing, but ok.
Any other language dependent information as abbreviations etc.: Full forms of abbreviations are not given. Not ok.
Annotation manual, guidelines, instructions: Implicitly given in the recording instructions. ok.
Description of quality assurance procedures: Implicitly given in the recording instructions. ok.
Selection of annotators: n. a.
Training of annotators: n. a.
Annotation tools used: n. a.
Annotation, automatic phoneme segmentation:
In this section only the automatic phoneme segmentation (MAU) will be covered. The canonical pronunciation will be addressed in the next section together with the lexicon.
Unambiguous spelling standard used in annotations: n. a.
Labelling symbols:Extended German SAMPA. ok.
List of non-standard spellings: n. a.
Character set used in annotations: ok.
Any other language dependent information as abbreviations etc.: ok.
Annotation manual, guidelines, instructions: ok.
Description of quality assurance procedures: ok.
Selection of annotators: n. a.
Training of annotators: n. a.
Annotation tools used: ok.
Lexicon and annotation of canonical pronunciation:
The lexicon and the annotation of canonical pronunciations in the KAN tier of the partitur file are closely related and will therefore be covered in one section.
Format: German SAMPA, LaTeX umlauts, two-column TAB-separated list: orthography TAB pronunciation. ok.
Text-to-phoneme procedure: not given. Not ok.
Explanation or reference to the phoneme set: Extended German SAM_PA. ok.
Phonological or higher order phenomena accounted in the phonemic transcriptions: 'Primary and secondary word accent are not coded in this dictionary'. acceptable.
Statistical information:
Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): not given. acceptable.
Word frequency table: not given. Not ok.
Others:
Any other essential language-dependent information or convention: Umlaut coding. ok.
Indication of how many files were double-checked by the producer together with percentage of detected errors: only applicable for automatic segmentation. Not ok.
Status documentation: not acceptable, but repairable.
The following list contains all validation steps with the methodology and results.
Difference between the README file and the "/doc/SI100_id.lis" file: (repairable)
Differences between the "/doc/SI100_id.lis" file and the "/doc/SI100_*.lis" files and the structure of the file system. It seems as if the speaker ids were at some point changed to 4 characters. In this effort some parts of the corpus and the documentation seem to have been forgotten: (repairable)
The following speaker ids were only partially changed:
The links to these files in the directory "SI100_total" partially point into nowhere.
Differences between the "/doc/SI100_id.lis" file concerning the spoken parts of the sub-corpora and the really spoken parts:
The documentation did not properly explain the purpose and the meaning of the following three different orthographic representations:
The following corrections were made in the headers and the partitur files but not in the files SI100_sz.txt and SI100_ce.txt :
For the following files Errors in the NIST-header were detected:
ERROR: /bmnt/BAS/SI100_1/alsc/alsc283c.nis : wrong size calculation
ERROR: /bmnt/BAS/SI100_1/chkr/chkr480c.nis : wrong size calculation
ERROR: /bmnt/BAS/SI100_1/erai/erai480c.nis : wrong size calculation
ERROR: /bmnt/BAS/SI100_1/itrc/itrc480c.nis : wrong size calculation
ERROR: /bmnt/BAS/SI100_1/jume/jume480c.nis : wrong size calculation
ERROR: /bmnt/BAS/SI100_2/rten/rten480c.nis : wrong size calculation
ERROR: /bmnt/BAS/SI100_2/sole/sole480c.nis : wrong size calculation
ERROR: /bmnt/BAS/SI100_4/anwi/anwi480c.nis : wrong size calculation
ERROR: /bmnt/BAS/SI100_4/bawe/bawe480c.nis : wrong size calculation
ERROR: /bmnt/BAS/SI100_6/bija/bija480c.nis : wrong size calculation
ERROR: /bmnt/BAS/SI100_6/flsc/flsc480c.nis : wrong size calculation
The first file is one byte too long, but the size of the header seems correct.
The word list SI100_wo.txt (6187 lines) and the lexicon SI100.lex (6186 lines) are of different length and are sorted differently, because of the Umlauts (Latex and ISO 8859-1 German). SI100_wo.txt has German Umlauts and includes the words "<drittens" and "a", which are not included in SI100.lex. The word "drittens" was only found in SI100.lex.
Apart from the Umlaut coding and some spaces, the ORT tier in the partitur file and the orthography tags in the NIST header are identical.
The word list SI100_wo.txt was compared with a word list created from the ORT tiers of the partitur files. Interestingly there are more words in file SI100_wo.txt than in the word list created from the annotations.
The second tier of SI100.lex, that contains the pronunciations, was compared to a list compiled form the KAN tier. Again there are more entries in SI100.lex. Some pronunciations differ. In the partitur files the sound "R" is used, while in SI100.lex only "r" occurs. It was not checked, whether s
The tiers of the partitur file DBN, LHD, NCH, REP, SAM, SBF, SNB and SSB were checked and no error was found. The tier LBD has no entry. For the files, where the speaker ids were changed (bija, kipp, miol, peko, step, waba, ziul and zuen ), the tiers SPN and SRC are different from the actual file and path name. It is quite surprising that some partitur files include a MAU tier (bija/bija383c.par) and some not (e.g. adhe329.par).
In a rough comparison of the KAN tiers in some partitur files and the respective sound files (by listening) the following errors were found:
In SI100.lex, two distinct pronunciations are specified for "-" and "\-":
In the file /bmnt/BAS/SI100_1/alsc/alsc283c.nis the speaker uttered "bInd@strIC", as specified in the prompt file. In NIST-header and partitur file "-" and "g@daNk@nstrIC" are annotated, which is not correct.
We found an inconsistent policy concerning white-space characters. This is a source of errors for automatic processing. We suggest to use one white-space between words and no spaces at the end of a line (found in SI100_3.lis).
The corpora were divided in different parts. in the case of CeBit the boundaries for part 5 are wrong in the README file. They are not "5 = 388-483", but "5 = 378-483".
The CeBit recordings 201-273 of speaker brda are not in part 2 as indicated in the file SI100_id.lis, but in part 3.
The speaker "jore" occurs twice in the file SI100_id.lis the only difference between the two entries is that the speaker read sentences of part 1 and part 5. This is impractical for automatic processing that is based on the speaker id. Dividing the corpus in different parts is not really necessary. For reasons of simplicity we would suggest to remove the "part" information and the second entry in the speaker-id file.
The entry orthography in the NIST-header allows German Umlauts. The evaluator is not entirely sure, if this is conform with the specifications by NIST.
The URL "http://www.icp.grenet.fr/Relator/standnist.html", which points to the documentation of the NIST-header does not exist anymore.
The file pardoc/PARSAMPA.HTM is not an HTML file but a simple ASCII text file. The appropriate name would be PARSAMPA.TXT.
There are no recordings for sentence 7 of the SZ corpus and sentence 303 of the CeBit corpus.
The documentation of the software is incomplete.
The use of "-" and "\-" to distinguish the pronunciations "g@daNk@nstrIC" and "bInd@strIC" is quite confusing. We would suggest to use at least in the partitur files more intuitive tags like "<Bindestrich>" and "<Gedankenstrich>".
There are three orthographic representations in the corpus. It makes sense to keep the actual prompts separate from a pseudo transcription. But it is rather impractical for the correction of errors and maintenance to have the transcription in the NIST-header and in the partitur file. We would suggest to remove one of them; probably the one in the NIST header."
The report of the previous validation is a word document, we suggest to make a HTML version of this document available as well.
The directory structure on the CDs should be modified, such that the directories containing the data files are all in one directory. This data directory, the partitur directory and the doc directory should be all on the same level.
The meaning of the SI100_WO.TXT file should be explained in the README file.
The labels contained in the NIST headers should be documented in the README file.
It looks as if the SI100 corpus has gone through many changes that have not all improved the quality. We suggest that all rendundancies should be removed from the corpus. This makes error correction, maintenance and documentation much easier. After the correction of the errors described in this report and after a revision of the documentation the corpus will be a valuable speech resource again.
In this evaluation the script par2ags.pl was used to test if the partitur files were formated according to the partitur file conventions. It might be useful for further evaluations to a have a proper partitur-parser at hand, that tests all dependencies within the file.