Revalidation report for the SI1000 Database

Authors	Florian Schiel, Katerina Louka
Affiliation	BAS Bayerisches Archiv für Sprachsignale Institut für Phonetik Universität München
Postal address	Schellingstr. 3 D 80799 München
E-mail	schiel@phonetik.uni-muenchen.de bas@phonetik.uni-muenchen.de
Telephone	+49-89-2180-2758
Fax	+49-89-2800362
Corpus Version
Date	13.09.2004
Status	final
Comment
Validation Guidelines	Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook

Validation results of the SI1000 Corpus:

Summary

The corpus contains read speech of 10 different speakers. Each speaker
has read approx. 1000 sentences from a German news paper corpus,
thus resulting in a total of approx. 10000 recorded utterances.

Introduction and Corpus Description

This document summarizes the results of an in-house validation of the speech corpus SI1000 made in the year 2004 within the project 'BITS' by the Institute of Phonetics of the Ludwig-Maximilians-University Munich. The recording took place at the Institut fuer Phonetik, University of Munich, Germany in 1994.

I.) Validation of Documentation

The General Documentation directory contains the following documentation files for the SI1000 corpus which can be found under: doc/

README	general documentation
SI1000id.lis	list of speakers for the total corpus
SI1000.txt	list of spoken utterances
phondat.doc	documentation about the file format PhonDat
SI1000.lex	pronunciation dictionary in Extended German SAM-PA
ext_sam.txt	description of Extended German SAM-PA
partitur/	BAS Partitur Files
pardoc/	BAS Partitur Files Docu

· Administrative Information:

Validating person: n. a.

Date of validation: n. a..

Contact for requests regarding the corpus: ok

Number and type of media: CDROMs

Content of each medium: no information, total size 2,2 GB of compressed data

· Technical information:

Layout of media: Information about file system type and directory structure:
The data of the speakers are
collected in subdirectories named with the speaker id.
The volumes are stored on CDROMs with High Sierra File System (HSFS)
or ISO 9660 format with RockRidge extensions.

File nomenclature: Explanation of used codes (no white space in file names!):
<speaker_id><sex>d<sentence # from corpus>.16 ok

Formats of signals and annotation files: If non standard formats are used it is common to give a full description or to convert into a standard format: ok

- PhonDat 2 format signal file

Coding: .16 (PhonDat format signal file with 16 kHz sampling rate)

Compression: n. a.

Sampling rate: 16kHz ok

Valid bits per sample: (others than 8, 16, 24, should be reported): 16 bit ok

Used bytes per sample: 2 bytes/samp ok

Multiplexed signals: (exact de-multiplexing algorithm; tools) n.a.

· Database contents:

Clearly stated purpose of the recordings:
no information

Speech type(s): (multi-party conversations, human-human dialogues, read sentences, connected and/or isolated digits, isolated words etc.) read sentences ok

Instruction to speakers in full copy: no information

· Linguistic contents of prompted speech:

Specifications of the individual text items: Sentences from a German news paper corpus

Specification for the prompt sheet design or specification of the design of the speech prompts: n.a.

Example prompt sheet or example sound file from the speech prompting: n.a.

· Linguistic contents of non-prompted speech:

Multi-party:(number of speakers, topics discussed, type of setting - formal/informal) ok

Human-human dialogues: (type of dialogues, e.g. problem solving, information seeking, chat etc., relation between speakers, topic(s) discussed, type of setting, scenarios) n.a.

Human-machine dialogues: (domain(s), topic(s), dialogues strategy followed by the machine, e.g. system driven, mixed initiative, type of system, e.g. test, operational service, Wizard-of-Oz) n.a.

· Speaker information:

Speaker recruitment strategies: No information, not important for this corpus

Number of speakers: 10 (SI1000id.lis)
ok

Distribution of speakers over sex, age, dialect regions: given (/doc/SI1000id.lis)
Description/definition of dialect regions: indirectly given (/doc/SI1000id.lis)

· Recording platform and recording conditions:

Recording platform: ok

Position and type of microphones:
- Company name and type id: Sennheiser Headset HMD 410
- Electret, dynamic, condenser: no information
- Directional properties: no information
- Mounting: no information

Position of speakers: (distance to microphone) Distance to mouth approx. 3-5 c

Bandwidth: (if other than zero to half of sampling rate) ok

Number of channels and channel separation: no information

Acoustical environment: no information

· Annotation (BAS Partitur Format Files):

Unambiguous spelling standard used in annotations: ok

Labeling symbols: ok

List of non-standard spellings (dialectal variation, names etc.): given

Distinction of homographs which are no homophones: n.a.

Character set used in annotations: ok

Any other language dependent information as abbreviations etc: given

Annotation manual, guidelines, instructions: ok – (The PAR documentation can be found in the files in PARDOC)

Description of quality assurance procedures: not given

Selection of annotators: not given

Training of annotators: not given

Annotation tools used: given

·
Lexicon:

Format: ok

Text-to-phoneme procedure: ok

Explanation or reference to the phoneme set: an indirect reference,ok. (ext_sam.txt is a description and mapping of Extended German SAM-PA)

Phonological or higher order phenomena accounted in the phonemic transcriptions: ok

· Statistical information:

Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): n.a.

Word frequency table: n.a.

· Others:

Any other essential language-dependent information or convention: given.

Indication of how many files were double-checked by the producer together with percentage of detected errors: known errors can be found in the README file.

Status of documentation: acceptable

II.) Automatic validation

The following list contains all validation steps with the methodology and results.

Completeness of signal files: Following files are missing:

Speaker ID	Missing Files
BJ	077, 079, 086
BK	077, 079, 086
BW	077, 079, 086, 201, 1000
CK	077, 079, 086
CS	077, 079, 086
HT	077, 079, 086, 101, 401, 501, 601
MH	077, 079, 086, 601
PG	077, 079, 086, 601
SH	077, 079, 086, 101, 201, 601, 801
TA	001, 077, 079, 086

Superflous files: csmd101.shn, csmd401.shn, csmd701.shn

Completeness of meta data files: ok

Completeness of annotation files: ok.

Correctness of file names: ok.

Empty files/ Corrupt files: The files "101, 401, 701" of the speaker with the id "CS".

Status of signal, annotation and meta data files: ok

Cross checks of meta information: ok

Cross checks of summary listings: ok

Annotation and lexicon contents:

All /R/ phonemes are changed to /r/ in order to be conform with the BAS Guidelines

III.) Manual Validation

Approximetely 4 % of the data and the PAR files were checked in comparison. The error rate was very low. The errors were misspoken words and not spoken punctuation marks.
Pauses, hesitations, and breathings are not annotated.

IV.) Other Relevant Observations

V.) Comments for Improvement

The results of the manual validation couldn't be repaired.

VI.) Result

The corpus is ok. The corpus, its history and its known errors are well documented and no data or documentation files are missing.