Revalidation report for the SC10 Database

Authors	Florian Schiel, Katerina Louka
Affiliation	BAS Bayerisches Archiv für Sprachsignale Institut für Phonetik Universität München
Postal address	Schellingstr. 3 D 80799 München
E-mail	schiel@phonetik.uni-muenchen.de bas@phonetik.uni-muenchen.de
Telephone	+49-89-2180-2758
Fax	+49-89-2800362
Corpus Version	1.2
Date	07.07.2004
Status	final
Comment
Validation Guidelines	Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook

Validation results of the SC10 Corpus:

Summary

The speech corpus SC10 of has been validated against general principles of good practise. The validation covered completeness, formal checks and manual checks of the selected subsamples. Missing data reduces the corpus in its usability and there could occur problems in using the corpus for other applications.

Introduction and Corpus Description

This document summarizes the results of an in-house validation of the speech corpus SC10 made in the year 2004 within the project 'BITS' by the Institute of Phonetics of the Ludwig-Maximilians-University Munich. The speech corpus contains read and non-prompted German and mother tongue speech of 70 different speakers from 17 mother languages (L1) in a variety of speaking styles.

I.) Validation of Documentation

The General Documentation directory contains the following documentation files for the SC10 corpus which can be found under: doc/

README	general documentation
type_a.txt	list of type a sentences
type_b.txt	list of type b utterances
type_c.txt	list of type c story
type_f.txt	fairy tale of re-telling type of speech
pardoc	documentation about the BAS Partitur File (BPF)
IPAChart96.pdf	IPA chart
IPANumberChart96.pdf	IPA chart with IPA numbers
SC10_lex.txt	canonical pronunciation dictionary
SC10_list.txt	word list
trllex_d.ps	documentation of the Verbmobi II transliteration format

· Administrative Information:

Validating person: n. a.

Date of validation: n. a..

Contact for requests regarding the corpus: ok

Number and type of media: 2 CDROM ok

Content of each medium: total size 736 MB. ok

· Technical information:

Layout of media: Information about file system type and directory structure:
CDROM with ISO9660 format ok

File nomenclature: Explanation of used codes (no white space in file names!):
<mother tongue><sex><speaker number within L1><type of speech a-f><number utterance> ok

Formats of signals and annotation files: If non standard formats are used it is common to give a full description or to convert into a standard format:
The signal file format is NIST SPHERE. ok

Coding: PCM linear ok

Compression: n. a.

Sampling rate: 16 kHz ok

Valid bits per sample: (others than 8, 16, 24, should be reported): 16 bits/samp ok

Used bytes per sample: 2 bytes/samp ok

Multiplexed signals: (exact de-multiplexing algorithm; tools)
n. a.

· Database contents:

Clearly stated purpose of the recordings:
This corpus may be used for several tasks. ok

Speech type(s): (multi-party conversations, human-human dialogues, read sentences, connected and/or isolated digits, isolated words etc.) ok

Instruction to speakers in full copy: no information, not ok

· Linguistic contents of prompted speech:

Specifications of the individual text items: ok

Specification for the prompt sheet design or specification of the design of the speech prompts: not given, not ok

Example prompt sheet or example sound file from the speech prompting: not given, not ok

· Linguistic contents of non-prompted speech:

Multi-party:(number of speakers, topics discussed, type of setting - formal/informal) ok

Human-human dialogues: (type of dialogues, e.g. problem solving, information seeking, chat etc., relation between speakers, topic(s) discussed, type of setting, scenarios) ok

Human-machine dialogues: (domain(s), topic(s), dialogues strategy followed by the machine, e.g. system driven, mixed initiative, type of system, e.g. test, operational service, Wizard-of-Oz) n.a.

· Speaker information:

Speaker recruitment strategies: No information, not importatnt for this corpus

Number of speakers: 70 (48 female, 22 male)
ok

Distribution of speakers over sex, age, dialect regions: indirectly given in the README file, not ok
Description/definition of dialect regions: No information

· Recording platform and recording conditions:

Recording platform: ok

Position and type of microphones:
- Company name and type id: Sennheiser Headset HMD 420-6
- Electret, dynamic, condenser: no information
- Directional properties: no information
- Mounting: indirect information, the speaker wears a headset

Position of speakers: (distance to microphone) No information

Bandwidth: (if other than zero to half of sampling rate) ok

Number of channels and channel separation: no information

Acoustical environment: studio quality Studio quality, no echo-cancelled environment

The L2-speaker was recorded in an acoustically insulated room with eye-contact to the L1-speaker

(which was not recorded)

· Annotation (Transliteration):

Unambiguous spelling standard used in annotations: ok

Labeling symbols: ok

List of non-standard spellings (dialectal variation, names etc.): trllex_d.ps

Distinction of homographs which are no homophones: n.a.

Character set used in annotations: n.a.

Any other language dependent information as abbreviations etc: n.a.

Annotation manual, guidelines, instructions: ok

Description of quality assurance procedures: not given

Selection of annotators: not given

Training of annotators: not given

Annotation tools used: not given

· Annotation (BAS Partitur Format Files):

Unambiguous spelling standard used in annotations: ok

Labeling symbols: ok

List of non-standard spellings (dialectal variation, names etc.): given

Distinction of homographs which are no homophones: n.a.

Character set used in annotations: ok

Any other language dependent information as abbreviations etc: given

Annotation manual, guidelines, instructions: ok – pardoc, SC10_lex.txt, IPANumberChart96.pdf , IPAChart96.pdf (http://www.bas.uni-muenchen.de/Bas/BasFormatseng.html)

Description of quality assurance procedures: not given

Selection of annotators: not given

Training of annotators: not given

Annotation tools used: given

· Lexicon:

Format: ok

Text-to-phoneme procedure: ok

Explanation or reference to the phoneme set: an indirect reference,ok. (http://www.bas.uni-muenchen.de/Bas/BasFormatseng.html)

Phonological or higher order phenomena accounted in the phonemic transcriptions: ok

· Statistical information:

Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): n.a.

Word frequency table: n.a.

· Others:

Any other essential language-dependent information or convention: given.

Indication of how many files were double-checked by the producer together with percentage of detected errors: not given

Status of documentation: acceptable

II.) Automatic validation

The following list contains all validation steps with the methodology and results.

Completeness of signal files: not ok

- itm1e001.nis is missing
- rsm2a004. nis is missing

In type D spoke speaker arm1 spoke some sentences twice, these files have another names as the normal data files:
arm1dd02.nis
arm1dd03.nis
arm1dd04.nis
arm1dd05.nis
arm1dd06.nis
arm1dd07.nis

Completeness of meta data files: ok

Completeness of annotation files:

par files:
- all the annotation files of speaker aew1 are missing
- rsm2f014.par is missing
- There are no annotation files of type a of the speaker rsm 2. The speaker rsm2 spoke the prompts of type a, but in the
README file is not noticed.
- fiw1e029.par is missing

There are no annotations files for "type d"

trl files:

There are two files of jpw1e:
- jpw1e000.trl
- jpw1e001.trl

Correctness of file names:

Filenames other than the file nomenclature:
arm1dd02.nis
arm1dd03.nis
arm1dd04.nis
arm1dd05.nis
arm1dd06.nis
arm1dd07.nis

Empty files: none

Status of signal, annotation and meta data files: ok

Cross checks of meta information: ok

Cross checks of summary listings: ok

Annotation and lexicon contents: ok

III.) Manual Validation

5% of the data and annotations files was checked in comparison. The trl files were not checked.9,9% of the data contained errors. Some kind of noises were not tagged and there were errors in the orthographical and sampa annotation.

IV.) Other Relevant Observations

There is no speaker file for this corpus.

V.) Comments for Improvement

The revalidation was able to repair some data (speaker file, README, file names). The results of the manual validation
couldn't be repaired. The sampa annotations and the noise markers should be revised.

VI.) Result

The corpus is ok. No data or documentations files are missing.