Revalidation report for the SC10 Database

Revalidation report for the TAXI Database

Authors	Florian Schiel, Katerina Louka
Affiliation	BAS Bayerisches Archiv für Sprachsignale Institut für Phonetik Universität München
Postal address	Schellingstr. 3 D 80799 München
E-mail	schiel@phonetik.uni-muenchen.de bas@phonetik.uni-muenchen.de
Telephone	+49-89-2180-2758
Fax	+49-89-2800362
Corpus Version	2.1
Date	18.06.2003
Status	final
Comment
Validation Guidelines	Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook

Validation results of the TAXI Corpus:

Summary

The speech corpus TAXI has been validated against general principles of good practise. The validation covered completeness, formal checks and manual checks of the selected sub samples. It must be considered that missing data reduces the corpus in its usability and there could occur problems in using the corpus for further applications.

Introduction and Corpus Description

This document summarizes the results of an in-house validation of the speech corpus TAXI made in the year 2004 within the project 'BITS' by the Institute of Phonetics of the Ludwig-Maximilians-University Munich. The speech corpus is created in June 2001 in collaboration with the DFKI in Saarbrücken.

TAXI contains 94 recorded dialogues between a cab dispatcher and a client recorded over public phone lines (network and GSM).The dispatcher always spoke German, while the clients always spoke English.

The corpus contains 94 recordings. Each recording contains a pre-dialogue part, turns by the dispatcher and client and a hang-up part after the last turn of the speakers until the hang up of both parties. Note that a recording does not imply that the recording went without any errors.

I.) Validation of Documentation

The General Documentation directory contains the following documentation files for the TAXI corpus which can be found under: doc/ (The README file can be found in the main directory)

README	general documentation
WORDS_DE.TXT	Word list, German part
WORDS_EN.TXT	Word list, English part
PRON_DE.LEX	Pronunciation dictionary, extended German SAM-PA, German part (spoken sentences AND translations)
SAMPA.txt	table of extended German SAMPA
PRON_EN.LEX	Pronunciation dictionary, English SAM-PA, English part(spoken sentences AND translations)
TRANS.TXT	Results of validation, transcription and translation
TRANSCRP.PDF	description of rules and conventions of SpeechDat transcription (German)
TRANSCRP_EN.PDF	description of rules and conventions of SpeechDat transcription (English)
BasFormatseng.html	Description of the BAS Partitur Format (see www.bas.uni-muenchen.de/Bas/BasFormatseng.html for an updated version of this document)
softw/	some tools that are distributed with the BAS corpora

· Administrative Information:

Validating person: n. a.

Date of validation: n. a..

Contact for requests regarding the corpus: ok

Number and type of media: CDROM ok

Content of each medium: no information

· Technical information:

Layout of media: Information about file system type and directory structure:
CDROM- ASCII, German Umlauts are coded in ISO-8859 ok

File nomenclature: Explanation of used codes (no white space in file names!):
<4 digit recording number><Speaker marker><Turn number starting with ‘00> ok

Formats of signals and annotation files: If non standard formats are used it is common to give a full description or to convert into a standard format: ok

- NIST file PCM coding
- NIST file ALAW coding

Coding: PCM, ALAW

Compression: n. a.

Sampling rate: 8kHz ok

Valid bits per sample: (others than 8, 16, 24, should be reported): ALAW coding: bits/samp, PCM coding, 16 bit ok

Used bytes per sample: 2 bytes/samp ok

Multiplexed signals: (exact de-multiplexing algorithm; tools) n.a.

· Database contents:

Clearly stated purpose of the recordings:
no information

Speech type(s): (multi-party conversations, human-human dialogues, read sentences, connected and/or isolated digits, isolated words etc.) ok

Instruction to speakers in full copy: The speakers had to follow specific conventions: To prevent overlap and to allow automatic segmentation by the recording server each party had to press a button on his phone to signal the other party that his turn is over.

· Linguistic contents of prompted speech:

Specifications of the individual text items: ok

Specification for the prompt sheet design or specification of the design of the speech prompts: n.a.

Example prompt sheet or example sound file from the speech prompting: n.a.

· Linguistic contents of non-prompted speech:

Multi-party:(number of speakers, topics discussed, type of setting - formal/informal) ok

Human-human dialogues: (type of dialogues, e.g. problem solving, information seeking, chat etc., relation between speakers, topic(s) discussed, type of setting, scenarios) ok

Human-machine dialogues: (domain(s), topic(s), dialogues strategy followed by the machine, e.g. system driven, mixed initiative, type of system, e.g. test, operational service, Wizard-of-Oz) n.a.

· Speaker information:

Speaker recruitment strategies: No information, not important for this corpus

Number of speakers: 94 (no further information about the speakers)
ok

Distribution of speakers over sex, age, dialect regions: not given
Description/definition of dialect regions: No information

· Recording platform and recording conditions:

Recording platform: ok

Position and type of microphones: The data is recorded over public phone lines (network and GSM)
- Company name and type id: n.a
- Electret, dynamic, condenser: n.a
- Directional properties: n.a.
- Mounting: n.a.

Position of speakers: (distance to microphone) No information

Bandwidth: (if other than zero to half of sampling rate) ok

Number of channels and channel separation: mono

Acoustical environment: City of Hanover

· Annotation (BAS Partitur Format Files):

Unambiguous spelling standard used in annotations: ok

Labeling symbols: ok

List of non-standard spellings (dialectal variation, names etc.): given

Distinction of homographs which are no homophones: n.a.

Character set used in annotations: ok

Any other language dependent information as abbreviations etc: given

Annotation manual, guidelines, instructions: ok – (WORDS_DE.TXT, WORDS_EN.TXT, SAMPA.txt, TRANS.TXT, TRANSCRP.PDF, TRANSCRP_EN.PDF, BasFormatseng.html)

Description of quality assurance procedures: not given

Selection of annotators: not given

Training of annotators: not given

Annotation tools used: given

· Lexicon:

Format: ok

Text-to-phoneme procedure: ok

Explanation or reference to the phoneme set: an indirect reference,ok. (http://www.bas.uni-muenchen.de/Bas/BasFormatseng.html, SAMPA.txt)

Phonological or higher order phenomena accounted in the phonemic transcriptions: ok

· Statistical information:

Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): n.a.

Word frequency table: n.a.

· Others:

Any other essential language-dependent information or convention: given.

Indication of how many files were double-checked by the producer together with percentage of detected errors: not given

Status of documentation: acceptable

II.) Automatic validation

The following list contains all validation steps with the methodology and results.

Completeness of signal files: There are 640 usable and 247 garbage turns.

The following sessions contain no usable turns:
0049
0053
0075
0083
0092
0094
0099
0127

Completeness of meta data files: ok

Completeness of annotation files: ok.

Correctness of file names: ok.

Empty files: none

Status of signal, annotation and meta data files: ok

Cross checks of meta information: ok

Cross checks of summary listings: ok

Annotation and lexicon contents: In the lexicons not understandable words "**" are not annotated.

III.) Manual Validation

10% of the 'usable' data, the BAS files and the file TRANS.TXT were checked in comparison. 9,2% of the data contained errors. The most errors were found in
the client's turns. Some words and pauses were not annotated, or annotated wrong.

IV.) Other Relevant Observations

No information is given about the 'clients'.

V.) Comments for Improvement

There is no information about the speakers of this corpus. The revalidation was able to repair some data (lexicon,README). The results of the manual validation couldn't be repaired.

VI.) Result

The corpus is ok. The corpus is well documented and no data or documentation files are missing.