Revalidation report for the SC10 Database

Revalidation report for the HEMPEL Database

Authors	Florian Schiel, Katerina Louka
Affiliation	BAS Bayerisches Archiv für Sprachsignale Institut für Phonetik Universität München
Postal address	Schellingstr. 3 D 80799 München
E-mail	schiel@phonetik.uni-muenchen.de bas@phonetik.uni-muenchen.de
Telephone	+49-89-2180-2758
Fax	+49-89-2800362
Corpus Version	1.3
Date	22.07.2004
Status	final
Comment
Validation Guidelines	Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook

Validation results of the HEMPEL Corpus:

Summary

The speech corpus HEMPEL has been validated against general principles of good practise. The validation covered completeness, formal checks and manual checks of the selected sub samples. It must be considered that missing data reduces the corpus in its usability and there could occur problems in using the corpus for further applications.

Introduction and Corpus Description

This document summarizes the results of an in-house validation of the speech corpus HEMPEL made in the year 2004 within the project 'BITS' by the Institute of Phonetics of the Ludwig-Maximilians-University Munich. The speech corpus was created in January 2003 in collaboration at the same institute.

Hempels Sofa is a collection of more than 3900 spontaneous speech items recorded as extra material during the German SpeechDat-II project. Speakers
were asked to report what they had been doing during the last hour: "Was haben Sie in der letzten Stunde gemacht?". This item was recorded as the
last item of the recording session. Speakers had become acquainted with the recording procedure and they were quite relaxed because they knew
that this item was the last to be recorded. This resulted in quite natural, colloquial speech, sometimes with marked regional accent.

The Hempels Sofa spontaneous speech data collections consists of 3909 recordings from telephone calls stored on
2 CD-ROMs in a format compatible with the SpeechDat(II) database exchange format as defined in the SpeechDat deliverable SD 1.3.1 V.4.3.

I.) Validation of Documentation

The General Documentation directory contains the following documentation files for the HEMPEL corpus which can be found under:

1.) the main directory

README	file describing the database structure
DISK.ID	Volume identifier for ISO 9660 file systems
COPYRIGH.TXT	Copyright text

2.) the directory /doc:

BASCORPO.PDF	LREC 2002 paper describing the corpus
ISO88591.PDF	ISO8859-1 (ISO Latin) code table
SAMPSTAT.TXT	SNR values
SD131V43.DOC	Database Exchange Format Specification
SD131V43_{1..7}.PDF}	Database Exchange Format Specification in PDF
SD132V24.{DOC\|PDF}	Orthographic and Transcription Conventions
SNR.TXT	Description of SNR computation by SPEX
SUMMARY.TXT	German summary file
TRANSCRP.{HTM\|PDF}	the validation and transcription handboo

3.) the directory /table:

LEXICON.TBL	pronunciation dictionary
SPEAKER.TBL	speaker information file
SESSION.TBL	session information file

4.) the directory /index:

CONTENTS.LS

contents of the database

· Administrative Information:

Validating person: n. a.

Date of validation: n. a..

Contact for requests regarding the corpus: ok

Number and type of media: 2 CD-ROMs ok

Content of each medium: CD-ROM HEMPEL_01 contains sessions 1000-4499, CD-ROM HEMPEL_02 contains sessions 4500-5809

· Technical information:

Layout of media: Information about file system type and directory structure:
2 CD-ROMs (compatible with the SpeechDat II database)

File nomenclature: Explanation of used codes (no white space in file names!):
<a1><4 digit recording number><x1>< .dea | .deo> ok

Formats of signals and annotation files: If non standard formats are used it is common to give a full description or to convert into a standard format: ok, raw signal file

Coding: ALAW

Compression: n. a.

Sampling rate: 8kHz ok

Valid bits per sample: (others than 8, 16, 24, should be reported): 8 bit, ok.

Used bytes per sample: 1 bytes/samp ok

Multiplexed signals: (exact de-multiplexing algorithm; tools) n.a.

· Database contents:

Clearly stated purpose of the recordings: ok. (BASCORPO.PDF)

Speech type(s): (multi-party conversations, human-human dialogues, read sentences, connected and/or isolated digits, isolated words etc.) ok

Instruction to speakers in full copy: n.a.

· Linguistic contents of prompted speech:

Specifications of the individual text items: spontaneous speech, ok

Specification for the prompt sheet design or specification of the design of the speech prompts: n.a.

Example prompt sheet or example sound file from the speech prompting: n.a.

· Linguistic contents of non-prompted speech:

Multi-party:(number of speakers, topics discussed, type of setting - formal/informal) ok

Human-human dialogues: (type of dialogues, e.g. problem solving, information seeking, chat etc., relation between speakers, topic(s) discussed, type of setting, scenarios) ok

Human-machine dialogues: (domain(s), topic(s), dialogues strategy followed by the machine, e.g. system driven, mixed initiative, type of system, e.g. test, operational service, Wizard-of-Oz) n.a.

· Speaker information:

Speaker recruitment strategies: detailed information in the README file, ok.

Number of speakers: 3909 (more information about the speakers in the SPEAKER.TBL)
ok

Distribution of speakers over sex, age, dialect regions: ok, (SPEAKER.TBL)
Description/definition of dialect regions: ok, (SPEAKER.TBL)

· Recording platform and recording conditions:

Recording platform: ok

Position and type of microphones: The data is recorded via a primary rate ISDN interface using Aculab Dialogics hardware and proprietary software.
- Company name and type id: n.a
- Electret, dynamic, condenser: n.a
- Directional properties: n.a.
- Mounting: n.a.

Position of speakers: (distance to microphone) No information

Bandwidth: (if other than zero to half of sampling rate) no information

Number of channels and channel separation: 1 channel

Acoustical environment: home environment

· Annotation (SAM label file):

Unambiguous spelling standard used in annotations: ok

Labeling symbols: ok

List of non-standard spellings (dialectal variation, names etc.): given

Distinction of homographs which are no homophones: n.a.

Character set used in annotations: ok

Any other language dependent information as abbreviations etc: given

Annotation manual, guidelines, instructions: ok –(all recordings were transcribed according to the SpeechDat-II guidelines using the WWW-Transcribe
Software, DOC/TRANSCRP.PDF)

Description of quality assurance procedures: not given

Selection of annotators: native speakers of German

Training of annotators: trained with material from SpeechDat-M and -II recordings

Annotation tools used: WWWTranscribe software

· Lexicon:

Format: ok

Text-to-phoneme procedure: ok

Explanation or reference to the phoneme set: ok. (/DOC/PRONCONV)

Phonological or higher order phenomena accounted in the phonemic transcriptions: ok

· Statistical information:

Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): n.a.

Word frequency table: ok

· Others:

Any other essential language-dependent information or convention: n.a.

Indication of how many files were double-checked by the producer together with percentage of detected errors: not given

Status of documentation: acceptable

II.) Automatic validation

The following list contains all validation steps with the methodology and results.

Completeness of signal files: ok

Completeness of meta data files: ok

Completeness of annotation files: ok.

Correctness of file names: ok.

Empty files: none

Status of signal, annotation and meta data files: ok

Cross checks of meta information: ok

Cross checks of summary listings: ok

Annotation and lexicon contents: The following two words of the SAM files couldn't be found in the lexicon, because they are written incorectly:
1.) Pensonierungsvorgang (in the file " a11844x1.deo")
2.)Vetriebsleitertagung (in the file "a11427x1.deo")

III.) Manual Validation

5% of the 'usable' data, the BAS files and the annotation SAM files were checked in comparison. 11,79% of the data contained errors (23 errors out of 195). Some words of the speaker were not annotated or the background noise was not marked.

IV.) Other Relevant Observations

The annotations were made on the basis of the DUDEN lexicon. Therefore the colloquial speech elements were sometimes annotated as mispronounced words or others exactly as their entry in the dictionary.

V.) Comments for Improvement

The revalidation was able to repair some data (README, typos in the SAM files, LEXICON.TBL) and to update the annotation files. The results of the manual validation couldn't be repaired.

VI.) Result

The corpus is ok. The corpus is well documented and no data or documentation files are missing.