Revalidation report for the SC10 Database

Validation report for the FORMTASK Database

Authors	Michael Medla, Susanne Beinrucker, Florian Schiel
Affiliation	BAS Bayerisches Archiv für Sprachsignale Institut für Phonetik Universität München
Postal address	Schellingstr. 3 D 80799 München
E-mail	schiel@phonetik.uni-muenchen.de bas@phonetik.uni-muenchen.de
Telephone	+49-89-2180-2758
Fax	+49-89-2180-5790
Corpus Version	2014-02-28 2.0
Date	2014
Status	finished
Comment	Aside from minor missing informations, the corpus is in good to very good shape for scientific and technical usage.
Validation Guidelines	Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook

Validation results of the FORMTASK Corpus:

Summary

The speech corpus FORMTASK has been validated against general principles of good practise. The validation covered completeness, formal checks and manual checks of the selected sub samples. The corpus is complete and in good condition. No major problems have been identified.

Aside from minor missing informations, the corpus is in good to very good shape for scientific and technical usage.

Introduction and Corpus Description

This document summarizes the results of an in-house validation of the speech corpus FORMTASK made in the year 2014 within the project 'BITS' by the Institute of Phonetics of the Ludwig-Maximilians-University Munich. The speech corpus was created in 2003 in collaboration at the same institute.

FORMTASK is a collection of more than 17000 spontaneous speech items recorded as extra material during the German SpeechDat-II project. Speakers were asked to describe typical forms found in everyday life, e. g. public transport tickets or invoices. Therefore the following questions should be answered: 1) What type of form is this?, 2) What date is on the form?, 3) What amount is on the form?, 4) Where is the amount written on the form?. The presented forms were given in black and white on paper.

The FORMTASK speech database consists of 4366 recording sessions with a total of 17293 recorded audio files from telephone calls in a format compatible with the SpeechDat(II) database exchange format as defined in the SpeechDat deliverable SD 1.3.1 V.4.3.

I.) Validation of Documentation

The General Documentation directory contains the following documentation files for the FORMTASK corpus which can be found under:

1.) the main directory

README	File describing the database and its structure
DISK.ID	Directory ID for OS
COPYRIGH.TXT	Copyright text

2.) the directory /doc:

SAMPSTAT.TXT	SNR values
SD131V43.DOC	Database Exchange Format Specification
SD131V43_{1..7}.PDF}	Database Exchange Format Specification in PDF
SD132V24.{DOC\|PDF}	Orthographic and Transcription Conventions
SNR.TXT	Description of SNR computation by SPEX
TRANSCRP.{PDF}	The validation and transcription handbook

3.) the directory /table:

FORMTASK.TBL	Tab delimited text file with summary information
LEXICON.TBL	Pronunciation dictionary
SPEAKER.TBL	Speaker information file
SESSION.TBL	Session information file

4.) the directory /index:

CONTENT.LST

Content of the database

5.) the directory /images:

BERLINITICKET.PDF	Berlin public transport ticket
BUERKLIN.PDF	Invoice
KNOLLEBW.PDF	Austrian parking ticket
QUITTUNG.PDF	Newsstand receipt
UEBERWEISUNG.PDF	Money transfer

· Administrative Information:

Validating person: n. a.

Date of validation: n. a.

Contact for requests regarding the corpus: ok

Number and type of media: n.a.

Content of each medium: ok (file tree of master)

· Technical information:

Layout of media: Information about file system type and directory structure: n.a.

File nomenclature: Explanation of used codes (no white space in file names!):
<a1><4 digit session number><y[6-9]>< .dea | .deo> ok

Formats of signals and annotation files: If non standard formats are used it is common to give a full description or to convert into a standard format: ok, raw signal file

Coding: ALAW, ok

Compression: n.a.

Sampling rate: 8kHz, ok

Valid bits per sample: (others than 8, 16, 24, should be reported): 8 bit, ok

Used bytes per sample: 1 bytes/sample, ok

Multiplexed signals: (exact de-multiplexing algorithm; tools) single channel, signals were not changed after recording ok

· Database contents:

Clearly stated purpose of the recordings: ok

Speech type(s): (multi-party conversations, human-human dialogues, read sentences, connected and/or isolated digits, isolated words etc.) description of images/forms -> semi-spontaneous speech ok

Instruction to speakers in full copy: no

· Linguistic contents of prompted speech:

Specifications of the individual text items: spontaneous speech, ok

Specification for the prompt sheet design or specification of the design of the speech prompts: prompt sheet design, ok

Example prompt sheet or example sound file from the speech prompting: missing

· Linguistic contents of non-prompted speech:

Multi-party:(number of speakers, topics discussed, type of setting - formal/informal) n.a.

Human-human dialogues: (type of dialogues, e.g. problem solving, information seeking, chat etc., relation between speakers, topic(s) discussed, type of setting, scenarios) n.a.

Human-machine dialogues: (domain(s), topic(s), dialogues strategy followed by the machine, e.g. system driven, mixed initiative, type of system, e.g. test, operational service, Wizard-of-Oz) n.a.

· Speaker information:

Speaker recruitment strategies: ok (detailed information in the README-file)

Number of speakers: ok, 4366 speakers

Distribution of speakers over sex, age, dialect regions: ok

Description/definition of dialect regions: ok

· Recording platform and recording conditions:

Recording platform: ISDN interface using Aculab Dialogics hardware and proprietary software

Position and type of microphones: no information
- Company name and type id: n.a.
- Electret, dynamic, condenser: n.a.
- Directional properties: n.a.
- Mounting: n.a.

Position of speakers: (distance to microphone) no information

Bandwidth: (if other than zero to half of sampling rate) ISDN bandwidth 300-3300Hz ok

Number of channels and channel separation: 1 channel ok

Acoustical environment: 3 types (home, office, telephone booth)

Recording hardware, telephone link (analog, digital): ISDN interface using Aculab Dialogics hardware and proprietary software

Network from where the call originated: no information

Type of handset: ok

· Annotation (SAM label file):

Unambiguous spelling standard used in annotations: ok

Labeling symbols: ok

List of non-standard spellings (dialectal variation, names etc.): n.a. (The annotation of the slang words was made according to their canonical form in the dictionary. Sometimes the slang words were annotated as mispronounced words.)

Distinction of homographs which are no homophones: n.a.

Character set used in annotations: ok (German SAMPA)

Any other language dependent information as abbreviations etc: n.a.

Annotation manual, guidelines, instructions: ok (all recordings were transcribed according to the SpeechDat-II guidelines using the WWW-Transcribe Software, DOC/TRANSCRP.PDF)

Description of quality assurance procedures: not given

Selection of annotators: native speakers of German

Training of annotators: trained with material from SpeechDat-M and -II recordings

Annotation tools used: WWWTranscribe software

· Lexicon:

Format: simple ASCII list with two columns ok

Text-to-phoneme procedure: manual transcription by experienced phonetician ok

Explanation or reference to the phoneme set: German SAM-PA ok

Phonological or higher order phenomena accounted in the phonemic transcriptions: n.a.

· Statistical information:

Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): not given

Word frequency table: not given

· Others:

Any other essential language-dependent information or convention: n.a.

Indication of how many files were double-checked by the producer together with percentage of detected errors: not given

Status of documentation: acceptable

II.) Automatic validation

The following list contains all validation steps with the methodology and results.

Completeness of signal files: ok

Completeness of meta data files: ok

Completeness of annotation files: ok

Correctness of file names: ok

Empty files: none

Status of signal, annotation and meta data files: ok

Cross checks of meta information: ok

Cross checks of summary listings: ok

Annotation and lexicon contents: ok

III.) Manual Validation

About two percent ((~2.2%; ~376 files) of the audio files and their SAM-files were checked manually. A few mistakes in the orthographical level of some SAM files (12 files) were noticed (e.g. wrong annotation, wrong or too much noise markers). Also, not every session contains four files (two files).

IV.) Other Relevant Observations

Some folders and files are missing. SPEAKER.TBL, SESSION.TBL and LEXICON.TBL, which should be found in the INDEX-directory, and CONTENTS.LST from the INDEX-directory are not available. No relevant information about CD-ROMS can be found (no DISK.ID). Also there are a few inconsistencies in the README. Its says there are no information about used telephone handsets (paragraph Speaker), but that information is given by the label PHM in meta data files. Furthermore the data structure in the readme-document is different from the structure in the master. In the paragraph “CD-ROM-Structure” you can see a different data structure (in the master the BLOCK-directory is not in the DATA-directory). *.DEO-files are encoded in ASCII and not in ISO-8859-1. For testing purposes a directory has been created using FORMTASK_utf8.tab (see /source). In this case it was determined that it must have been 17297 recordings. But four of them were useless and not kept together with their labelfiles in the corpus. In SAM-files as well as in CONTENTS.LST the path (\FIXED1DE\) is wrong. It was directed to a non-existing directory. This error was manually changed.

V.) Comments for Improvement

After building the missing table files (see IV.) with given scripts, also a few adjustments for the README are necessary. As above mentioned, the data structure in the corpus is different from the original README. Furthermore it is suggested to make a note about such informations like files, which were deleted after the recording sessions or after reviewing them for the corpus.

VI.) Result

Aside from minor missing informations, the corpus is in good to very good shape for scientific and technical usage.