Authors |
Michael Medla, Susanne Beinrucker, Florian Schiel |
Affiliation |
BAS Bayerisches Archiv für Sprachsignale |
Postal address |
Schellingstr. 3 |
|
schiel@phonetik.uni-muenchen.de |
Telephone |
+49-89-2180-2758 |
Fax |
+49-89-2180-5790 |
Corpus Version |
2014-02-28 2.0 |
Date |
2014 |
Status |
finished |
Comment |
Aside from minor missing informations, the corpus is in good to very good shape for scientific and technical usage. |
Validation Guidelines |
Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook |
The speech corpus FORMTASK has been validated against general principles of good practise. The validation covered completeness, formal checks and manual checks of the selected sub samples. The corpus is complete and in good condition. No major problems have been identified.
Aside from minor missing informations, the corpus is in good to very good shape for scientific and technical usage.
This document summarizes the results of an in-house validation of the speech corpus FORMTASK made in the year 2014 within the project 'BITS' by the Institute of Phonetics of the Ludwig-Maximilians-University Munich. The speech corpus was created in 2003 in collaboration at the same institute.
FORMTASK is a collection of more than 17000 spontaneous speech items recorded as extra material during the German SpeechDat-II project. Speakers were asked to describe typical forms found in everyday life, e. g. public transport tickets or invoices. Therefore the following questions should be answered: 1) What type of form is this?, 2) What date is on the form?, 3) What amount is on the form?, 4) Where is the amount written on the form?. The presented forms were given in black and white on paper.
The FORMTASK speech database consists of 4366 recording sessions with a total of 17293 recorded audio files from telephone calls in a format compatible with the SpeechDat(II) database exchange format as defined in the SpeechDat deliverable SD 1.3.1 V.4.3.
The General Documentation directory contains the following documentation files for the FORMTASK corpus which can be found under:
1.) the main directory
README |
File describing the database and its structure |
DISK.ID |
Directory ID for OS |
COPYRIGH.TXT |
Copyright text |
2.) the directory /doc:
SAMPSTAT.TXT |
SNR values |
SD131V43.DOC |
Database Exchange Format Specification |
SD131V43_{1..7}.PDF} |
Database Exchange Format Specification in PDF |
SD132V24.{DOC|PDF} |
Orthographic and Transcription Conventions |
SNR.TXT |
Description of SNR computation by SPEX |
TRANSCRP.{PDF} |
The validation and transcription handbook |
3.) the directory /table:
FORMTASK.TBL |
Tab delimited text file with summary information |
LEXICON.TBL |
Pronunciation dictionary |
SPEAKER.TBL |
Speaker information file |
SESSION.TBL |
Session information file |
4.) the directory /index:
CONTENT.LST |
Content of the database |
5.) the directory /images:
BERLINITICKET.PDF |
Berlin public transport ticket |
BUERKLIN.PDF |
Invoice |
KNOLLEBW.PDF |
Austrian parking ticket |
QUITTUNG.PDF |
Newsstand receipt |
UEBERWEISUNG.PDF |
Money transfer |
· Administrative Information:
Validating person: n. a.
Date of validation: n. a.
Contact for requests regarding the corpus: ok
Number and type of media: n.a.
Content of each medium: ok (file tree of master)
Copyright statement and intellectual property rights (IPR): ok
· Technical information:
Layout of media: Information about file system type and directory structure: n.a.
File
nomenclature: Explanation
of used codes (no white space in file names!):
<a1><4
digit session number><y[6-9]>< .dea | .deo> ok
Formats of signals and annotation files: If non standard formats are used it is common to give a full description or to convert into a standard format: ok, raw signal file
Coding: ALAW, ok
Compression: n.a.
Sampling rate: 8kHz, ok
Valid bits per sample: (others than 8, 16, 24, should be reported): 8 bit, ok
Used bytes per sample: 1 bytes/sample, ok
Multiplexed signals: (exact de-multiplexing algorithm; tools) single channel, signals were not changed after recording ok
· Database contents:
Clearly stated purpose of the recordings: ok
Speech type(s): (multi-party conversations, human-human dialogues, read sentences, connected and/or isolated digits, isolated words etc.) description of images/forms -> semi-spontaneous speech ok
Instruction to speakers in full copy: no
· Linguistic contents of prompted speech:
Specifications of the individual text items: spontaneous speech, ok
Specification for the prompt sheet design or specification of the design of the speech prompts: prompt sheet design, ok
Example prompt sheet or example sound file from the speech prompting: missing
· Linguistic contents of non-prompted speech:
Multi-party:(number of speakers, topics discussed, type of setting - formal/informal) n.a.
Human-human dialogues: (type of dialogues, e.g. problem solving, information seeking, chat etc., relation between speakers, topic(s) discussed, type of setting, scenarios) n.a.
Human-machine dialogues: (domain(s), topic(s), dialogues strategy followed by the machine, e.g. system driven, mixed initiative, type of system, e.g. test, operational service, Wizard-of-Oz) n.a.
· Speaker information:
Speaker recruitment strategies: ok (detailed information in the README-file)
Number of speakers: ok, 4366 speakers
Distribution of speakers over sex, age, dialect regions: ok
Description/definition of dialect
regions: ok
· Recording platform and recording conditions:
Recording platform: ISDN interface using Aculab Dialogics hardware and proprietary software
Position and type
of microphones: no
information
- Company name and
type id: n.a.
- Electret, dynamic, condenser: n.a.
-
Directional properties: n.a.
- Mounting: n.a.
Position of speakers: (distance to microphone) no information
Bandwidth: (if other than zero to half of sampling rate) ISDN bandwidth 300-3300Hz ok
Number of channels and channel separation: 1 channel ok
Acoustical environment: 3 types (home, office, telephone booth)
Recording hardware, telephone link (analog, digital): ISDN interface using Aculab Dialogics hardware and proprietary software
Network from where the call originated: no information
Type of handset: ok
· Annotation (SAM label file):
Unambiguous spelling standard used in annotations: ok
Labeling symbols: ok
List of non-standard spellings (dialectal variation, names etc.): n.a. (The annotation of the slang words was made according to their canonical form in the dictionary. Sometimes the slang words were annotated as mispronounced words.)
Distinction of homographs which are no homophones: n.a.
Character set used in annotations: ok (German SAMPA)
Any other language dependent information as abbreviations etc: n.a.
Annotation manual, guidelines, instructions: ok (all recordings were transcribed according to the SpeechDat-II guidelines using the WWW-Transcribe Software, DOC/TRANSCRP.PDF)
Description of quality assurance procedures: not given
Selection of annotators: native speakers of German
Training of annotators: trained with material from SpeechDat-M and -II recordings
Annotation tools used: WWWTranscribe software
· Lexicon:
Format: simple ASCII list with two columns ok
Text-to-phoneme procedure: manual transcription by experienced phonetician ok
Explanation or reference to the phoneme set: German SAM-PA ok
Phonological or higher order phenomena accounted in the phonemic transcriptions: n.a.
· Statistical information:
Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): not given
Word frequency table: not given
· Others:
Any other essential language-dependent information or convention: n.a.
Indication of how many files were double-checked by the producer together with percentage of detected errors: not given
Status of documentation: acceptable
The following list contains all validation steps with the methodology and results.
Completeness of signal files: ok
Completeness of meta data files: ok
Completeness of annotation files: ok
Correctness of file names: ok
Empty files: none
Status of signal, annotation and meta data files: ok
Cross checks of meta information: ok
Cross checks of summary listings: ok
Annotation and lexicon contents: ok
About two percent ((~2.2%; ~376 files) of the audio files and their SAM-files were checked manually. A few mistakes in the orthographical level of some SAM files (12 files) were noticed (e.g. wrong annotation, wrong or too much noise markers). Also, not every session contains four files (two files).
Some folders and files are missing. SPEAKER.TBL, SESSION.TBL and LEXICON.TBL, which should be found in the INDEX-directory, and CONTENTS.LST from the INDEX-directory are not available. No relevant information about CD-ROMS can be found (no DISK.ID). Also there are a few inconsistencies in the README. Its says there are no information about used telephone handsets (paragraph Speaker), but that information is given by the label PHM in meta data files. Furthermore the data structure in the readme-document is different from the structure in the master. In the paragraph “CD-ROM-Structure” you can see a different data structure (in the master the BLOCK-directory is not in the DATA-directory). *.DEO-files are encoded in ASCII and not in ISO-8859-1. For testing purposes a directory has been created using FORMTASK_utf8.tab (see /source). In this case it was determined that it must have been 17297 recordings. But four of them were useless and not kept together with their labelfiles in the corpus. In SAM-files as well as in CONTENTS.LST the path (\FIXED1DE\) is wrong. It was directed to a non-existing directory. This error was manually changed.
After building the missing table files (see IV.) with given scripts, also a few adjustments for the README are necessary. As above mentioned, the data structure in the corpus is different from the original README. Furthermore it is suggested to make a note about such informations like files, which were deleted after the recording sessions or after reviewing them for the corpus.
Aside from minor missing informations, the corpus is in good to very good shape for scientific and technical usage.