Revalidation Report for the PhonDat2 Database






Authors
Florian Schiel, Angela Baumann, Katerina Louka
Affiliation  
BAS Bayerisches Archiv für Sprachsignale
Institut für Phonetik
Universität München
Postal address
Schellingstr. 3
D 80799 München
E-Mail

bas@phonetik.uni-muenchen.de
Telephone
+49-89-2180-2758
Fax
+49-89-2800362
Corpus Version
2.5
Date
25/06/2003
Status
final
Comment
The following validation results show a lack of essential information in documentation and annotation. The usability of the corpus for other applications may be reduced.
Validation Guidelines
Florian Schiel: The Validation of Speech Corpora, <Verlag>, 2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook 


















Validation Results of the PhonDat2 Corpus

Summary

The speech corpus of PhonDat2 has been validated against general principles of good practise. The
validation covered completeness, formal checks and manual checks of the selected subsamples.
The missing data reduces the corpus in its usability and there could occur problems in using the
corpus for other applications. 

Introduction and Corpus Description

This document summarizes the results of an inhouse validation of the speech corpus PhonDat2
made in the year 2003 within the project 'BITS' by the Institute of Phonetics of the
Ludwig-Maximilians-University Munich. The corpus was produced at three different
sites in Germany, namely the University of Kiel, the University of Bonn and the University of
Munich. The language of the corpus is German. Recordings were made for a train inquiry system
and were accurately controlled by a supervisor in quiet studio environment. Speakers had to read
prompts from a screen which were formulated like speakers would talk to an automatic dialog
system. The corpus contains the read speech of 16 different speakers and each speaker had to read
a corpus of 200 different sentences. All utterances were recorded by a variety of different Sennheiser
microphones. Annotations were made for word segmentation (s0-files), phonological hand
segmentation (s1-files), automatic time alignment (s2-files) and prosodic labeling (pr-files).

I) Validation of Documentation


The General Documentation directory contains the following documentation files for the PD2
corpus which can be found under: docu/...

README: documentation
PD2_sprk.txt: speaker information
PD2_ort.txt: orthography of the corpus
PD2_can.txt: canonical forms of the corpus
phondat.doc: documentation about the file format PhonDat
seg_conv.txt: handbook for hand segmentation (only in German)
ext_sampa.txt: table containing the extended SAM-PA for German

The following required contents of the documentation have been checked:

- Administrative Information:

Validating person:
n.a.

Date of validation: n.a.

Contact for requests regarding the corpus: ok.

Number and type of media: 1 volume on CDROM. ok.

Content of each medium: 3200 recorded utterances on 1 volume, total size 493 MB. ok.

Copyright statement and intellectual property rights (IPR): ok.


- Technical information:

Layout of media: information about file system type and directory structure:
CDROM with High Sierra File System (HSFS) or ISO 9660 format. ok.
Root directory structure not given: repairable.

File nomenclatura: explanation of used codes (no white space in file names!):
<speaker_id><recording site id><sentence #><# of repetition>.<ext>. ok.

Formats of signals and annotation files: if non standard formats are used, it's common to give
a fully description or convert into standard format
: the signal file format is PhonDat2 (see
phondat.doc). Better in README: repairable
The formats for annotation are s0 (word segmentation), s1 (phonological hand
segmentation), s2 (automatic time alignment) and pr (prosodic labeling).
The partitur files can be found under the subdirectory
docu/pardoc/. ok.
                                                                             
Coding: PCM. Coding is given in phondat.doc. Better in README: repairable
PhonDat2, s0, s1, s2 and pr are proprietary file formats and not to be considered as standard
formats. However, all information of the above labelling files is also included in the BPF
files which can be considered as a standard format: ok.
The signal file format PhonDat2 is not a standard format and should be transformed into
NIST: repairable

Compression: just widely supported compressions like zip or gzip should be used: not used. ok.

Sampling rate: 16 kHz. The speech data were digitally filtered to 8 kHz cutoff
frequency and then downsampled to 16 kHz. This information is given in phondat.doc. Better
in README: repairable

Valid bits per sample: 12 bit. In README wrongly given as 16 bit while in signal files it's
given in 12: repairable

Used bytes per sample: 2. ok.

Multiplexed signals: n.a.


- Database contents:

Clearly stated purpose of the recordings: train query. ok.
                                                                 
Speech type(s): read sentences. Prompting method not given: repairable

Instruction to speakers in full copy: just a verbal instruction: 'read carefully but fluently as in a
real live train inquiry to an automatic dialog system'. ok.


- Linguistic contents of prompted speech:

Specifications of the individual text items: (see PD2_ort.txt). This file contains all text
prompts that have been spoken together with the utterance id. Umlauts coding not given: repairable

Specification for the prompt sheet design or specification of the design of the speech prompts:
screen-prompted, read from a train query task. No specification of prompt design given.  Not ok.                                                                                                                                                                                                                                                                                                    
Example prompt sheet or example sound file from the speech prompting: n.a.

 -
Speaker information:

Speaker recruitment strategies: no information. Not ok.

Number of speakers: 16 speakers (6f/10m). Number of male and female speakers just indirectly given
in PD2_sprk.txt: repairable

Distribution of speakers over sex, age, dialect regions: no distribution over age and dialects.
Just a simple classification in 'old' (A) and 'young' (J) and the recording site (A=Kiel, N=Bonn,
D=Munich) given (see PD2_sprk.txt). Not ok.

Description/definition of dialect regions: not given. Not ok.


- Recording platform and recording conditions:
       
Recording platform: not given. Not ok.

Position and type of microphones: various Sennheiser Microphones, e.g. MKH 20 P48.
No further specification given. Not ok.

  
- Electret, dynamic, condenser: not given. Not ok.
   - Directional properties: not given. Not ok.
   - Mounting: not given. Not ok.

Position of speakers: distance to microphone not given. Not ok.

Bandwidth of microphones:  not given. Not ok.

Number of channels and channel separation: n.a.

Acoustical environment: quiet studio conditions. But no further information about individual
conditions in the 3 recording rooms. Not ok.


- Annotation 1 (s1-files 'Phonological hand segmentation'):

Unambiguous spelling standard used in annotations: not given. Not ok.
           
Labeling symbols: see (seg_conv.txt) only in German. Acceptable
Further markers are: '#c' (beginning of the first word), '#p' (pause), '#v' (mis-pronunciation)
'$' (ordinary segment), '##' (word boundary segment), '$#' (compositum boundary segment),
'#.' or '#,' or '#?' or '#!' (punctuation). ok. 

List of non-standard spellings: not given. Not ok.

Character set used in all annotations: not given: repairable

Any other language dependend information as abbreviations etc.: not given. Ok.

Annotation manual, guidelines, instructions: instruction and guideline for annotators (see
seg_conv.txt). Only in German. Acceptable.

Description of quality assurance procedures: not given. Not ok.

Selection of annotators: experienced students of Phonetics. Names of transcribers not given.  Not ok.

Training of annotators:  annotators were already 'experienced'. ok.

Annotation tools used: not given. Not ok.


- Annotation 2 (s0-files 'Word segmentation'):
           
Labeling symbols: '#c' and '.' (for begin and ending of utterance).

Annotation manual, guidelines, instructions: not given. Not ok.

Description of quality assurance procedures: not given. Not ok.

Selection of annotators: supposably the same annotators as for annotation 1.
Not ok.

Annotation tools used: not given. Not ok.


- Annotation 3 (s2-files 'Automatic time alignment') :

Description of quality assurance procedures: not given. Not ok.

Annotation tools used: Viterbi alignment using standard HMM
techniques and some rule-based postprocessing with the software
segment-1.4. ok.


- Annotation 4 (pr-files 'Prosodic labeling'):
           
Labeling symbols: 'PA' (main accent of intonation phrase), 'NA' (additional secondary accent)
'B3' (intonational phrase boundary). Neither tones nor 'B2' or 'B9' were labelled. ok.

Annotation manual, guidelines, instructions: not given. Not ok.

Description of quality assurance procedures: not given. Not ok.

Selection of annotators: 5 students of the University of Braunschweig. No names given. Not ok.

Annotation tools used: not given. Not ok.


 -Lexicon:

Format: 7-bit-ASCII. In PD2_lex.txt, but all only in German: repairable

Text-to-phoneme procedure: in PD2_lex.txt, only in German: repairable

Explanation or reference to the phoneme set: Extended German SAM_PA. ok.

Phonological or higher order phenomena accounted in the phonemic transcriptions: in PD2_lex.txt,
only in German: repairable


- Statistical information:

Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): not given. ok.

Word frequency table: word counts in lexicon. Acceptable.


    - Others:

Any other essential language-dependent information or convention: not given. ok.

Indication of how many files were double-checked by the producer together with percentage of detected errors: not given. Not ok.


Status documentation: not acceptable.
If items were marked as 'repairable' in this file, they have been repaired under docu/README.

II.)  Automatic Validation

The following list contains all validation steps with the methodology and results.
 
 Completness of signal, annotation and meta data files:

All signal files of one speaker are collected in a subdirectory named with the speaker id

The word segmentation directory of each speaker of Munich and Kiel contain more than the utterances 530-600 of the recordings

known errors, but not fixed:
  1. tpon7040.s1 missing
  2. bmon5150.s1 missing

Correctness of file names:
ok


Empty files: none


Annotation and Lexicon Contents: ok

       

III.) Manual Validation 

10% of the data and annotations files was checked in comparison. 1.25% of the data contained errors.

IV.) Other Relevant Observations

V.) Comments for Improvement

The revalidation was able to repair some data, but most important data like the specification
of the used microphones and the recording platform could not be found and therefore not be
repaired. To avoid such tremendous bugs in future the producers of speech corpora should
first think about what infomation they tend to take for granted and therefore will be forgotten to
be reported. Such forgotten and not reported information makes it sometimes impossible for the
user itself to operate with the corpus for the desired application and makes the corpus on the long
run nearly worthless for other applications. Before producing speech corpora a guideline like the
book 'The Production and Validation of Speech Corpora' can help to think about all information
to be documented. 

VI.) Result

The results show a lack of essential information in documentation and annotation. Some data have
been repairable, but most relevant data like the specification of the used microphones, the recording
platform and the exact acoustical environment could not  be repaired.  These missings reduce the
corpus in its usability for other applications.