Authors |
Florian Schiel, Katerina Louka |
Affiliation |
BAS Bayerisches Archiv für Sprachsignale |
Postal address |
Schellingstr. 3 |
|
schiel@phonetik.uni-muenchen.de |
Telephone |
+49-89-2180-2758 |
Fax |
+49-89-2800362 |
Corpus Version |
2.1 |
Date |
18.06.2003 |
Status |
final |
Comment |
|
Validation Guidelines |
Florian Schiel: The
Validation of Speech Corpora, Bastard Verlag,
2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook |
The speech corpus TAXI has
been validated against general principles of good practise. The
validation
covered completeness, formal checks and manual checks of the selected
sub samples.
It must be considered that missing data reduces the corpus in its
usability and
there could occur problems in using the corpus for further
applications.
This document summarizes
the results of an in-house validation of the speech corpus TAXI made in
the
year 2004 within the project 'BITS' by the
TAXI contains 94 recorded
dialogues between a cab dispatcher and a client recorded over public
phone
lines (network and GSM).The dispatcher always spoke German, while the
clients
always spoke English.
The corpus contains 94
recordings. Each recording contains a pre-dialogue part, turns by the
dispatcher and client and a hang-up part after the last turn of the
speakers
until the hang up of both parties. Note that a recording does not imply
that
the recording went without any errors.
The
General Documentation directory contains the following documentation
files for
the TAXI corpus which can be found under: doc/ (The README file can be
found in
the main directory)
README |
general documentation |
WORDS_DE.TXT |
Word
list, German part |
WORDS_EN.TXT |
Word
list, English part |
PRON_DE.LEX |
Pronunciation dictionary,
extended German SAM-PA, German part (spoken sentences AND translations) |
SAMPA.txt |
table
of extended German SAMPA |
PRON_EN.LEX |
Pronunciation dictionary,
English SAM-PA, English part(spoken sentences AND translations) |
TRANS.TXT |
Results of validation,
transcription and translation |
TRANSCRP.PDF |
description of rules and
conventions of SpeechDat transcription (German) |
TRANSCRP_EN.PDF |
description of rules and
conventions of SpeechDat transcription (English) |
BasFormatseng.html |
Description of the BAS
Partitur Format (see www.bas.uni-muenchen.de/Bas/BasFormatseng.html
for an updated version of this document) |
softw/ |
some tools that are
distributed with the BAS corpora |
· Administrative Information:
Validating person: n. a.
Date of
validation: n. a..
Contact
for requests regarding the corpus:
ok
Number and
type of media: CDROM ok
Content of
each medium: no information
Copyright
statement and intellectual property rights (IPR): ok
· Technical information:
Layout of
media: Information
about file system type and directory structure:
CDROM- ASCII, German Umlauts are coded
in ISO-8859 ok
File
nomenclature: Explanation
of used codes (no white space in file names!):
<4 digit recording number><Speaker marker><Turn number
starting
with ‘00> ok
Formats of
signals and annotation files: If
non standard formats are used it is
common to give a full description or to convert into a standard format:
ok
-
NIST file PCM coding
- NIST file ALAW coding
Coding: PCM, ALAW
Compression:
n. a.
Sampling
rate:
8kHz ok
Valid bits
per sample: (others than 8,
16, 24, should be reported): ALAW coding: bits/samp,
PCM coding, 16 bit ok
Used bytes
per sample: 2 bytes/samp ok
Multiplexed
signals: (exact
de-multiplexing algorithm;
tools) n.a.
· Database contents:
Clearly
stated purpose of the recordings:
no information
Speech
type(s): (multi-party
conversations, human-human dialogues, read sentences,
connected and/or isolated digits, isolated words etc.) ok
Instruction
to speakers in full copy: The
speakers had to follow specific
conventions: To prevent overlap and to allow automatic segmentation by
the
recording server each party had to press a button on his phone to
signal the
other party that his turn is over.
·
Linguistic
contents of prompted speech:
Specifications
of the individual text items: ok
Specification
for the prompt sheet design or specification of the design
of the speech prompts: n.a.
Example
prompt sheet or example sound file from the speech prompting: n.a.
·
Linguistic
contents of non-prompted speech:
Multi-party:(number of speakers, topics discussed, type of
setting -
formal/informal) ok
Human-human
dialogues: (type of
dialogues, e.g. problem solving, information seeking, chat
etc., relation between speakers, topic(s) discussed, type of setting,
scenarios) ok
Human-machine
dialogues: (domain(s),
topic(s), dialogues strategy followed by the machine,
e.g. system driven, mixed initiative, type of system, e.g. test,
operational
service, Wizard-of-Oz) n.a.
· Speaker information:
Speaker
recruitment strategies: No
information, not important for this corpus
Number of
speakers: 94 (no further
information about the speakers)
ok
Distribution of speakers over sex, age, dialect regions: not
given
Description/definition of
dialect
regions: No information
·
Recording
platform and recording conditions:
Recording
platform: ok
Position
and type of microphones: The data is recorded over public phone
lines (network and GSM)
- Company name and type id: n.a
- Electret, dynamic, condenser: n.a
- Directional properties: n.a.
- Mounting: n.a.
Position
of speakers: (distance to
microphone) No information
Bandwidth: (if
other than zero to half of sampling rate) ok
Number of
channels and channel separation:
mono
Acoustical
environment: City of
·
Annotation
(BAS Partitur Format Files):
Unambiguous
spelling standard used in annotations: ok
Labeling
symbols: ok
List of
non-standard spellings (dialectal variation, names etc.): given
Distinction
of homographs which are no homophones: n.a.
Character
set used in annotations: ok
Any other
language dependent information as abbreviations etc: given
Annotation
manual, guidelines, instructions:
ok – (WORDS_DE.TXT, WORDS_EN.TXT, SAMPA.txt, TRANS.TXT,
TRANSCRP.PDF, TRANSCRP_EN.PDF, BasFormatseng.html)
Description
of quality assurance procedures:
not given
Selection
of annotators: not given
Training
of annotators: not given
Annotation tools used: given
· Lexicon:
Format: ok
Text-to-phoneme procedure: ok
Explanation
or reference to the phoneme set:
an indirect reference,ok.
(http://www.bas.uni-muenchen.de/Bas/BasFormatseng.html,
SAMPA.txt)
Phonological
or higher order phenomena accounted in the phonemic
transcriptions: ok
·
Statistical
information:
Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): n.a.
Word
frequency table: n.a.
· Others:
Any other
essential language-dependent information or convention: given.
Indication
of how many files were double-checked by the producer
together with percentage of detected errors: not given
Status of documentation: acceptable
The following list contains
all validation steps with the methodology and results.
Completeness of signal
files: There are 640
usable and 247 garbage turns.
The following sessions contain no
usable turns:
0049
0053
0075
0083
0092
0094
0099
0127
Completeness of meta
data files: ok
Completeness of
annotation files: ok.
Correctness of file
names: ok.
Empty files: none
Status of signal,
annotation and meta data files:
ok
Cross checks
of meta information: ok
Cross checks
of summary listings: ok
Annotation and lexicon
contents: In the
lexicons not understandable words "**" are not annotated.
10% of the 'usable' data, the BAS files and the file
TRANS.TXT were
checked in comparison. 9,2% of the data contained errors. The most
errors were found in
the client's turns. Some words and pauses were not annotated, or
annotated wrong.
No
information is given about the 'clients'.
There is no information
about the speakers of this corpus. The revalidation was able to repair
some
data (lexicon,README). The results of
the
manual validation couldn't be repaired.
The corpus is ok. The corpus is well
documented and no data
or documentation files are missing.