Authors |
Florian Schiel, Katerina Louka |
Affiliation |
BAS Bayerisches Archiv für Sprachsignale |
Postal address |
Schellingstr. 3 |
|
schiel@phonetik.uni-muenchen.de |
Telephone |
+49-89-2180-2758 |
Fax |
+49-89-2800362 |
Corpus Version |
1.3 |
Date |
22.07.2004 |
Status |
final |
Comment |
|
Validation Guidelines |
Florian Schiel: The
Validation of Speech Corpora, Bastard Verlag,
2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook |
The speech corpus HEMPEL has
been validated against general principles of good practise. The
validation
covered completeness, formal checks and manual checks of the selected
sub samples.
It must be considered that missing data reduces the corpus in its
usability and
there could occur problems in using the corpus for further
applications.
This
document summarizes
the results of an in-house validation of the speech corpus HEMPEL made
in
the
year 2004 within the project 'BITS' by the
The
General Documentation directory contains the following documentation
files for
the HEMPEL corpus which can be found under:
1.) the
main directory
README |
file describing the database
structure |
DISK.ID |
Volume identifier for ISO 9660
file systems |
COPYRIGH.TXT |
Copyright text |
BASCORPO.PDF |
LREC 2002 paper describing the
corpus |
ISO88591.PDF |
ISO8859-1 (ISO Latin) code table |
SAMPSTAT.TXT |
SNR values |
SD131V43.DOC |
Database Exchange Format
Specification |
SD131V43_{1..7}.PDF} |
Database Exchange Format
Specification in PDF |
SD132V24.{DOC|PDF} |
Orthographic and Transcription
Conventions |
SNR.TXT |
Description of SNR computation
by SPEX |
SUMMARY.TXT |
German summary file |
TRANSCRP.{HTM|PDF} |
the validation and transcription
handboo |
LEXICON.TBL |
pronunciation dictionary |
SPEAKER.TBL |
speaker information file |
SESSION.TBL |
session information file |
CONTENTS.LS |
contents of the database |
· Administrative Information:
Validating person: n. a.
Date of
validation: n. a..
Contact
for requests regarding the corpus:
ok
Number and
type of media: 2 CD-ROMs ok
Content of
each medium: CD-ROM
HEMPEL_01 contains sessions 1000-4499, CD-ROM HEMPEL_02 contains
sessions 4500-5809
Copyright
statement and intellectual property rights (IPR): ok
· Technical information:
Layout of
media: Information
about file system type and directory structure:
2 CD-ROMs (compatible with the
SpeechDat II database)
File
nomenclature: Explanation
of used codes (no white space in file names!):
<a1><4 digit recording number><x1>< .dea |
.deo> ok
Formats of
signals and annotation files: If
non standard formats are used it is
common to give a full description or to convert into a standard format:
ok, raw signal file
Coding: ALAW
Compression:
n. a.
Sampling
rate:
8kHz ok
Valid bits
per sample: (others than 8,
16, 24, should be reported): 8 bit, ok.
Used bytes
per sample: 1 bytes/samp ok
Multiplexed
signals: (exact
de-multiplexing algorithm;
tools) n.a.
· Database contents:
Clearly
stated purpose of the recordings:
ok. (BASCORPO.PDF)
Speech
type(s): (multi-party
conversations, human-human dialogues, read sentences,
connected and/or isolated digits, isolated words etc.) ok
Instruction
to speakers in full copy: n.a.
·
Linguistic
contents of prompted speech:
Specifications
of the individual text items:
spontaneous speech, ok
Specification
for the prompt sheet design or specification of the design
of the speech prompts: n.a.
Example
prompt sheet or example sound file from the speech prompting: n.a.
·
Linguistic
contents of non-prompted speech:
Multi-party:(number of speakers, topics discussed, type of
setting -
formal/informal) ok
Human-human
dialogues: (type of
dialogues, e.g. problem solving, information seeking, chat
etc., relation between speakers, topic(s) discussed, type of setting,
scenarios) ok
Human-machine
dialogues: (domain(s),
topic(s), dialogues strategy followed by the machine,
e.g. system driven, mixed initiative, type of system, e.g. test,
operational
service, Wizard-of-Oz) n.a.
· Speaker information:
Speaker
recruitment strategies:
detailed information in the README file, ok.
Number of
speakers: 3909 (more information
about the speakers in the SPEAKER.TBL)
ok
Distribution of speakers over sex, age, dialect regions: ok,
(SPEAKER.TBL)
Description/definition of
dialect
regions: ok, (SPEAKER.TBL)
·
Recording
platform and recording conditions:
Recording
platform: ok
Position
and type of microphones: The data is recorded via a primary rate ISDN
interface using Aculab Dialogics hardware and proprietary software.
- Company name and type id: n.a
- Electret, dynamic, condenser: n.a
- Directional properties: n.a.
- Mounting: n.a.
Position
of speakers: (distance to
microphone) No information
Bandwidth: (if
other than zero to half of sampling rate) no information
Number of
channels and channel separation:
1 channel
Acoustical
environment:
home environment
·
Annotation
(SAM label file):
Unambiguous
spelling standard used in annotations: ok
Labeling
symbols: ok
List of
non-standard spellings (dialectal variation, names etc.): given
Distinction
of homographs which are no homophones: n.a.
Character
set used in annotations: ok
Any other
language dependent information as abbreviations etc: given
Annotation
manual, guidelines, instructions:
ok –(all recordings were transcribed according to the SpeechDat-II
guidelines using the WWW-Transcribe
Software, DOC/TRANSCRP.PDF)
Description
of quality assurance procedures:
not given
Selection
of annotators: native
speakers of German
Training
of annotators: trained
with material from SpeechDat-M and -II recordings
Annotation tools used: WWWTranscribe software
· Lexicon:
Format: ok
Text-to-phoneme procedure: ok
Explanation
or reference to the phoneme set:
ok.
(/DOC/PRONCONV)
Phonological
or higher order phenomena accounted in the phonemic
transcriptions: ok
·
Statistical
information:
Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): n.a.
Word
frequency table: ok
· Others:
Any other
essential language-dependent information or convention: n.a.
Indication
of how many files were double-checked by the producer
together with percentage of detected errors: not given
Status of documentation: acceptable
The following list contains
all validation steps with the methodology and results.
Completeness of signal
files: ok
Completeness of meta
data files: ok
Completeness of
annotation files: ok.
Correctness of file
names: ok.
Empty files: none
Status of signal,
annotation and meta data files:
ok
Cross checks
of meta information: ok
Cross checks
of summary listings: ok
Annotation and lexicon
contents: The following two
words of the SAM files couldn't be found in the lexicon, because
they are written incorectly:
1.) Pensonierungsvorgang (in the file " a11844x1.deo")
2.)Vetriebsleitertagung (in the file "a11427x1.deo")
5% of the 'usable' data, the BAS files and the annotation SAM
files were
checked in comparison. 11,79% of the data contained errors (23 errors
out of 195). Some words of the speaker were not
annotated or the background noise was not marked.
The annotations
were made on the basis of the DUDEN lexicon. Therefore the colloquial
speech elements were sometimes annotated as mispronounced words or
others exactly as their entry in the dictionary.
The revalidation was able to
repair
some
data (README, typos in the SAM files, LEXICON.TBL) and to
update the annotation files. The results of
the
manual validation couldn't be repaired.
The corpus is ok. The corpus is well
documented and no data
or documentation files are missing.