Authors |
Florian Schiel, Katerina Louka |
Affiliation |
BAS Bayerisches Archiv für Sprachsignale |
Postal address |
Schellingstr. 3 |
|
schiel@phonetik.uni-muenchen.de |
Telephone |
+49-89-2180-2758 |
Fax |
+49-89-2800362 |
Corpus Version |
|
Date |
13.09.2004 |
Status |
final |
Comment |
|
Validation Guidelines |
Florian Schiel: The
Validation of Speech Corpora, Bastard Verlag,
2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook |
The corpus contains read speech of 10
different speakers. Each speaker
has read approx. 1000 sentences from a German news paper corpus,
thus resulting in a total of approx. 10000 recorded utterances.
This document summarizes
the results of an in-house validation of the speech corpus SI1000
made in
the
year 2004 within the project 'BITS' by the
The
General Documentation directory contains the following documentation
files for
the SI1000 corpus which can be found under: doc/
README |
general documentation |
SI1000id.lis |
list
of speakers for the total corpus |
SI1000.txt |
list of spoken
utterances |
phondat.doc |
documentation about the
file format PhonDat |
SI1000.lex |
pronunciation
dictionary in Extended German SAM-PA |
ext_sam.txt |
description of Extended
German SAM-PA |
partitur/ |
BAS Partitur Files |
pardoc/ |
BAS Partitur Files Docu |
· Administrative Information:
Validating person: n. a.
Date of
validation: n. a..
Contact
for requests regarding the corpus:
ok
Number and
type of media: CDROMs
Content of
each medium: no information,
total size 2,2 GB of compressed data
Copyright
statement and intellectual property rights (IPR): ok
· Technical information:
Layout of
media: Information
about file system type and directory structure:
The data of the speakers are
collected in subdirectories named with the speaker id.
The volumes are stored on CDROMs with High Sierra File System (HSFS)
or ISO 9660 format with RockRidge extensions.
File
nomenclature: Explanation
of used codes (no white space in file names!):
<speaker_id><sex>d<sentence # from corpus>.16 ok
Formats of
signals and annotation files: If
non standard formats are used it is
common to give a full description or to convert into a standard format:
ok
- PhonDat 2
format signal file
Coding: .16 (PhonDat format signal file
with 16 kHz sampling rate)
Compression:
n. a.
Sampling
rate: 16kHz ok
Valid bits
per sample: (others than 8,
16, 24, should be reported): 16 bit ok
Used bytes
per sample: 2 bytes/samp ok
Multiplexed
signals: (exact
de-multiplexing algorithm;
tools) n.a.
· Database contents:
Clearly
stated purpose of the recordings:
no information
Speech
type(s): (multi-party
conversations, human-human dialogues, read sentences,
connected and/or isolated digits, isolated words etc.) read
sentences ok
Instruction
to speakers in full copy: no
information
·
Linguistic
contents of prompted speech:
Specifications
of the individual text items:
Sentences from a German news paper corpus
Specification
for the prompt sheet design or specification of the design
of the speech prompts: n.a.
Example
prompt sheet or example sound file from the speech prompting: n.a.
·
Linguistic
contents of non-prompted speech:
Multi-party:(number of speakers, topics discussed, type of
setting -
formal/informal) ok
Human-human
dialogues: (type of
dialogues, e.g. problem solving, information seeking, chat
etc., relation between speakers, topic(s) discussed, type of setting,
scenarios) n.a.
Human-machine
dialogues: (domain(s),
topic(s), dialogues strategy followed by the machine,
e.g. system driven, mixed initiative, type of system, e.g. test,
operational
service, Wizard-of-Oz) n.a.
· Speaker information:
Speaker
recruitment strategies: No
information, not important for this corpus
Number of
speakers: 10 (SI1000id.lis)
ok
Distribution of speakers over sex, age, dialect regions: given (/doc/SI1000id.lis)
Description/definition
of
dialect
regions: indirectly given (/doc/SI1000id.lis)
·
Recording
platform and recording conditions:
Recording
platform: ok
Position
and type of microphones:
- Company name and type id: Sennheiser Headset HMD 410
- Electret, dynamic, condenser: no information
- Directional properties: no information
- Mounting: no information
Position
of speakers: (distance to
microphone) Distance to mouth approx. 3-5 c
Bandwidth: (if
other than zero to half of sampling rate) ok
Number of
channels and channel separation:
no information
Acoustical
environment: no
information
·
Annotation
(BAS Partitur Format Files):
Unambiguous
spelling standard used in annotations: ok
Labeling
symbols: ok
List of
non-standard spellings (dialectal variation, names etc.): given
Distinction
of homographs which are no homophones: n.a.
Character
set used in annotations: ok
Any other
language dependent information as abbreviations etc: given
Annotation
manual, guidelines, instructions:
ok – (The PAR documentation can be found in the files in PARDOC)
Description
of quality assurance procedures:
not given
Selection
of annotators: not given
Training
of annotators: not given
Annotation tools used: given
·
Lexicon:
Format: ok
Text-to-phoneme procedure: ok
Explanation
or reference to the phoneme set:
an indirect reference,ok. (ext_sam.txt is
a description and mapping of Extended German SAM-PA)
Phonological
or higher order phenomena accounted in the phonemic
transcriptions: ok
·
Statistical
information:
Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): n.a.
Word
frequency table: n.a.
· Others:
Any other
essential language-dependent information or convention: given.
Indication
of how many files were double-checked by the producer
together with percentage of detected errors: known errors can be found in the README
file.
Status of documentation: acceptable
The following list contains
all validation steps with the methodology and results.
Completeness of signal files: Following files are missing:
Speaker
ID |
Missing Files
|
BJ |
077, 079, 086 |
BK |
077, 079, 086 |
BW |
077, 079, 086, 201, 1000 |
CK |
077, 079, 086 |
CS |
077, 079, 086 |
HT |
077, 079, 086, 101, 401, 501, 601 |
MH |
077, 079, 086, 601 |
PG |
077, 079, 086, 601 |
SH |
077, 079, 086, 101, 201, 601, 801 |
TA |
001, 077, 079, 086 |
Superflous files:
csmd101.shn, csmd401.shn, csmd701.shn
Completeness of meta
data files: ok
Completeness of
annotation files: ok.
Correctness of file
names: ok.
Empty files/ Corrupt files: The files "101, 401,
701" of the speaker with the id "CS".
Status of signal,
annotation and meta data files:
ok
Cross checks
of meta information: ok
Cross checks
of summary listings: ok
Annotation and lexicon contents:
All /R/ phonemes are changed to /r/ in order to be conform with the BAS Guidelines
Approximetely 4 % of the data and
the PAR
files were
checked in comparison. The error rate was very low. The errors were
misspoken words and not spoken punctuation marks.
Pauses, hesitations, and
breathings are not annotated.
The results of
the
manual validation couldn't be repaired.
The corpus is ok. The corpus, its
history and its known errors are well
documented and no data
or documentation files are missing.