Authors |
Florian Schiel, Katerina Louka |
Affiliation
|
BAS Bayerisches Archiv für Sprachsignale |
Postal address |
Schellingstr. 3 |
E-mail |
schiel@phonetik.uni-muenchen.de |
Telephone |
+49-89-2180-2758 |
Fax |
+49-89-2800362 |
Corpus Version |
1.2 |
Date |
07.07.2004 |
Status |
final |
Comment |
|
Validation Guidelines
|
Florian Schiel: The
Validation of Speech Corpora, Bastard Verlag,
2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook |
The speech corpus SC10
of has been validated against general principles of good
practise. The
validation covered completeness, formal checks and manual checks of the
selected subsamples. Missing data reduces
the corpus
in its usability and there could occur
problems in
using the corpus for other applications.
This document summarizes
the results of an in-house validation of the speech corpus SC10 made in
the
year 2004 within the project 'BITS' by the
The
General Documentation directory contains the following documentation
files for the SC10 corpus which can be found under: doc/
README |
general documentation |
type_a.txt |
list of type a
sentences |
type_b.txt |
list of type b
utterances |
type_c.txt |
list of type c
story |
type_f.txt |
fairy tale of
re-telling type of speech |
pardoc |
documentation about the BAS
Partitur File (BPF) |
IPAChart96.pdf |
IPA chart |
IPANumberChart96.pdf |
IPA chart with
IPA numbers |
SC10_lex.txt |
canonical pronunciation dictionary |
SC10_list.txt |
word list |
trllex_d.ps |
documentation of
the Verbmobi II transliteration format |
·
Administrative Information:
Validating
person: n. a.
Date of
validation: n. a..
Contact
for requests regarding the corpus:
ok
Number and
type of media: 2 CDROM ok
Content of
each medium: total size 736 MB.
ok
Copyright
statement and intellectual property rights (IPR): ok
·
Technical
information:
Layout of
media: Information
about file system type and directory structure:
CDROM with ISO9660 format ok
File
nomenclature: Explanation
of used codes (no white space in file names!):
<mother tongue><sex><speaker number within
L1><type of
speech a-f><number utterance> ok
Formats of
signals and annotation files: If
non standard formats are used it is
common to give a full description or to convert into a standard format:
The signal file format is NIST SPHERE. ok
Coding: PCM
linear ok
Compression:
n. a.
Sampling
rate:
16 kHz ok
Valid bits
per sample: (others than 8,
16, 24, should be reported): 16 bits/samp ok
Used bytes
per sample: 2 bytes/samp ok
Multiplexed
signals: (exact
de-multiplexing algorithm;
tools)
n. a.
·
Database contents:
Clearly
stated purpose of the recordings:
This corpus may be used for several tasks. ok
Speech
type(s): (multi-party
conversations, human-human dialogues, read sentences,
connected and/or isolated digits, isolated words etc.) ok
Instruction
to speakers in full copy: no
information, not ok
·
Linguistic
contents of prompted speech:
Specifications
of the individual text items: ok
Specification
for the prompt sheet design or specification of the design
of the speech prompts: not
given, not ok
Example
prompt sheet or example sound file from the speech prompting: not given, not ok
·
Linguistic
contents of non-prompted speech:
Multi-party:(number of speakers, topics discussed, type of
setting -
formal/informal) ok
Human-human
dialogues: (type of
dialogues, e.g. problem solving, information seeking, chat
etc., relation between speakers, topic(s) discussed, type of setting,
scenarios) ok
Human-machine
dialogues: (domain(s),
topic(s), dialogues strategy followed by the machine,
e.g. system driven, mixed initiative, type of system, e.g. test,
operational
service, Wizard-of-Oz) n.a.
·
Speaker
information:
Speaker
recruitment strategies: No
information, not importatnt
for this corpus
Number of
speakers: 70 (48 female, 22 male)
ok
Distribution of speakers over sex, age, dialect regions:
indirectly given in the README file, not ok
Description/definition of
dialect
regions: No information
·
Recording
platform and recording conditions:
Recording
platform: ok
Position
and type of microphones:
- Company name and type id: Sennheiser Headset HMD 420-6
- Electret, dynamic, condenser: no information
- Directional properties: no information
- Mounting: indirect information, the speaker wears a headset
Position
of speakers: (distance to
microphone) No information
Bandwidth: (if
other than zero to half of sampling rate) ok
Number of
channels and channel separation:
no information
Acoustical
environment: studio
quality Studio quality, no echo-cancelled
environment
The
L2-speaker was recorded in an acoustically insulated room with
eye-contact to
the L1-speaker
(which
was not recorded)
·
Annotation
(Transliteration):
Unambiguous
spelling standard used in annotations: ok
Labeling
symbols: ok
List of
non-standard spellings (dialectal variation, names etc.): trllex_d.ps
Distinction
of homographs which are no homophones: n.a.
Character
set used in annotations: n.a.
Any other
language dependent information as abbreviations etc: n.a.
Annotation
manual, guidelines, instructions:
ok
Description
of quality assurance procedures:
not given
Selection
of annotators: not given
Training
of annotators: not given
Annotation
tools used:
not given
·
Annotation
(BAS Partitur Format Files):
Unambiguous
spelling standard used in annotations: ok
Labeling
symbols: ok
List of
non-standard spellings (dialectal variation, names etc.): given
Distinction
of homographs which are no homophones: n.a.
Character
set used in annotations: ok
Any other
language dependent information as abbreviations etc: given
Annotation
manual, guidelines, instructions:
ok – pardoc, SC10_lex.txt, IPANumberChart96.pdf
, IPAChart96.pdf
(http://www.bas.uni-muenchen.de/Bas/BasFormatseng.html)
Description
of quality assurance procedures:
not given
Selection
of annotators: not given
Training
of annotators: not given
Annotation
tools used:
given
·
Lexicon:
Format: ok
Text-to-phoneme
procedure: ok
Explanation
or reference to the phoneme set:
an indirect reference,ok.
(http://www.bas.uni-muenchen.de/Bas/BasFormatseng.html)
Phonological
or higher order phenomena accounted in the phonemic
transcriptions: ok
·
Statistical
information:
Frequency
of sub-word
units: phonemes
(diphones, triphones,
syllables,...):
n.a.
Word
frequency table: n.a.
·
Others:
Any other
essential language-dependent information or convention: given.
Indication
of how many files were double-checked by the producer
together with percentage of detected errors: not given
Status of documentation:
acceptable
The following list contains
all validation steps with the methodology and results.
Completeness of signal files: not ok
-
itm1e001.nis is missing
- rsm2a004. nis is missing
In type D spoke speaker arm1
spoke some sentences twice, these files have another names as the
normal data files:
arm1dd02.nis
arm1dd03.nis
arm1dd04.nis
arm1dd05.nis
arm1dd06.nis
arm1dd07.nis
Completeness of meta
data files: ok
Completeness of
annotation files:
par
files:
- all the annotation files of speaker aew1 are missing
- rsm2f014.par is missing
- There are no annotation files of type a of the speaker rsm 2.
The speaker rsm2 spoke the prompts of type a, but in the
README file is not noticed.
- fiw1e029.par is missing
There are no annotations files for "type
d"
trl
files:
There are two files of jpw1e:
- jpw1e000.trl
- jpw1e001.trl
Correctness of file names:
Filenames other than the file
nomenclature:
arm1dd02.nis
arm1dd03.nis
arm1dd04.nis
arm1dd05.nis
arm1dd06.nis
arm1dd07.nis
Empty files: none
Status of signal,
annotation and meta data files: ok
Cross checks
of meta information: ok
Cross checks
of summary listings: ok
Annotation and lexicon
contents: ok
5% of the data and annotations files was
checked in comparison. The trl files were not checked.9,9% of the data
contained errors. Some kind of noises were not tagged and there were
errors in the orthographical and sampa annotation.
The revalidation was able
to repair some data (speaker file, README, file names).
The
results of the manual validation
couldn't be repaired. The sampa
annotations and the
noise markers should be revised.
The corpus is ok. No data
or documentations files are missing.