Authors |
Florian Schiel, Katerina Louka |
Affiliation |
BAS Bayerisches Archiv für Sprachsignale |
Postal address |
Schellingstr. 3 |
|
schiel@phonetik.uni-muenchen.de |
Telephone |
+49-89-2180-2758 |
Fax |
+49-89-2800362 |
Corpus Version |
1.0 |
Date |
06.12.2004 |
Status |
final |
Comment |
|
Validation Guidelines |
Florian Schiel: The
Validation of Speech Corpora, Bastard Verlag,
2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook |
This document summarizes
the results of an in-house validation of the speech corpus SmartKom
made in
the
year 2004 within the project 'BITS' by the
The corpus contains 466 sessions. The corpus is structured
into sessions which contain one recording of
approx. 4,5 min length with one person. Sessions are stored on numbered
DVDs.
interaction (HCI) in a number of different tasks (domains) and
technical setups (scenarios)
structured and validated against basic BAS guidelines.
The
General Documentation directory contains the following documentation
files for
the SmartKom corpus which can be found under: doc/
README |
general documentation |
DTD/
|
Document
type definitions for recording protocols and speaker profiles |
german-sampa.txt
|
Definition
of extended German SAM-PA as used in most German speech resources |
pardoc/ |
Copy of the BAS Partitur File
definition (HTML: start with "index.html") |
quicktime/
|
Quicktime
installation archive for macintosh and windows systems |
readme.ges |
Format description to
the 2D gesture labelling files *.ges |
readme.mar |
Format description to
the turn segment files *.mar |
readme.par |
Format description to
the BAS Partitur Format (BPF) files *.par |
readme.trl
|
Format description to
the transliteration files *.trl |
readme.trp |
Format description to
the prosodic segmentation files *.trp |
readme.ush |
Format description to
the user state label files *.ush |
readme.usm |
Format description to the user
state label files *.usm |
readme.woz.german |
subject instruction in German |
session-statistics |
Listing of all available
channels, annotations as well as some recording and speaker features |
sk_ger.lex |
Pronunciation dictionary
(SAM-PA) to all SK data |
techdocs/ |
Project reports in German |
trl-coding/ |
A copy of the english
version of the conventions of transliteration in SmartKom |
webpages.zip |
webpages of the user interface |
· Administrative Information:
Validating person: n. a.
Date of
validation: n. a..
Contact
for requests regarding the corpus:
ok
Number and
type of media: DVD ok
Content of
each medium: no information
Copyright
statement and intellectual property rights (IPR): ok
· Technical information:
Layout of
media: Information
about file system type and directory structure:
DVD
DVD
nomenclature: dvd-<DVD
number><DVD version number>
The root
directory of each DVD contains the following:
readme.##.V |
specific Readme for each DVD |
data |
signal files of the sessions on
the DVD |
doc |
documents about the corpus
recording, annotation and pronunciation dictionary |
annot |
subdirectories for each
annotation type |
meta |
speaker profiles and recording
protocols to all SK recordings |
File
nomenclature: Explanation
of used codes (no white space in file names!):
<Type of Recording><Session Number><_> <Technical
scenario><Primary task><Recording
Channel><_><Turn numbering><_><Speaker ID>.
<extension> ok
Type of
Recording:
b : biometric data
w : Wizard-of-Oz
d : demo session
p : test session
v : evaluation session
Technical scenario:
p : Public
m : Mobil
h : Home
Primary task:
k : cinema
t : touristic planing
f : TV guide
r : restaurant
n : navigation
v : VCR programing
m : music jukebox
a : phone
x : fax
Recording channel:
a : clip-on microphone, channel 1 Sennheiser ME104
b : clip-on microphone, channel 2 Sennheiser ME104
h : headset microphone Sennheiser ME104
1-4 : microphone array 4 channels Sennheiser ME104
d : directional microphone Sennheiser ME 66
w : system output
p : playback backround noise front
q : playback background noise back
t : tableau coordinates
s : SIVIT coordinates
i : infrared video of interaction area
m : front capture camera
l : left lateral capture
o : system display capture
g : synchronized video streams
Extentions:
.ags BPF represented as an annotation graph (XML)
.avi video file AVI (channels g,o)
.ges gestic labeling file
.mov video file DV (channels i,l,m)
.par BAS Partitur Format file (BPF)
.qt QuickTime file (master frame file)
.rpr recording session protocol
.spr speaker protocol file
.trl transliteration
.trp user state labeling file ('prosody')
.ush user state labeling file ('holistic')
.usm user state labeling file ('mimic')
.wav RIFF audio file (channels 1,2,3,4,a,b,d,h,p,q,w)
Formats of
signals and annotation files: If
non standard formats are used it is
common to give a full description or to convert into a standard format:
ok
- RIFF audio
file
- Video file AVI
- Video file DV
- QuickTime file
Coding: .wav, .avi, .mov, .qt
Compression:
n. a.
Sampling
rate: 16 kHz ok
Valid bits
per sample: (others than 8,
16, 24, should be reported): ALAW coding: bits/samp,
PCM coding, 16 bit ok
Used bytes
per sample: 2 bytes/samp ok
Multiplexed
signals: (exact
de-multiplexing algorithm;
tools) n.a.
· Database contents:
Clearly
stated purpose of the recordings:
Empirical Study of
Human-Computer interaction (README.doc,
/doc/papers/LREC2002-Overview.ps)
Speech
type(s): (multi-party
conversations, human-human dialogues, read sentences,
connected and/or isolated digits, isolated words etc.) ok
Instruction
to speakers in full copy: ok (more informations under /doc/techdocs/ )
·
Linguistic
contents of prompted speech:
Specifications
of the individual text items:
n.a.
Specification
for the prompt sheet design or specification of the design
of the speech prompts: n.a.
Example
prompt sheet or example sound file from the speech prompting: n.a.
·
Linguistic
contents of non-prompted speech:
Multi-party:(number of speakers, topics discussed, type of
setting -
formal/informal) ok
Human-human
dialogues: (type of
dialogues, e.g. problem solving, information seeking, chat
etc., relation between speakers, topic(s) discussed, type of setting,
scenarios) n.a.
Human-machine
dialogues: (domain(s),
topic(s), dialogues strategy followed by the machine,
e.g. system driven, mixed initiative, type of system, e.g. test,
operational
service, Wizard-of-Oz) ok (README.DOC)
· Speaker information:
Speaker
recruitment strategies: ok
(more information under /doc/papers/ and /doc/techdocs/)
Number of
speakers: 461
ok
Distribution of speakers over sex, age, dialect regions: ok
(more informations under: /doc/dtd/readme.spr)
Description/definition of
dialect
regions: ok (more informations
under: /doc/dtd/readme.spr)
·
Recording
platform and recording conditions:
Recording
platform: ok
Position
and type of microphones: ok
- Company name and type id: Sennheiser ME104, Sennheiser ME 66
- Electret, dynamic, condenser: no information
- Directional properties: ok (readme.doc,
/doc/techdocs/TechDok-NR-07.ps)
- Mounting: ok (readme.doc, /doc/techdocs/TechDok-NR-07.ps )
Position
of speakers: (distance to
microphone) ok (readme.doc, /doc/techdocs/TechDok-NR-07.ps )
Bandwidth: (if
other than zero to half of sampling rate) ok
Number of
channels and channel separation:
ok (readme.doc)
Acoustical
environment: ok
(more information under /doc/techdocs/)
·
Annotation
(BAS Partitur Format Files):
Unambiguous
spelling standard used in annotations: ok
Labeling
symbols: ok
List of
non-standard spellings (dialectal variation, names etc.): given
Distinction
of homographs which are no homophones: n.a.
Character
set used in annotations: ok
Any other
language dependent information as abbreviations etc: given
Annotation
manual, guidelines, instructions:
ok – (readme.par, doc/pardoc/index.html,
doc/papers/Schiel-02-LREC-WS.ps)
Description
of quality assurance procedures:
no information
Selection
of annotators: no information
Training
of annotators: no information
Annotation
tools used:
no information
Annotation
(Orthographic transliteration):
Unambiguous
spelling standard used in annotations: ok
Labeling
symbols: ok
List of
non-standard spellings (dialectal variation, names etc.): ok
Distinction
of homographs which are no homophones: n.a.
Character
set used in annotations: ok
Any other
language dependent information as abbreviations etc: given
Annotation
manual, guidelines, instructions:
ok – (readme.trl,/doc/techdocs/TechDok-NR-02.ps,
/doc/papers/Beringer-01-verona.ps,
/doc/papers/Oppermann-01-EUROSPEECH.ps,
/doc/papers/Siepmann-01-ISCA.pdf)
Description
of quality assurance procedures:
no information
Selection
of annotators: no information
Training
of annotators: no information
Annotation
tools used:
no information
Annotation
(Annotation 2D Gesture):
Unambiguous
spelling standard used in annotations: n.a.
Labeling
symbols: ok
List of
non-standard spellings (dialectal variation, names etc.): n.a.
Distinction
of homographs which are no homophones: n.a.
Character
set used in annotations:
n.a.
Any other
language dependent information as abbreviations etc: n.a.
Annotation
manual, guidelines, instructions:ok
– (readme.ges, doc/techdocs/TechDok-NR-14.ps,
doc/papers/Steininger-London-01.ps,
doc/papers/Steininger-Verona-01.pdf)
Description
of quality assurance procedures:
no information
Selection
of annotators: no information
Training
of annotators: no information
Annotation tools used: no information
Annotation
(Prosodic labeling for User State):
Unambiguous
spelling standard used in annotations: n.a.
Labeling
symbols: ok
List of
non-standard spellings (dialectal variation, names etc.): ok
Distinction
of homographs which are no homophones: n.a.
Character
set used in annotations:
ok
Any other
language dependent information as abbreviations etc: n.a.
Annotation
manual, guidelines, instructions:ok
– (readme.trp, readme.trl, doc/techdocs/TechDok-NR-17.ps,
doc/papers/Steininger-02-LREC.pdf )
Description
of quality assurance procedures:
no information
Selection
of annotators: no information
Training
of annotators: no information
Annotation tools used: no information
Annotation
(User State - interesting emotional
and cognitive state):
Unambiguous
spelling standard used in annotations: n.a.
Labeling
symbols: ok
List of
non-standard spellings (dialectal variation, names etc.): n.a.
Distinction
of homographs which are no homophones: n.a.
Character
set used in annotations:
ok
Any other
language dependent information as abbreviations etc: n.a.
Annotation
manual, guidelines, instructions:ok
– (readme.ush, doc/papers/Steininger-02-LREC.pdf,
doc/papers/Steininger-02-LREC-WS.pdf )
Description
of quality assurance procedures:
no information
Selection
of annotators: no information
Training
of annotators: no information
Annotation
tools used:
no information
Annotation
(User States labeled without
the audio information ):
Unambiguous
spelling standard used in annotations: n.a.
Labeling
symbols: ok
List of
non-standard spellings (dialectal variation, names etc.): n.a.
Distinction
of homographs which are no homophones: n.a.
Character
set used in annotations:
n.a.
Any other
language dependent information as abbreviations etc: n.a.
Annotation
manual, guidelines, instructions:ok
– (readme.usm, doc/techdocs/TechDok-NR-17.ps,
doc/papers/Steininger-02-LREC.pdf,
doc/papers/Steininger-02-LREC-WS.pdf)
Description
of quality assurance procedures:
no information
Selection
of annotators: no information
Training
of annotators: no information
Annotation
tools used:
no information
· Lexicon:
Format: ok
Text-to-phoneme procedure: ok
Explanation
or reference to the phoneme set:
ok.
(/doc/german-sampa.txt)
Phonological
or higher order phenomena accounted in the phonemic
transcriptions: ok
·
Statistical
information:
Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): n.a.
Word
frequency table: n.a.
· Others:
Any other
essential language-dependent information or convention: given.
Indication
of how many files were double-checked by the producer
together with percentage of detected errors: no information
Status of documentation: good
The following list contains
all validation steps with the methodology and results.
Completeness of signal
files: Refer to
"/doc/session-statistics"
Completeness of meta
data files: ok
Completeness of
annotation files: not ok.
Empty files: none
Status of signal,
annotation and meta data files:
ok
Cross checks
of meta information: ok
Cross checks
of summary listings: ok
Annotation and lexicon
contents: The Sampa annotation
is different from the lexicon Sampa data/ Sampa is incorrectly
annotated:
h"Und6t#Qaxt#Unt#z'i:ptsIC
v"e:g@#Unt#pl'Ets@#t"u:6
fi:6h'OxtsaIt@nUntaIn
S'OpIN#z"Ent6
rEstor'a:~#kategor"i:@n
Q'aIn#Unt#tsv"antsIC
axa
f'i:r#Unt#tsv"antsICst6
rEstor'a:~s
f'Ynf#Unt#tsv"antsICst@n
QIt'a:li@n
z'Eks#Unt#f"I6tsIC
das#fr'a:g@#Unt#Q'antv"O6t#Sp"i:l
St'at#InfO6matsj"o:n
b"ESt#Of#n'aInti:s
dr'aI#Unt#tsv"antsIC
n'OYn#Unt#n"OYntsIC
gro:sbrit'ani@n
tur'Ist@n#InfO6matsj"o:n
f'a:st#fu:t#restor"a:~
f'i:r#Unt#tsv"antsIC
SIfsQaUsflU
Q'aIn#Unt#tsv"antsICst6
gU6m'e:#rEstor"a:~
kant'a:t@#be:ve:f"aU#Q'aIn#hUnd6t#z"i:b@n#Unt#f"I6tsIC#kor'a:l
b"u:t@n#Unt#b'In@n#n'a:xrICt@n
S'Ifs#aUsfl"y:g@
S'Ifs#aUsfl"y:g@n
Q'aIn#Unt#n"OYntsIC
rEStor'a:~#kategor"i:
z'a:lomOn#Unt#di:#k"2:nIgIn#fOn#z'a:ba
rEstor'a:~m"E:sIC
tsv'aI#Unt#f"I6tsIC
d'e:Ii#z"o:p
f'Ynf#Unt#dr"aIsIC
rEStor'a:~#Qy:b6z"ICt
das#f"Ynft@#elem'Ent
kant'a:t@#be:ve:f"aU#Q'aIn#hUnd6t#z"i:b@n#Unt#f"I6tsIC
Q'axt#Unt#n"OYntsIC
f'i:r#Unt#dr"aIsIC
Q'axt#Unt#tsv"antsICst@n
rEstor'a:~#b@z"u:xs
ha:#Unt#'QEm
f'asnaxt#am#n"Eka:r
Q'axt#Unt#Q"axtsIC
QEm"e:#Unt#j'a:gua:r
rEstor'a:~#b@z"u:x@s
h"Und6t#zEks#Unt#f'I6tsIC
rEstor'a:~
t'It@l
rEStor'a:~#b@z"u:x
Q'aIn#Unt#z"i:ptsIC
fi:6#h'OxtsaIt@n#Unt#aIn#t'o:d@s#f"al
das#l"e:b@n#Ist#S'2:n
n"EkarSt'aIna:x
p'ark#anl"a:g@
S'Ifs#aUsfl"u:k
z'Ende#t"It@l
tsv'aI#Unt#tsv"antsIC
Q'aIn#Unt#dr"aIsICst6
f'Ynf#Unt#f"I6tsIC
draI#k'2:nICs#Str"a:s@
10% of the 'usable' data, the audio files and the par files
were checked in comparison. 6,52% of the data contained errors.
The revalidation was able to repair
some
data (lexicon,README). The results of
the
manual validation couldn't be repaired.
The corpus is ok. The corpus is well
documented and the error rate is very low.