BAS
Bavarian Archive for Speech Signals
SmartWeb Video Corpus - SVC
Last Update: 2014-03-04 - gleiche Seite in deutsch
Description
This multimodal corpus contains 99 recordings each containing a
human-human-machine dialogue: one speaker (which is being recorded)
interacts with a human partner as well with a dialogue system via a
smart phone (SmartWeb system).
The speaker uses a client-server based dialogue system (SmartWeb) for
spoken access to Internet contents in a natural environment (office,
hallway, street, park, cafe,...).
Speech was captured over a Bluetooth headset and transfered via an UMTS
cellular line to the server; a second collar attached microphone was captured
on a portable iRiver recorder to yield an undisturbed, high quality
reference signal. The face of the speaker was captured by the build-in
face camera of the smart phone.
The speech signal was segmented into queries (automatically by the prompting
system) and a second time manually into turns and transcribed according
to Verbmobil transliteration standard.
The video signal was labelled manually into OnView / OffView and - partly -
spatially segmented for face detection.
The motivation for this corpus was to capture realistic multimodal
(speech + face) data in a realistic human machine interaction as well as to capture
as many OffTalk situations as possible (OffTalk being all speech uttered
by the speaker that is not intended as input to the system).
- number of dialogues / recorded speakers: 99
- number of segmented turns: 2218
- total duration: 971min
- Vocabulary size (number of unique word tokens): 1643
- formats:
- collar mic: WAV 44,1kHz, 16 bit
- Bluetooth/UMTS-channel: ALAW 8kHz 8bit
- video: 176x144, 24bpp, 15fps, 3GPP + MPEG1
- Verbmobil Transliteration (TRS), BAS Partitur Format (BPF), ATLAS Annotation Graph (XML)
- meta data: speaker and recording protocol (XML)
- segmentation: automatic segmentation into input queries by the prompting
system; manual segmentation into turns; OffTalk labelling; OffView labelling,
spatially segmentation of face (partly manually)
- distribution: 5 DVD-R
Publication: Schiel, F., Mögele, H. (2008). Talking and Looking: the SmartWeb Multimodal Interaction Corpus. In: Proc. of LREC 2008, Marrakesch, Marokko.
Audio examples
Recording i067/man-0000rec-110 Bluetooth Headset UMTS
bis <"ah> <h"as> wieviel Uhr fahren denn in der Nacht die "offentlichen Verkehrsmitt= <PP> <h"as> <P> bis% um wieviel Uhr fahren denn in der Nacht die "offentlichen Verkehrsmittel ?
Recording i067/man-0000rec-110 Collar Microphone High Quality (no UMTS transmission)
bis <"ah> <h"as> wieviel Uhr fahren denn in der Nacht die "offentlichen Verkehrsmitt= <PP> <h"as> <P> bis% um wieviel Uhr fahren denn in der Nacht die "offentlichen Verkehrsmittel ?
Video examples
Recording i097.mpg Male age 32, indoor, Bluetooth Headset UMTS
Transcript i097.trl
Recording Protocol i097.rpr
Speaker Protocol AJAW.spr
Recording i100.mpg Femail age 25 with glasses, indoor, Bluetooth Headset UMTS
Transcript i100.trl
Recording Protocol i100.rpr
Speaker Protocol APDW.spr
Availability and Costs
Without restrictions usable (except distribution to third parties).
SmartWeb Video Corpus - SVC
6 DVD-R Iso 9660 + Shipping
Scientific EUR 1.275,00 (ELRA Members EUR 635,00) + VAT
Commercial EUR 2.275,00 (ELRA Members EUR 1.635,00) + VAT
(VAT does not apply for overseas orders and non-German, within EU orders)
Questions and orders to:
Florian Schiel