BAS
Bavarian Archive for Speech Signals
SmartWeb Video Corpus - SVC

Last Update: 2014-03-04 - gleiche Seite in deutsch

Description

This multimodal corpus contains 99 recordings each containing a human-human-machine dialogue: one speaker (which is being recorded) interacts with a human partner as well with a dialogue system via a smart phone (SmartWeb system).
The speaker uses a client-server based dialogue system (SmartWeb) for spoken access to Internet contents in a natural environment (office, hallway, street, park, cafe,...).
Speech was captured over a Bluetooth headset and transfered via an UMTS cellular line to the server; a second collar attached microphone was captured on a portable iRiver recorder to yield an undisturbed, high quality reference signal. The face of the speaker was captured by the build-in face camera of the smart phone. The speech signal was segmented into queries (automatically by the prompting system) and a second time manually into turns and transcribed according to Verbmobil transliteration standard. The video signal was labelled manually into OnView / OffView and - partly - spatially segmented for face detection.

The motivation for this corpus was to capture realistic multimodal (speech + face) data in a realistic human machine interaction as well as to capture as many OffTalk situations as possible (OffTalk being all speech uttered by the speaker that is not intended as input to the system).

number of dialogues / recorded speakers: 99
number of segmented turns: 2218
total duration: 971min
Vocabulary size (number of unique word tokens): 1643
formats:
- collar mic: WAV 44,1kHz, 16 bit
- Bluetooth/UMTS-channel: ALAW 8kHz 8bit
- video: 176x144, 24bpp, 15fps, 3GPP + MPEG1
- Verbmobil Transliteration (TRS), BAS Partitur Format (BPF), ATLAS Annotation Graph (XML)
- meta data: speaker and recording protocol (XML)
segmentation: automatic segmentation into input queries by the prompting system; manual segmentation into turns; OffTalk labelling; OffView labelling, spatially segmentation of face (partly manually)
distribution: 5 DVD-R

Corpus documentation (total SmartWeb corpus)

Publication: Schiel, F., Mögele, H. (2008). Talking and Looking: the SmartWeb Multimodal Interaction Corpus. In: Proc. of LREC 2008, Marrakesch, Marokko.

Documentation Addendum DVD SVC (additional video annotation)

Audio examples

Recording i067/man-0000rec-110 Bluetooth Headset UMTS
bis <"ah> <h"as> wieviel Uhr fahren denn in der Nacht die "offentlichen Verkehrsmitt= <PP> <h"as> <P> bis% um wieviel Uhr fahren denn in der Nacht die "offentlichen Verkehrsmittel ?
Recording i067/man-0000rec-110 Collar Microphone High Quality (no UMTS transmission)
bis <"ah> <h"as> wieviel Uhr fahren denn in der Nacht die "offentlichen Verkehrsmitt= <PP> <h"as> <P> bis% um wieviel Uhr fahren denn in der Nacht die "offentlichen Verkehrsmittel ?

Video examples

Recording i097.mpg Male age 32, indoor, Bluetooth Headset UMTS
Transcript i097.trl
Recording Protocol i097.rpr
Speaker Protocol AJAW.spr

Recording i100.mpg Femail age 25 with glasses, indoor, Bluetooth Headset UMTS
Transcript i100.trl
Recording Protocol i100.rpr
Speaker Protocol APDW.spr

Availability and Costs

Without restrictions usable (except distribution to third parties).
SmartWeb Video Corpus - SVC
6 DVD-R Iso 9660 + Shipping
Scientific EUR 1.275,00 (ELRA Members EUR 635,00) + VAT
Commercial EUR 2.275,00 (ELRA Members EUR 1.635,00) + VAT

(VAT does not apply for overseas orders and non-German, within EU orders)

Questions and orders to:

Florian Schiel

BASBavarian Archive for Speech Signals SmartWeb Video Corpus - SVC