Corpus Specification of WebCommand

WebCommand is a speech corpus for the development and validation of speech recognition algorithms for British English and French. The target application is a portable full-size touch screen controlled by voice commands, a so-called `Web Pad'. This device is intended primarily for communication, i.e. video phone, email and Internet access.

The pre-validation and the final validation have been done by the producer itself, although we recommend asking a third independent institution for both. However, this might be justified because of the relatively small size of the corpus and the very constrained budget of the client.

In the following, the corpus specification of WebCommand will be presented in the manner of a check list. The elements of this check list have already been discussed in this order in chapter [*]. If elements are not applicable for WebCommand, they're marked with a `n.a.'.

Speaker Profiles Speakers are native speakers of British English or French and at least 18 years old. Gender distribution is 50:50, all dialects allowed, education level not specified
Number of Speakers At least 40 speakers had to be recorded, 20 for British English and 20 for French. The number of male and female speakers had to be preferably equal in every language.
Contents: The contents of the corpus were specified by the client in form of a plain text command list. The text corpus was fixed - that is all speakers recorded in one recording room spoke the same corpus of 135 command words. There are in total four text corpora: one for each of the two recording environments (see below) in the languages British English and French.
- Vocabulary English: 163 words; French: 188 words
- Domain Control commands and names
- Task No task specified
- Phonologic Distribution No distribution specified
Speaking Style:  
- Read Speech +
- Answering Speech -
- Command/Control Speech -
- Non Prompted Speech -
- Spontaneous Speech -
- Neutral/Emotional -
Recording Setup: On-site Recording
- Acoustical environment Each speaker is to be recorded on-site in two different recording rooms P and S on different days. The acoustical background consisted only of the hum of the recording device which was a regular Macintosh Desktop PC approx. 50 cm from the head of the speaker. The PCs were rated to be rather silent.
- Script Speakers read prompts from the CRT display in their native language

- Background noise no artificial background noise specified
- Microphones The speaker wears an ear-free headset Beyerdynamik NEM 192; a second Beyerdynamik MCE 10 is mounted on the upper left corner of a dummy laptop case that the user holds with both hands on his/her lap to simulate free speaking.
Technical Specifications:  
- Sampling Rate 22050 Hz
- Sample Type and Width Sample Type: linear, not compressed.
- Number of Channels Two channels recording: left channel: Beyerdynamik NEM 192; right channel:Beyerdynamik MCE 10.
- Signal File Format File format: WAV stereo (RIFF)
- Annotation File Format SAM annotation files according to SpeechDat specifications and a summarized annotation table for each recording block.
- Meta Data File Format Table SPEAKER.TBL gives a mapping of 4-digit speaker id to sex, age and mother tongue. Table SESSION.TBL contains a mapping of 4-digit session id to speaker id, place of recording, microphone types, channel mapping, environment. The file SUMMARY.TXT contains the SpeechDat compliant summary of recordings: for each recording session all individual recordings are listed in the line. If a recording is missing, a `-' is listed instead of the three-digit prompt number.
- Lexicon Format Two-column plain text file: orthography and pronunciation coded in SAM-PA

Corpus Structure:  
- Structure Recordings are stored in separate subdirectories for each combination of recording environment and language. The corpus contains 47 complete sessions (130 recordings per session). Care is taken that each speaker is recorded in complete sessions in each of the two recording rooms. Additional incomplete recording sessions are collected in the directories NOT_USED_FR (4 sessions) and NOT_USED_EN (7 sessions) respectively. Signal data are stored on DVD; a separate CDROM contains documentation, annotation files and pronunciation dictionaries.
- Terminology Session names are coded as SES#*** where # codes the combination of environment and language and *** encodes the session number, e.g. SES6013 is the 13th recording session of a French speaker in room P. A mapping from speaker IDs to sessions, as well as the speaker profile can be found in the file SESSION.TBL.
A recording file name is encoded as Q1#***YYYY.WAV where YYYY denotes the number of the text prompt (000-129) e.g. Q16013051.WAV contains the two microphone signals in a WAV stereo file of the 52nd prompt of the 13th recording session of French speakers in room P. The channel assignment for the microphones is stored in the file SESSION.TBL.
- Distribution Media The corpus consists of two DVD-5 with a total size of 7.5 GByte plus a CD-ROM with the label files and documentation. On one DVD the data of the British speakers are stored; on the second DVD the data of the French speakers.

Release Plan 06.05.02 : Start of project, delivery of the prompts for both languages by ordering company.
01.07.02 : Database British English will be delivered to ordering company.
15.07.02 : Database British English will be delivered to ordering company.

The client agrees that the corpus is offered to third parties via the national catalogue of the BAS and the international catalogue of the European Language Resource Association (ELRA) after a blocking period of one year. If the ELDA acts as a broker to deliver the corpus to a third party, ELDA earns a commission of 20% of the agreed royalties. A discount for research and for members of the ELRA is not provided.

Documentation REPORT.TXT: main documentation including copyrights, history and error log (see section [*] for a complete listing)
SAMEXPORT.TXT: summary of annotation
SESSION.TBL: recording protocol: mapping of 4-digit session id to speaker id, place of recording, date of recording, microphone types, channel mapping, environment
SPEAKER.TBL: speaker protocol: mapping of 4-digit speaker id to sex, age and mother tongue
Documentation of SpeechDat annotation guidelines and format and pictures from the recording setup

