_/_/_/_/ _/_/ _/_/_/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/_/_/_/ _/_/_/_/ _/_/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/_/_/_/ _/ _/ _/_/_/_/ BAVARIAN ARCHIVE FOR SPEECH SIGNALS University of Munich, Institut of Phonetics Schellingstr. 3/II, 80799 Munich, Germany bas@phonetik.uni-muenchen.de Hempels Sofa Spontaneous Speech Data Collection CD-ROM Database Version 2.1 Copyright(C) 2003 by Bavarian Archive for Speech Signals University of Munich, Germany Compiled by: Chr. Draxler Department of Phonetics and Speech Communication University of Munich Schellingstr. 3/II D 80799 Munich +49/89/2866 9968 +49/89/280 0362 fax draxler@phonetik.uni-muenchen.de The Hempels Sofa spontaneous speech data collections consists of 3903 recordings from telephone calls stored on 2 CD-ROMs in a format compatible with the SpeechDat(II) database exchange format as defined in the SpeechDat deliverable SD 1.3.1 V.4.3. CD-ROM HEMPEL_01 contains sessions 1000-4499, CD-ROM HEMPEL_02 contains sessions 4500-5809. Introduction ------------ Hempels Sofa is a collection of more than 3900 spontaneous speech items recorded as extra material during the German SpeechDat-II project. Speakers were asked to report what they had been doing during the last hour: "Was haben Sie in der letzten Stunde gemacht?". This item was recorded as the last item of the recording session. Speakers had become acquainted with the recording procedure and they were quite relaxed because they knew that this item was the last to be recorded. This resulted in quite natural, colloquial speech, sometimes with marked regional accent. The corpus collection is described in more detail in the LREC2002 paper "Three New Corpora at the Bavarian Archive for Speech Signals - and a First Step Towards Distributed Web-Based Recording" by C. Draxler and F. Schiel. This paper is contained in this database in file DOC/BASCORPO.PDF; it also contains links to related SpeechDat documents. Note: the name of the corpus refers to the German proverbial phrase: "wie bei Hempels unter'm Sofa". This phrase is often used to indicate that something is not well cleaned-up -- not dirty, just in its everyday state when one is not expecting visitors. I thought the phrase to be appropriate for this data collection because quite often when listening to the recordings one gets the impression of sitting next to the speaker on the sofa in a common living room. Speakers -------- The speakers were recruited using the following recruiting strategies: 1) hierarchical recruitment within a large company: group leaders within the company were asked to distribute prompt sheets among their group members. 2) calls for participation in newspapers: newspapers with a marked regional distribution were asked to publish calls for participation. Speakers called a speech server (not toll-free) and were prompted to leave their postal address. The requested number of sheets was then sent to them, together with the toll-free number to call for the recording. 3) snowball recruitmens: once speakers had called they were sent a reward (a telephone card with a value of 6.- DM (~3 Euro). Together with the thank-you letter they were sent extra prompt sheets and a recruitment sheet. For every speaker so recruited speakers were given an extra ticket in the final lottery which was held when the recordings had ended. Publishing calls for participation in local newspapers allowed finely tuning the geographic distribution of speakers. In the snowball recruitment, speakers were asked to recruit only speakers with a given demographic profile (region, sex, age). The total number of speakers included in the HEMPEL database is 3909, and each speaker is stored using a unique session number (this number corresponds to the original SpeechDat-II session number). To determine the dialect region, speakers were asked in which federal state they had entered school. This information can easily be provided by the speakers: ------------------------------ Code | Federal state ------------------------------ BB | Brandenburg BE | Berlin BW | Baden-Württemberg BY | Bayern HB | Bremen HE | Hessen HH | Hamburg MV | Mecklenburg-Vorpommern NI | Niedersachsen NW | Nordrhein-Westfalen RP | Rheinland-Pfalz SH | Schleswig-Holstein SL | Saarland SN | Sachsen ST | Sachsen-Anhalt TH | Thüringen AT | Austria CH | Switzerland XX | OTHER/UNKNOWN ------------------------------ Three different acoustical environments were defined: HOME, OFFICE and telephone BOOTH (min. 2% of all speakers had to call from a booth). Speakers were asked to use a fixed network phone (code: PSTN). Information on the type of handset used is not available. All demographic data for the speakers is given in the table files SESSION.TBL and SPEAKER.TBL. Recording --------- The recordings were made via a primary rate ISDN interface using Aculab Dialogics hardware and proprietary software. The signals were not changed after recording, i.e. they come as 8 KHz 8 bit a-law encoded ISDN data. An SNR measure was computed for every signal file in the database. The software for the calculation of the SNR value was developed by SPEX in the framework of the SpeechDat-II project. A short description of the computation can be found in the file DOC/SNR.TXT (kindly provided by SPEX, Nijmegen, NL). Transcript ---------- All recordings were transcribed according to the SpeechDat-II guidelines using the WWWTranscribe software. The recordings were listened to using closed headphones; each recording could be repeated as often as wanted; no visual display of the speech signal was presented. Annotations contain an orthographic transcription augmented with a set of four noise markers, plus markers for signal truncation, mispronunciations and repairs, and incomprehensible parts of speech: ----------------------------------------------------------------------- Marker | Usage ----------------------------------------------------------------------- [fil] | filled pause, hesitations, e.g. "äm", "mh", etc. [int] | intermittent non-articulatory noise, e.g. door slam [spk] | articulatory speaker noise, e.g. laughing, lip smack [sta] | stationary non-articulatory noise, e.g. line noise, background | traffic noise *word | mispronounced word ** | incomprehensible speech ~word | signal truncation at begin of utterance word~ | signal truncation at end of utterance ----------------------------------------------------------------------- The use of the annotation markers is described in DOC/TRANSCRP.PDF. Notice: The annotation of the slang words was made according to their canonical form in the dictionary. For example:the word "hab" was annotated as "habe". Sometimes the slang words were annotated as mispronounced words. The annotators were native speakers of German, and they had been trained with material from SpeechDat-M and -II recordings. BAS Partitur Format ------------------- For each recording file there exist an equally named BAS Partitur Format file with the extension '.PAR. The format is decribed in DOC/PARDOC.HTML. The following tiers are contained: TR2, ORT, KAN and MAU [Version 2.1: the MAU tier has been deleted due to inferior segmentation quality]. The following item were mapped to the 'garbage phone' '': '**' '*word' '~word' 'word~' '[fil]' was mapped to '' '[spk]' was mapped to '' '[int]' was mapped to '<#>' '[sta]' was deleted. Lexicon ------- The lexicon for Hempels Sofa uses an extension of the German SAM-PA phoneme alphabet. This alphabet includes nasalized vowels and allows non-lengthened pure free vowels (/a/, /e/, /i/, /o/, /u/, /y/, /2/) and uses the /Q/ to denote the glottal stop. The canonical pronunciation ('citation form') in the lexicon was compiled by hand by an experienced phonetician. The rules of text-to-phoneme used are documented in www.bas.uni-muenchen.de/Bas/BasGermanPronunciation/index.html. A copy of this document at time of production can be found in DOC/PRONCONV. The pronunciation was not double-checked by another person. CD-ROM Structure ---------------- /-- DISK.ID /-- README /-- COPYRIGH.TXT /-- HEMPEL -- +- DOC-----+-- BASCORPO.PDF | +-- ISO88591.PDF | +-- SAMPSTAT.TXT | +-- SD131V43[_1..7].{DOC | PDF} | +-- SD132V24.{DOC | PDF} | +-- SNR.TXT | +-- SUMMARY.TXT | +-- TRANSCRP.{HTM | PDF} | +-- PRONCONV------+--index.html | +-- Revalidation_HEMPEL.html | +- INDEX---+-- CONTENTS.LST | +- TABLE---+-- LEXICON.TBL (Iso8859-1) | +-- LEXICON.LATEX (7bit ASCII LaTeX) | +-- SESSION.TBL | +-- SPEAKER.TBL | +- DATA ---+-- BLOCKyy-+ (with yy=[10..58]) +-- SESyyzz --+ (with zz=[00..99]) + -- A1yyzzX1.DEA (signal file) + -- A1yyzzX1.DEO (SAM label file) + -- A1yyzzX1.PAR (BAS Partitur file) The BLOCK directories contain the actual recordings. Each call is written to a SES directory, where the 4-digit number in the directory name identifies the session uniquely. The signal and label files are held in the session directory; for each signal file (extension .DEA) there is the corresponding SAM label file (extension .DEO) and the BAS Partitur Format file (extension .PAR). Note: file name extension mappings ---------------------------------- .DEA 8 KHz 8 bit alaw encoded raw signal file .DEO ISO 8859-1 encoded SAM label file .PAR BAS PArtitur Format file (see DOC/PARDOC.HTML) .DOC Microsoft Word format .HTM[L] HTML formatted document .LST ISO 8859-1 encoded log file .PDF Adobe Portable Document Format .TXT DOS-formatted ISO 8859-1 .TBL tab-delimited ISO 8859-1 table file .LATEX coding in LaTeX, 7bit ASCII File and directory contents --------------------------- Note: In general, all file names are in upper case. However, depending on the operating system, file names may appear in upper or lower case. Please check whether your mount-command for CD media allows mounting disks with upper case file names. DISK.ID Volume identifier for ISO 9660 file systems README This file describing the database structure COPYRIGH.TXT Copyright text The following directories contain documentation and related information: DOC : BASCORPO.PDF LREC 2002 paper describing the corpus ISO88591.PDF ISO8859-1 (ISO Latin) code table SAMPSTAT.TXT SNR values SD131V43.DOC Database Exchange Format Specification SD131V43_{1..7}.PDF} Database Exchange Format Specification in PDF SD132V24.{DOC|PDF} Orthographic and Transcription Conventions SNR.TXT Description of SNR computation by SPEX SUMMARY.TXT German summary file TRANSCRP.{HTM|PDF} the validation and transcription handbook PARDOC.HTML Documentation of the BAS Partitur Format Revalidation_HEMPEL.html Revalidation Report of BAS INDEX : CONTENTS.LST contents of the database The order of fields in the table is DIR SRC CCD SCD SEX AGE ACC LBO and the fields are separated by tabs. The fields are explained in SD131V43. TABLE : contains the following DOS-formatted ISO 8859-1 files LEXICON.TBL pronunciation dictionary ORTHOGRAPHY FREQUENCY PRONUNCIATION SPEAKER.TBL the speaker information file with the following tab separated fields. The fields are explained in SD131V43. SCD SEX AGE ACC SESSION.TBL the session information file with the following tab separated fields. The fields are explained in SD131V43. SES RED RET SCD AGE SEX ACC REG ENV NET HEMPEL_emuDB : contains the emuDB To load and view in R: install.packages("emuR") # if necessary library(emuR) handle = load_emuDB("/path/to/HEMPEL_emuDB") serve(handle) History ------- 05/01/2003 : First edition 1.0 05/19/2003 : In-house validation (DOC/BASREPORT.TXT) 06/01/2003 : 1.1 repairs after in-house validation 02/17/2004 : 1.2 6 bugs in LEXICON.TBL fixed 07/26/2004 : 1.3.Repairs in the Sam files One new entry in the LEXICON.TBL Update of the README file. 07/22/2004 : 1.4 some repairs to documentation, annotations and lexicon after inhouse validation (see DOC/Revalidation_HEMPEL.html) 12/13/2004 : 1.5 sessions ses1358 ses2445 ses2466 ses2540 ses2896 ses3292 deleted because they contain only noises sessions ses1332 ses3174 bug fixes in the SAM label file LEXICON.TBL updated BAS Partitur Format files added to each recording 24/08/2012 : 1.7 changed file names of sessions and data to upper case to be conform with docu 07/06/2013 : 1.8 BPF and LEXICON.LATEX contained french accented vowels a and e (ISO8859) which is against BPF format; changed to plain a and e chars in the words: 'a' 'Cafe' 'Poincare'. 23/02/2017 : 2 (BAS Clarin Repository Version 3). Converted corpus into an emuR compatible emuDB. Replaced alaw audio files (*DEA) with WAV files using sox. Converted BPF annotation files (*PAR) into emuDB annotation files (*_annot.json). The old BPF files continue to be distributed. 21/03/2017 : 2.1 (BAS Clarin Repository Version 4): Deleted MAU tier from BPF files as well as emuDB due to inferior segmentation quality.