_/_/_/_/ _/_/ _/_/_/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/_/_/_/ _/_/_/_/ _/_/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/_/_/_/ _/ _/ _/_/_/_/ BAVARIAN ARCHIVE FOR SPEECH SIGNALS University of Munich, Institute of Phonetics Schellingstr. 3/II, 80799 Munich, Germany bas@phonetik.uni-muenchen.de BITS LOGATOME CORPUS DVD-ROM Database Copyright(C) 2005 by Bavarian Archive for Speech Signals University of Munich, Germany Version 1.5 Short Name: BAS BITS-LG Corpus Date: 2006/04/27 Modification Date: 2010/01/29 The BITS synthesis corpus consists of two parts: a set of logatome recordings for controlled diphone synthesis and a set of sentence recordings for unit selection techniques. BITS stands for "BAS Infrastructures for Technical Speech Processing" and was funded by the German Ministry of Science and Education during 2003-2005. This README file reports on the BITS Logatome Corpus which consists of 11036 recordings stored on 4 DVDs. Each DVD contains the recordings, the annotation files and the meta data files of one of the four professional speakers, and the entire corpus' documentation. Note that all documentation files are coded in Unicode UTF-8 if not stated otherwise. Table of contents ----------------- 1.) Introduction 2.) Speakers 2.1.) Recruitment of Speakers 2.2.) Speakers Profile 3.) Recording 4.) Recording Procedure 5.) Annotation 6.) File Nomenclatura 7.) Structure of each DVD 8.) Other Documentation Files 9.) Contact 10.) History Introduction ------------ The BITS Logatome Corpus consists of a set of logatome recordings for controlled diphone synthesis. Speech synthesis using concatenative techniques is maturing to a point where standard procedures are being implemented in a variety of products. However, because of the considerable costs most small and medium-sized companies as well as university labs cannot afford to produce the required speech resources on their own. Although there are some public domain German diphone voices available for research purposes (e.g. MBROLA) there is definitely a lack of publicly available synthesis resources. The BITS synthesis corpus (recorded and) produced by BAS fills the obvious gap. The work was funded by the German Ministry of Education and Science (grant no 01 IV B01). Speakers -------- Recruitment of speakers: ------------------------ 45 speakers were invited for a casting. They were asked to read 90 logatomes that contained a subset of our diphone set so that three target sentences of nearly all German phonemes could be synthesised. Based on a ranking according to naturaless and pleasantness 10 speakers were selected as nominees. After an overall evaluation - by specialists in speech synthesis and by the BITS group - the best four speakers (two male and two female) were chosen for the final recordings. More informations about the recruitment of the speakers can be found under: /DOC/HTML/Specification_logatome_corpus.pdf Speakers Profile: ----------------- Four professional speakers were recorded, between the age of 40 to 45. All speakers were of German nationality and had at least foreign language competence in English. More informations about the speakers can be found in the table /DOC/SRPK.TBL. SRPK.TBL contains a list that gives information about the speakers. The ordered list has 10 columns (seperated by tabs): ID : speaker id (SES200[1-4]) Sex : M = male, W = female Age : age of the speakers at the time of the recordings Name : full name Nationality : the nationality of the speaker Size : size in cm Weight : weight in kg ACC : the accent of the speaker is determined through the federal state the speaker entered the school Edu : Education of the speaker PoL : current place of living Prof : current occupation FL : foreign languages ENG - English FR - French I - Italian EL - Greek Smk : smoker (y=yes, n=no, cas=casually) Recording: ---------- The speech signal was recorded in three channels (headset-microphone, room-microphone and laryngograph signal). The sampling rate is 48kHz, with 16 bit quantization. All signals are recorded via a Yamaha 02R digital sound mixer directly to hard using the multi-channel recording software SpeechRecorder. - Channel 0 : close talk microphone (Beyerdynamic NEM 192) positioned 7cm to the right of the mid-sagital plane at the height of the upper lip. - Channel 1 : laryngograph signal (LaryngoGraph PCLX) - Channel 2 : large membrane condenser microphone (Neumann Type TLM 103) 60cm from the mouth. Channels were separated into standard WAV format files; no further processing was performed to avoid any undesired degradations of the signals. Recording Procedure: -------------------- The speaker was seated in an insulated room with low reverberation. The positions of the chair and room microphone were marked on the floor. Before the recordings the speaker was asked to put on the headset microphone and the laryngograph electrodes. During the session the speech prompts are displayed through a window using the program "SpeechRecorder". Three supervisors monitored the recording and a prompt was repeated until all three supervisors agreed about its quality. More informations about the recording procedure can be found under: /DOC/HTML/Preparation_and_Execution_of_Recordings_8_2.pdf Annotation: ----------- All logatoms were inspected manually using the software praat and the two phonems building the target diphone were segmented phonetically. Results of this procedure are stored in the directory ANNOT/SES#### (#### = speaker number) in three different file formats: - BAS Partitur Format (*.par) with the following tiers: ORT : orthographic representation of the prompted logatom KAN : canonical pronunciation (SAM-PA) of logatom SAP : segmentation of phonemes : pre-ceeding and tailing parts of the logation are segmented into ''; in between there are 2-4 segments describing the two phonemes (more than two segments are required of a plosive is involved since plosives are segmented into closure and burst phase separatly; see details below) - Annotation Graphs (XML, *.ags) This is basically the same information in XML form. See the ag.dtd and metadata.dtd in directory DOC as well as http://agtk.sourceforge.net/ for details. - TextGrid (praat, *.TextGrid) The original segmentation results in the praat format. Short description of segmentation procedure Please note that the segmentation described here refers to phoneme boundaries. Not the classic diphone as being used in speech synthesis is segmented but rather the two phonemes building the diphone. Users of the corpus have to apply their own specific cutting technique to derive the final diphone segment from this segmentation (for instance by using the classic 40/60 rule). For the phonetic annotation all logatomes were segmented in a first pass with MAUS into German SAM-PA. (More about the SAM-PA encoding used for the annotation under /DOC/HTML/Conventions_for_segmentation_8_5e.pdf) The logatomes were then pre-segmented according to their canonical form using MAUS. This guaranteed that the logatome contained the diphone in correct SAM-PA. There were automatically presented only three boundaries to the segmenter: - beginning of the diphone - border between the two phonemes - end of diphone. In a second pass a group of ten to twelve trained phoneticians manually corrected the pre-segmented sentences and logatomes. After that three phoneticians that were consistent to each other corrected the segmentations in a third pass. In a last step all segmentations were reviewed by the team supervisor. The following rules of segmentation were used: - the placing of boundaries is primarily based on the auditory judgement. - the boundaries of segments are always placed at positive zero-crossings of the oscillogram (only in SAP TextGrid tiers!). - the placement of the boundaries should be controlled by sonagram and oscillogram. - within transitions in which both of two adjacent phonemes can be heard, the boundary is placed in the middle of this transition (50% rule). - voiced (periodic) elements start with the first clearly identifiable glottal pulse. - the boundaries of segments with low intensity (e.g. /h/, aspiration) are set where the signal can be clearly distinguished from the background noise. Noises of breathing - if clearly recognised - have to be cut off from the friction or aspiration. Special labels aside from standard German SAM-PA: Q : glottal stop (SAM-PA: ?) ~ : preceeding vowel is nasalized § : preceeding phoneme contains an audible lip smack q : preceeding vowel was glottalized More informations about the annotation of the corpus can be found under: /DOC/HTML/Conventions_for_segmentation_8_5e.pdf File Nomenclatura: ------------------ The names of both audio and annotation files consist of the following: LG####%%%%_$ with #### : speaker id 2001 - 2004 %%%% : logatom id (see table /DOC/BITS-LG.TBL) $ : channel 0 - 2 File name extension mappings .TextGrid Praat Label file with interval tiers .wav Audio file .txt Text file .html HTML file .par BAS Partitur Format file Structure of each DVD --------------------- Each DVD contains the following: README : this file DATA/ : the recordings of one of the four speakers ANNOT/ : the annotations files of one of the four speakers DOC/ : the documentation files (start with DOC.HTML) PLAY/ : simple concatenation script (see README there) The DOC/ directory contains the following: README.PAR : brief documentation of the BAS Partitur Format (BPF) SPRK.TBL : speaker profiles (see before) KNOWN-ERRORS : list of known errors that cannot be fixed BITS-LG.TBL : logatome list (see below) DOC.HTML : start of main documentation HTML/ : main documentation files PUBLICATIONS/ : publications Other Doumentation Files ------------------------ KNOWN-ERRORS - known errors that cannot be fixed BITS-LG.TBL - logatome list This file contains a 5-column table describing the spoken logatoms of the corpus: where: : logatom id 0001-2795 : logatome prompt during recording : SAM-PA transcript of the complete logatome : non-German language phoneme contained ENG : one English phoneme FR : one French phoneme ENG/FR : diphone English-French FR/ENG : diphone French-English Contact ------- For questions, remarks, bug reports etc. please contact Florian Schiel schiel@bas-services.de +49-89-2180-5751 History ------- 15.03.06 : Version 1.0 27.04.06 : Version 1.1 : Documentation re-worked BAS Partitur Format files added 04.08.06 : Version 1.2 : Several minor bugs fixed, AGS files added to annotation 09.08.06 : Version 1.3 : BPF tiers ORT and SAP contained 8-bit ASCII which is not conform to BPF. Replaced by LaTeX codes. 25.08.06 : Version 1.4 : Bug fix in AGS file creation, all *.ags files anew 29.01.10 : Version 1.5 : Bug fix in annotation files LG20031321_0.par and LG20031321_0.TextGrid : the burst segment of the glottal stop /Q_b/ was missing -> fixed