_/_/_/_/ _/_/ _/_/_/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/_/_/_/ _/_/_/_/ _/_/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/_/_/_/ _/ _/ _/_/_/_/ BAVARIAN ARCHIVE FOR SPEECH SIGNALS University of Munich, Institut of Phonetics Schellingstr. 3/II, 80799 Munich, Germany bas@phonetik.uni-muenchen.de Infos to VM Data Sets ===================== Version 2.3 F. Schiel 28.10.2003 / 02.03.2004 This document contains information regarding the usage of the German VM speech corpus for ASR or other experiments where a defined distinction between training, development and test set is necessary. The subsets defined here are not of any official nature; they are merely given here as a guidance for future experiments and to allow authors the refer to defined subsets of the German VM corpus. The sets given here were newly defined in Nov 2003. Please beware, if you already have worked with this set definition before. The older set definition can be still downloaded from ftp://ftp.bas.uni-muenchen.de/pub/BAS/VM/SETS.20031112 ----------------------------------------------------------------------- Division and basic numbers -------------------------- The VM corpus is divided into VM1 (recordings before 1997) and VM2 (recording after 1996). Both sets differ in recording conditions and tasks (s. general documentation to the VM corpora). The training (TRAIN), development (DEV) and test (TEST) sets currently used in our experiments on the VM corpus are a compromise of the following constraints: - each speaker is allowed in only one set (hard constraint) - of each speaker there must be at least one complete dialogue sequence of turns (to allow speaker adaptation algorithms to be applied; hard constraint) - speakers should be distributed equally across sexes in all sets (soft constraint) - recordings should be distributed equally across recording sites in all sets (to cover possible accents preferences in one site; soft constraint) - number of words in the DEV and TEST set should be around 14000 (to allow significant differences of 0.5% in the range of 95% word accuracy, p=0.005) Basic numbers of the subsets of the total corpus: SET WORDS TURNS DIALOGS LEX SPEAK TAKEN FROM VOLUMES --------------------------------------------------------------------- DEV 26989 1222 125 2218 48 14.1 15.1 20.1 21.1 22.1 24.1 29.1 30.1 TEST 24470 1223 126 1946 46 14.1 15.1 20.1 21.1 22.1 24.1 29.1 39.1 48.1 TRAIN 438718 24435 962 9045 748 1.1 2.1 3.1 4.1 5.1 7.1 12.1 14.1 15.1 20.1 21.1 22.1 24.1 29.1 30.1 38.1 39.1 48.1 49.1 53.1 Basic numbers for the subsets in VM1 and VM2: VM1: SET WORDS TURNS LEX SPEAK ------------------------------------- DEV 15084 630 1537 35 TEST 14615 631 1342 33 TRAIN 285280 12600 6472 629 VM2: SET WORDS TURNS LEX SPEAK ------------------------------------- DEV 11905 592 1397 13 TEST 9855 592 1264 13 TRAIN 153438 11835 5238 119 Turn listings ------------- Turn listings for all 6 subsets are stored in: VM<#>_ with: # : 1|2 (VM1, VM2) SET : TRAIN, TEST, DEV Format: TAB You may obtain the 3 sets for the total corpus by concatenating the correponding subset lists. Lexica ------ Lexica and word listings for all 6 subsets: VM<#>_.list VM<#>_.lex Again you may obtain lexica for the total corpus by concatenating and sorting the corresponding subsets. Remarks ------- - The DEV and TEST set of VM1 is solely taken from the volume VM14.1 This has historical reasons: VM14.1 was the last official evaluation data set in the VM1 project. - The low number of speakers in the DEV and TEST sets are a compromise: to test speaker adaptation techniques it it required that enough data of single speakers are contained in these sets. Therefore the total number of speakers is rather low. Examples ASR Results -------------------- Using the above defined subsets we obtain currently (Jan 2004) the following accuracies using a HTK recognizer and a bigram trained solely on the training corpus: Trained on VM1_TRAIN + VM2_TRAIN; Tested on VM1_DEV + VM2_DEV set with lexicon VM1_DEV.lex + VM2_DEV.lex + VM1_TEST.lex + VM2_TEST.lex (total: 2944 lexical entries): Monophones: WA = 64.27% (52 iterations of HERest) (512 mixtures per state) Triphones crossword: WA = 64,51% (37 Iteration of HERest) (8 mixtures per state, same number of parameters as monophone system) Some more details for those who are interested: 12 Standard MFCC + Energy + velocity + acceleration (39) Diagonal covariance matrices 3-5 states per phoneme 43 phoneme classes (extended German SAMPA) + garbage + voice garbage + silence + laugh + breath (48) Models initialized using the MAU tier of the BPF from 1 third of TRAIN Re-estimation and splitting mixtures after 6 iterations on total TRAIN; testing after every two iterations on DEV Weight of language model fixed to 6.5 (option -s); beam search width 100.0 No testing on TEST until now. History 21.12.03 : Version 2.1 : Corrections of pronunciations to vm_ger.lex transferred to the lex and list sets 29.01.04 : Version 2.2 : Evaluation of monophone to new sets 02.03.04 : Version 2.3 : Evaluation to word cross triphones added