=== IMPORTANT === This document VALRESPO.TXT contains the original validation report by SPEX plus comments and descriptions of the corrections and other actions by the database producers. SPEX comments begin with "=>", responses to these with "<*>" === IMPORTANT === SPEX / Dept. of Language and Speech University of Nijmegen Erasmusplein 1 NL-6525 HT Nijmegen The Netherlands SUBJECT: Validation German Speaker Verification corpus AUTHORS: Henk van den Heuvel, Eric Sanders VERSION: 1.0 DATE : July 24, 2001 This speech database was validated by by SPEX, Nijmegen, the Netherlands, to assess its compliance with the SpeechDat format and content specifications, as documented in Deliverables 1.1.3, 1.2.3, 1.3.1, 1.3.2 and 1.3.3 of the project. The validation results are contained in this document. In the validation procedure we systematically check a list of validation criteria for a range of subjects. In the following sections we will evaluate these criteria one by one. Validation results that call for attention because of deviations from the SpeechDat specifications are marked by =>. They can be easily extracted/'grepped' in this way. The following subjects were validated: 1 DOCUMENTATION 2 DATABASE STRUCTURE, CONTENTS AND FILE NAMES 3 ITEMS 4 SAMPLED DATA FILES 5 ANNOTATION FILES 6 LEXICON 7 SPEAKERS 8 ENVIRONMENTS 9 TRANSCRIPTION The document is concluded by 10 SUMMARY ==================================================================== 1. DOCUMENTATION - File DESIGN.DOC; & deliverables SD131 and SD132 can be handy OK, a DESIGN.DOC is provided together with SPeechDat deliverables SD113, SD131, SD132, SD133. - Language of doc file: English OK - Contact person: name, address, affiliation OK, cover page - Number of CDs OK, section 1. There is one DVD disk. => We recommend to mention the number of speakers and number of calls per => speaker in section 1 already. <*> ok, done. - Contents of each CD OK, section 1 - The directory structure of the CDs OK, section 1.3 - Description of all the items in the corpus OK, sections 1.2, 3, and 10 => Section 10.4.2 (just above table 28) uses 'row' where 'column' is intended. <*> checked for all occurrences of "row" and corrected them accordingly - Prompting . linguistic specification (and motivation) for the prompting material (in case of additional optional items) . connection of sheet items to item numbers on CD . sheet example . items must be spread over the sheet to prevent list effects (e.g. three yes/no questions immediately after another are not allowed) OK, section 2.3 => A reference to section 8 is in place. <*> ok, done. - Naming conventions for directories and files OK, sections 1.2 and 1.3 - Speaker recruitment OK, section 2.2 => It is unclear how a list of potentially interested speakers was made. => It is also unclear which strategies were used to recruit sufficient speakers. <*> no information on this available - Speaker demographics . which regions, how many of each . motivation for selection of regions . which age groups, how many of each . sexes: males, females, also children?; how many of each. . how many calls by each speaker OK, section 4 => There are some errors in the speaker tables in section 4. => See section 7 of this report. <*> the table given in this section is the target distribution; the real distribution is given in the tables SPEAKER.TBL and of course the annotation files. The documentation text was modified slightly to indicate that the table given is the target distribution. - Analysis of frequency of occurrence of the sub-word units represented in the phonetically rich sentences (either of phones, biphones, triphones) OK, section 10.5 - Recording platform and telephone link description (which part is digital) OK, section 2.1 - Number calls from quiet environment and from noisy environment (proportion should be 70% and 30% respectively) OK, section 6.1 - Number of calls over fixed network and over GSM network (proportion should be 50% and 50% respectively) OK, section 6.2 - Signal characteristics (number of bits per sample; bandwidth; coding type; compression procedures) OK, section 1.1 - The format of the speech files (A-law, 8 bit, 8 kHz, uncompressed) OK, section 1.1 - The format of the annotation files (SAM label files) OK, section 1.4 - Annotation . procedure . quality assurance . character set used for annotation (transcription) (ISO-8859) . annotations symbols for non-speech acoustic events must be mentioned at least for Filled Pause, Speaker Noise, Stationary Noise, Intermittent Noise. . list of symbols used to denote word truncations, mispronunciations not understandable speech, specific GSM distortions. . case sensitivity of transcriptions OK, section 2.4 - Lexicon information . Procedures to obtain phonemic forms from orthographic input (lexicon generation and lay out) . (Reference to) SAMPA symbols used . case sensitivity of entries (matching the transcriptions) OK, section => /a~/ is mentioned but not present in the lexicon. => /Z/ is not included in the lexicon. => The acronym IPSK is not explained => ! it is essential that the 15% erroneous entries were also manually corrected ! Please add this information. The lexicon is not acceptable if manual correction was not carried out. <*> ok, done. - Only one spelling of each word is allowed. Therefore a list of normalised spellings for words with alternative spellings should be included (SPELLALT.DOC). Otherwise a statement why such a list is not necessary. OK, section 5 tells how this was dealt with. - Information on test (set) specification => Information on subdivision of the database into a train and test is missing. => Typically the first five sessions of each speaker recorded with the => preferred hand set should be used for training/enrollment. <*> for VERIF1DE, first five sessions for every speaker have been <*> used as test material (750 sessions), the remaining 15 plus all material <*> recorded during repair sessions (3684) as training material. - Indication of how many of the files were double checked by the producer together with percentage of detected errors => This information is missing in section 2.4 <*> information is not available - The validation report made by SPEX (VALREP.TXT) is referred to => OK, section 1.3. It should be referred to as VALREP.TXT => The last line in section 1.3 is not correct and not acceptable. => Comments are to be included in a separate document. <*> ok, done. Responses to VALREP.TXT are in this document. ========================================================================== 2. DATABASE STRUCTURE CONTENTS AND FILE NAMES - Directory / subdirectory conventions Format of directory tree should be \\\ . data base: defined as <#> can be FIXED, MOBIL, VERIF <#> is 0 for SpeechDat(M) and 1 for SpeechDat is the ISO two-letter code for the language . block : defined as BLOCK where is a progressive number from 00 to 99. Block numbers are unique over all CDs. They correspond to the first two digits of below. . session: defined as SES where is the session code also appearing in file name OK - All text files should be in MS-DOS format () at line ends OK - A README.TXT file should be in the root describing all (documentation) files on the CD-ROM. OK - A file containing a shortened version of the volume name (11 chars max.) should be in the root directory. The name of this file is DISK.ID. This file supplies the volume label to UNIX systems that cannot read the physical volume label. Example of contents: FIXED1EN_01. OK - A copyright statement should be present in the file COPYRIGH.TXT (root) OK - Documentation should be in \\DOC . DESIGN.DOC . TRANSCRIP.DOC (optional) . SPELLALT.DOC (optional) . SAMPALEX.PS . ISO8859<1,2>.PS . SUMMARY.TXT . SAMPSTAT.TXT OK, optional files are absent. TRANSCRIP.DOC is provided as TRANSCRIP.PDF. Extra are SD113V33.DOC, SD131V43.DOC, SD132V24.DOC, SD133V19.DOC. - The contents list (CONTENTS.LST) is in \\INDEX OK - Tables should be in \\TABLE . SPEAKER.TBL . LEXICON.TBL . REC_COND.TBL . SESSION.TBL OK - Index files should be in \\INDEX Only CONTENTS.LST and the speaker list files are mandatory. => The speaker list files are missing <*> ok, speaker list files have now been created - The speaker index files should obey the following nomenclature .LST e.g. C10023EN.LST => The speaker list files are missing <*> ok, speaker list files have now been created - The item index files (if present) obey the nomenclature .LST e.g. A1ENN3.LST (see below for item_code) Not provided - Prompt sheet files (optional) should be in \\PROMPT Not provided - All sessions indicated in the documentation SUMMARY.TXT are present on the CDs OK - File naming conventions All file names should obey the following pattern: DDNNNNCC.LLF DD : database identification code For SpeechDat : A1 = fixed net, B1 = mobile, C1 = speaker verification NNNN : session code 0000 to 9999 CC : item code; first character is item type identifier, second character is item number LL : ISO-639 language code (with extensions) F : speech file type A is for A-law O is for Orthographic label file OK - Correct item codes should be used: A1-2 : common application words B1 : sequence of isolated digits C3 : credit card number C4 : PIN code L1-3 : spelled names/words O1,7-8: forename & surname S0-9 : phonetically rich sentences OK - NNNN in filenames is not in conflict with BLOCK and SES numbers in pathname OK - Contents lowest level subdirectories should be of one call only OK - Empty (i.e. zero-length) files are not permitted OK - Missing items per speaker Check with documentation (SUMMARY.TXT) OK, there are no items missing - File match: For each label file there must be one speech file and vice versa. OK - Part of the corpus is designed for training and a smaller part for testing. => A subdivision in train and test set is missing. <*> ok, C1TRNDE.SES and C1TSTDE.SES have now been provided. - All table files, and index files should report the field names as the first row in the files using tabs as in the data records following. OK - The contents of the database as given in CONTENTS.LST should comprise . CD-ROM volume name (VOL:) . full pathname (DIR:) . speech file name (SRC:) . corpus code (CCD:) . corpus repetition (CRP:) . speaker code (SCD:) . speaker sex (SEX:) . speaker age (AGE:) . speaker accent (ACC:) . orthographic transcription of uttered item (LBO:) The first line should be a header specifying the information in each record. This file must be supplied as an ASCII TAB delimited file. OK - The contents of the SUMMARY.TXT files should comprise: . The full directory name where speech and label files are to be found . the session number . a string of typically N codes. Each item present is represented by its code. If the item is missing, a '--' should appear. . recording date . recording time of first item . optional comment text . all these fields are separated by spaces . Note: The contents of the SUMMARY.TXT file are not CD-dependent OK ====================================================================== 3. ITEMS - 1 sequence of 10 isolated digit (code B1) . each sequence must include all digits . optional are hash and star OK, 11 digits per prompt are used to include both variants of '2'. - 2 connected digits (code C3-4) - 16 digit credit card number . read . set of 150 . unique number for each speaker . if there is a checksum then formula must be provided - 6 digit PIN code . read . set of 150 . unique number for each speaker . digits must appear numerically on the sheet, not as words OK, 150 different credit card numbers are used, 20 times each OK, 150 different PIN codes are used, 20 times each - 3 spelled names (code L1-3) . L1 is forename & surname spelling of O1 from set of 150, always the same for a speaker . other two are names/words (set of 40) . these two are read . equal balance of all vocabulary letters artificial words can be used to enforce this balance . average length at least 7 letters OK, L1 linked to O1, set of 150 L2-3 are a list of 150 words; 40 words per speaker; each words appears 40 times. Average spelling length: 7.37 letters per prompt - 2 common application words (code A1-2) . read . set of 30 should be used, 25 of which are fixed for all . same 2 words to be used in all calls by a speaker OK, 30 different application words are found; each word appears 200 times Same two words are used in all calls by a speaker => There is no equivalent for 'Program', as was included in the => SpeechDat collections. <*> this cannot be changed. - 10 phonetically rich sentences (code S0-09) . read . 4 common sentences for all speakers drawn from set of 250 (S0-3) . 6 speaker specific sentences drawn from 5 sets of 120 sentences per set (S4-9) OK Common sentences are in S0-3. For the speaker specific sentences (S4-9) we found 599 different sentences. => The following 33 speakers had 119 different sentences instead of 120: 0002 0007 0012 0017 0022 0027 0032 0037 0042 0047 0052 0057 0062 0067 0072 0077 0082 0087 0092 0097 0102 0107 0112 0117 0122 0127 0132 0137 0142 0147 <*> this cannot be changed - 3 person names (code O1,7-8) . O1 is always the same fore-surname combination for a speaker (from set of 150) . O7-8: forename & surname combination from set of 10 (selected for each speaker from 150 FDB set) . since 2*20 = 40 names are recorded per speaker, there should be 4 examples of each name per speaker OK O1 is linked to L1; set of 150 O7-8 are from a set of 10 for each speaker The following completeness checks are performed on obligatory SpeechDat items only: 1. Structurally missing items None of the items are structurally missing. 2. Incidentally missing items a. files that are not there There are no missing files b. files with empty transcriptions in the LBO label field (effectively missing files) 28 effectively missing files were detected according to the following distribution over the items: A1: 2 A2: 4 C4: 1 L2: 1 O1: 1 O8: 3 S0: 2 S1: 1 S2: 3 S3: 2 S5: 3 S6: 1 S7: 2 S8: 1 S9: 1 The distribution per item (per speaker) is: A1: 2 A2: 4 B1: 0 C3: 0 C4: 1 L1: 0 L2: 1 L3: 0 O1: 1 O7: 0 O8: 3 S0: 2 S1: 1 S2: 3 S3: 2 S4: 0 S5: 3 S6: 1 S7: 2 S8: 1 S9: 1 c. corrupted speech files If we regard utterances which have only truncated or mispronounced words as corrupted files, and merge these with the effectively missing files then the following distribution emerges : 16 A1 30 A2 1 C4 1 L2 1 L3 2 O1 2 O7 7 O8 2 S0 1 S1 3 S2 4 S3 2 S4 4 S5 1 S6 2 S7 1 S8 1 S9 (This will not be used to reject or approve a database but it will be supplied as supplementary information.) d. files containing truncation and mispronunciation marks (*,**,~ are counted in the transcriptions of the individual items to get an idea of distorted speech data. This will not be used to reject or approve a database but it will be supplied as supplementary information.) A1: 33 A2: 39 B1: 72 C3: 59 C4: 44 L1: 73 L2: 83 L3: 88 O1: 39 O7: 50 O8: 65 S0: 39 S1: 32 S2: 49 S3: 65 S4: 74 S5: 92 S6: 108 S7: 99 S8: 108 S9: 74 3. Overall conclusion SpeechDat has the following criteria for missing items: . For SDB databases all items must be 95% complete for at least 120 speakers, . As missing files are counted: absent files, and files containing non-speech events only. . There will be no further comparison of prompt and transcription text in order to decide if a file is effectively missing. As a consequence: If there is some speech in the transcription, then the file will NOT be considered missing, even if it is in fact useless. OK, for 150 speakers an item may miss 0.05 * 150 = 7 times. The listing under b. (see above) shows that all items fulfil this criterion. =========================================================================== 4. SAMPLED DATA FILES 1 Coding . A-law, 8 bit, 8 kHz, no compression OK 2 Sample distribution Several sample statistics are generated: File length, clipping rate, mean sample value, Signal-to-Noise Ratio (SNR). Statistics were generated on file level by the producer of the database, using SPEX software. The results were delivered to SPEX. SPEX compiled histograms on the basis these results. These histograms are presented below, both on file level and on directory (call) level. The histograms are presented as they are and not further interpreted by SPEX. On the basis of these data the user of the database should be able to decide which acoustic quality is still acceptable for the application at hand. Statistics on the acoustics of individual speech files can be retrieved from file \DOC\SAMPSTAT.TXT. The columns in SAMPSTAT.TXT have the following meaning: file max min #samples cliprate mean snr C11001C2.DEA:16384:-13056:80000: 0.00: -4.28: 35.89 2.1 File length We calculated the length of the files in seconds in order to trace spurious recordings if files were of extraordinary length. Duration distribution over all items: Length (s) #Occurrences 0 - 1 : 777 2 - 3 : 1462 3 - 4 : 8643 4 - 5 : 8314 5 - 6 : 15858 6 - 7 : 7101 7 - 8 : 9168 8 - 9 : 1637 9 - 10 : 1623 10 - 11 : 1522 11 - 12 : 2278 12 - 13 : 1164 13 - 14 : 943 14 - 15 : 669 15 - 16 : 1240 16 - 17 : 119 17 - 18 : 69 18 - 19 : 54 19 - 20 : 39 20 - 21 : 27 21 - 22 : 21 22 - 23 : 13 23 - 24 : 4 24 - 25 : 255 Duration distribution over calls/directories: Length (s) #Occurrences 0 - 1 : 37 4 - 5 : 21 5 - 6 : 662 6 - 7 : 1447 7 - 8 : 556 8 - 9 : 277 => 37 sessions have an extreme short average file duration below 1 sec. => See comments for SNR. <*> missing SAMPSTAT entries have been provided 2.2 min-max samples We provide a histogram with clipping ratios, The clipping ratio is defined as the proportion of samples in a file that is equal to the maximum/minimum value, divided by all samples in the file. The histogram, then, is an overview of how many files were found in a set of clipping rate intervals. Clip distribution for all items: Clipping Occurrences rate (in %) There are no clipped files Clip distribution over calls/directories: Clipping Occurrences rate (in %) There are no clipped files 2.3 Mean values We computed the mean sample value of each item in each call. We provide a histogram with mean values below. The histogram, then, is an overview of how many files were found in a set of mean sample value intervals. This overview can be used to trace files with large DC-offsets. Mean distribution over all items: Mean Occurrences -310 - -300 : 1 -280 - -270 : 1 -270 - -260 : 1 -250 - -240 : 2 -230 - -220 : 1 -220 - -210 : 1 -210 - -200 : 2 -200 - -190 : 1 -170 - -160 : 1 -130 - -120 : 2 -110 - -100 : 4 -100 - -90 : 1 -90 - -80 : 4 -80 - -70 : 195 -70 - -60 : 16 -60 - -50 : 52 -50 - -40 : 130 -40 - -30 : 354 -30 - -20 : 1302 -20 - -10 : 4059 -10 - 0 : 28423 0 - 10 : 26009 10 - 20 : 876 20 - 30 : 354 30 - 40 : 239 40 - 50 : 346 50 - 60 : 269 60 - 70 : 88 70 - 80 : 184 80 - 90 : 46 90 - 100 : 6 100 - 110 : 8 110 - 120 : 9 120 - 130 : 3 130 - 140 : 1 140 - 150 : 2 150 - 160 : 1 160 - 170 : 1 180 - 190 : 1 190 - 200 : 1 200 - 210 : 1 220 - 230 : 1 270 - 280 : 1 Mean distribution over calls/directories: Mean Occurrences -110 - -100 : 1 -80 - -70 : 9 -50 - -40 : 4 -40 - -30 : 11 -30 - -20 : 65 -20 - -10 : 197 -10 - 0 : 1362 0 - 10 : 1255 10 - 20 : 32 20 - 30 : 11 30 - 40 : 12 40 - 50 : 14 50 - 60 : 12 60 - 70 : 4 70 - 80 : 11 None of the sessions has an alarming average sample value. 2.4 Signal to Noise Ratio We split each signal file into contiguous windows of 10 ms and computed the Mean Square (energy) in each window. The mean sample value over the complete file was subtracted from each individual sample value before MS was computed. 5% of the windows that contained the lowest energy were assumed to contain line noise. In this way the signal to noise ratio could be calculated for each file by dividing the mean energy over all windows by the mean energy of the 5% sample mentioned above. The result was multiplied by 10*log for scaling. SNR distribution over all items: SNR occurrences 0 - 5 : 778 5 - 10 : 3 10 - 15 : 28 15 - 20 : 208 20 - 25 : 1076 25 - 30 : 3703 30 - 35 : 9230 35 - 40 : 15394 40 - 45 : 16968 45 - 50 : 10697 50 - 55 : 3798 55 - 60 : 751 60 - 65 : 168 65 - 70 : 64 70 - 75 : 65 75 - 80 : 24 80 - 85 : 13 85 - 90 : 12 90 - 95 : 7 95 - 100 : 2 100 - 105 : 4 105 - 110 : 6 110 - 115 : 1 SNR distribution over calls/directories: SNR occurrences 0 - 5 : 37 15 - 20 : 4 20 - 25 : 40 25 - 30 : 131 30 - 35 : 458 35 - 40 : 774 40 - 45 : 871 45 - 50 : 495 50 - 55 : 157 55 - 60 : 22 60 - 65 : 5 65 - 70 : 3 70 - 75 : 3 => 37 sessions have an average SNR of 0. They are from 2 speakers: SES1400 SES1401 SES1402 SES1403 SES1404 SES1406 SES1407 SES1408 SES1409 SES1410 SES1411 SES1412 SES1414 SES1415 SES1416 SES1417 SES1418 SES1419 SES2600 SES2601 SES2602 SES2603 SES2604 SES2605 SES2606 SES2607 SES2608 SES2609 SES2610 SES2611 SES2612 SES2613 SES2614 SES2615 SES2616 SES2617 SES2618 => The reason is that the SAMPSTAT.TXT file has empty values for these sessions. <*> missing SAMPSTAT entries have been provided =========================================================================== 5. ANNOTATION FILE - Each line must be delimited by OK - Mandatory (SAM) mnemonics: LHD: SAM, 5.10 DBN: SPEECHDAT__Speaker_Verification (??) VOL: VERIF1_ SES: DIR: SRC: CCD: CRP: REP: RED: RET: SAM: 8000 < = sampling freq.> BEG: END: SNB: 1 < = number of bytes per sample> SBF: < = sample byte order, meaningless with single bytes> SSB: 8 < = number of significant bits per sample> QNT: A-LAW < = quantisation> SCD: SEX: M/F/UNKNOWN AGE: or DOB: ACC: ! mnemo is not SAM HLT: TRD: RCC: REG: SNL: ENV: PHM: ! mnemo is not SAM NET: ! mnemo is not SAM LBD: LBR: , , [gain], [minimum value], [maximum value], LBO: , [centre sample], , EXT: 80 chars on one line> ELF: - Optional (SAM) mnemonics (may be omitted or left empty) TYP: orthographic TXF: CMT: NCH: 1 < = number of channels recorded> ARC: ! mnemo is not SAM SHT: ! mnemo is not SAM CMP: EXP: SYS: DAT: SPA: DSC: < = discontinuity marker> EDU: ! mnemo is not SAM SOC: ! mnemo is not SAM STR: ASS: ! mnemo is not SAM - Order restrictions: . LHD and TYP are first . LBR and LBO come after LBD . ELF is end of file keyword OK - All mnemonics should be SAM mnemonics or explicitly defined in documentation OK - No illegal mnemonics used OK - There are no mnemonics missing OK - All files must contain the same mnemonics. This holds as well for the optional mnemonics. - No line may exceed 80 chars OK - No illegal field values should appear => LHD: SAM, 5.2 is used (SAM 6.0 removed the 80 char. line length constraint) <*> ok, corrected. => The obligatory labels HLT, TRD, SNL have empty values. <*> this cannot be changed => The path after DIR misses the first backslash to indicate the root dir. <*> ok, corrected. => END: has no value in 55 sessions: 0357 0358 0359 1175 1176 1177 1178 1179 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 2199 2454 2455 2456 2457 2458 2459 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 <*> ok, corrected => END is also empty in the files: C10386L2.DEO C10569A2.DEO C10569O8.DEO C10569S2.DEO C10569S3.DEO <*> ok, corrected => LBR:, and LBO, have missing END values in the same directories/files <*> ok,corrected => PHM: has MOBILE as only value. <*> ok, corrected => HOME, OFFICE, CAR STOPPED, PUBLIC PLACE, STREET, MOVING VEHICLE => Used are: QUIET, NOISY <*> this cannot be changed => RED: The month is written in three capital letters, whereas only the => first letter should be a capital. <*> ok, corrected => RED: 1900 is used where 2000 is meant. => 874 sessions are concerned. <*> ok, corrected => RED: Wong formats are used in 4 sessions: SES1170 SES1616 SES1857 SES2439 <*> format is correct, the date is 28/Feb/2000 which is a valid date - Each lowest subdirectory does not refer to multiple sheet ids. OK - For spontaneous speech LBR should contain a mnemonic word. There is no spontaneous speech - The CRP field should distinguish the different repetitions of EXACTLY the same word (from LBR) uttered by a speaker. OK - Transliterations is case-sensitive unless specified otherwise. ( In general lower case is used also at sentence beginning Only exception: proper names and spelled words, ZIP codes, acronyms and abbreviations In the latter case blanks should be used in between the letters. German is the only exception to this convention. ) OK - Punctuation marks should not be used in the transliterations OK - Digits must appear in full orthographic form OK - In principle only the following symbols are allowed to indicate non-speech acoustic events: [fil] [spk] [sta] [int] Other symbols (and language equivalents) must be mentioned in the documentation OK - Asterisks should be used to indicate mispronunciations OK - Double asterisks should be used for not understandable parts OK - Tildes should be used to indicate truncations OK - & should be used for typical GSM phenomena like fading, frame distortions Not used - Assessment of speech items in terms of SNR, presence of additional noise, adherence to prompting text is provided (optional) Not used ======================================================================== 6. LEXICON - Check lexicon existence (\TABLE\LEXICON.TBL) OK - The entries should be alphabetically ordered OK - Used SAMPA symbols are provided in \DOC\SAMPALEX.PS OK - In transcriptions only SAMPA symbols are allowed OK - All SAMPA phoneme symbols should be covered. OK The symbol /Z/ is missing - Phoneme symbols must be separated by blanks OK - A line in the lexicon should have the following format [ ] [] [TAB] is ASCII 9. OK - Each line is delimited by OK - Alternative transcriptions are optional. They may follow the first transcription, separated by [TAB] or have a separate entry (only in case also frequency information is supplied) Alternative transcriptions are not provided - Orthographic entries are as a rule split by spaces only, not by apostrophes, and not by hyphens. OK - Words with * or ~ should not appear in the lexicon OK - The lexicon should be complete . Check for undercompleteness (are all words in lexicon) . Check for overcompleteness (Undercompleteness is worse than overcompleteness. Overcompleteness cannot be a reason for rejection) OK - Lexicon contents should be taken from actual utterances (from LBO), so the entries should exactly match the transcriptions. OK - Optional information: stress, word/morphological/syllabic boundaries. But, if provided, then it should follow the SpeechDat conventions. Not provided ========================================================================== 7. SPEAKERS - Check existence speaker database file with static speaker info (SPEAKER.TBL) OK - Obligatory information in SPEAKER.TBL: 1. unique number (speaker/caller) SCD (or less preferably SES) 2. sex SEX 3. age AGE or DOB 4. accent ACC OK - Check existence session table with dynamic speaker info (SESSION.TBL) OK - Obligatory information in SESSION.TBL: 1. Session number SES 2. Speaker code SCD 3. health HLT (HEALTHY/SORE-THROAT/COLD/OTHER) 4. tiredness TRD (FRESH/TIRED) => HLT and TRD are empty <*> this cannot be changed - Optional information: . native language NLN . education level EDL . smoking habits SMK (SMOKER/NON-SMOKER) . socio-economic status SOC . stress STR (TENSE/RELAXED) Not used - Each line is delimited by OK - Each field is separated by [TAB] (ASCII 9) OK - An additional set of speaker list files in directory INDEX is obligatory. => Speaker list files are missing <*> ok, speaker list files have been added - For 120 speakers there must be 20 calls present OK, there are 150 speaker with 20 calls each - Between successive calls there must be a time interval of at least 3 days => Some minor violations were found: SPK : SES1 DATE1 SES2 DATE2 0011: 0200, 19/JAN/1900 <> 0219, 21/JAN/1900 0124: 2461, 20/NOV/1999 <> 2463, 22/NOV/1999 <*> this cannot be changed - Balance of sexes . How many males, how many females, should match specification in documentation file . Disbalance may not exceed 5% (Each sex must be represented between 45-55%) OK, there is a perfect balance of 75 male and 75 female speakers. - Balance of dialect regions . which dialect regions and how many of each should match specification in documentation file . at least 8 speakers from each dialect region . at least 2 dialect regions The following distribution of speakers over accent regions was found by automatic inspection of the label files: 15 Baden-WŸrttemberg 19 Bayern 13 Brandenburg 9 Hessen 3 Hochdeutsch 8 Mecklenburg-Vorpommern 11 Niedersachsen 25 Nordrhein-Westfalen 3 OTHER 6 Rheinland-Pfalz 2 Saarland 10 Sachsen 9 Sachsen-Anhalt 10 Schleswig-Holstein 7 ThŸringen The SPEAKER.TBL file gives the same distribution as reported above. => Three regions have less than 8 speakers: Rheinland-Pfalz, Saarland, ThŸringen => There are small deviations with section 4.1 of DESIGN.DOC. <*> this cannot be changed; DESIGN.DOC has been modified to explain deviations - Balance of ages . which age groups and how many of each should match specification in documentation file . Criteria < 16 : >= 1% strongly recommended 16-30 : >= 20% ,, 31-45 : >= 20% ,, 46-60 : >= 15% ,, OK, The following distribution of speakers ages was found by automatic inspection of the label files: 00-15 : 24 16-30 : 56 31-45 : 39 46-60 : 24 61-99 : 7 other : 0 The SPEAKER.TBL file gives the same distribution as reported above. => There are deviations with section 4.2 of DESIGN.DOC. <*> DESIGN.DOC has been changed to explain deviations ======================================================================= 8. RECORDING CONDITIONS - Check existence recording conditions table (\TABLE\REC_COND.TBL) OK - Information in REC_COND.TBL Minimum set . recording conditions code RCC . region code REG . environment ENV . telephone network NET . telephone model PHM OK, SNL should be left out, since it is empty. => The meaning of the recording codes is not explained. => It seems that a recording code is a speaker label. This is obviously => not the correct use of the label. Approporiate is to attach a code => to each combination of ENV and NET. <*> recording codes now match SpeechDat specifications; there are 60 different <*> recording codes (15 regions * 2 environments * 2 handsets) - 70% of calls must be from quiet environment (check on mnemonic ENV) (HOME, OFFICE, CAR STOPPED); 30% from a noisy environment (PUBLIC PLACE, STREET MOVING VEHICLE) 10% of speakers (=12) may have 1 call from wrong environment 2.5% of speakers (=3) may have 2 calls from wrong environment OK, all speakers called 14 times from a QUIET environment and 6 times from a NOISY environment. - 50% of calls must be from fixed network (check on mnemonic NET); 50% of calls must be from GSM network 10% of speakers (=12) may have 1 call from wrong network 2.5% of speakers (=3) may have 2 calls from wrong network OK, all speakers called 10 times over the fixed network and 10 times over the cellular network. ============================================================================= 9. TRANSCRIPTION This validation was carried out by taking 1000 of the mandatory short items and 1000 of the mandatory long items from the corpus of 3000 calls. The transcriptions in the label files for these samples were checked by listening to the corresponding speech files and correcting the transcription if necessary. In case of doubt nothing was corrected. This check was performed by a native speaker of the language. Short items are: - names (O7-8) - application words (A1-2) Long items are: - isolated digit string (B1) - connected digits (C3-4) - spelled words (L1-3) - phonetically rich sentences (S0-9) - The evaluation comprised the following guidelines: . Two types of errors were distinguished: speech and non-speech transcription errors . Non-speech refers to [fil] [spk] [sta] [int] only . For non-speech all symbols were mapped to one during validation. i.e. If a non-speech symbol was at the proper location then it was validated as correct (regardless if it was the correct non-speech symbol or not). . Only noise deletions in the transcription were counted as wrong, not noise insertions . the given transcription is given the benefit of the doubt; only obvious errors are corrected. . Errors were only determined on item level, not on word level . For speech a maximum of 5% of the validated items (=files) may contain a transcription error . For non-speech a maximum of 20% of the validated items (=files) may contain a transcription error. RESULTS 1. Long items Transcription errors with respect to speech were found in 10 items. This amounts to 1.0%, which is well below the criterion of 5%. Errors in the transcription of non-speech were found in 60 items. This amounts to 6.0% of the items, which is well below the criterion of 20%. 2. Short items Errors with respect to the transcription of speech were found in 10 items. This amounts to 1.0%, which is well below the criterion of 5%. Errors in the transcription of non-speech were found in 61 items. This amounts to 6.1% of the items, which is well below the criterion of 20%. 3. Overall result When we take the long and short item sets together then we find errors with respect to the transcription of speech in 20 items. This amounts to 1.0%, which is well below the 5% criterion. Errors in the transcription of non-speech were found in 121 items. This amounts to 6.1% which is well below the 20% criterion. ========================================================================== 10. SUMMARY The German speaker verification database in SpeechDat style was validated by SPEX. The essential requirements for such a database are met. The most important remarks following from the validation are : - the database should not be called a SpeechDat database, but a database according to SpeechDat specifications. - the speaker list files are missing - SAMPSTAT.TXT misses the data for 2 speakers - no enrolment/test set partitioning indicated - many small errors in the label files - HLT (health) and TRD (tiredness) are missing speaker information. - Three regions have less than 8 speakers: Rheinland-Pfalz, Saarland, Thüringen. Below we give a brief overview of our findings for this database. The subsections follow the order of the various topics in the previous sections of the report. 1. Documentation The documentation is clear and complete. The information about the quality check of the lexicon should be more extensive. It is recommended to include a train and test set partitioning of the database. Some smaller comments are listed in section 1 of this report. 2. Data base structure and file names The database structure is well in agreement with the SpeechDat conventions. However, the speaker list files are missing, 3. Items The database contains all required items and in sufficient qantities. 4. Sampled data files The speech data files are in the correct format (A-law). A file with acoustic characteristics of each file (SAMPSTAT.TXT) is delivered. Histograms of a number of acoustic characteristics of the files (duration, mean sample value, clipping rate, SNR) were generated and included in section 4 of this report. Acoustical details of individual files can be looked up in the SAMPSTAT.TXT file. The information of 37 sessions (2 apeakers) is missing in the SAMPSTAT.TXT file. 5. Label files All obligatory mnemonics are used. There are (unacceptably) many errors in the values attached to the mnemonics. 6. Lexicon The lexicon has the correct format, contains all and only SAMPA symbols for the phone transcriptions, and it is complete. 7. Speakers HLT (health) and TRD (tiredness) are missing speaker information. Three regions have less than 8 speakers: Rheinland-Pfalz, Saarland, Thüringen. 8. Recording conditions All recording requirements were fulfilled. The meaning of the recording code in the recording table is unclear. 9. Transcription When we take the long and short item sets together then we find errors with respect to the transcription of speech in 20 items. This amounts to 1.0%, which is well below the 5% criterion. Errors in the transcription of non-speech were found in 121 items. This amounts to 6.1% which is well below the 20% criterion. =========================================================================