_/_/_/_/ _/_/ _/_/_/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/_/_/_/ _/_/_/_/ _/_/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/_/_/_/ _/ _/ _/_/_/_/ BAVARIAN ARCHIVE FOR SPEECH SIGNALS University of Munich, Institut of Phonetics Schellingstr. 3/II, 80799 Munich, Germany bas@phonetik.uni-muenchen.de COPYRIGHT University of Munich 1995. All rights reserved. This corpus and software may not be disseminated further - not even partly - without a written permission of the copyright holders. Additional Copyright Holders Fa. Siemens, Munich, 1995. ====================================================================== Siemens_100 SI100 Version 4.1 F. Schiel 09.03.95 / 18.10.20 ====================================================================== Siemens_100 Corpus - General Documentation This directory contains documentation files for the SI100 corpus. README : this file SI100_id.lis : list of speakers for the total corpus SI100_ce.txt : texts of subcorpus CeBit SI100_sz.txt : texts of subcorpus SZ SI100_wo.txt : list of spoken words SI100.lex : pronunciation lexicon pardoc/ : BAS Partitur Files Docu Revalidation_SI100.html : BAS revalidation of version 3.0 in 03/07/2003 this file is in most parts obsolete! If not pure ASCII all documentation files including lists are encoded in ISO-8859. Siemens_100 Corpus - Description The corpus contains read speech of 101 different speakers (50 female, 50 male, 1 unknown). Each speaker has read approx. 100 sentences from either the SZ subcorpus or the CeBit subcorpus. The language is German. The subcorpus SZ contains 544 sentences from newspaper articles ("Sueddeutsche Zeitung"). The subcorpus CeBit contains 483 sentences from newspaper articles about the CeBit 1995. Each subcorpus is divided into 5 parts of approx. 100 utterances each. Every speaker read only one part of one subcorpus (with some exceptions), thus resulting in a total of 10.387 recorded utterances (31,5 h of speech). The recording took place at the Institut fuer Phonetik, University of Munich, Germany in 1995. Siemens_100 Corpus - Speaker Information The file SI100_id.lis contains an ordered list (alphabetic to ids), that gives information about the part and the subcorpus of each speaker. The same can be found in the files SI100_sz.lis and SI100_ce.lis for the two subcorpora. The ordered lists have 11 columns: Corpus : 1 = Subcorpus SZ 2 = Subcorpus CeBit Part : part of the subcorpus SZ: 1 = 1-103 2 = 104-220 3 = 221-341 4 = 342-435 5 = 436-544 CeBit : 1 = 1-100 2 = 101-200 3 = 201-273 4 = 274-377 5 = 388-483 ID : speaker id, 4 chars Sex : m = male, w = female Name : full name Birth : date of birth Size : size in cm Weight : weight in kg Place of Education : place of living during school period Place of Living : current place of living Profession : current occupation A single 'x' as an entry means 'no information available'. A single '*' in the column 'Place of Living' means that the current place of living is the same as the place of living during school period ('Place of Education'). Siemens_100 Corpus - Spoken Texts The orthography of the subcorpora can be found in the files SI100_sz.txt and SI100_ce.txt respectively (UTF-8). Each utterance is numbered according the file naming convention. These texts are identical to the header information in the signal files. (see there for additional information about the format) Note special spelling rules (see below). Siemens_100 Corpus - Recording Situation The task for the speaker was to read as carefully as possible and to read all punctuations as in a dictation task. If an error occurred, the recording was interrupted by the superviser and the sentence was repeated. The recording conditions are as following: Microphone: Sennheiser Headset HMD 410 Distance to mouth approx. 3-5 cm Room: 'dry' acoustics ('quiet office'), no noise Sampling rate: 48 kHz Resolution: 16 Bit The speech data were digitally filtered to 8 kHz cutoff frequency and downsampled to 16 kHz. Siemens_100 Corpus - File structure Note that on some platforms all characters in filenames are always upper case (e.g. DOS). Each speaker has read approx. 100 sentences from either the subcorpus SZ or the subcorpus CeBit (see above). Each sentence is stored in a seperate file. The filenaming convention is as follows: [c].nis ^ | | indicating CeBit Subcorpus for example: step046c.nis contains the 46th utterance from the CeBit corpus spoken by the speaker STEP (see SI100_id.lis for more information about the speaker). flsc099.nis contains the 99th utterance from the SZ subcorpus spoken by the speaker FLSC. The total length of the prefix is always less or equal to 8. The suffix is always 'nis' to denote a NIST SPHERE format signal file with 16 kHz sampling rate. The format of the signal files is NIST SPHERE as described in the the URL http://www.icp.grenet.fr/Relator/standnist.html or directly at NIST (e.g. ftp://jaguar.ncsl.nist.gov/pub/). The key 'orthography' contains the German orthographic representation of the utterance. The utterance does not always start with a capital letter. Only if it starts with a capital written word (e.g. a noun, as nouns are written capital in German). Punctuations are separated from words and counted as words (because they were spoken as words!), except the hyphen in compound words (e.g. 'Büro-Utensilien'). German 'Umlaute' are coded as follows: Umlaut LaTeX ASCII hex. ========================== ae "a E4 ue "u FC oe "o F6 Ae "A C4 Ue "U CD Oe "O D6 ss "s DF This coding gives the text a 'natural' view, if displayed with a ISO 8859-1 German character set. Siemens_100 Corpus - BAS Partitur Format Files All information accompanying the signal is summed up in the corresponding Partitur File (same prefix but extension 'par') in the subdirectory DATA. The BAS Partitur Format (BPF) is an open structure that allows the easy description and processing of information aligned to a speech signal. The BPF is currently developed at the BAS with the support of many partners. Please refer to the following URL for a detailed description: http://www.phonetik.uni-muenchen.de/Bas/BasFormatseng.html A copy of the on-line documentation at the time of CDROM production can be found in the files in PARDOC. Use a standard WWW Browser to read these documents. In the SI100 Corpus the following tiers are used: - ORT : orthography - KAN : canonic pronunciation derived from DOC/SI100.LEX - MAU : phonetic-phonemic segmentation - MAS : syllabification of phonetic segmentation Note that German Umlauts are coded in UTF-8 within the BPF files. Siemens_100 Corpus - Spelling rules Sentences were spoken as to a dictation software. To indicate different pronunciations of ambigue punctuation signs the following rules applied: '-' is spoken as 'StrIC' '\-' is spoken as 'bInd@StrIC' '~-' is spoken as 'minUs' Siemens_100 Corpus - Pronunciation Dictionary A dictionary containing the 'canonical' pronunciation for each spoken word of the corpus is contained in the file SI100.lex. The file comprises a two-column, TAB-separated list with orthography (UTF-8) in the first and the pronunciation (SAM-PA) in the second columns. To access the dictionary directly from the texts contained in the NIST header of the signal files or from the text corpora SI100_ce.txt or SI100_sz.txt you must take care to transform the Umlauts accordingly. The pronunciation is coded in extended German SAM-PA as being used in most German speech resources. Primary and secondary word accent are coded as preceeding ' and " in this dictionary. See the file 'PARDOC/BasSAMPA' for a detailed description of the used phoneme set. Siemens_100 Corpus - EMU DB The subdir SI100_emuDB/ contains a simple Emu database of the corpus with annotation layers ORT,KAN and MAU. See for details about the EMU-SDMS see: https://ips-lmu.github.io/EMU.html Siemens_100 Corpus - History ... 15.09.99 : Version 2.1 Errors in pronunciation dictionary fixed: 8-Bit chars and undefined phoneme sequences, wrong canonical pronunciations in abbreviations 25.11.99 : Version 2.2 BAS Partitur Files inserted 10.09.02 : SPEX produces QQC report. See 'SI100-S0025_QQC.doc'. 02.01.03 : Version 3.0 Changed word 'gründen' to 'Gründen' in utterance 480 of corpus CeBit. Erased 'gründen' from dictionary. Erased the following corrupt signal files: Speaker No chkr 381 chkr 437 chkr 438 chkr 439 chkr 440 Added docu of the pronunciation dictionary. Changed file names ziu* to ziul* in NIST and PAR Changed file names kip* to kipp* in NIST and PAR Changed file names bij* to bija* in NIST and PAR Changed file names mol* to miol* in NIST and PAR Changed file names pko* to peko* in NIST and PAR Changed file names steph* to step* in NIST and PAR Changed file names wab* to waba* in NIST and PAR Changed file names zuenk* to zuen* in NIST and PAR Changed id name tihg to till in SI100_id.lis Corrected SI100_id.lis: anse deleted, gira and mamu added, erased multiply TABs in SI100_id.lis, transformed all birth dates into the form 'DD.MM.YY', erased multiply TABs. Erased redundant doc files SI100_sz.lis and SI100_ce.lis. With update 3.0 all errors found in the SPEX QQC report are fixed. 17.09.03 : Version 3.1 Changed all /R/ phonemes in the canonical pronuciation dictionary to /r/ to be conform with the BAS Guidelines (www.bas.uni-muenchen.de/Bas/BasGermanPronunciation/ 14.06.04 : Version 3.2 Removed names of speakers from speaker list SI100_id.lis 12.07.11 : Version 3.3 Changed word 'a' with apostrophe into simple 'a' 18.10.20 : Version 4.0 Corrected SAMPA /Q/ to /?/ in KAN tier of *.par and in second column of DOC/SI100.LEX Converted LaTeX Umlauts to UTF-8 in ORT tier of *.par and in first column of DOC/SI100.LEX Created emuDB based on *.par using the BAS WebService MAUS. Created new BPF files based on *.par using pipe MAUS_PHO2SYL. Moved *.nis files from speaker subdirs to data (flatten). Moved *.par files to data dir. Found bugs: - recordings stsc438, itrc479c, flsc380c, flsc381c are truncated - removed - recordings stsc436, stsc437, stsc448, udha104, udha214, udha215 are empty - removed - large number (25) of recordings have sample count 0; removed. 20.10.20 : added SI100_emuDB : this emuDB does not contain all recordings due to some errors in the signal and annotation files; mostly NIST SPHERE files with a correct header but sample count 0. 26.10.20 : added new *.par files based on maus 5.76 and pho2syl 1.31 to subdir data; removed doc/partitur; note that the canonic pronunciation in the KAN tier is not created by G2P but derived from DOC/SI100.LEX.