Revalidation report for the SI1000 Database

Authors

Florian Schiel, Katerina Louka

Affiliation  

BAS Bayerisches Archiv für Sprachsignale
Institut für Phonetik
Universität München

Postal address

Schellingstr. 3
D 80799 München

E-mail

schiel@phonetik.uni-muenchen.de
bas@phonetik.uni-muenchen.de

Telephone

+49-89-2180-2758

Fax

+49-89-2800362

Corpus Version


Date

13.09.2004

Status

final

Comment

 

Validation Guidelines

Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook 

Validation results of the SI1000 Corpus:

Summary

The corpus contains read speech of 10 different speakers. Each speaker
has read approx. 1000 sentences from a German news paper corpus,
thus resulting in a total of approx. 10000 recorded utterances.


Introduction and Corpus Description

This document summarizes the results of an in-house validation of the speech corpus SI1000 made in the year 2004 within the project 'BITS' by the Institute of Phonetics of the Ludwig-Maximilians-University Munich. The recording took place at the Institut fuer Phonetik, University of Munich, Germany in 1994.  

I.) Validation of Documentation

The General Documentation directory contains the following documentation files for the SI1000 corpus which can be found under: doc/

README

general documentation

SI1000id.lis

list of speakers for the total corpus

SI1000.txt
list of spoken utterances
phondat.doc
documentation about the file format PhonDat
SI1000.lex
pronunciation dictionary in Extended German SAM-PA
ext_sam.txt
description of Extended German SAM-PA
partitur/
BAS Partitur Files
pardoc/
BAS Partitur Files Docu

·         Administrative Information:

Validating person: n. a.

Date of validation: n. a..

Contact for requests regarding the corpus: ok

Number and type of media: CDROMs

Content of each medium: no information, total size  2,2  GB of compressed data

Copyright statement and intellectual property rights (IPR): ok

·         Technical information:

Layout of mediaInformation about file system type and directory structure:
The data of the speakers are
collected in subdirectories named with the speaker id.
The volumes are stored on CDROMs with High Sierra File System (HSFS)
or ISO 9660 format with RockRidge extensions.

File nomenclatureExplanation of used codes (no white space in file names!):
<speaker_id><sex>d<sentence # from corpus>.16 ok

Formats of signals and annotation files: If non standard formats are used it is common to give a full description or to convert into a standard format: ok

- PhonDat 2 format signal file

Coding:  .16  (PhonDat format signal file with 16 kHz  sampling rate)

Compression: n. a.

Sampling rate: 16kHz ok

Valid bits per sample: (others than 8, 16, 24, should be reported): 16 bit ok

Used bytes per sample: 2 bytes/samp ok

Multiplexed signals: (exact de-multiplexing algorithm; tools) n.a.

·         Database contents:

Clearly stated purpose of the recordings:
no information

Speech type(s): (multi-party conversations, human-human dialogues, read sentences, connected and/or isolated digits, isolated words etc.) read sentences ok

Instruction to speakers in full copy:  no information

·         Linguistic contents of prompted speech:

Specifications of the individual text items: Sentences from a German news paper corpus

Specification for the prompt sheet design or specification of the design of the speech prompts:  n.a.

Example prompt sheet or example sound file from the speech prompting: n.a.

·         Linguistic contents of non-prompted speech:

Multi-party:(number of speakers, topics discussed, type of setting - formal/informal) ok

Human-human dialogues: (type of dialogues, e.g. problem solving, information seeking, chat etc., relation between speakers, topic(s) discussed, type of setting, scenarios)  n.a.

Human-machine dialogues: (domain(s), topic(s), dialogues strategy followed by the machine, e.g. system driven, mixed initiative, type of system, e.g. test, operational service, Wizard-of-Oz) n.a.

·         Speaker information:

Speaker recruitment strategies: No information, not important for this corpus

Number of speakers: 10 (SI1000id.lis)
 ok

           Distribution of speakers over sex, age, dialect regions: given (/doc/SI1000id.lis)
           Description/definition of dialect regions:  indirectly given  (/doc/SI1000id.lis)

·         Recording platform and recording conditions:

Recording platform: ok

Position and type of microphones:
- Company name and type id: Sennheiser Headset HMD 410
- Electret, dynamic, condenser: no information
- Directional properties:  no information
- Mounting:  no information

Position of speakers: (distance to microphone) Distance to mouth approx. 3-5 c

Bandwidth: (if other than zero to half of sampling rate) ok

Number of channels and channel separation:  no information

Acoustical environment: no information

 

·         Annotation (BAS Partitur Format Files):

Unambiguous spelling standard used in annotations: ok

Labeling symbols: ok

List of non-standard spellings (dialectal variation, names etc.): given

Distinction of homographs which are no homophones: n.a.

Character set used in annotations: ok

Any other language dependent information as abbreviations etc: given

Annotation manual, guidelines, instructions: ok – (The PAR documentation can be found in the files in PARDOC)

Description of quality assurance procedures: not given

Selection of annotators: not given

Training of annotators: not given

Annotation tools used: given

·       
Lexicon
:

Format: ok

Text-to-phoneme procedure: ok

Explanation or reference to the phoneme set: an indirect reference,ok. (ext_sam.txt is a description and mapping of Extended German SAM-PA)

Phonological or higher order phenomena accounted in the phonemic transcriptions: ok

·         Statistical information:

Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): n.a.

Word frequency table: n.a.

·         Others:

Any other essential language-dependent information or convention: given.

Indication of how many files were double-checked by the producer together with percentage of detected errors:  known errors can be found in the README file.

          Status of documentation: acceptable

II.) Automatic validation

The following list contains all validation steps with the methodology and results.

Completeness of signal files:  Following files are missing:

Speaker ID
                                Missing Files                                   
BJ
077, 079, 086
BK
077, 079, 086
BW
077, 079, 086, 201, 1000
CK
077, 079, 086
CS
077, 079, 086
HT
077, 079, 086, 101, 401, 501, 601
MH
077, 079, 086, 601
PG
077, 079, 086, 601
SH
077, 079, 086, 101, 201, 601, 801
TA
001, 077, 079, 086

Superflous files:  csmd101.shn,  csmd401.shn, csmd701.shn
 
Completeness of meta data files: ok

Completeness of annotation files: ok.

Correctness of file names: ok.

Empty files/ Corrupt files:  The files "101, 401, 701"  of the speaker  with the id  "CS".

Status of signal, annotation and meta data files: ok

Cross checks of meta information: ok

Cross checks of summary listings: ok

Annotation and lexicon contents: 

All /R/ phonemes are changed to /r/ in order to be conform with the BAS Guidelines


III.) Manual Validation

Approximetely 4  % of the data and the PAR files were checked in comparison. The error rate was very low. The errors were misspoken words and not spoken punctuation marks.
Pauses, hesitations, and breathings  are not annotated.

IV.) Other Relevant Observations

V.) Comments for Improvement

 The results of the manual validation couldn't be repaired.

VI.) Result

The corpus is ok. The corpus, its history and its known errors are well documented and  no data or documentation files are missing.