Revalidation report for the SC10 Database

Validation report for the PHATTSESSIONSZ Database

Authors	Michael Medla
Affiliation	BAS Bayerisches Archiv für Sprachsignale Institut für Phonetik Universität München
Postal address	Schellingstr. 3 D 80799 München
E-mail	bas@phonetik.uni-muenchen.de
Telephone	+49-89-2180-2758
Fax	+49-89-2180-5790
Corpus Version	1.1.0
Date	2014
Status
Comment
Validation Guidelines	Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook

Validation results of the PHATTSESSIONSZ Corpus:

Summary

The speech corpus PHATTSESSIONSZ has been validated against general principles of good practise. The validation covered completeness, formal checks and manual checks of the selected sub samples.

Introduction and Corpus Description

The Ph@ttSessionz speech database version 2.0.0 contains recordings of 1019 adolescent speakers of German (mostly from the age range 12-20). The recordings were performed via the WWW in public schools (Gymnasium) in 46 locations in Germany (and one in Austria). The Speech material recorded is a superset of the German SpeechDat-II and RVG-I corpora. A session consists of up to 138 recording items, with both read and non-scripted speech. Th e read speech material comprises isolated digits, digit sequences, numbers, time and date expressions, spellings, person, company and geographical names, and phonetically rich sentences. The non-scripted speech consists of short and long text production items. The short te xt production items are questions on the current date or prompts for descriptions, e.g. on how to get from home to the train station, or the speaker's clothes; for the long items the speaker was asked to talk about the last holidays, or the favorite subject at school, etc.

I.) Validation of Documentation

The General Documentation directory contains the following documentation files for the PHATT_11 corpus which can be found under:

1.) the main directory

README.TXT	File describing the database and its structure
DISK.ID	Directory ID for OS
COPYRIGHT.TXT	Copyright text

2.) the directory /doc:

060628_MPre_UG_EN01.pdf	M-Audio MobilePre USB user guide
at3031_english.pdf	Audio-Technica AT3031 data sheet
HTML	Description of the recording procedure
Opus54_DB_E.pdf	Beyerdynamic Opus 54 data sheet
SAMPALEX.PDF	German SAMPA table
TRANSCRIPTION.PDF	Validation and transcription handbook for Ph@attSessionz

3.) the directory /table:

LEXICON.TBL	Pronunciation dictionary
METADATA.TBL	Speaker information table
PH110TRN.TBL	SpeechDat FDB training set
PH110TST.TBL	SpeechDat FDB test set

4.) the directory /index:

CONTENTS.TBL

Content of the database

5.) the directory /source

DEFTEST.PL

Perl script defining SpeechDat FDB test sets

6.) the directory /doc/html

manual_eng.html	Starting page of instructions for recording staff
manual_eng.zip	Compressed archive file of description
manual_files	Description file in HTML

· Administrative Information:

Validating person: n.a.

Date of validation: n.a.

Contact for requests regarding the corpus: ok

Number and type of media: n.a./not given

Content of each medium: ok (file tree of master)

· Technical information:

Layout of media: Information about file system type and directory structure: ok

File nomenclature: Explanation of used codes (no white space in file names!): <aaa><4 digit number><2 or 3 character items>_<recording channel 0 | 1><.par | .wav>, ok

Formats of signals and annotation files: If non standard formats are used it is common to give a full description or to convert into a standard format: ok (waveform audio file, BAS partitur file)

Coding: Signed Integer PCM, ok

Compression: n.a./none

Sampling rate: 22050 Hz, ok

Valid bits per sample: (others than 8, 16, 24, should be reported): ok, 16-bit

Used bytes per sample: 2 ok

Multiplexed signals: (exact de-multiplexing algorithm; tools) stereo to mono

· Database contents:

Clearly stated purpose of the recordings: not given

Speech type(s): (multi-party conversations, human-human dialogues, read sentences, connected and/or isolated digits, isolated words etc.) read and non-scripted speech, ok

Instruction to speakers in full copy: not given

· Linguistic contents of prompted speech:

Specifications of the individual text items: n.a.

Specification for the prompt sheet design or specification of the design of the speech prompts: n.a.

Example prompt sheet or example sound file from the speech prompting: not given

· Linguistic contents of non-prompted speech:

Multi-party:(number of speakers, topics discussed, type of setting - formal/informal) n.a.

Human-human dialogues: (type of dialogues, e.g. problem solving, information seeking, chat etc., relation between speakers, topic(s) discussed, type of setting, scenarios) n.a.

Human-machine dialogues: (domain(s), topic(s), dialogues strategy followed by the machine, e.g. system driven, mixed initiative, type of system, e.g. test, operational service, Wizard-of-Oz) n.a.

· Speaker information:

Speaker recruitment strategies: ok

Number of speakers: ok, 1019 speakers

Distribution of speakers over sex, age, dialect regions: ok

Description/definition of dialect regions: ok (information given in METADATA.TBL)

· Recording platform and recording conditions:

Recording platform: n.a. (PC, but no more information is available)

Position and type of microphones:
- Company name and type id: ok (Beyerdynamic Opus54, AudioTechnica AT3031)
- Electret, dynamic, condenser: condenser, ok
- Directional properties: ok (description in instruction file for recording staff)
- Mounting: ok (description in instruction file for recording staff)

Position of speakers: (distance to microphone) n.a.

Bandwidth: (if other than zero to half of sampling rate)

Number of channels and channel separation: 1 channel for each microphone, ok

Acoustical environment: office/class room

Recording hardware, telephone link (analog, digital): M-Audio MobilePre USB audio interface (A/D converter)

Network from where the call originated: none

Type of handset: none

· Annotation (BAS partitur files)

Unambiguous spelling standard used in annotations: ok

Labeling symbols: ok

List of non-standard spellings (dialectal variation, names etc.): n.a.

Distinction of homographs which are no homophones: n.a.

Character set used in annotations: ok, German SAMPA

Any other language dependent information as abbreviations etc: n.a.

Annotation manual, guidelines, instructions: ok (TRANSCRIPTION.PDF)

Description of quality assurance procedures: ok (TRANSCRIPTION.PDF)

Selection of annotators: phonetics and linguistics students

Training of annotators: ok

Annotation tools used: WWWTranscribe software

· Lexicon:

Format: UTF-8 file with three columns, ok

Text-to-phoneme procedure: n.a.

Explanation or reference to the phoneme set: German SAMPA

Phonological or higher order phenomena accounted in the phonemic transcriptions: n.a.

· Statistical information:

Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): n.a.

Word frequency table: not given

· Others:

Any other essential language-dependent information or convention: n.a.

Indication of how many files were double-checked by the producer together with percentage of detected errors: n.a.

Status of documentation:

II.) Automatic validation

The following list contains all validation steps with the methodology and results.

Completeness of signal files: two audio files were missing; s. IV.

Completeness of meta data files: ok

Completeness of annotation files: ok

Correctness of file names: ok

Empty files: none

Status of signal, annotation and meta data files: ok

Cross checks of meta information: ok

Cross checks of summary listings: none (no summary listing)

Annotation and lexicon contents: ok

III.) Manual Validation

60 files were reviewed. 40 audio (for each mikrophone) and 20 partitur files were checked for consistency. In this sample of files it appears that noise markers, which were used for the orthographic and canonical transcription, are transcribed with MAUS as words.

IV.) Other Relevant Observations

All audio files are symbolic links to the real audio files in/from Ph@ttSessionz. Two files are missing: AAA481344_1.wav and AAA481344_0.wav; the appendant partitur file is available, so are their symbolic links, which are leading to nowhere. That is now repaired. There are both files AAA481344_1.wav and AAA481344_0.wav linked to the data base.

In BAS partitur files the label DIR leads to non-existing directories. The same speaker ID is used for different speakers (cf. label SPN in partitur files; SPN: 4 and DIR: DATA/4). It seems that this is an error.

V.) Comments for Improvement

The information given by the labels in/from BAS partitur files should be consistent with requirements for meta data of BAS speech corpora. E.g. the label SPN in/from BAS partitur files is used for different speakers instead for only one at a time.

VI.) Result

The reported inconsistencies/errors have been fixed in version 2.0.0 (2015). The corpus is in a usable form and good condition for automated processing.

The corpus is suitable for publication according to BAS and CLARIN standards.