Revalidation report for the Strange Corpus 1 (SC1) Database (1995)

Authors Florian Schiel, Tania Ellbogen, Karl Weilhammer
Affiliation   BAS Bayerisches Archiv für Sprachsignale
Institut für Phonetik
Universität München
Postal address Schellingstr. 3
D 80799 München
E-mail
bas@phonetik.uni-muenchen.de
Telephone +49-89-2180-2758
Fax +49-89-2800362
Corpus Version 1.0
Date 18.06.2003
Status final
Comment  
Validation Guidelines Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook 

Validation results of the SC1 Corpus:

Summary

The speech corpus Strange Corpus 1 Accents (SC1) has been validated against general principles of good practice. The validation covered completeness and formal checks.

Introduction and Corpus Description

This document summarizes the results of an in-house validation of the speech corpus SC1 (Version 1.0), which was released in 1995. The corpus was recorded in the years 1979 and 1991 at the University of Munich in Germany. The language of the corpus is German. 72 of the 88 speakers were not born in Germany and were educated in various countries. The aim was automatic accent detection, test of robustness against different accents in automatic speech recognition and scientific investigation of accents in German. The corpus contains read speech of 88 different speakers. Each speaker read the German text "Nordwind und Sonne".

I.) Validation of Documentation

The General Documentation directory contains the following documentation files for the SC1 corpus, which can be found under: doc/...

README: documentation,
phondat.doc: PhonDat Data Format - Description,
sc1_ort.txt: orthography of the corpus,
sc1_spk.txt: speaker information,
seg_conv.txt: conventions for transcription and segmentation

Status of documentation: not ok.

II.) Automatic validation

The following list contains all validation steps with the methodology and results.

Completeness of signal, annotation and meta data files:

Correctness of file names: ok

Empty files: none

Status of signal, annotation and meta data files: ok.

III.) Manual Validation

Rough inspection of some sound files by ear.

IV.) Other Relevant Observations

It is not mentioned in the documentation, that sub-corpus C is part of the phondat 2 corpus. It is neither clear, what was read by the speakers, nor what was annotated in the level 1 annotation:

"The orthography of the spoken text can be found in the file sc1_ort.txt (ASCII, with German 'Umlaute' coded in 7 bit ASCII).
Note that some of the speakers have not read the title of the story. Please refer to the signal file headers to get the exact orthography of what was spoken.
Also note that some of the speakers has repeated single words or phrases and/or have deleted words during reading. These are NOT marked in the orthographic or canonic strings of the headers."

This sentence is not necessary in the Documentation:

"This corpus will be extended to more countries and more material in 1996."

V.) Comments for Improvement

Parts of the documentation are not clear, and even contradictory, we would therefore suggest to reformulate some passages to get a clear and consistent README file. The corpus consists of two parts, that are somehow similar but not entirely. This fact should be considered in the documentation. Obviously the speakers had different prompt texts, we suggest to add the other versions of the prompt text to the file sc1_ort.txt and give at least some hints in the README file, which text was used in which recording.

Obviously it would be a great improvement of the corpus to provide at least correct orthographic transcripts for sub-corpus C. Phoneme segmentations for sub-corpus C would improve the corpus tremendously. For users of the corpus that can not read German, it would be interesting to have the segmentation manual in English.

Since the phondat header format is not easily accessible on a windows system or for unexperienced users, we suggest to convert the headers to the NIST format, which is easily readable with a text editor.

VI.) Result

With an update of the documentation and after the conversion of the phondat files into files with NIST headers the corpus is in acceptable conditions. Part T is very valuable. Since all recordings were repeated until they were flawless, the promt text can be regarded as an orthographic transcription, on which the phone segmentations, that exist for each audio file are based.

Without proper orthographic transcriptions Part C is only suited for users that are mainly interested in the audio signals or only need to know roughly, what was read.