BAS Validation Report SIGNUM

BAS Validation Report for the SIGNUM Database

Authors	Florian Schiel
Affiliation	BAS Bayerisches Archiv für Sprachsignale Institut für Phonetik Universität München
Postal address	Schellingstr. 3 D 80799 München
E-Mail	bas@phonetik.uni-muenchen.de
Telephone	+49-89-2180-2758
Fax	+49-89-21805790
Corpus Version	0.9
Date	27/04/2009
Status	finished
Comment	All minor recommendations listed in this validation report have been amended by the authors in version 1.0 of the corpus. With regard to this validation procedure the corpus is now error-free.
Validation Guidelines	Florian Schiel: The Validation of Speech Corpora, IPS LMU München, 2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook

Validation Results of the SIGNUM Corpus

Summary

The video sign language corpus SIGNUM has been validated against general principles of good practise. The
validation covered completeness, formal checks and manual checks of the selected subsamples.
The corpus SIGNUM is in very fine condition. Documentation is complete and all documented files are found and functional.

Introduction and Corpus Description

This document summarizes the results of an inhouse validation of the sign language corpus SIGNUM (video)
produced by the Institute of Man/Machine Interaction, located at the RWTH Aachen University in Germany.
The corpus was produced in 2007/2008 (exact date not given in the documentation) in Aachen, Germany. The sign language of the corpus is German. The aim was to record a base vocabulary (450) and a selection of sentences based on this vocabulary (780) for bench-marking a recognition system for sign language. The associated research project SIGNUM including the corpus production was funded by the DFG (German research Council, research contract number not given).
The corpus contains signed German speech of 25 different signers. Each signer signed a corpus of 450 basic signs and 780 continuous sentences once. One signer (the reference signer) performed this corpus three times.
The corpus contains no audio files, only the full body frontal video of the signer in front of a neutral blue background. Because of the size of the corpus (980GB) it is distributed on hard disc. All the following absolute file paths start in the root .../signum of the hard disc.

I) Validation of Documentation

The General Documentation directory /html contains the following documentation files:
index.html : start of HTML documentation,
html : further referenced documentation files
images : images used in the HTML documentation
videos : flash video clips referenced in the HTML documentation
papers : copies of published work about SIGNUM (PDF)
The HTML documentation is close-referenced with relatives paths and can therefore be moved to a different location without any loss. For that purpose move the starting index file index.html and all mentioned subdirs of /html to a new location.
The HTML documentation includes example flash video clips of basic signs and sentences. We found that some browsers have problems with the embedded flash video player; therefore an additional direct link to the flash player was added to the embedding HTML page allowing users with disfunctional or out-dated browsers to view the example videos.
Aside from the HTML documentation no other documentation or meta data files are contained in the corpus.

The following required contents of the documentation have been checked:

- Administrative Information:

Validating person: not given

Date of validation: not given

Contact for requests regarding the corpus: ok.

Number and type of media: 929GByte on hard disc. ok.

Content of each medium: n.a.

Copyright statement and intellectual property rights (IPR): ok.

- Technical information:

Layout of media: Information about file system type and directory structure: file system type not given (NTFS), dir structure ok.

File nomenclatura: Explanation of used codes (no white space in file names!): ok.

Formats of signals and annotation files: If non standard formats are used, it's common to give
a fully description or convert into standard format: JPEG picture. ok.
Coding: n.a.

Compression: Just widely supported compressions like zip or gzip should be used: JPEG implies (lossy) compression, grade of compression quality not given.

Sampling rate: 30fps. ok.

Valid bits per pixel: 24. ok.

Image size: 776x578. ok.

Multiplexed signals: n.a.

- Database contents:

Clearly stated purpose of the recordings: sign language recognition; no specifics given. ok.

Speech type(s): basic signs and sentences read from screen. ok.

Instruction to speakers in full copy: ok.

- Linguistic contents of prompted speech:

Specifications of the individual text items: detailed description of basic signs, compounds and sentences. ok.
Specification for the prompt sheet design or specification of the design of the speech prompts:
screen-prompted, no specification of prompt design given.

Example prompt sheet or example sound file from the speech prompting: n.a.

- Speaker information:

Speaker recruitment strategies: not given.

Number of speakers: 25 speakers. ok.

Distribution of signers over sex, age, dialect regions: not given; only implicitely in the sgner data profiles.
Signer profiles contain sufficient information but are only accessible in the HTML documentation. A simple ASCII table should be added to the corpus, so that automatic processing is feasible.
Description/definition of dialect regions: probably n.a.

- Recording platform and recording conditions:

Recording platform: n.a.

Position and type of camera: ok.

Position of speakers: ok.

Number of channels and channel separation: n.a.

Acoustical environment: n.a.

- Annotation:
The mapping of file names and content can only be seen by help of the search fuction within the HTML documentation. For automatic processing this is not acceptable. The mapping is stored in JavaScript tables in German and English: /html/sentences.js ...
A simple ASCII table should provided in the corpus as well.
Unambiguous spelling standard used in annotations:
"The annotation follows the specifications of the Aachener Glossenumschrift, developed by the Deaf Sign Language Research Team (DESIRE) at the RWTH Aachen University." It is unclear whether this standard is widely used.

Labeling symbols: n.a.

List of non-standard spellings: probably n.a.

Distinction of homographs which are no homophones: n.a.

Character set used in annotations: not given; probably 7-BIT ASCII/HTML

Any other language dependend information as abbreviations etc.: ok.

Annotation manual, guidelines, instructions: n.a.

Description of quality assurance procedures: n.a.

Selection of annotators: n.a.

Training of annotators: n.a.

Annotation tools used: n.a.

- Lexicon: n.a.

- Statistical information:

Frequency of sub-word units: phonemes: n.a.

Word frequency table: n.a.

Status documentation: repairable

II.) Formal Validation

The following contains all validation steps together with the methodology and results.

Mountability-Check: mountability on three OS: Windows, Macintosh and Linux:
Windows XP, Internet Explorer : ok
Windows XP, FireFox : ok
Macintosh, Safari : documentation: flashplayer does not work
Macintosh, FireFox: ok
Linux (SuSE 11.3): FireFox: ok
Linux (SuSE 11.3): Seaamonkey: ok
Linux (SuSE 11.3): Opera: documentation: flashplayer does not work
Linux (SuSE 11.3): Konqueror: documentation: flashplayer does not work
Mountability ok for all OS

Completeness of signal, annotation and meta data files:

Signal files JPEG format:
Since the number of JPEG pictures per recording is variable, we only check the number of required directories and whether these directories contain JPEG files.
Required directories: 27 * (780+450) = 33210
Counted: 33210 ok
Required files: 5970450
Counted: 5970450 ok

Correctness of file names:
Checked for dir names syntax: ok
Checked for filename syntax: ok
Checked for dot-file: none
Checked for superfluous files: none

Empty files: none

III.) Manual Validation

Does not apply, because no manual labelling/tagging.

IV.) Other Relevant Observations

The usage of NTFS via USB on LINUX is quite unstable. The system crashed during the validation procedure.
Maybe a recommendation should be added to the documentation saying that users with LINUX systems should consider to copy the corpus to more stable file systems like Reiser or ext3 before working with the corpus.

V.) Comments for Improvement

A machine readable speaker table and content table (mapping from content to basic sign and sentence numbers should be added to the corpus.
The following minor information might be useful and therefore added to the HTML documentation:

a corpus version number on the starting HTML page (1.0?)
file system type of medium: NTFS
grade of compression quality
information about speaker recruitment strategy
more information about 'Aachener Glossenumschrift' and its relation to other systems
hint about possible problems when mounting NTFS under LINUX
example commands on how to play JPEG sequences or to encode JPEG sequences into other video formats, for example using 'mplayer':
play: mplayer "mf://*.jpg" -mf fps=30
encode to MPEG4: mencoder "mf://*.jpg" -mf fps=30 -o con0066.avi -ovc lavc -lavcopts vcodec=mpeg4

VI.) Result

The corpus SIGNUM is in very fine condition. Documentation is complete and all documented files are found and functional. Some meta information (speaker table, content mapping to file numbering) should be added in machine readable form to simplify further usage of the corpus.