First draft of a speech corpus specification

Result from the first BITS workshop on speech synthesis corpora 10/08 - 10/09 2002

Version 1.3


F. Schiel

The following list of properties, structure and procedures is the result of an exhaustive discussion on the first workshop on speech synthesis corpora organised by the BITS project in Munich. This is a starting point for a first draft specification. It does not imply that all of the following should be implemented in the final corpus production; rather it should be regarded as an 'ideal' aim that the BITS project should follow. Progress of the efforts will be reported to the participants of the workshop on a regular basis ('news letter') and we hope for lively discussions and comments on that email distributor as well.

  1. General Issues

    - The corpus will consist of the following part (with decreasing priority):

1. Core corpus for:

- diphone synthesis

- unit selection synthesis techniques.

(annotated etc. in a way that it is usable with state-of-the-art)

2. 'Goodie' corpora:

- goodie1: several hours from a 'core' speaker with studio quality

speaking in different topics and different styles

(style: controlled to conversational, up to dialogues where only the voice of the speaker is recorded)

might only be transcribed or annotated in parts

Florian, Alex, Mark and Gregor (and possible others) will lead an
ongoing discussion about this via the mailing list.

- goodie2: huge amounts of existing recordings of one speaker (broadcasting, if possible from a core speaker)
3. Dictionary for core corpus

4. Annotations

BAS Partitur Format (BPF) to allow easy extensions to new feature in the future

- The corpus is aimed at broad usage (for example email reading)

  1. Speaker selection

    - Professional or semi-professional speakers

    - Pre-selection: according to demo recordings, check of spectral consistency

Recording under controlled conditions (to judge reading abilities)

Selection (from 10-12):

Target sentences (3) and mini-corpus that contain the diphones for that targets.

(Mini-corpus should be checked or designed by an expert!)

Record targets (for the prosody only!) and mini-corpus for selected 10-12 speakers

Segment in half-phones

Sent the material to helpers to synthesise the targets

Helpers might be: B. Bozkurt, D. Hirschfeld, IMS, maybe others

Rank the synthesised targets.

Also check the aligner to the signals whether automatic segmentation works.

(Maybe try to get additional funding from A and CH for that and use the existing techniques)

- would be good , if the selected speakers also provide material for the 'goodie2' part

  1. Text / Unit selection, Structure

- 3 types of texts:

List of logatoms (maybe logatoms with more than one diphone): approx. 3000 (max)

Under controlled conditions (phonetics) and flat prosody

To be used for diphone synthesis only, one m and f speaker only (low prio)

(Dissertation of Th. Portele for a set of German diphones)

List of sentences containing diphones: approx. 700 (max)

Under controlled conditions, one with flat prosody only, one m and f,
recording with natural prosody all 4 core speakers (high prio)

To be used for diphone synthesis and as a subset of the unit selection

Unclear where to get the list, maybe in collaboration with IMS

List of sentences for unit selection: approx. 5000 (max)

recording with natural prosody, one m and f.

Control for false pronunciation (including stress , maybe marked up stress in the text) and the technical stuff

Unclear where to get this list, again maybe in collaboration with IMS

- Domains? Weather, stock reports, traffic reports, sports ? Very limited contents, topics that make it easy to speak in emotions

  1. Recording Setup

  2. Technical Specs

  3. Annotation / Segmentation

1. Units

List of SmartKom/Verbmobil phoneme set as a starting point, circulation through the mailing list for comments

For all vowel and continuous sounds: 50%

Diphtongs: at the begin of formant change

Plosives: right before the burst

- Then use this material for iterative training of half-phones and distribute both segmentations

- Syllable segments as an independent tier in the annotation; boundaries selected from the existing boundaries by certain rules (Hirschfeld); also insert a marker here for stress derived from the dictionary.

2. Prosody

3. Linguistic annotation

    VII. Other Things in the final distribution:

Dictionary: Consistent pronunciation dictionary for all words contained in the corpus, including syllable boundaries and stress markers. Covering core and goodie1. Maybe an automatic generated pronunciation for goodie2.

Spoken texts (possible validated for errors = transliteration) for core and goodie1.

Texts for goodie2, but not validated for errors. Remark: Recordings of one of the BITS speakers even without transcription is better than nothing.

    VIII. Rough Schedule:

    IX. Other comments:

-priority on the female voices