First draft of a speech corpus specification
Result from the first BITS workshop on speech synthesis corpora 10/08 - 10/09 2002
The following list of properties, structure and procedures is the result of an exhaustive discussion on the first workshop on speech synthesis corpora organised by the BITS project in Munich. This is a starting point for a first draft specification. It does not imply that all of the following should be implemented in the final corpus production; rather it should be regarded as an 'ideal' aim that the BITS project should follow. Progress of the efforts will be reported to the participants of the workshop on a regular basis ('news letter') and we hope for lively discussions and comments on that email distributor as well.
- The corpus will consist of the following part (with decreasing priority):
1. Core corpus for:
- diphone synthesis
- unit selection synthesis techniques.
(annotated etc. in a way that it is usable with state-of-the-art)
2. 'Goodie' corpora:
- goodie1: several hours from a 'core' speaker with studio quality
speaking in different topics and different styles
(style: controlled to conversational, up to dialogues where only the voice of the speaker is recorded)
might only be transcribed or annotated in parts
Alex, Mark and Gregor (and possible others) will lead an
ongoing discussion about this via the mailing list.
- goodie2: huge amounts of existing recordings of one speaker
(broadcasting, if possible from a core speaker)
3. Dictionary for core corpus
BAS Partitur Format (BPF) to allow easy extensions to new feature in the future
- The corpus is aimed at broad usage (for example email reading)
- Professional or semi-professional speakers
- Pre-selection: according to demo recordings, check of spectral consistency
Recording under controlled conditions (to judge reading abilities)
Selection (from 10-12):
Target sentences (3) and mini-corpus that contain the diphones for that targets.
(Mini-corpus should be checked or designed by an expert!)
Record targets (for the prosody only!) and mini-corpus for selected 10-12 speakers
Segment in half-phones
Sent the material to helpers to synthesise the targets
Helpers might be: B. Bozkurt, D. Hirschfeld, IMS, maybe others
Rank the synthesised targets.
Also check the aligner to the signals whether automatic segmentation works.
Core: Number of speakers: 2 m + 2 f ( possibly + 2 children at age of 12)
Multilinguality: speaker should speak French or English
Accents (Austrian, Swiss): no accent for the core (if feasible).
(Maybe try to get additional funding from A and CH for that and use the existing techniques)
- would be good , if the selected speakers also provide material for the 'goodie2' part
Text / Unit selection, Structure
- 3 types of texts:
List of logatoms (maybe logatoms with more than one diphone): approx. 3000 (max)
Under controlled conditions (phonetics) and flat prosody
To be used for diphone synthesis only, one m and f speaker only (low prio)
(Dissertation of Th. Portele for a set of German diphones)
List of sentences containing diphones: approx. 700 (max)
conditions, one with flat prosody only, one m and f,
recording with natural prosody all 4 core speakers (high prio)
To be used for diphone synthesis and as a subset of the unit selection
Unclear where to get the list, maybe in collaboration with IMS
List of sentences for unit selection: approx. 5000 (max)
recording with natural prosody, one m and f.
Control for false pronunciation (including stress , maybe marked up stress in the text) and the technical stuff
Unclear where to get this list, again maybe in collaboration with IMS
- Domains? Weather, stock reports, traffic reports, sports ? Very limited contents, topics that make it easy to speak in emotions
head mounted microphone 15 cm in front of the nose or the eye (fixed distance!)
EGG with 48 kHz
If feasible, using a high quality microphone B&K 4165 with 2 Hz lower frequency and then use a linear high pass after
Additional large membrane microphone 30cm with wind screen
Studio: soundproof box
One control person within the studio; one outside for technical control
Prompting: from tilted TFT screen; speaker oriented in 32,8 degree angle to avoid reflection
Core: recording unit is the sentence and each sentence is repeated when an error occurs; goodie: recording unit should be a paragraph
Computer enforced recording protocol
Core: Recording in a short period to avoid quality changes if feasible
Sampling rate: 48 kHz (advantage: better timing with regard to EGG signals)
Post-processing: linear phase high pass at 50Hz; no down-sampling
File formats: NIST
Gain range: -18dB average, parallel recording with additional -6dB on the fourth channel
Annotation / Segmentation
half-phones; maybe automatic division 40/60% within each phone
Coding : SAM-PA with 'r' and 'l' prefixes, for instance 'l_a:' and 'r_a:'
List of SmartKom/Verbmobil phoneme set as a starting point, circulation through the mailing list for comments
Automatic segmentation: HMM for phones, iteratively speaker-adapted model
Manual correction = insertion of half-phone boundary for 1 hour:
For all vowel and continuous sounds: 50%
Diphtongs: at the begin of formant change
Plosives: right before the burst
- Then use this material for iterative training of half-phones and distribute both segmentations
- Syllable segments as an independent tier in the annotation; boundaries selected from the existing boundaries by certain rules (Hirschfeld); also insert a marker here for stress derived from the dictionary.
Produce a reliable, verified (manually or otherwise), ToBI-style
prosodic annotation for at least one f and one m speaker (maybe more).
Ideally, this annotation would follow the GToBI annotation manual (cf.
Verbmobil or IMS); alternatively, a reduced set of prosodic labels,
such as ToBI-Lite, which possibly can be generated more reliably than
a full set. One possible starting point: automatic labeling of GTobi from Norbert (IMS), collaboration of BITS with Norbert to evaluate and correct these data manually to provide two tiers for at least on f and one m speaker (maybe more).
Adding other prosodic labeling tiers by using tools from other
sources (e.g., Batliner's prosodic labels [cf. Verbmobil]; Mixdorff's
Fujisaki parameter extraction tool; Möhler's Painte parameterization
tool; etc.); help is needed there from partners.
3. Linguistic annotation
POS already there from the text selection (for instances STTS)
Chunk parser results (delivered from DFKI)
VII. Other Things in the final distribution:
Dictionary: Consistent pronunciation dictionary for all words contained in the corpus, including syllable boundaries and stress markers. Covering core and goodie1. Maybe an automatic generated pronunciation for goodie2.
Spoken texts (possible validated for errors = transliteration) for core and goodie1.
Texts for goodie2, but not validated for errors. Remark: Recordings of one of the BITS speakers even without transcription is better than nothing.
VIII. Rough Schedule:
text corpora selection and phoneme set
in parallel recruiting speakers for the pre-selection (40-50)
in parallel development of annotation / recording tools
negotiations with speakers to ensure they are willing to give their voice to BITS. Get the contract before for everything.
pre-recordings of selected 10-12 speakers for 3 target sentences
manual labelling and segmentation in half-phones of this data and distribution to interested partners.
Partners do not change boundaries and deliver resulting synthesis sentences back. Then - depending on time and money - BITS does a double-blind WWW test and every partner can rank the results.
Selection of the final 4 speakers; select two fall back speakers.
Start the recording of the core.
Do one voice first (including segmentation), distribute it, get feedback etc., do correction to the process, then start with the next voice.
IX. Other comments:
-priority on the female voices