5th Speech Production Seminar
Abstracts

Paper No.: 03

A MULTICHANNEL ARTICULATORY SPEECH DATABASE AND ITS APPLICATION FOR AUTOMATIC SPEECH RECOGNITION


Department of Speech and Language Sciences, Queen Margaret University College, Edinburgh, UK
William J. Hardcastle Department of Speech and Language Sciences, Queen Margaret University College, Edinburgh, UK

The MOCHA (Multi-CHannel Articulatory) database is being created to provide a resource for training speaker-independent continuous ASR systems and for general co-articulatory studies. The planned dataset includes 40 speakers of English, each reading up to 460 TIMIT sentences (British version). The articulatory channels currently include Electromagnetic Articulograph (EMA) sensors directly attached to the vermilion border of the upper and lower lips, lower incisor (jaw), tongue tip (5-10mm from the tip), tongue blade (approximately 2-3cm posterior to the tongue tip sensor), tongue dorsum (approximately 2-3cm posterior to the tongue blade sensor) and soft palate (approximately 10-20mm from the edge of the hard palate). A Laryngograph provides voicing information and an Electropalatograph (EPG) provides tongue-palate contact data. This paper describes work in progress using this database to determine a set of articulatory parameters which are optimised for the task of automatic phone recognition. The articulatory feature vector is created by applying principal components analysis to data provided by EPG and EMA, supplemented with a voicing energy feature extracted from a Laryngograph. The results show relatively poor recognition performance when compared with a standard acoustic feature vector. Dental stops are the only phonetic category in which the articulatory feature vector outperforms the acoustic standard. Examination of the phone level behaviour of the recogniser indicates several areas in which it may be improved. The strong tendency of the system to delete schwa and other central vowels may indicate a many-to-one mapping problem because the tongue can be in many positions and still define a uniform acoustic tube but it may also be indicative that targetless schwa has no distinct gesture associated with it and therefore brings into question its legitimacy as a distinct segment. That is to say the acoustic percept of a schwa in some instances may simply be a result of the coarticulation between adjacent consonant segments. The failure to improve on voiced/voiceless consonant discrimination suggests that the addition of an instrument to measure glottal opening may be required. More generally, principal component analysis is probably not the best method for dimension reduction. It accounts for the variance in the data, which naturally favours tongue movement but is not optimised for discriminating between phone classes. Linear discriminant analysis would be a better choice. Two speakers from the MOCHA database along with MATLAB display macros can be found at http://www.cstr.ed.ac.uk/artic/