Face, Speech, and Acoustics  

Face, Speech, and Acoustics

   Introduction ·  Program ·  Local Information ·  Registration ·  Contact

 

Abstract

Nick Campbell (ATR, Kyoto)

The sound of a smiling voice?

This talk will address some issues of spoken language communication from the aspect of non-verbal content. It will argue that if we are to produce speech synthesis that is able to model all aspects of human spoken interactions, then we should model not just the lexical and syntactic structures but also the "grunts" that constitute a large part of informal spoken human communication. We have recently begun analysing a large corpus of interactive speech (from 1000 hours of recordings of spontaneous natural conversations) and find that more than half of the sounds cannot be adequately synthesised by conventional means. They are typically in the form of simple syllables, often repeated many times, but are prosodically extremely complex. They also make considerable use of voice-quality to signal different paralinguistic effects. Our findings provide support for many of the claims made by David Crystal in the 60's and 70's, but we are now in a better position to model these sounds acoustically, by making use of improved algorithms for the inverse filtering of continuous running speech. In reporting work in progress, we will describe our first attempts to categorise these sounds for synthesis, and present a hybrid model of concatenative speech synthesis which combines non-verbal speech sounds, selected by higher-level prosodic features (including voice-quality as a prosodic component), with more traditional methods of unit selection for the lexical content.

 


Last modified: Fri Oct 25 16:15:59 CEST 2002