Section: REPORTS
SENSORIMOTOR ADAPTATION IN SPEECH PRODUCTION
Human
subjects are known to adapt their motor behavior to a shift of the
visual field brought about by wearing prism glasses over their eyes.
The analog of this phenomenon was studied in the speech domain. By use
of a device that can feed back transformed speech signals in real time,
subjects were exposed to phonetically sensible, online perturbations of
their own speech patterns. It was found that speakers learn to adjust
their production of a vowel to compensate for feedback alterations that
change the vowel's perceived phonetic identity; moreover, the effect
generalizes across phonetic contexts and to different vowels.
When
human subjects are asked to reach to a visual target while wearing
displacing prisms over their eyes, they are observed to miss the target
initially, but to adapt rapidly such that within a few movements their
reaching appears once again to be rapid and natural. Moreover, when the
displacing prisms are subsequently removed subjects are observed to
show an aftereffect; in particular, they miss the target in the
direction opposite to the displacement. This basic result has provided
an important tool for investigating the nature of the sensorimotor
control system and its adaptive response to perturbations ( 1).
The
experiment described in this report is based on an analogy between
reaching movements in limb control and articulatory movements in speech
production. Although reaching and speaking are qualitatively very
different motor acts, they nonetheless share the similarity of having
sensory goals--reaching movements are made to touch or grasp a target,
and articulatory movements are made to produce a desired acoustic
pattern. It is therefore reasonable to ask whether the speech motor
control system might also respond adaptively to alterations of sensory
feedback ( 2).
However, beyond the intrinsic interest of speech motor control and the
importance of discovering commonalities between different effector
systems, there are also advantages to studying sensorimotor adaptation
in the speech domain. Whereas in arm movement research there is little
agreement as to the nature of the underlying discrete units of complex
movements (and indeed there is controversy as to whether or not such
discrete units exist), in speech there is substantial evidence
regarding an underlying discrete control system. In particular, the
disciplines of phonology and phonetics have provided linguistic and
psychological evidence for the existence of discrete units such as
syllables ( 3), phonemes ( 4), and features ( 5).
There are still major controversies, however, regarding the role of
such discrete units in the online control of speech production ( 6).
An important reason for the lack of agreement is methodological; in
particular, there is no agreed-upon methodology for decomposing
articulatory and acoustic records into segments that might be
identified with underlying control structures. Thus, while linguistic
and psychological evidence have provided useful hypotheses as to the
putative discrete control structures underlying speech motor control,
it has proven difficult to evaluate these hypotheses directly in
experiments on speech motor control.
Our
research provides a new line of attack on this problem. In an
adaptation paradigm, we can expose subjects to acoustic perturbations
of their articulatory output in one linguistic context and ask whether
any adaptation that is found transfers to another linguistic context.
For example, if the formants of the vowel [Epsilon] are altered in the
context of "pep," we can ask whether adaptation generalizes to
[Epsilon] in the context of "set" or in the context of "forget." We can
also ask whether adaptation is observed for other vowels. Such
manipulations provide a direct probe of the putative hierarchical,
segmental control of speech production.
We
built an apparatus to alter subjects' feedback in real time (Fig. 1).
The apparatus allows us to shift formant frequencies independently so
as to impose arbitrary perturbations on the speech signal within the
two-dimensional (F1, F2) formant space ( 7-9).
This apparatus was used in an experiment in which a subject whispered
4220 prompted words over approximately 2 hours. The experiment
consisted of the following: a 10-min acclimation phase; a 17-min
baseline phase; a 20-min ramp phase; a 1-hour training phase; and a
17-min test phase. During the ramp phase, the feedback heard by the
subject was increasingly altered, reaching a maximal alteration
strength at which it was held for the duration of the training and test
phases ( 10).
During
the experiment, the subject was prompted to produce words randomly
selected from two different sets: a set of training words (in which
adaptation was induced) and a set of testing words (in which carryover
of the training word adaptation was measured). Test and training words
were interspersed with one another throughout the experiment. However,
only when the subject produced training words was he exposed to the
altered feedback. The training words were all bilabial
consonant-vowel-consonants (CVC) with [Epsilon] as the vowel ("pep,"
"peb," "bep," and "beb") and the subject produced them while hearing
either feedback of his whispering or masking noise that blocked his
auditory feedback ( 11).
The set of testing words was divided into two subsets, each designed to
assess different types of carryover of the training word adaptation.
Three of the testing words--"peg," "gep," and "teg"--were included to
determine if the adaptation of [Epsilon] in the bilabial training word
context carried over to [Epsilon] in different word contexts. The
remaining testing words--"pip," "peep," "pap," and "pop"--were included
to determine if the adaptation of [Epsilon] caused similar production
changes in other vowels.
Eight
male Massachusetts Institute of Technology (MIT) students participated
in the study. All were native speakers of North American English and
all were naive to the purpose of the study ( 12).
Each was run in the adaptation experiment and also in a control
experiment that was identical to the adaptation experiment except that
no feedback perturbations were introduced.
Figure
2 shows the feedback transformations and resulting compensation and
adaptation for a single subject. The diamonds show mean formant
positions of the subject's productions of the vowels [i], [l],
[Epsilon], [ae], and [a], as measured in a pretest procedure several
days before the actual adaptation experiment. Formants were shifted
along the path linking the positions of these vowels (dotted line) ( 13).
Formants were shifted in one direction along this path for half the
subjects; they were shifted in the opposite direction for the other
subjects. The formant shifts were large enough that if the subject
produced [Epsilon], he heard either [i] or [a], depending on the
direction of shift.
For the
subject in Fig. 2, formants were shifted toward [i]. Formants were
shifted in proportion to the spacing between vowels on the path: If the
subject produced [Epsilon] his formants were shifted so he heard [i];
if he produced [--] he heard [ae]; and if he produced [l] he heard
[Epsilon]. Position B (Fig. 2) corresponds to the mean vowel formants
for the training words produced by the subject in the baseline phase of
the adaptation experiment. B shows the formants presented to the
subject as a result of the altered feedback.
The
arrow labeled "compensation" is the subject's compensation to the
altered feedback: The arrow shows that, in response to hearing B as B,
the subject has, by the test phase of the experiment, changed his
production of B to T. The arrow labeled "altered feedback" shows that
the altered feedback causes the subject to hear the production change
as a shift from B to T. The arrow shows that, by the experiment's test
phase, the subject now hears his formants at T, which are close to the
baseline, B. The subject has thus compensated for the altered feedback.
The arrow labeled "adaptation" shows how much of the compensation is
retained when the feedback is blocked by noise (in this case, about 72%
is retained).
The analysis of mean compensation and mean adaptation across subjects is shown in Fig. 3 ( 14). The figure shows that the majority of subjects significantly compensated (P < 0.006) and adapted (P < 0.023) ( 15).
The figure also shows other features commonly seen in adaptation
experiments in the reaching domain: compensation varies across
subjects, each subject compensates more than he adapts, and subjects
that tend to compensate more also tend to adapt more.
Figure
4 shows mean generalization for the test words--a ratio expressing the
fraction of the adaptation of [Epsilon] in the training words that
carried over to the vowel production in a testing word ( 16).
Adaptation to the training set affected the production of the vowels in
test words containing the same vowel but in different consonant
contexts (Fig. 4A). Overall, there is significant generalization of the
training word adaptation to these test words (P < 0.040) ( 17).
However, the apparently greater mean generalization to "peg" than to
"gep" and "teg" is not statistically significant. This lack of
significance is traceable to coarticulatory influences that caused
imperfect estimates of steady-state vowel formants of [Epsilon] in
"gep" and "teg".
Adaptation to the training set affected the production of the vowels in words containing different vowels (Fig. 4B) ( 18).
Again, there is overall significant generalization of the training word
adaptation to these test words (P < 0.013), but again, the apparent
differences in mean generalization between the words is not
statistically significant.
In
summary, our experimental results show that control of the production
of vowels adapts to perturbations of auditory feedback. This adaptation
is analogous to the adaptation seen in the control of reaching.
Moreover, the generalization observed for [Epsilon] in the testing
words provides direct evidence that the testing and the training words
share a common representation of the production of [Epsilon]; it is of
course natural to hypothesize that this common representation is the
phoneme [Epsilon]. Finally, the significant generalization to "pip" and
"pap" considered together shows that the adaptation of a vowel can
spread not only across contexts but also to other vowels. This suggests
that the control process underlying the production of the trained vowel
is partially shared in the control of the productions of other vowels;
moreover, it is natural to attempt to identify these control structures
with the featural decompositions studied in phonology.
(*) To whom correspondence should be addressed. E-mail: houde@phy.ucsf.edu
[A]
Present address: University of California San Francisco, Keck Center,
513 Parnassus Avenue, S-877, San Francisco, CA 94143-0732, USA.
DIAGRAM:
Fig. 1. The apparatus used in the experiments. CVC words were prompted
on the personal computer (PC) video monitor. Subjects were instructed
to whisper the word; we used whispered speech to minimize the effects
of bone conduction which are strong in voiced speech. While the subject
whispered, the speech signal was picked up by a microphone and sent to
a digital signal processing (DSP) board in the PC. The DSP board
processed successive intervals of the subject's speech into
synthesized, formant-altered feedback with only a 16-ms processing
delay [such a delay is nondisruptive; see reference to DAF in ( 2)].
Each interval was first analyzed into a 64-channel, 4 kHz-wide
magnitude spectrum from which formants (which are generally peaks in
the spectrum) were estimated (all graphs are schematic plots of
magnitude versus frequency). The frequencies of the three lowest
frequency formants (F1, F2, and F3) were then shifted to implement a
desired feedback alteration (as explained below). The shifted formants
were then used to synthesize formant-altered whispered speech. This
synthesized speech was fed back to the subject via earphones at
sufficient volume that he essentially heard only the synthesized
feedback of his whispering.
GRAPH: Fig. 2. Altered feedback and resulting compensation and adaptation for a single subject (subject OB).
GRAPHS:
Fig. 3. Mean compensation (top) and adaptation (bottom) for all
subjects (designated CW through AH) in the adaptation (black bars) and
control (white bars) experiments.
GRAPHS:
Fig. 4. Mean generalization for the analyzable testing words in the
experiment. Shown are (A) words with the same vowel ([Epsilon]) used in
the training words, but different consonants; and (B) words with
different vowels.
REFERENCES AND NOTES
(1.)
H. V. Helmholtz, Treatise on Physiological Optics, vol. 3 (1867)
(Optical Society of America, Rochester, NY, 1925); G. M. Stratton,
Psychol. Rev. 3, 611 (1896); I. Kohler, Acta Psychol. 11, 176 (1955);
R. Held, J. Nerv. Ment. Dis. 132, 26 (1961); for a review, see R. B.
Welch, Perceptual Modification: Adapting to Altered Sensory
Environments (Academic Press, New York, 1978).
(2.)
The studies reported in W. E. Cooper [Speech Perception and Production
(Ablex Publishing, Norwood, NJ, 1979)] showed the interdependence of
speech perception and production: repetitive hearing of voiceless
consonants decreases the perceived voice-onset time (VOT) of test
stimuli and also decreases the VOT of produced consonants. Masking
noise feedback increases speech volume [E. Lombard, Ann. Maladies
Oreille Larynx Nez Pharynx 37, 101 (1911), as cited in H. Lane and B.
Tranel, J. Speech Hear. Res. 14, 677 (1971) ]. Investigations of
delayed auditory feedback (DAF) show that delays of 30 ms can disrupt
speech [B. S. Lee, J. Acoust. Soc. Am. 22, 639 (1950); see A. J. Yates,
Psychol. Bull. 60, 213 (1963) for a review]. Frequency translations of
the spectrum of the auditory feedback have also been shown to affect
speech [V. L. Gracco, et al., J. Acoust. Soc. Am. 95, 2821 (1994)].
Recent investigations of pitch perturbations have shown adaptive
responses by speakers to alterations in their pitch frequency [H.
Kawahara, ibid. 94, 1883 (1993)].
(3.)
C. W. Eriksen, M. D. Pollack, W. E. Montague, J. Exp. Psychol. 84, 502
(1970); S. T. Klapp, W. G. Anderson, R. W. Berrian, ibid. 100, 368
(1973); S. Sternberg, S. Monsell, R. L. Knoll, C. E. Wright, in
Information Processing in Motor Control and Learning, G. E. Stelmach,
Ed. (Academic Press, New York, 1978), pp. 117-152.
(4.)
R. Wells, Yale Sci. 26, 9 (1951); S. Shattuck-Hufnagel, in Sentence
Processing: Psycholinguistic Studies Presented to Merrill Garrett, W.
E. Cooper and E. C. T. Walker, Eds. (Erlbaum, Hillsdale, NJ, 1979), pp.
295-342; F. Ferreira, J. Mem. Lang. 30, 210 (1991); A. S. Meyer ibid.,
p. 69; G. S. Dell and P. G. O'Seaghdha, Cognition 42, 287 (1992).
(5.)
N. Chomsky and M. Halle, The Sound Pattern of English (MIT Press,
Cambridge, MA, 1968); G. N. Clements, Phonol. Yearb. 2, 225 (1985).
(6.)
For reviews of these issues, see W. J. M. Levelt [Speaking: From
Intention to Articulation (MIT Press, Cambridge, MA, 1989)], A. S.
Meyer [Cognition 42, 181 (1992)], and R. A. Mowrey and I. R. A. McKay
[J. Acoust. Soc. Am. 88, 1299 (1990)].
(7.)
As Fig. 1 shows, we actually perturbed speech sounds in the three
dimensions F1, F2, and F3. However, because F3 shows small variation
across the vowel sounds we studied, our perturbations acted principally
on only F1 and F2.
(8.)
Given that subjects show substantial variation in the location of their
vowels within this space, we collected baseline data for each subject
that allowed us to tailor the transformations to individual subjects.
(9.)
Details on the implementation of the feedback transformations and
methods of data analysis are provided below and in J. F. Houde, thesis,
Massachusetts Institute of Technology, Cambridge, MA (1997).
(10.)
The gradual introduction of the feedback perturbation was intended to
reduce a subject's awareness of it. Indeed, postexperiment interviews
revealed that all subjects claimed to be unaware that their feedback
was altered during the experiment.
(11.)
A sound-pressure level of 60 dB was sufficient to block subjects' ability to hear their own whispering.
(12.)
Informed consent was obtained from all subjects after the nature and possible consequences of the study were explained.
(13.)
Feedback transformations were defined geometrically with respect to a
subject's [i]-[a] path. The subject's unaltered formant frequencies
were represented as a point in formant space. This point was then
rerepresented in terms of two measures: (i) path deviation--the
distance to the nearest point on the [i]-[a] path, and (ii) path
projection--the position on the [i]-[a] path of this nearest point. The
feedback transformation then shifted only the point's path projection;
the point's path deviation was preserved.
(14.)
Mean compensation measures how much a subject's mean training word
vowel formant change (test phase - baseline) countered the shift of the
feedback transformation. It was measured as: (path projection of mean
vowel formant change)/(-path projection shift of transform) [see (13)
for explanation of path projection]. This ratio is 1.0 for perfect
compensation. Mean adaptation measured how much compensation was
retained in the absence of feedback. Thus, mean adaptation was
calculated with the same ratio shown above, except it used only formant
data collected when the subject whispered with feedback blocked by
noise. (In the control experiment, because feedback was not altered,
mean compensation and adaptation for each subject were calculated with
respect to the feedback alteration used in the adaptation experiment.)
(15.)
Analysis-of-variance tests of path projection changes (test phase -
baseline) across subjects in the adaptation and control experiments
were computed from formant data collected when subjects whispered while
hearing feedback (for the compensation analysis) or while hearing was
blocked by masking noise (for the adaptation analysis). The interaction
of experiment type (adaptation versus control) and path projection
changes was used to judge significance.
(16.)
For a given test word, mean generalization was computed as: (mean test
word relative adaptation)/(mean training word relative adaptation),
where relative adaptation was computed by subtracting adaptation seen
in the control experiment from that seen in the adaptation experiment.
(17.)
Tests of significant generalization were based on computing the
significance of test word adaptations, which were computed the same way
as the training word adaptation significance tests described in (15).
(18.)
We had technical problems estimating the formants of whispered [i] and
[a]; thus, productions of "peep" and "pop" were excluded from our
results.
(19.)
We thank J. Perkell, K. Stevens, R. Held, and P. Sabes for helpful discussions.
15 September 1997; accepted 29 December 1997
~~~~~~~~
By
John F. Houde(*)[A] and Michael I. Jordan, Department of Brain and
Cognitive Sciences, Massachusetts Institute of Technology, Cambridge,
MA 02139, USA.
Copyright
of Science is the property of American Association for the Advancement
of Science and its content may not be copied without the publisher's
express written permission except for the print or download
capabilities of the retrieval software used for access. This content is
intended solely for the use of the individual user. Copyright of
Science is the property of American Association for the Advancement of
Science and its content may not be copied or emailed to multiple sites
or posted to a listserv without the copyright holder's express written
permission. However, users may print, download, or email articles for
individual use.