X-ray films of speech

A few examples are given here of x-ray films of short sentences taken from a much larger corpus of x-ray material. See the end of this web-page for more information about the source and history of the recordings.

Why do x ray films constitute an important speech resource?
For many years (before the increasing availability of fast real-time MRI scans) x-ray filming was the only imaging technique that allowed all the most important speech articulators (jaw, tongue, lips, soft palate, larynx) to be captured in a single view at a framerate that is reasonably adequate for speech.
However, in most countries it is no longer considered ethically acceptable to make cineradiographic recordings of speech with healthy subjects, unless there is some clear clinical indication for the recordings.
Thus existing x-ray films form an irreplaceable source of information on speech.

The following aids to navigation have been included in the films:
(1) A spectrogram of the utterance has been inserted into each film. A cursor moves along the time axis of the spectrogram to indicate the time point of the film frame currently visible. Linking movies with spectrograms is a big advantage of digital media, since understanding the relationship between speech movements and the resulting acoustics is one of the central issues in phonetics
(2) A text track has been inserted into each film giving a transliteration of the utterance in both normal orthography and phonetic transcription (the latter using the SAMPA system). Moving brackets indicate which word (in the orthography) and which sound (in the phonetic transcription) is currently visible.

Overview of the utterances
Speaker 1 (female)
F1 “Is that biography?”
F2 “Loraine just left”

Speaker 2 (male)
M1 “Its ten below outside”
M2 “He’s a lousy singer”

These four clips are shown immediately below, and followed by notes on points of interest.

Following these notes there are four additional films from the male speaker (without further comments, and with no text track in the movie):
M3 “They are a nomadic tribe”
M4 “McCarthy was a madman”
M5 “He likes Tom Sawyer the best”
M6 “Tom likes Greco-Roman art”

Hint: Expanding the films to full-screen makes it easier to use the slider to look in detail at the movements.
To make it easier to compare films and refer to the notes, each film can be opened in a separate tab if desired.

F1. “Is that biography?” ( / I z D { t b aI Q g r @ f i /)

Open movie l80_11 in separate tab

F2. “Loraine just left” (/ l @ r eI n dZ V s t l e f t /)

Open movie l80_28 in separate tab

M1. “Its ten below outside” (/ I t s t E n b @ l @U aU t s aI d /)

Open movie l77_04 in separate tab

M2. “He’s a lousy singer” (/ h i: z @ l aU z i s I N @' /)

Open movie l77_30 in separate tab

Notes on the utterances
(1)
The two utterances of the female speaker (F1 and F2) give an illustration of how different the articulation of the ‘same’ sound can be.
Contrast the /r/ sound in “biography” with the /r/ sound in “Lorraine”.
In the former case the /r/ is barely visible, in the latter case there is an extremely clear retroflex articulation (tongue tip curled back).
This also gives a nice idea of the mobility of the tongue tip.

(2)
The sound sequences /tb/ (from “that biography”; F1) and /nb/ (from “ten below”; M1) are two classic cases where considerable temporal overlap in the movements of the articulators may be expected.
It is worth looking at these utterances frame by frame.
In “that biography” the lip closure for /b/ follows very closely after tongue tip closure for /t/.
There is thus a period of simultaneous closure at both tongue and lips. Then the tongue closure is released, followed by release of the lip closure.
In “ten below” there is a similar pattern of movement by tongue and lips.
An additional consideration is the movement of the soft palate. At the point where the lips close, the soft palate is still open (it had to be open for the /n/, of course), so in effect something like an /m/ is articulated briefly, until the soft palate raises to close the velopharyngeal port and allow air pressure to build up for the /b/.

(3)
A further note on velar timing:
Nasal sounds occur in three utterances (“Lorraine”, F2 ; “ten”, M1; “singer”, M2).
Generally, lowering of the velum occurs well before the segment that is actually labelled as nasal, usually around the beginning of the vowel preceeding the nasal, so movement of the velum in the nasal segment itself is mostly upward, which in a sense is the opposite of what one might expect.

(4)
Utterance M2 illustrates characteristic differences in tongue tip configuration for alveolar consonants.
For the fricatives /z, s/ the constriction is made with the tongue blade (laminal), for the lateral /l/ with the tongue tip (apical). This again shows the mobility of the tongue tip. Note also that although the fricatives and the lateral have nominally the same place of articulation the position of the jaw is much lower for /l/.

(5)
Also in utterance M2, the last vowel of “singer” illustrates the r colouring that occurs in North American dialects of English. This is generally considered to be a very unusual vowel, since it is seen to involve two constrictions in the vocal tract, one in the palatal region (as also found in high vowels like /i/) and at the same time in the pharyngeal region (as in vowels like /a/).

Additional examples (same male speaker)
M3. “They are a nomadic tribe”

Open movie l77_08 in separate tab

M4. “McCarthy was a madman”

Open movie l77_12 in separate tab

M5. “He likes Tom Sawyer the best”

Open movie l77_13 in separate tab

M6. “Tom likes Greco-Roman art”

Open movie l77_14 in separate tab

Historical background
These films have had a tortuous history. They were originally recorded on 35mm film in 1974 by Dr. Claude Rochette at the Departement de Radiologie de l'Hotel Dieu de Quebec, Quebec, Canada.
In the early 1990s they were transferred by Drs. Kevin Munhall (Queens's University, Kingston, Canada), Eric Vatikiotis Bateson and Yohichi Tohkura (then at ATR Labs, Kyoto, Japan) to analog video disk, as the original films were in danger of disintegration. In order to do this, the audio track had to be processed to remain in synch with the slower frame rate of the video disk, compared to the original films (30 vs. 50 fps.). This is why the audio track sounds somewhat strange: when the films are played at 30fps the pitch is correct but the tempo of the utterance is slower than the original.
In 2000, Phil Hoole of Munich University transferred fourteen out of the total of twenty five films from the videodisk to computer. The digitization was performed without compression, i.e. the full frame rate and resolution were retained. (The assistance of Dr. Marc Batschkus (Multimedia Lerncenter Medizin, Klinikum Grosshadern, Munich) in transferring the films is gratefully acknowledged.) Each film consists of roughly 30 short sentences (typically about 100 frames per sentence). The image area containing the x ray information is about 400 vertical by 500 horizontal pixels in size. Each sentence in the film has been stored as a separate file. The data is currently stored as MATLAB compatible files, since we use MATLAB as our main processing and display environment.
However, in order to make the films more generally available, especially for didactic purposes, we converted most of the material first to QuickTime and more recently to MP4 format.
Video processing notes: Noise suppression using an adaptive Wiener filter (mainly to remove some of the effect of the grain in the film). Gamma correction to emphasize low intensity regions. Sorensen codec (for the QuickTime format).

References:
https://www.queensu.ca/psychology/speech-perception-and-production-lab/x-ray-database
Munhall, K.G., Vatikiotis-Bateson, E., & Tohkura, Y. (1994). Manual for the X-ray film database. ATR Technical Report, TR-H-116.
Munhall, K.G., Vatikiotis-Bateson, E., & Tohkura, Y. (1995). X-ray Film database for speech research. Journal of the Acoustical Society of America, 98, 1222-1224.