Follow the setup instructions given here, i.e. download R and RStudio, create a directory on your computer where you will store files on this course, make a note of the directory path, create an R project that accesses this directory, and install all indicated packages.
For this and subsequent tutorials, access the tidyverse
,magrittr
, emuR
, and wrassp
libraries:
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'magrittr'
##
## The following object is masked from 'package:purrr':
##
## set_names
##
## The following object is masked from 'package:tidyr':
##
## extract
##
## Attaching package: 'emuR'
##
## The following object is masked from 'package:base':
##
## norm
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
The following makes use of an extract of read speech from the Kiel corpus.
Either download the zip archive, unzip the archive and move kielread_emuDB
to the folder emu_databases
or execute the following code:
a = "https://www.phonetik.uni-muenchen.de/~jmh/"
b = "lehre/Rdf/kielread_emuDB.zip"
down_path = paste(a, b, sep="/")
download.file(down_path,
file.path(targetDir, "kielread.zip"))
unzip(file.path(targetDir, "kielread.zip"), exdir = targetDir)
# Load the database
kielread_DB = load_emuDB(file.path(targetDir, "kielread_emuDB"))
The discrete cosine transformation (DCT) is used in speech research for quantifying the shape of one-dimensional signals. Typically these signals are parameters such as the fundamental frequency, RMS, or any formant frequency as a function of time. As will become clear in the next module, the DCT can also be used for parameterising the shapes of spectra.
In DCT analysis, any one-dimensional signal is decomposed into a set of 1/2 cycle cosine waves of increasing frequency \((k = 0, 0.5, 1, 1.5...)\) cycles, such that if these cosine waves are summed, the original signal is exactly reconstructed (this summation forms part of DCT synthesis which is discussed below). The output of the dct()
function of the emuR
package is the amplitude of the cosine waves at these frequencies which are also known as DCT coefficients. The number of DCT coefficients returned by the emuR
function dct()
is always the same as the signal length. For example, here is a signal with 8 values:
## [1] 8
The dct()
function applied to this signal also gives 8 values:
## [1] 2.8284271 4.7152829 3.2471766 -12.7283981 -7.0710678 -0.6144048
## [7] 7.8393778 -3.5883835
which are DCT coefficients 0 to 7 also sometimes known as DCT-0, DCT-1, DCT-2...DCT-7
or \(k_0\), \(k_1\), \(k_2\)…\(k_7\) respectively. As already stated, the above values are the amplitudes of cosine waves of increasing 1/2 cycle frequencies. The lowest few DCT coefficients are especially useful in the analysis of speech because they provide important information about the signal’s shape. More specifically:
This can be illustrated further with the data frame bridge.df
that contains a hypothetical count of the number of cars crossing a bridge between midnight and midday. A plot of the number of cars as a function of time separately for the three days is:
bridge.long.df =
bridge.df %>%
# for the columns of the days of the week
pivot_longer(cols = c("Mon", "Tue", "Wed"),
# make a new column called day
names_to = "day",
# make a new column of the no. of cars
values_to = "count")
bridge.long.df %>%
ggplot +
# plot the number of cars as a function of time
aes(y = count, x = times) +
geom_point() +
geom_line() +
xlab("time of day") +
ylab("number of cars")+
# separately per day
facet_wrap(~day)
The figure shows:
## # A tibble: 3 × 2
## day `mean(count)`
## <chr> <dbl>
## 1 Mon 38.5
## 2 Tue 38.5
## 3 Wed 38.5
lm()
to fit a straight line separately to each day:bridge.long.df %>%
ggplot +
aes(y = count, x = times) +
geom_point() +
geom_line() +
# fit linear regression
geom_smooth(method="lm", formula = y ~ x) +
facet_wrap(~day)
bridge.long.df %>%
ggplot +
aes(y = count, x = times) +
geom_point() +
geom_line() +
# fit a parabola
geom_smooth(method="lm", formula = y ~ poly(x, 2)) +
facet_wrap(~day)
The above plot suggests that the parabolic fit is best for the Wednesday data.
bridge.long.df %>%
ggplot +
aes(y = count, x = times) +
geom_point() +
geom_line() +
# fit a cubic
geom_smooth(method="lm", formula = y ~ poly(x, 3)) +
facet_wrap(~day)
it seems that the cubic fit to Monday and Tuesday is about the same and better than for the Wednesday data.
Summarising across the above, it can be expected that:
The calculation of the above coefficients using the dct()
function applied to the separate columns (i.e. separately to Mon, Tue, Wed in bridge.df
) confirms the above predictions.
bridge.df %>%
# calculate DCT coeffs. k0...k3 separately per day
# the 2nd argument is 3 to calculate k0...k3
summarise(Mon = dct(Mon, 3),
Tue = dct(Tue, 3),
Wed = dct(Wed, 3)) %>%
# prepend a column to show the DCT coeff. number
add_column(k = 0:3, .before = "Mon") %>%
# round to 2 dec. places for ease of viewing
round(., 2) %>%
ungroup()
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## k Mon Tue Wed
## 1 0 54.39 54.39 54.39
## 2 1 29.54 -29.54 0.07
## 3 2 -26.13 -26.13 -54.71
## 4 3 -24.65 24.65 0.93
As already stated, if all the cosine waves into which a signal is decomposed were summed, then the original signal would be exactly reconstructed. The summation of the cosine waves associated with DCT coefficients is known as DCT synthesis (or also as an inverse DCT).
The same function dct()
but with the additional argument fit=T
applies first DCT analysis followed by DCT synthesis: that is, with the argument fit=T
the dct()
function first carries out DCT analysis (a decomposition into cosine waves as described above) followed immediately by DCT synthesis in which the corresponding cosine waves are summed. Omitting the number of coefficients, m
, causes the signal to be decomposed into a maximum number of N cosine waves (where N is the length of the signal to which the DCT has been applied) which are then summed. This summation of the maximum number of cosine waves is given by dct(x, fit=T)
without the second argument m
. As shown here, this summation of all the cosine waves into which a signal is decomposed just returns the same signal to which the dct has been applied:
## [1] -4 10 25 2 -20 1 -8 10
## [1] -4 10 25 2 -20 1 -8 10
But if only the first m
cosine waves are summed, then a smoother version of the signal is reconstructed. For example, for m=3
, the signal is reconstructed from cosine waves at frequencies 0, 0.5, 1, 1.5 cycles: that is, the resulting signal is the summation of a flat line, a sloping line, a parabola, and a cubic. In the following, the signals from the bridge
data are smoothed with m=3
:
bridgesmooth.df =
bridge.df %>%
mutate(Mon.smooth = dct(Mon, m = 3,fit=T),
Tue.smooth = dct(Tue, m = 3,fit=T),
Wed.smooth = dct(Wed, m = 3,fit=T))
Alternatively and equivalently:
bridgesmooth.df =
bridge.df %>%
mutate(across(Mon:Wed, ~
dct(.x, m = 3, fit=T),
.names = "{.col}.smooth"))
Here is a plot of the raw (the points) and smooth (the blue trajectory) data for Monday:
bridgesmooth.df %>%
ggplot +
aes(times) +
# raw data
geom_point(aes(y = Mon)) +
# dct-smoothed
geom_line(aes(y = Mon.smooth), linewidth=0.9, colour = "turquoise3")
Higher values of m
in the above produce a smoothed version of a signal that is increasingly closer to the original, raw signal.
Here are few more technical details about the DCT. As discussed above, in DCT analysis a signal of length N data points is decomposed into a set of N cosine waves of increasing frequency from 0, 0.5, 1…N-0.5 cycles. The cosine wave at zero cycles is a horizontal line (because all sinusoids at 0 frequency are horizontal lines) and is derived as follows:
# length of signal
N = length(sig.dct)
# a vector for the time axis
n = 0:(N-1)
# the frequencies 0, 0.5, 1...N-0.5 cycles
k = seq(0, by=.5, length=N)
# the cosine wave with frequency 0 cycles
j = 0
k0 = (sig.dct[j+1]/sqrt(2)) * cos(2 * pi * k[j+1] * (n+0.5)/N)
plot(0:(N-1), k0, type="b")
The formula for cosine waves of higher frequencies is the same, but without the constant 1/sqrt(2)
. Thus:
# Cosine wave at 0.5 cycles
j = 1
k1 = (sig.dct[j+1]) * cos(2 * pi * k[j+1] * (n+0.5)/N)
# Cosine wave at 1 cycle
j = 2
k2 = (sig.dct[j+1]) * cos(2 * pi * k[j+1] * (n+0.5)/N)
# Cosine wave at 1.5 cycles
j = 3
k3 = (sig.dct[j+1]) * cos(2 * pi * k[j+1] * (n+0.5)/N)
Any of the above can be plotted as before: e.g., here is the cosine wave at 1.5 cycles:
Adding up all these cosine waves at equal points on the time axis is DCT synthesis (an inverse DCT). Therefore, the following:
## [1] -0.9585961 9.6464424 15.8608556 6.9914251 -8.9914251 -14.3461370
## [7] -3.1611611 10.9585961
gives the same as:
## [1] -0.9585961 9.6464424 15.8608556 6.9914251 -8.9914251 -14.3461370
## [7] -3.1611611 10.9585961
The purpose here is to consider the extent to which the DCT can be used to distinguish between a monophthong and three diphthongs in Standard German based on the shape of F2 as a function of time. The formant data are taken from the same kielread
corpus analysed in earlier modules, i.e. from the productions of a Standard German male and female speaker. These data are accessed below and also time- and speaker-normalised. The commands are given without detailed comment since all this was covered in an earlier module.
# identify the wav files
kiel_wav_paths = list.files(file.path(targetDir, "kielread_emuDB"),
pattern = ".*wav$",
recursive = T,
full.names = T)
# identify the speaker
n = nchar(kiel_wav_paths)
speaker = substring(kiel_wav_paths, n-11, n-9)
# a logical vector identifying the male speaker
temp.m = speaker == "K67"
# run the formant tracker over that speaker's `.wav` files
forest(kiel_wav_paths[temp.m], gender = "m")
# the same for the female speaker but with `gender` set to `f`
temp.f = speaker == "K68"
forest(kiel_wav_paths[temp.f], gender = "f")
# add the formants to the database
add_ssffTrackDefinition(kielread_DB,
"FORMANTS", "fm", "fms")
# query 8 vowel types
v.s = query(kielread_DB, "Phonetic = i:|e:|a:|o:|u:|aI|aU|OY")
# get their formants
v.fm = get_trackdata(kielread_DB, v.s, "FORMANTS")
# time-normalise them to 11 data points
v.fmt = normalize_length(v.fm, N = 11)
# identify the speaker
v.fmt %<>% mutate(speaker = substring(v.fmt$bundle, 1, 3))
# calculate the mean and sd per speaker
# for Lobanov speaker normalisation
mean_sd <- v.fmt %>%
# Identify the vowels
filter(labels %in% c("i:", "e:", "a:", "o:", "u:")) %>%
# For each speaker:
group_by(speaker) %>%
# compute mean & sd of F1 and F2
summarise(m1 = mean(T1),
s1 = sd(T1),
m2 = mean(T2),
s2 = sd(T2)) %>%
ungroup
# over-write v.fmt with `%<>%`
v.fmt %<>%
# 1. Add the mean & sd values from above
left_join(mean_sd, by = "speaker") %>%
# 2. Create two new columns, F1 and F2
# which are computed from the
# raw formant values in columns T1 and T2
mutate(F1 = (T1 - m1) / s1,
F2 = (T2 - m2) / s2)
The concern here is with testing whether the DCT can be used to distinguish between a tense monophthong [a:] and three diphthongs [aI, OY, aU] (in e.g., mein, Häuser, Haus) based on the F2-time shapes.
The first step as always is to look at the data. Here is a plot of F2 as a function of time separately by speaker and vowel category.
v.mean.df = v.fmt %>%
# four vowels
filter(labels %in% c("a:", "aI", "aU", "OY")) %>%
# for each label, speaker and normalised time point
group_by(labels, speaker, times_norm) %>%
# calculate mean F2
summarise(F2 = mean(F2)) %>%
ungroup()
## `summarise()` has grouped output by 'labels', 'speaker'. You can override using
## the `.groups` argument.
F2 aggregate by speaker:
# save plot as plot.F2.raw
v.mean.df %>%
ggplot +
# plot F2 as a functino of normalised time
# colour-coded by vowel type
aes(y = F2, x = times_norm, col=labels, group=labels) +
geom_smooth() +
# separately by speaker
facet_wrap(~speaker)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
From the above:
The DCT coefficients for all the vowels the trackdata object are calculated as follows:
v.dct = v.fmt %>%
# for each segment i.e. sl_rowIdx
# labels and speaker are included in order
# to be able to identify them in the resulting data-frame
group_by(sl_rowIdx, labels, speaker) %>%
# calculate DCT coeffs. k0...k3 separately for F2
# the 2nd argument is 3 to calculate k0...k3
summarise(F2dct = dct(F2, 3)) %>%
# add a column named k showing which DCT-coefficient
mutate(k = 0:3) %>%
ungroup()
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'sl_rowIdx', 'labels', 'speaker'. You can
## override using the `.groups` argument.
This is to check that the output is correct, e.g. the DCT coefficients of F2 of the 100th segment:
## [1] 0.49879822 -0.03924975 -0.06469523 -0.01266024
should be the same as:
## [1] 0.49879822 -0.03924975 -0.06469523 -0.01266024
The task is now to plot all data points from the four vowel categories in the plane of \(k_0 \times k_1\). For this purpose, the data-frame is first reorganised to include DCT coefficients as columns:
This in turn means that v.dctwide
should now have the same number of observations (segments) as the segment list v.s
from which the formant data were originally derived (i.e. one set of coefficients per vowel):
## [1] TRUE
The plot in the \(k_0 \times k_1\) space is as follows:
v.dctwide %>%
filter(labels %in% c("a:", "aI", "OY", "aU")) %>%
ggplot +
aes(y = `0`, x = `1`, col = labels) +
geom_point() +
facet_wrap(~speaker) +
xlab("k1") +
ylab("k0")
This above plot shows a reasonably good separation between the vowel types in accordance with the predictions. Here is the same plot but with 95% confidence ellipses and with the data pooled across the two speakers:
v.dctwide %>%
filter(labels %in% c("a:", "aI", "OY", "aU")) %>%
ggplot +
aes(y = `0`, x = `1`, col = labels) +
geom_point() +
# facet_wrap(~speaker) +
stat_ellipse() +
xlab("k1") +
ylab("k0")
The task here is to show how DCT-smoothed formant trajectories can be plotted as a function of time. The following calculates DCT-smoothed trajectories up to \(k_3\) for every vowel token:
This shows DCT-smoothed F2 (the blue line) for e.g. the 100th segment (which is an [e:]). The raw values are the black data points:
# choose segment number
j = 100
# the label
lab = v.fmt %>%
filter(sl_rowIdx == j) %>%
pull(labels) %>% unique
v.fmt %>%
filter(sl_rowIdx == j) %>%
ggplot +
aes(times_norm) +
# raw data
geom_point(aes(y = F2)) +
# dct-smoothed
geom_line(aes(y = F2.smooth),
linewidth=0.9, colour = "turquoise3") +
xlab("Normalised (proportional) time") +
ylab("F2 (Standard deviations)") +
ggtitle(lab)
The following is an ensemble plot of two diphthongs showing the raw F2 in the first row and DCT-smoothed F2 below that:
# the raw data
plot.raw =
v.fmt %>%
filter(labels %in% c("aI", "aU")) %>%
ggplot +
# a plot of all the data
aes(y = F2, x = times_norm,
col=labels, group_by=interaction(labels, sl_rowIdx)) +
geom_line() +
facet_wrap(~speaker) +
# remove legend
theme(legend.position="none") +
xlab("") +
ylab("F2 standard deviations") +
ggtitle("Raw")
# the DCT-smoothed daata
plot.smooth =
v.fmt %>%
filter(labels %in% c("aI", "aU")) %>%
ggplot +
# a plot of all the data
aes(y = F2.smooth, x = times_norm, col=labels, group_by=interaction(labels, sl_rowIdx)) +
geom_line() +
facet_wrap(~speaker) +
xlab("Normalised (proportional time)") +
ylab("F2 standard deviations") +
# legend at top and remove legend title
theme(legend.position="top",
legend.title=element_blank()) +
ggtitle("DCT-smoothed")
# raw data in row 1, DCT-smoothed data in row 2
grid.arrange(plot.raw, plot.smooth, nrow = 2)