1 Preliminaries
2 An introduction to the discrete cosine transformation
- 2.1 DCT analysis
- 2.2 DCT synthesis
3 Application of DCT to speech signals
- 3.1 DCT analysis of a monophthong and three diphthongs
- 3.2 DCT smoothing of F2

1 Preliminaries

1.1 General setup

Follow the setup instructions given here, i.e. download R and RStudio, create a directory on your computer where you will store files on this course, make a note of the directory path, create an R project that accesses this directory, and install all indicated packages.

For this and subsequent tutorials, access the tidyverse,magrittr, emuR, and wrassp libraries:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(magrittr)

## 
## Attaching package: 'magrittr'
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## The following object is masked from 'package:tidyr':
## 
##     extract

library(emuR)

## 
## Attaching package: 'emuR'
## 
## The following object is masked from 'package:base':
## 
##     norm

library(wrassp)
sourceDir = "./testsample"
targetDir = "./emu_databases"

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

1.2 A simple data frame for illustrating the DCT

pfad = "https://www.phonetik.uni-muenchen.de/~jmh/lehre/Rdf/"
df = "bridge.df.txt"
bridge.df = read.table(file.path(pfad, df))

1.3 An excerpt of the Kiel corpus

The following makes use of an extract of read speech from the Kiel corpus.

Either download the zip archive, unzip the archive and move kielread_emuDB to the folder emu_databases or execute the following code:

a = "https://www.phonetik.uni-muenchen.de/~jmh/"
b = "lehre/Rdf/kielread_emuDB.zip"
down_path = paste(a, b, sep="/")
download.file(down_path, 
              file.path(targetDir, "kielread.zip"))
unzip(file.path(targetDir, "kielread.zip"), exdir = targetDir)
# Load the database
kielread_DB = load_emuDB(file.path(targetDir, "kielread_emuDB"))

2 An introduction to the discrete cosine transformation

The discrete cosine transformation (DCT) is used in speech research for quantifying the shape of one-dimensional signals. Typically these signals are parameters such as the fundamental frequency, RMS, or any formant frequency as a function of time. As will become clear in the next module, the DCT can also be used for parameterising the shapes of spectra.

2.1 DCT analysis

In DCT analysis, any one-dimensional signal is decomposed into a set of 1/2 cycle cosine waves of increasing frequency \((k = 0, 0.5, 1, 1.5...)\) cycles, such that if these cosine waves are summed, the original signal is exactly reconstructed (this summation forms part of DCT synthesis which is discussed below). The output of the dct() function of the emuR package is the amplitude of the cosine waves at these frequencies which are also known as DCT coefficients. The number of DCT coefficients returned by the emuR function dct() is always the same as the signal length. For example, here is a signal with 8 values:

sig = c(-4, 10, 25, 2, -20, 1, -8, 10)
length(sig)

## [1] 8

The dct() function applied to this signal also gives 8 values:

sig.dct = dct(sig)
sig.dct

## [1]   2.8284271   4.7152829   3.2471766 -12.7283981  -7.0710678  -0.6144048
## [7]   7.8393778  -3.5883835

which are DCT coefficients 0 to 7 also sometimes known as DCT-0, DCT-1, DCT-2...DCT-7 or \(k_0\), \(k_1\), \(k_2\)…\(k_7\) respectively. As already stated, the above values are the amplitudes of cosine waves of increasing 1/2 cycle frequencies. The lowest few DCT coefficients are especially useful in the analysis of speech because they provide important information about the signal’s shape. More specifically:

\(k_0\) is proportional the signal’s mean.
\(k_1\) is proportional the signal’s linear slope.
\(k_2\) is proportional the signal’s curvature.
\(k_3\) is proportional the signal’s skew.

This can be illustrated further with the data frame bridge.df that contains a hypothetical count of the number of cars crossing a bridge between midnight and midday. A plot of the number of cars as a function of time separately for the three days is:

bridge.long.df =
  bridge.df %>%
# for the columns of the days of the week
  pivot_longer(cols = c("Mon", "Tue", "Wed"), 
               # make a new column called day
               names_to = "day", 
               # make a new column of the no. of cars
               values_to = "count") 
bridge.long.df %>%
  ggplot +
  # plot the number of cars as a function of time
  aes(y = count, x = times) +
  geom_point() +
  geom_line() +
  xlab("time of day") +
  ylab("number of cars")+
  # separately per day
  facet_wrap(~day)

The figure shows:

The values averaged across the time points for each day is about the same. In fact, the mean per day is exactly the same:

bridge.long.df %>%
  group_by(day) %>%
  summarise(mean(count)) %>%
  ungroup()

## # A tibble: 3 × 2
##   day   `mean(count)`
##   <chr>         <dbl>
## 1 Mon            38.5
## 2 Tue            38.5
## 3 Wed            38.5

Across the time points, the linear slope is mostly falling for Monday, mostly rising for Tuesday, and more or less zero for Wednesday. This becomes a bit clearer by re-plotting the data and using lm() to fit a straight line separately to each day:

bridge.long.df %>%
  ggplot +
  aes(y = count, x = times) +
  geom_point() +
  geom_line() +
  # fit linear regression
  geom_smooth(method="lm", formula = y ~ x) + 
  facet_wrap(~day)

The more any of these plots look like a parabola, the greater the curvature. The extent of the curvature can be seen by fitting a polynomial regression of the form \(y = x^2\):

bridge.long.df %>%
  ggplot +
  aes(y = count, x = times) +
  geom_point() +
  geom_line() +
  # fit a parabola
  geom_smooth(method="lm", formula = y ~ poly(x, 2)) + 
  facet_wrap(~day)

The above plot suggests that the parabolic fit is best for the Wednesday data.

Curves that have a large skew are typically well modelled by a cubic curve \(y = x^3\). From the following:

bridge.long.df %>%
  ggplot +
  aes(y = count, x = times) +
  geom_point() +
  geom_line() +
  # fit a cubic
  geom_smooth(method="lm", formula = y ~ poly(x, 3)) + 
  facet_wrap(~day)

it seems that the cubic fit to Monday and Tuesday is about the same and better than for the Wednesday data.

Summarising across the above, it can be expected that:

\(k_0\): should be the same for all three.
\(k_1\) should be opposite in sign for Monday and Tuesday (because the slopes are falling and rising respectively) and close to zero for Wednesday.
\(k_2\) should have the same sign for all days (because the parabola is the same way round in all cases) and should show the greatest difference from zero for Wednesday.
\(k_3\) should show a greater difference from zero for both Monday and Tuesday (because these are more skewed) than for Wednesday; \(k_3\) should have the same value for Monday and Tuesday but be opposite in sign (because the skew is in different directions for the otherwise identical distributions on these days).

The calculation of the above coefficients using the dct() function applied to the separate columns (i.e. separately to Mon, Tue, Wed in bridge.df) confirms the above predictions.

bridge.df %>%
  # calculate DCT coeffs. k0...k3 separately per day
  # the 2nd argument is 3 to calculate k0...k3
  summarise(Mon = dct(Mon, 3),
            Tue = dct(Tue, 3),
            Wed = dct(Wed, 3)) %>%
  # prepend a column to show the DCT coeff. number
   add_column(k = 0:3, .before = "Mon") %>%
  # round to 2 dec. places for ease of viewing
  round(., 2) %>%
  ungroup()

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

##   k    Mon    Tue    Wed
## 1 0  54.39  54.39  54.39
## 2 1  29.54 -29.54   0.07
## 3 2 -26.13 -26.13 -54.71
## 4 3 -24.65  24.65   0.93

2.2 DCT synthesis

As already stated, if all the cosine waves into which a signal is decomposed were summed, then the original signal would be exactly reconstructed. The summation of the cosine waves associated with DCT coefficients is known as DCT synthesis (or also as an inverse DCT).

The same function dct() but with the additional argument fit=T applies first DCT analysis followed by DCT synthesis: that is, with the argument fit=T the dct() function first carries out DCT analysis (a decomposition into cosine waves as described above) followed immediately by DCT synthesis in which the corresponding cosine waves are summed. Omitting the number of coefficients, m, causes the signal to be decomposed into a maximum number of N cosine waves (where N is the length of the signal to which the DCT has been applied) which are then summed. This summation of the maximum number of cosine waves is given by dct(x, fit=T) without the second argument m. As shown here, this summation of all the cosine waves into which a signal is decomposed just returns the same signal to which the dct has been applied:

sig

## [1]  -4  10  25   2 -20   1  -8  10

dct(sig, fit=T)

## [1]  -4  10  25   2 -20   1  -8  10

But if only the first m cosine waves are summed, then a smoother version of the signal is reconstructed. For example, for m=3, the signal is reconstructed from cosine waves at frequencies 0, 0.5, 1, 1.5 cycles: that is, the resulting signal is the summation of a flat line, a sloping line, a parabola, and a cubic. In the following, the signals from the bridge data are smoothed with m=3:

bridgesmooth.df =
  bridge.df %>%
  mutate(Mon.smooth = dct(Mon, m = 3,fit=T),
         Tue.smooth = dct(Tue, m = 3,fit=T), 
         Wed.smooth = dct(Wed, m = 3,fit=T))

Alternatively and equivalently:

bridgesmooth.df =
  bridge.df %>%
 mutate(across(Mon:Wed, ~ 
                 dct(.x, m = 3, fit=T), 
               .names = "{.col}.smooth"))

Here is a plot of the raw (the points) and smooth (the blue trajectory) data for Monday:

bridgesmooth.df %>%
  ggplot + 
  aes(times) +
  # raw data
  geom_point(aes(y = Mon)) +
  # dct-smoothed
  geom_line(aes(y = Mon.smooth), linewidth=0.9, colour = "turquoise3")

Higher values of m in the above produce a smoothed version of a signal that is increasingly closer to the original, raw signal.

Here are few more technical details about the DCT. As discussed above, in DCT analysis a signal of length N data points is decomposed into a set of N cosine waves of increasing frequency from 0, 0.5, 1…N-0.5 cycles. The cosine wave at zero cycles is a horizontal line (because all sinusoids at 0 frequency are horizontal lines) and is derived as follows:

# length of signal
N = length(sig.dct)
# a vector for the time axis
n = 0:(N-1)
# the frequencies 0, 0.5, 1...N-0.5 cycles
k = seq(0, by=.5, length=N)
# the cosine wave with frequency 0 cycles
j = 0
k0 = (sig.dct[j+1]/sqrt(2)) * cos(2 * pi * k[j+1] * (n+0.5)/N)
plot(0:(N-1), k0, type="b")

The formula for cosine waves of higher frequencies is the same, but without the constant 1/sqrt(2). Thus:

# Cosine wave at 0.5 cycles
j = 1
k1 =  (sig.dct[j+1]) * cos(2 * pi * k[j+1] * (n+0.5)/N)
# Cosine wave at 1 cycle
j = 2
k2 =  (sig.dct[j+1]) * cos(2 * pi * k[j+1] * (n+0.5)/N)
# Cosine wave at 1.5 cycles
j = 3
k3 =  (sig.dct[j+1]) * cos(2 * pi * k[j+1] * (n+0.5)/N)

Any of the above can be plotted as before: e.g., here is the cosine wave at 1.5 cycles:

plot(0:(N-1), k3, type="b")

Adding up all these cosine waves at equal points on the time axis is DCT synthesis (an inverse DCT). Therefore, the following:

k0 + k1 + k2 + k3

## [1]  -0.9585961   9.6464424  15.8608556   6.9914251  -8.9914251 -14.3461370
## [7]  -3.1611611  10.9585961

gives the same as:

dct(sig, m=3, fit=T)

## [1]  -0.9585961   9.6464424  15.8608556   6.9914251  -8.9914251 -14.3461370
## [7]  -3.1611611  10.9585961

3 Application of DCT to speech signals

The purpose here is to consider the extent to which the DCT can be used to distinguish between a monophthong and three diphthongs in Standard German based on the shape of F2 as a function of time. The formant data are taken from the same kielread corpus analysed in earlier modules, i.e. from the productions of a Standard German male and female speaker. These data are accessed below and also time- and speaker-normalised. The commands are given without detailed comment since all this was covered in an earlier module.

# identify the wav files
kiel_wav_paths = list.files(file.path(targetDir, "kielread_emuDB"), 
                            pattern = ".*wav$", 
                            recursive = T, 
                            full.names = T)
# identify the speaker
n = nchar(kiel_wav_paths)
speaker = substring(kiel_wav_paths, n-11, n-9)
# a logical vector identifying the male speaker
temp.m = speaker == "K67"
# run the formant tracker over that speaker's `.wav` files
forest(kiel_wav_paths[temp.m], gender = "m")
# the same for the female speaker but with `gender` set to `f`
temp.f = speaker == "K68"
forest(kiel_wav_paths[temp.f], gender = "f")
# add the formants to the database
add_ssffTrackDefinition(kielread_DB, 
                        "FORMANTS", "fm", "fms")
# query 8 vowel types
v.s = query(kielread_DB, "Phonetic = i:|e:|a:|o:|u:|aI|aU|OY")
# get their formants
v.fm = get_trackdata(kielread_DB, v.s, "FORMANTS")
# time-normalise them to 11 data points
v.fmt = normalize_length(v.fm, N = 11)
# identify the speaker
v.fmt %<>% mutate(speaker = substring(v.fmt$bundle, 1, 3))
# calculate the mean and sd per speaker
# for Lobanov speaker normalisation
mean_sd <- v.fmt %>% 
  # Identify the vowels
  filter(labels %in% c("i:", "e:", "a:", "o:", "u:")) %>% 
  # For each speaker:
  group_by(speaker) %>% 
  # compute mean & sd of F1 and F2
  summarise(m1 = mean(T1),
            s1 = sd(T1),
            m2 = mean(T2),
            s2 = sd(T2)) %>%
  ungroup

# over-write v.fmt with `%<>%`
v.fmt %<>% 
  # 1. Add the mean & sd values from above
  left_join(mean_sd, by = "speaker") %>%
  # 2. Create two new columns, F1 and F2
  # which are computed from the 
  # raw formant values in columns T1 and T2 
  mutate(F1 = (T1 - m1) / s1,
         F2 = (T2 - m2) / s2)

3.1 DCT analysis of a monophthong and three diphthongs

The concern here is with testing whether the DCT can be used to distinguish between a tense monophthong [a:] and three diphthongs [aI, OY, aU] (in e.g., mein, Häuser, Haus) based on the F2-time shapes.

The first step as always is to look at the data. Here is a plot of F2 as a function of time separately by speaker and vowel category.

v.mean.df = v.fmt %>%
  # four vowels
  filter(labels %in% c("a:", "aI", "aU", "OY")) %>%
  # for each label, speaker and normalised time point
  group_by(labels, speaker, times_norm) %>%
  # calculate mean F2
  summarise(F2 = mean(F2)) %>%
  ungroup()

## `summarise()` has grouped output by 'labels', 'speaker'. You can override using
## the `.groups` argument.

F2 aggregate by speaker:

# save plot as plot.F2.raw
v.mean.df %>%
  ggplot +
  # plot F2 as a functino of normalised time
  # colour-coded by vowel type
  aes(y = F2, x = times_norm, col=labels, group=labels) +
  geom_smooth() +
  # separately by speaker
  facet_wrap(~speaker)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

From the above:

\(k_0\) should distinguish between [aI] and [OY] (since the mean of [aI] across all times points is higher than that of [OY])
\(k_1\) should distinguish [aI, OY] with a rising slope from [aU] with a falling slope and from [a:] with no slope.

The DCT coefficients for all the vowels the trackdata object are calculated as follows:

v.dct = v.fmt %>%
  # for each segment i.e. sl_rowIdx
  # labels and speaker are included in order
  # to be able to identify them in the resulting data-frame
  group_by(sl_rowIdx, labels, speaker) %>%
  # calculate DCT coeffs. k0...k3 separately for F2
  # the 2nd argument is 3 to calculate k0...k3
  summarise(F2dct = dct(F2, 3)) %>%
  # add a column named k showing which DCT-coefficient
  mutate(k = 0:3) %>%
  ungroup()

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `summarise()` has grouped output by 'sl_rowIdx', 'labels', 'speaker'. You can
## override using the `.groups` argument.

This is to check that the output is correct, e.g. the DCT coefficients of F2 of the 100th segment:

v.fmt %>%
  filter(sl_rowIdx == 100) %>%
  pull(F2) %>% dct(., 3)

## [1]  0.49879822 -0.03924975 -0.06469523 -0.01266024

should be the same as:

v.dct %>%
  filter(sl_rowIdx == 100) %>%
  pull(F2dct)

## [1]  0.49879822 -0.03924975 -0.06469523 -0.01266024

The task is now to plot all data points from the four vowel categories in the plane of \(k_0 \times k_1\). For this purpose, the data-frame is first reorganised to include DCT coefficients as columns:

v.dctwide = 
  v.dct %>%
  pivot_wider(names_from = k, 
              values_from = F2dct)

This in turn means that v.dctwide should now have the same number of observations (segments) as the segment list v.s from which the formant data were originally derived (i.e. one set of coefficients per vowel):

nrow(v.s) == nrow(v.dctwide)

## [1] TRUE

The plot in the \(k_0 \times k_1\) space is as follows:

v.dctwide %>%
  filter(labels %in% c("a:", "aI", "OY", "aU")) %>%
  ggplot +
  aes(y = `0`, x = `1`, col = labels) +
  geom_point() +
  facet_wrap(~speaker) +
  xlab("k1") +
  ylab("k0")

This above plot shows a reasonably good separation between the vowel types in accordance with the predictions. Here is the same plot but with 95% confidence ellipses and with the data pooled across the two speakers:

v.dctwide %>%
  filter(labels %in% c("a:", "aI", "OY", "aU")) %>%
  ggplot +
  aes(y = `0`, x = `1`, col = labels) +
  geom_point() +
  # facet_wrap(~speaker) +
  stat_ellipse() +
  xlab("k1") +
  ylab("k0")

3.2 DCT smoothing of F2

The task here is to show how DCT-smoothed formant trajectories can be plotted as a function of time. The following calculates DCT-smoothed trajectories up to \(k_3\) for every vowel token:

v.fmt  %<>%
  group_by(sl_rowIdx) %>%
  mutate(F2.smooth = dct(F2, m = 3,fit=T)) %>%
  ungroup()

This shows DCT-smoothed F2 (the blue line) for e.g. the 100th segment (which is an [e:]). The raw values are the black data points:

# choose segment number
j = 100
# the label
lab = v.fmt %>%
  filter(sl_rowIdx == j) %>% 
  pull(labels) %>% unique
v.fmt %>%
  filter(sl_rowIdx == j) %>%
  ggplot + 
  aes(times_norm) +
  # raw data
  geom_point(aes(y = F2)) +
  # dct-smoothed
  geom_line(aes(y = F2.smooth), 
            linewidth=0.9, colour = "turquoise3") +
  xlab("Normalised (proportional) time") +
  ylab("F2 (Standard deviations)") +
  ggtitle(lab)

The following is an ensemble plot of two diphthongs showing the raw F2 in the first row and DCT-smoothed F2 below that:

# the raw data
plot.raw =
  v.fmt %>%
  filter(labels %in% c("aI", "aU")) %>%
  ggplot + 
  # a plot of all the data
  aes(y = F2, x = times_norm, 
      col=labels, group_by=interaction(labels, sl_rowIdx)) +
  geom_line() +  
  facet_wrap(~speaker) +
  # remove legend
  theme(legend.position="none") +
  xlab("") + 
  ylab("F2 standard deviations") +
ggtitle("Raw")

# the DCT-smoothed daata
plot.smooth = 
  v.fmt %>%
  filter(labels %in% c("aI", "aU")) %>%
  ggplot + 
  # a plot of all the data
  aes(y = F2.smooth, x = times_norm, col=labels, group_by=interaction(labels, sl_rowIdx)) +
  geom_line() +
  facet_wrap(~speaker) +
  xlab("Normalised (proportional time)") + 
  ylab("F2 standard deviations") +
   # legend at top and remove legend title
  theme(legend.position="top", 
        legend.title=element_blank()) +
ggtitle("DCT-smoothed")
# raw data in row 1, DCT-smoothed data in row 2
grid.arrange(plot.raw, plot.smooth, nrow = 2)

The discrete cosine transformation

Jonathan Harrington & Johanna Cronenberg

2023