1 Preliminaries
- 1.1 General setup
- 1.2 An excerpt of the Kiel corpus
2 Fricative spectra
3 DCT coefficients and spectral analysis
4 Mel-scaled DCT coefficients using a filter bank
5 Mel-scaled DCT coefficients without using a filter bank

1 Preliminaries

1.1 General setup

Follow the setup instructions given here, i.e. download R and RStudio, create a directory on your computer where you will store files on this course, make a note of the directory path, create an R project that accesses this directory, and install all indicated packages.

For this and subsequent tutorials, access the tidyverse,magrittr, emuR, and wrassp libraries:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(magrittr)

## 
## Attaching package: 'magrittr'
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## The following object is masked from 'package:tidyr':
## 
##     extract

library(emuR)

## 
## Attaching package: 'emuR'
## 
## The following object is masked from 'package:base':
## 
##     norm

library(wrassp)
sourceDir = "./testsample"
targetDir = "./emu_databases"

# Mel filter bank analysis
if(!any(.packages(all.available = TRUE) == "seewave")) 
  install.packages("seewave")
library(seewave)

## 
## Attaching package: 'seewave'

## The following object is masked from 'package:emuR':
## 
##     mel

## The following object is masked from 'package:lubridate':
## 
##     duration

## The following object is masked from 'package:readr':
## 
##     spec

# function for calculating Mel filter bank
a = "https://www.phonetik.uni-muenchen.de/~jmh/"
b = "lehre/Rdf/MelFilterBank_Emu.txt"
source(paste(a, b, sep="/"))

1.2 An excerpt of the Kiel corpus

The following makes use of an extract of read speech from the Kiel corpus.

Either download the zip archive, unzip the archive and move kielread_emuDB to the folder emu_databases or execute the following code:

a = "https://www.phonetik.uni-muenchen.de/~jmh/"
b = "lehre/Rdf/kielread_emuDB.zip"
down_path = paste(a, b, sep="/")
if (!file.exists(file.path(targetDir, "kielread.zip"))) {
  download.file(down_path, 
                file.path(targetDir, "kielread.zip"))
}
unzip(file.path(targetDir, "kielread.zip"), exdir = targetDir)

# Load the database
kielread_DB = load_emuDB(file.path(targetDir, "kielread_emuDB"))

2 Fricative spectra

The following repeats the calculating of fricative spectra at the temporal midpoint of [s, ʃ, x] from the Kiel read database as described in the previous module.

As discussed in the earlier module, the sampling frequency of the Kiel corpus was $fs = 16000$ Hz which results in spectra being applied with an $N = 512$ point window for a frequency resolution of 40 Hz. The spectra then have $(N/2) + 1 = 257$ spectral components equally spaced between 0 Hz and the Nyquist frequency at $fs/2 = 8000$ Hz. The frequency resolution i.e. interval between spectral components of the spectrum is, as also discussed in that module, $fs/N = 16000/512 = 31.25$ Hz.

# Derive spectra for the database
add_ssffTrackDefinition(kielread_DB,
                        # choose a track name
                        "spectrum",
                        onTheFlyFunctionName = "dftSpectrum")

# Segment list of the three fricatives
fric.s = query(kielread_DB, "Phonetic = s|S|x")
# Spectral data at the fricatives' temporal midpoint
fric.dft = get_trackdata(kielread_DB, 
                         fric.s, 
                         "spectrum", 
                         cut = .5)
# add a column for the speaker
fric.dft %<>% 
  mutate(speaker = substring(fric.dft$bundle, 1, 3))

3 DCT coefficients and spectral analysis

3.1 DCT coefficients of spectrum

The task here is to calculate the DCT of a spectrum. The purpose of doing so is because as discussed in this earlier module the DCT coefficients encode the shape properties of a signal. In this case, the shape is concerned with whether e.g. a spectrum is rising or falling or has most of its information concentrated in an upper or lower part of the spectrum, etc.. Such shape characteristics provide critical information about phonetic quality.

As a reminder, the ensemble plot of all the spectra from these fricatives shows evident shape differences between [s, ʃ, x]. The plot was created as follows:

# convert to long format
fric.dft.long = convert_wideToLong(fric.dft, calcFreqs = T)
fric.dft.long %>% 
  ggplot +
  # plot intensity (track_value) against frequency
  aes(x = freq, 
      y = track_value, 
      group = sl_rowIdx, 
      col = labels) +
  geom_line() +
  facet_wrap(~speaker) +
  xlab("Frequency (Hz)") +
  ylab("Intensity (dB)")

Evidently, [s] has rising spectra while those of [x] are falling; and [ʃ] has most of its energy concentrated in the central part of the spectrum.

Following the analysis from an earlier module these differences suggest that:

[s, x] should differ on DCT1 i.e. $k_1$ given that these fricative types differ predominantly in spectral slope.
[ʃ] vs. [s, x] should differ in curvature, i.e. $k_2$ given that the spectrum of [ʃ] has the greatest resemblance to a parabolic shape over the frequency range in question.

The following calculates for each of the 361 fricative spectra the DCT coefficients $k_0$ through to $k_3$ as separate columns:

dctvals = fric.dft.long %>% 
  group_by(sl_rowIdx, labels, speaker) %>%
  summarise(dctspec = dct(track_value, 3)) %>%
  mutate(k = 0:3) %>%
  pivot_wider(names_from = k, 
              values_from = dctspec) %>%
ungroup()

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `summarise()` has grouped output by 'sl_rowIdx', 'labels', 'speaker'. You can
## override using the `.groups` argument.

A plot of the fricatives in the plane of $k_1 \times k_2$ shows a very effective distinction for both speakers:

dctvals %>% 
  ggplot + 
  aes(y = `2`, x = `1`, col=labels) +
  geom_point() +
  facet_wrap(~speaker) +
  xlab("k1") +
  ylab("k2")

Notice in partuclar that the absolute values on $k_2$ are much greater for [ʃ] because of its greater curvature; and that [s, x] have opposite sign values on $k_1$ because their spectra are characterised by rising/falling linear slopes respectively.

The DCT-coefficients of a spectrum are also sometimes known as cepstral coefficients.

3.2 DCT-smoothed spectrum

As also discussed in an earlier module, a DCT-smoothed signal can be created by summing just the first few DCT-coefficients into which the signal was decomposed. The following calculates a DCT-smoothed spectrum for each separate fricative spectrum ($n = 361$) by summing the cosine waves corresponding to the DCT coefficients up to $k = 3$.

fric.dft.long %<>% 
  group_by(sl_rowIdx) %>%
  mutate(dctsmooth = dct(track_value, m = 3, fit = T)) %>%
  ungroup()

The following overlays the raw and DCT-smoothed spectrum for the 10th segment:

j = 10
fric.dft.long %>%
  filter(sl_rowIdx == j) %>%
  ggplot +
  aes(x = freq) +
  geom_line(aes(y = track_value)) +
  geom_line(aes(y = dctsmooth), col="red") +
  xlab("Frequency (Hz)") +
  ylab("Intensity (dB)")

An important point to note from this kind of plot is that, whereas the raw spectrum has a very jagged short-term variation, the DCT-smoothed spectrum does not. The short-term variation of the raw spectrum is due to the random noise source created by the turbulent airstream at the fricative’s point of constriction that creates random noise distributed throughout the spectrum. The DCT-smoothed spectrum on the other hand shows mostly the resonances (in increasing detail if DCT smoothing is calculated based on an increasing number of DCT coefficients) of the vocal tract. Thus, as discussed in further detail below with respect to a voiced signal, the DCT essentially separates the components in the spectrum due to the source (in this case the fricative noise) from those due to the vocal tract filter (i.e. vocal tract shape).

3.3 DCT and the separation of the source from the filter

Some spectral data of voiced vowels will be analysed to consider in further detail this source-filter separation when applying a DCT to a voiced signal. The following extracts spectral data at the temporal midpoint for all [ɛ] vowels in the database:

# segment list of [E] vowels
v.e = query(kielread_DB, "Phonetic = E")
# sampled speech data from the first segment, 
# 512 points at the temporal midpoint
v.dft5 = get_trackdata(kielread_DB, 
                       v.e, 
                       "spectrum", 
                       cut = .5)

Next, the DCT coefficients are calculated for all vowels. This time, instead of calculating just the first few DCT coefficients (e.g. up to $k = 3$), all DCT coefficients are derived (which can be done by omitting the second argument m to the dct() function). That is, 257 coefficients $k = 0, 1, 2, ... 256$ are derived in each case (257 because there are 257 spectral components given that the spectra were derived from time signals with window lengths of $N = 512$):

# convert to long format
v.dft5.long = convert_wideToLong(v.dft5, calcFreqs = T)

# calculate DCT coefficients 
v.dct5 = v.dft5.long %>% 
  group_by(sl_rowIdx, labels) %>%
  # store these as the variable A
  summarise(A = dct(track_value)) %>%
  # provide the coefficient number 
  mutate(k = 0:256) %>%
  ungroup()

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `summarise()` has grouped output by 'sl_rowIdx', 'labels'. You can override
## using the `.groups` argument.

Next, all 257 DCT coefficients from $k = 1$ all the way through to $k = 256$ are plotted for the first vowel ($k = 0$ is excluded because it often has a high value that depends on the overall signal amplitude):

v.dct5 %>%
  # choose the first segment
  filter(sl_rowIdx == 1) %>%
  # exclude the d.c. offset i.e. k = 0
  filter(k != 0) %>%
  ggplot +
  aes(y = A, x = k) +
  geom_line() +
  geom_vline(xintercept = 106, lty = 2, col = "red") +
  xlab("DCT coefficient number") + 
  ylab("Amplitude")

The lower components i.e. around $k = 0$ to $k = 5$ on the left of the plot are due to the overall long-term, global shape of spectrum that are caused by the shape of the vocal tract. Thus in a spectrum of /u/, $k_1$ will be high because of the strong downward linear slope in the spectrum; /a/ might have a high degree of curvature brought about by the proximity of F1 and F2 in the spectrum and therefore a high value of $k_2$. The general point is this: DCT coefficients that say something about the global shape of spectrum and that are also informative about the shape of the vocal tract filter (supralaryngeal vocal tract shape) i.e. about phonetic quality are on the left of the plot at low values of $k$.

In addition, there is a spike at $k = 106$ that has been marked by a red line. This is due to the short-term, saw-tooth like variation in the spectrum that is caused by the fundamental frequency and its harmonics.

There is a direct relationship between this spike and the fundamental frequency.

The 106th DCT component corresponds to a cosine wave with a frequency 53 cycles (because the DCT calculates cosine waves at half-cycle frequencies). This means that the DCT has picked up a component in the spectrum that repeats itself 53 times between 0 and 8000 Hz: this repetition is of course due to the fundamental frequency and its evenly spaced harmonics. The interval between these repetitions must therefore be $8000/53 \approx 151$ Hz which is the estimated fundamental frequency of the speech signal. (Another equivalent way to get the fundamental frequency is with $fs/k$ where $fs$ is the sampling frequency in Hz and $k$ the DCT-component at which the spike occurs: thus for this case, $16000/106 \approx 151$ Hz). To see whether this is so, the following extracts the fundamental frequency for this vowel at the temporal midpoint:

# f0 at temporal midpoint
v.f0 = get_trackdata(kielread_DB,
                     v.e, 
                     cut = .5,
                     onTheFlyFunctionName = "ksvF0")
# f0 at the temporal midpoint for
# segment 1
v.f0 %>%
  filter(sl_rowIdx == 1) %>% pull(T1)

which gives 149 Hz, i.e. a value which is very close the estimated f0 from the 106th DCT coefficient.

Perhaps more importantly, a DCT-smoothed spectrum that excludes the DCT coefficients from $k = 106$ and above should also exclude any effects of the fundamental frequency and its harmonics. Here for example, DCT-smoothed spectra are calculated with coefficients up to $k = 20$, i.e. well below the coefficient that is responsible for the fundamental frequency:

v.dft5.long %<>% 
  group_by(sl_rowIdx) %>%
  mutate(specsmooth = dct(track_value, m = 20, fit = T)) %>%
  ungroup()

The following now overlays the raw and DCT-smoothed spectrum:

j = 10
v.dft5.long %>%
  filter(sl_rowIdx == j) %>%
  ggplot +
  aes(x = freq) +
  geom_line(aes(y = track_value)) +
  geom_line(aes(y = specsmooth), col = "red") +
  xlab("Frequency (Hz)") +
  ylab("Intensity (dB)")

Notice how the DCT-smoothed spectrum shows the formant frequencies far more clearly, precisely because the jagged saw-tooth like effects of the fundamental frequency are not present. The same plot but in the frequency range up to 4 kHz shows this even more clearly:

j = 10
v.dft5.long %>%
  filter(sl_rowIdx == j) %>%
  filter(freq < 4000) %>%
  ggplot +
  aes(x = freq) +
  geom_line(aes(y = track_value)) +
  geom_line(aes(y = specsmooth), col = "red") +
  xlab("Frequency (Hz)") +
  ylab("Intensity (dB)")

The three peaks from left-to-right in the DCT-smoothed spectrum are due to F1-F3 respectively.

4 Mel-scaled DCT coefficients using a filter bank

The mel scale is a warping of the frequency axis to correspond more closely with the way in which frequency is resolved by the human ear. The scale, like the Bark scale, is more or less linear up to 1000 Hz and then quasi-logarithmic thereafter. The relationship between the mel and Hz scales can be seen here (note that 1000 mel = 1000 Hz):

plot(0:8000, mel(0:8000), type="l",
     ylab = "Frequency (mel)", xlab = "Frequency (Hz)")

In much of the speech technology literature, mel-scaled DCT coefficients are derived from summed energy in frequency bands that are spaced at approximately equal intervals in frequency along the mel scale. The function MelFilterBank_Emu() can be used to derive such a filter bank. Its three arguments are:

the sampling frequency fs (set to 16000 Hz by default)
N, the number of points in the speech signal used to calculate spectra (default 512)
m, the number of filters.

The mel filter bank with default arguments can be seen here:

mfilters.df = MelFilterBank_Emu()
mfilters.df %>%
  ggplot +
  aes(y = A, x = freq, col = factor(fbank), group = fbank) + 
  geom_line() +
  ylab("Amplitude") +
  xlab("Frequency (Hz)")

As the plot shows:

the filters all extend between 0 and 1.
they are triangular which means that they extend at an equal frequency rate from 0 to 1 and from 1 to 0.
they overlap with each other such that the left edge of filter $n$ is at the centre frequency of filter $n - 1$ and the right edge of filter $n$ is at the centre frequency of filter $n + 1$ . This can be seen more clearly by looking at three of the filters from $n = 6$ to 8:

mfilters.df %>%
  filter(fbank %in% c(6:8)) %>%
  ggplot +
  aes(y = A, x = freq, col = factor(fbank), group = fbank) + 
  geom_line() +
  ylab("Amplitude") +
  xlab("Frequency (Hz)")

The output of MelFilterBank_Emu() includes:

fbank: an index from 1 to m showing which row belongs to which filter.
k: the frequency in cycles up to the Nyquist frequency. So for N is 512 this extends in integer steps up to k = 257, as length(unique(mfilters.df$k)) shows.
freq: the frequency in Hz of these spectral components. In other words k * fs/N (where in this case fs/N = 16000/512).
Melcen: The centre frequencies of the mel filters in mel. These increase in roughly equal steps as shown by:

# subtract difference in mel between
# successive filters
diff(unique(mfilters.df$Melcen))

## [1] 288.7049 255.2990 249.4526 270.0471 256.5963 260.6272 252.6985 258.8986
## [9] 262.4829

Hzcen: The centre frequencies of the mel filters in Hz
A: the amplitude of the filter from 0 to 1

The filter bank approach works by multiplying a filter with the amplitudes of a spectrum in its frequency range and then summing the result. For example, the third mel filter extends over these frequencies in Hz:

melfreqs_3 = 
  mfilters.df %>% 
  filter(fbank == 3) %>%
  pull(freq)
melfreqs_3

##  [1]  406.25  437.50  468.75  500.00  531.25  562.50  593.75  625.00  656.25
## [10]  687.50  718.75  750.00  781.25  812.50  843.75  875.00  906.25  937.50
## [19]  968.75 1000.00 1031.25

For e.g. the 10th segment in the fricative data, the dB values in this frequency range are:

specdb10 =
fric.dft.long %>%
filter(sl_rowIdx == 10) %>%
  filter(freq %in% melfreqs_3) %>%
  pull(track_value)
specdb10

##  [1] 31.409519 26.510557 31.657766 32.830517 27.674707  4.983821 16.149040
##  [8] 20.581833 23.021132 29.017111 27.256666 22.598461 22.633463 23.081146
## [15] 24.814157 26.193884 23.641523 17.458786 24.660589 16.843718 30.093287

The above dB values have to be multiplied with the amplitude values of the third mel filter which are shown here:

melamps_3 = 
  mfilters.df %>% 
  filter(fbank == 3) %>%
  pull(A)
melamps_3

##  [1] 0.00000000 0.11111111 0.22222222 0.33333333 0.44444444 0.55555556
##  [7] 0.66666667 0.77777778 0.88888889 1.00000000 0.90909091 0.81818182
## [13] 0.72727273 0.63636364 0.54545455 0.45454545 0.36363636 0.27272727
## [19] 0.18181818 0.09090909 0.00000000

The result of this multiplication in then summed. However, multiplying and summing decibel values is inappropriate because decibels are logarithms: that is, the decibels must first be converted back into power values. This can be done as follows:

specpower10 = 10^(specdb10/20)
specpower10

##  [1] 37.194263 21.160594 38.272631 43.805217 24.195543  1.774970  6.418773
##  [8] 10.692805 14.159782 28.239405 23.058620 13.487239 13.541699 14.257957
## [15] 17.406357 20.403008 15.208142  7.463444 17.101313  6.953219 31.964236

The power values can now be multiplied with the amplitude values of the third mel filter; these multiplications are then summed:

summed_output = 
  sum(specpower10 * melamps_3)
summed_output

## [1] 171.6141

The final step is to convert the above back to decibels:

20 * log(summed_output, base=10)

## [1] 44.69106

The above now has to be done for all of the (in this case) 10 mel filters and for each segment.

The first step is to convert decibels to power values in the fricative data to which the mel filter bank is to be applied:

fric.dft.long %<>% 
  mutate(power = 10^(track_value/20))

The next step is to use the left_join() function to join the filter bank with the spectral object. The join is done by frequency of the spectral component. These will uniquely occur in $freq of the spectral object, as long fs and N in building the mel filter bank were set to the same values with which the spectral data was derived (so in this case to 16000 Hz and 512 points respectively).

both.df = left_join(fric.dft.long, 
                    mfilters.df, 
                    by = "freq")

## Warning in left_join(fric.dft.long, mfilters.df, by = "freq"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 6 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

The summed energy values in the mel scale filter bank can now be calculated:

energysum.df = both.df %>% 
  # for each unique filter bank number
  # and for each segment; also include
  # the fricative and speaker labels:
  group_by(fbank, sl_rowIdx, labels, speaker) %>%
  # multiple the power with the amplitude
  # of the filter
  mutate(output = power * A) %>%
  # and sum the output
  summarise(energypower = sum(output)) %>%
  # and convert back to decibels
  mutate(dB = 20 * log(energypower, base=10)) %>%
  ungroup()

## `summarise()` has grouped output by 'fbank', 'sl_rowIdx', 'labels'. You can
## override using the `.groups` argument.

# there should be 3610 observations. This
# is because there are 361 fricative segments
# and 10 summed energy values for each 
# of the mel filters
dim(energysum.df)

## [1] 3610    6

A plot of the summed energy values in the filter banks for all 361 segments can be seen as follows:

energysum.df %>%
  ggplot +
  aes(y = dB, x = fbank, 
      col = labels, group = sl_rowIdx) +
  geom_line() +
  ylab("Intensity (dB)") +
  xlab("Mel filter bank number")

The final step is to calculate the DCT-coefficients of the above data i.e. to apply the dct() function to the mel scaled, filtered output to give mel-scaled dct-coefficients (which are more or less the same as mel-scaled cepstral coefficients):

melcepstra.df = energysum.df %>% 
  # for each segment; also include
  # the fricative label and speaker
  group_by(sl_rowIdx, labels, speaker) %>%
  # calculate DCT coefficients 0 to 3
  summarise(coeffs = dct(dB, 3)) %>%
  # include the coefficient number
  mutate(k = 0:3) %>%
  # arrange the DCT coefficients in columns
  pivot_wider(names_from = k, 
              values_from = coeffs) %>%
ungroup()

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `summarise()` has grouped output by 'sl_rowIdx', 'labels', 'speaker'. You can
## override using the `.groups` argument.

A plot in the plane of $k_1 \times k_2$ should separate the fricatives for the same reason as before: [s] has rising mel-scaled filter-bank spectra, for [x] they are level or falling, while [ʃ] shows the greatest degree of curvature:

melcepstra.df %>% 
  ggplot + 
  aes(y = `2`, x = `1`, col = labels) +
  geom_point() +
  facet_wrap(~speaker) +
  ylab("k2") + 
  xlab("k1")

5 Mel-scaled DCT coefficients without using a filter bank

It is also possible to obtain mel-scaled DCT coefficients without going to the bother of building a filter bank (and of converting dB to power and back to dB) as above. This approach involves interpolating equal points in amplitude along a mel-warped frequency scale.

The first step is to convert the Hz scale into mel:

fric.dft.long %<>%
  mutate(fmel = mel(freq))

As a result of this, a spectrum can now be plotted on the mel scale. For example, for the first segment:

fric.dft.long %>%
  filter(sl_rowIdx == 1) %>%
  ggplot +
  aes(y = track_value, x = fmel) +
  geom_line() +
  geom_point() +
  ylab("Intensity") +
  xlab("Frequency (mel)")

The DCT cannot be applied to this spectrum because the spectral components are not spaced at equal intervals on the mel frequency scale (or rather applying a DCT to this spectrum would be the same as applying it the spectrum on the Hz scale). Therefore, the way forward is to use linear interpolation to provide 257 equally spaced points on the mel scale. This can be straightforwardly done using the approx() function, thus:

  fric.dft.long %<>%
  group_by(sl_rowIdx) %>%
  # create a mel spectrum with 257 points at
  # equal intervals on the mel scale
  mutate(Ai = approx(fmel, track_value, n=257)$y, 
         fmeli = approx(fmel, track_value, n=257)$x)

The points equally spaced on the mel axis are shown in red. As is apparent they track exactly the same spectrum. The only difference is where they are positioned on the spectrum:

j = 10
fric.dft.long %>%
  filter(sl_rowIdx==10) %>%
  ggplot + 
  coord_cartesian(xlim = c(0, 750)) +
  aes(x = fmel) +
  geom_line(aes(y = track_value)) +
  geom_point(aes(y = track_value, x=fmel), size=4)+
  geom_point(aes(y = Ai, x = fmeli), col="red") +
  ylab("Intensity") +
  xlab("Frequency (mel)")

In the above figure, the large black points are those from the original mel spectrum; the red points are the interpolated ones that are equally spaced in frequency on the mel axis.

The mel-scaled DCT-coefficients can now be calculated in exactly the same way as before, as long as these are applied to the rescaled amplitude values:

melcepstra2.df = fric.dft.long  %>% 
  # for each segment; also include
  # the fricative label and speaker
  group_by(sl_rowIdx, labels, speaker) %>%
  # calculate DCT coefficients 0 to 3
  # for the interpolated points 
  summarise(coeffs = dct(Ai, 3)) %>%
  # include the coefficient number
  mutate(k = 0:3) %>%
  # arrange the DCT coefficients in columns
  pivot_wider(names_from = k, 
              values_from = coeffs) %>%
ungroup()

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `summarise()` has grouped output by 'sl_rowIdx', 'labels', 'speaker'. You can
## override using the `.groups` argument.

A plot of these mel-scaled coefficients in the plane of $k_1$ × $k_2$ shows a distribution that is almost the same as the one derived with the filter bank method earlier:

melcepstra2.df %>% 
  ggplot + 
  aes(y = `2`, x = `1`, col = labels) +
  geom_point() +
  facet_wrap(~speaker) +
  ylab("k2") + 
  xlab("k1")

An advantage of this second approach is that it is easier to try out DCT parameterisations in certain frequency ranges. For example, suppose the aim is now to calculate DCT coefficients but only above frequencies of 500 Hz which is 607.5 mel (as shown by mel(500)). In this case:

melcepstra3.df = fric.dft.long  %>%
  # select frequencies above 607.5 mel before
  # calculating DCT coefficients
  filter(fmeli > 607.5) %>%
  # for each segment; also include
  # the fricative label and speaker
  group_by(sl_rowIdx, labels, speaker) %>%
  # calculate DCT coefficients 0 to 3
  # for the interpolated points 
  summarise(coeffs = dct(Ai, 3)) %>%
  # include the coefficient number
  mutate(k = 0:3) %>%
  # arrange the DCT coefficients in columns
  pivot_wider(names_from = k, 
              values_from = coeffs) %>%
ungroup()

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `summarise()` has grouped output by 'sl_rowIdx', 'labels', 'speaker'. You can
## override using the `.groups` argument.

The resulting plot is given by:

melcepstra3.df %>% 
  ggplot + 
  aes(y = `2`, x = `1`, col = labels) +
  geom_point() +
  facet_wrap(~speaker) +
  ylab("k2") + 
  xlab("k1") +
  ggtitle("mel-scaled DCT coefficients derived from spectra > 500 Hz")

The above plot gives perhaps the best separation between the three fricative types so far.

Parameterising spectra with the DCT

Jonathan Harrington & Johanna Cronenberg

2023