Follow the setup instructions given here, i.e. download R and RStudio, create a directory on your computer where you will store files on this course, make a note of the directory path, create an R project that accesses this directory, and install all indicated packages.
For this and subsequent tutorials, access the tidyverse
,magrittr
, emuR
, and wrassp
libraries:
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'magrittr'
##
## The following object is masked from 'package:purrr':
##
## set_names
##
## The following object is masked from 'package:tidyr':
##
## extract
##
## Attaching package: 'emuR'
##
## The following object is masked from 'package:base':
##
## norm
The following makes use of an extract of read speech from the Kiel corpus.
Either download the zip archive, unzip the archive and move kielread_emuDB
to the folder emu_databases
or execute the following code:
The discrete Fourier transform (DFT) is essentially the same as the discrete cosine transformation (DCT) with the difference that a time signal is decomposed into a set of sinusoids (rather than a set of cosine waves) with variable phase. Another difference is that the decomposition with the DFT is into whole-cycle (rather than half-cycle) sinusoids. But the overall principles of analysis and synthesis are the same. Thus, a DFT decomposes a time signal of length \(N\) into a set of \(N\) sinusoids with frequencies \(k = 0, 1, 2, ... (N-1)\) cycles in such a way that if these sinuoids are summed, then the original signal is exactly reconstructed.
For various computational reasons, the algorithm for carrying out a DFT is a fast Fourier transform (FFT). A DFT and FFT both give the same output but the FFT accomplishes the transformation more quickly, as long as the number of points, \(N\), in the signal to which the FFT is applied is a power of 2. In this simple example, an FFT is applied to a signal of length \(N = 8\):
## [1] 8
and this returns 8 sinusoids of frequencies \(k = 0,1 ... 7\) cycles whose amplitudes and phases are compactly expressed as a complex or (imaginary) number involving \(i\) which is the square root of minus 1.
## [1] 8
## [1] 16.00000+ 0.00000i 28.02082-33.70711i -41.00000+ 1.00000i
## [4] 3.97918+32.29289i -30.00000+ 0.00000i 3.97918-32.29289i
## [7] -41.00000- 1.00000i 28.02082+33.70711i
The amplitudes and phases of the sinusoids can be obtained with the following functions:
## [1] 16.00000 43.83304 41.01219 32.53713 30.00000 32.53713 41.01219 43.83304
## [1] 0.0000000 -0.8772575 3.1172072 1.4481927 3.1415927 -1.4481927 -3.1172072
## [8] 0.8772575
The phase varies between ±\(\pi\). The emuR function cr()
can be used to plot the sinusoids as follows:
par(mfrow = c(8, 1))
par(mar = c(1, 2, 1, 1))
for(j in 1:8) {
cr(sig.amp[j], j-1, sig.phase[j], 8, main = paste(j-1, "cycles"), axes = F)
axis(side = 2)
}
It is apparent from the above plot that frequencies greater than \(N/2\) cycles at \(k = 5, 6, 7\) cycles are identical to those at \(k = 3, 2, 1\) cycles respectively. This is the result of aliasing that can be best understood by drawing e.g. sinusoids at \(k = 5\) and \(k = 3\) cycles for \(N = 8\) points in the signal:
The 5-cycle sinusoid – meaning that 5 revolutions around the circle have to be completed given 8 points – derives identical amplitude values: the only difference is that they are on opposite sides of the circle compared with the amplitude values from those of the 3-cycle sinusoid. This derivation of identical sinusoids from Fourier analysis is known as aliasing. Aliasing occurs for frequencies above \(k = N/2\); this frequency \(k = N/2\) is known as the critical or Nyquist frequency (named after the Swedish engineer Harry Nyquist).
In speech processing, only the frequencies up to the Nyquist frequency are retained. Thus for an FFT applied to an \(N\) point signal, the retained frequencies in the spectrum are at \(k = 0, 1, 2, ... N/2\) cycles. The number of such retained frequencies is always \(N/2 + 1\) (thus 5 retained frequencies for \(N = 8\)).
The d.c. offset is the sinusoid at \(k = 0\) cycles which is always a straight line. The reason why is evident here:
i.e. the point is derived by jumping up and down 8 times on the circle but without moving around the circle. The d.c. offset is equal to the sum of the signal values to which the FFT was applied, thus:
## [1] 16
## [1] 16
The type of spectrum used in speech analysis is a plot of the sinsuoids’ amplitudes as a function of frequency up to the Nyquist frequency. Thus for the above case, the spectrum is:
Notice that the above plot (as is typical for most spectral plots) joins the points by straight lines. This is misleading because there are never any spectral components between the integer number of cycles. The spectrum is therefore really one of:
par(mfrow = c(1, 2))
plot(0:4, sig.amp[1:5],
ylab = "Amplitude",
xlab = "Frequency (number of cycles)")
plot(0:4, sig.amp[1:5],
ylab = "Amplitude",
xlab = "Frequency (number of cycles)",
type = "h")
The y-axis is typically in decibels which are more faithfully associated with perceived loudness than the amplitude of air-pressure variations recorded at the microphone. Decibel is 10 times the base-10 logarithm of the power relative to a reference sound. The power is just the amplitude squared and the reference sound in many types of computer analysis is assumed to be zero (which corresponds in fact to a reference sound with an amplitude of +1 or -1, i.e. a power of 1 unit). So to get to decibels:
# signal length
N = length(sig)
# Power
sig.power = sig.amp^2
# Bels
sig.Bel = log(sig.power, base = 10)
# Decibels
sig.dB = 10 * sig.Bel
# or in one step from amplitudes
sig.dB = 20 * log(sig.amp, base = 10)
# discard all dB intensities over the Nyquist frequency
sig.dB = sig.dB[1:((N/2)+1)]
The x-axis is usually in Hertz (Hz), i.e. cycles per second. This can be derived only if the sampling frequency is known, i.e. the number of data points in the time signal per second. Suppose then the sampling frequency for sig
to which the FFT has been applied is fs
= 20000 Hz. Then, the observable frequencies in Hz up to the Nyquist frequency for an \(N\) point signal is given by:
## [1] 0 2500 5000 7500 10000
Finally, then the dB-spectrum is:
An important concept in spectra is the frequency resolution which is the interval on the frequency axis between spectral components and given by the sampling frequency divided by the signal length, thus:
## [1] 2500
From this it is clear that when an FFT is applied to an \(N\) point signal with a sampling frequency of \(fs\) Hz, then the spectrum will have spectral components between 0 Hz and \(fs/2\) Hz (the Nyquist frequency) spaced at equal intervals of \(fs/N\) Hz on the frequency axis (and no other spectral components).
Fourier synthesis is the sum of the sinusoids into which the signal was decomposed using Fourier analysis but also normalised by the signal length. The emuR cr()
function literally adds up all the points of the 8 sinusoids shown earlier for the same time value. The other way to do Fourier synthesis is to use the fft()
function but with inverse = T
. This gives an output with a zero-valued complex number that can be discarded using the Re()
function (= take the real part only). Thus:
# add up the sinusoids and normalise for the signal length
cr(sig.amp, 0:7, sig.phase, 8, values = T, plotf = F) / 8
## [1] -4 10 25 2 -20 1 -8 10
## [1] -4 10 25 2 -20 1 -8 10
## [1] -4 10 25 2 -20 1 -8 10
The analysis of the spectral data in this case will be based on an [ɛ] vowel from the Kiel Corpus of Read Speech produced by a male speaker. The following extracts \(N = 512\) points of the waveform from the temporal midpoint of the first [ɛ] vowel:
# segment list of [E] vowels
v.e = query(kielread_DB, "Phonetic = E")
# sampled speech data from the first segment, 512 points at the temporal midpoint
v.sam512 = get_trackdata(kielread_DB,
v.e[1,],
"MEDIAFILE_SAMPLES",
cut = .5,
npoints = 512)
##
## INFO: parsing 1 wav segments/events
##
|
| | 0%
|
|======================================================================| 100%
The sample rate is 16 kHz as shown by e.g.:
## [1] 16000
so that the \(N = 512\) points of signal extend over a duration of \(1000 \cdot 512/16000 = 32\) ms. A plot of the data is shown here and vertical lines have been placed to extend over 4 pitch periods:
v.sam512 %>%
ggplot +
aes(y = T1, x = times_rel) +
geom_line() +
geom_vline(xintercept = c(4.15, 30.91),
color = "red")
The four pitch periods have a duration of \(30.91 - 4.15 = 26.76\) ms. Thus one pitch period has a duration of approximately \(26.76/4 = 6.69\) ms. From this, the fundamental frequency can be estimated at \(1000/6.69 \approx 149\) Hz. This will become relevant in looking at the spectrum of the signal.
Fourier analysis operates under the assumption that the signal has an infinite duration i.e. that the above signal of the waveform repeats along the time axis indefinitely. When the FFT is applied to the waveform then it is also applied to a rectangular window which has a value of 1 for the duration of just one repetition of the waveform and is zero elsewhere (which removes all the other repeated waveforms). The purpose of applying a raised cosine window is to minimise the effects on the spectrum of this rectangular window. Below is the waveform multiplied with a raised cosine window known as a Hann window:
# function has default arguments for Hann;
# set `a` and `b` to 0.54 and 0.46
# for a Hamming window
coswin = function(N = 512, a = 0.5, b = 0.5) {
n = 0:(N-1)
a - b * cos(2 * pi * n/(N-1))
}
plot(coswin(), type = "l")
The waveform after applying the Hann window to the waveform is superimposed in black on the original waveform in grey.
# create T1h with Hann-windowed data
v.sam512 %<>%
mutate(T1h = T1 * coswin())
v.sam512 %>%
ggplot +
aes(times_rel) +
# original waveform
geom_line(aes(y = T1), colour = "slategray") +
# Hann windowed waveform
geom_line(aes(y = T1h), size = 0.9) +
ylab("Amplitude") +
xlab("time (ms)")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The spectrum can once again be calculated with the fft()
function. The number of points in the signal, \(N\), should be a power of 2 (i.e., one of 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024…):
# number of points in the signal
N = 512
v.spec512 = v.sam512 %>%
# apply fft and get the spectral amplitudes
mutate(amp = Mod(fft(T1h))) %>%
# convert the amplitudes to intensity
mutate(dB = 20 * log(amp, base = 10)) %>%
# keep the spectral components up to the
# Nyquist frequency
head(n = (N/2)+1)
The sample rate is 16000 Hz. So the frequency in Hz up to the Nyquist frequency is at intervals of \(fs/N = 16000/512 = 31.25\) Hz.
v.spec512 %<>%
# 257 is (N/2)+1
# Create a column showing the frequency in Hz
mutate(Hz = seq(0, by = 31.25, length.out = 257))
Finally, a plot of the spectrum is:
The connection to the fundamental frequency can be seen by plotting frequencies in the lower part of the spectrum:
v.spec512 %>%
filter(Hz < 1000) %>%
ggplot +
aes(y = dB, x = Hz) +
geom_point() +
geom_line() +
geom_vline(xintercept = 593.75,
color = "red")
The red line marks the 4th harmonic (the four peak in the spectrum) which has a frequency of 593.75 Hz. From this, the fundamental frequency can be estimated at \(593.75/4 \approx 148.5\) Hz which is in very close agreement to the value estimated from the waveform earlier.
In voiced speech, the signal drops off at -12dB/octave (per doubling of frequency) but radiation at lips causes the signal to be boosted by +6dB per octave so that the net effect is a -6dB/octave drop off. This combined effect is often corrected with pre-emphasis by giving the signal a +6 dB/octave boost. For various reasons to do with the relationship between time and frequency, a 6dB/octave boost can be produced by differencing which can be done by subtracting a time-shifted version of the signal from itself. This can be accomplished using the shift()
function in emuR which shifts circularly (so that the last element becomes the first):
## [1] -4 10 25 2 -20 1 -8 10
## [1] 10 -4 10 25 2 -20 1 -8
## [1] -14 14 15 -23 -22 21 -9 18
Here then is a comparison of the spectra derived from waveforms without (as above) and with differencing:
v.spec512d = v.sam512 %>%
# Differencing
mutate(T1hd = T1h - shift(T1h, 1)) %>%
# calculate FFT thereof
mutate(amp = Mod(fft(T1hd))) %>%
# Intensity
mutate(dB = 20 * log(amp, base = 10)) %>%
head(n = (N/2)+1)
# Add the spectrum of the differenced signal
# as an extra column to the spectral data obtained earlier:
v.spec512 %<>%
mutate(dBboost = v.spec512d$dB)
A plot of the spectra with and without pre-emphasis is shown below. The component at 0 Hz has been removed because for various reasons, pre-emphasis causes it to have an infinite value. As shown below, the result of pre-emphasis is to tilt the spectrum so that intensity of low frequencies is reduced and that of high frequencies increased.
v.spec512 %>%
# remove d.c. offset i.e. spectral component at 0 Hz
slice(-1) %>%
ggplot +
aes(x = Hz) +
# original spectrum
geom_line(aes(y = dB)) +
# pre-emphasised signal
geom_line(aes(y = dBboost),
col = "red")
The task now is to calculate spectra for the entire databases using the function dftSpectrum()
from the wrassp
signal processing package; and then to compare the spectra of some German voiceless fricatives.
The arguments to the dftSpectrum()
function are given by:
## $listOfFiles
## NULL
##
## $optLogFilePath
## NULL
##
## $beginTime
## [1] 0
##
## $centerTime
## [1] FALSE
##
## $endTime
## [1] 0
##
## $resolution
## [1] 40
##
## $fftLength
## [1] 0
##
## $windowShift
## [1] 5
##
## $window
## [1] "BLACKMAN"
##
## $bandwidth
## [1] 0
##
## $toFile
## [1] TRUE
##
## $explicitExt
## NULL
##
## $outputDirectory
## NULL
##
## $forceToLog
## useWrasspLogger
##
## $verbose
## [1] TRUE
The most important of these for the present purposes are
resolution. This, as its name suggests, is the frequency resolution between spectral components and which is set by default to 40 Hz. As shown earlier, the sampling frequency for the database under investigation, kielread_dB
, is \(fs = 16000\) Hz. Using the formula given earlier of resolution
= \(fs/N\), this means that the window length for a resolution of 40 Hz is \(N = fs/40 = 400\) points. However, the FFT only operates on windows that a are a power of 2. The next power of 2 window up from 400 points is 512. For this reason, if dftSpectrum()
is applied to the Kiel corpus, then this will be done on signals of length \(N = 512\) points resulting in 257 spectral components equally spaced between 0 Hz and the 8000 Hz (the Nyquist frequency).
windowShift. This is set to 5 ms by default. It means that an FFT of 512 points (in this case) will be applied to the left edge (onset) of the speech signal, then the FFT will be re-applied at +5 ms from the signal start, then at +10 ms and so on to the end of the signal.
window. This is set by default to “BLACKMAN” which has a very similar effect on the speech signal as the Hann window discussed earlier.
Spectral data can be calculated for the entire database using the add_ssffTrackDefinition()
as described in an earlier module. This is done with the following command which stores the spectral data with extension .dft
in the same bundle as the location of the waveform:
add_ssffTrackDefinition(kielread_DB,
# choose a track name
"spectrum",
onTheFlyFunctionName = "dftSpectrum")
The defaults can be changed by saving the parameters from the formals()
function, manipulating them, and then passing that to the argument onTheFlyParams
. For example, the following increases the frequency resolution to 250 Hz. In this case, the FFT will be applied to signals of length \(N = 16000/250 = 64\) points resulting in 33 equally spaced spectral components between 0 Hz and the Nyquist frequency. In order that the .dft
files from the preceding calculation with the frequency resolution of 40 Hz are not overwritten, a new file extension is chosen (dft2
) for storing this spectral data with a frequency resolution of 250 Hz.
The task here is to compare some spectra at the temporal midpoint of the fricatives [s, ʃ, x] extracted from the Kielread corpus. The expectation is that the energy should be concentrated in the highest frequency region for [s], then in a mid-frequency region for [ʃ]. Since [x] has an articulatory configuration that is quite like the back vowel [u], the energy (as for [u]) should be predominantly in the lower part of the spectrum.
# Segment list of the three fricatives
fric.s = query(kielread_DB, "Phonetic = s|S|x")
# Spectral data at the fricatives' temporal midpoint
fric.dft = get_trackdata(kielread_DB,
fric.s,
"spectrum",
cut = .5)
# add a column for the speaker
fric.dft %<>%
mutate(speaker = substring(fric.dft$bundle, 1, 3))
The spectral data should have 257 columns beginning with T
(i.e. T1
, T2
… T257
) given that, as calculated earlier, if the resolution is 40 Hz and the sampling frequency is 16000 Hz, then the FFT is being applied over a 512 point window resulting in 257 equally spaced spectral components between 0 Hz and 8000 Hz (the Nyquist frequency):
## [1] 257
The spectra calculated with a frequency resolution of 250 Hz should, by contrast, have 33 spectral components also equally spaced between 0 Hz and 8000 Hz (33 because \(N = 64\) in this case, hence \(N/2 + 1 = 33\) spectral components):
## [1] 33
As is clear from the above, the spectral components are all arranged in columns. But for plotting purposes, it is more convenient to arrange them in a long format. The emuR function convert_wideToLong()
will do this when applied to spectral data. More specifically, with the argument calcFreqs = T
, convert_wideToLong()
creates a variable (column) called track_name
containing the spectral columns (T1
to T257
); a variable (column) called track_values
containing the spectral components; and another variable (column) freq
containing the frequency in Hz at which these spectral components occur:
Alternatively, the same can be achieved using the following tidyverse
code:
# sample rate
fs = unique(fric.dft$sample_rate)
# number of spectral components
N = fric.dft %>%
select(starts_with("T", ignore.case = F)) %>%
ncol()
fric.dft.long2 <- fric.dft %>%
pivot_longer(cols = c(starts_with("T", ignore.case = F)),
names_to = "track_name",
values_to = "track_value") %>%
mutate(freq = rep(seq(0, fs/2, length.out = N),
times = nrow(fric.dft)))
The following plots the spectrum at the temporal midpoint of the 10th segment (corresponding to the 10th row of fric.s
from which these spectral data were derived):
j = 10
fric.dft.long %>%
filter(sl_rowIdx == j) %>%
ggplot +
aes(x = freq, y = track_value) +
geom_line() +
xlab("Frequency (Hz)") +
ylab("Intensity (dB)")
To do the same but in a particular frequency range (here 2-6 kHz):
j = 10
fric.dft.long %>%
filter(sl_rowIdx == j) %>%
filter(freq > 2000 & freq < 6000) %>%
ggplot +
aes(x = freq, y = track_value) +
geom_line() +
xlab("Frequency (Hz)") +
ylab("Intensity (dB)")
The following makes a plot of all the fricative spectra separately for the two speakers. Note that group = sl_rowIdx
is used in order to ensure that there is one spectrum per segment:
fric.dft.long %>%
ggplot +
aes(x = freq,
y = track_value,
group = sl_rowIdx,
col = labels) +
geom_line() +
facet_wrap(~speaker) +
xlab("Frequency (Hz)") +
ylab("Intensity (dB)")
The following can be used to calculate an ensemble-average per speaker and label type:
fric.dft.long.mean = fric.dft.long %>%
group_by(labels, freq, speaker) %>%
summarise(track_value = mean(track_value)) %>%
ungroup()
## `summarise()` has grouped output by 'labels', 'freq'. You can override using
## the `.groups` argument.
ggplot(fric.dft.long.mean) +
aes(x = freq,
y = track_value,
col = labels) +
facet_wrap(~speaker) +
geom_line() +
xlab("Frequency (Hz)") +
ylab("Intensity (dB)")
From the above, it is clear enough that [s] has a rising spectrum which is distinguished from [ʃ, x] which have falling spectra with increasing frequency; and [ʃ, x] are distinguished from each other first because the overall dB level of [ʃ] is greater (which is to be expected since [ʃ, x] are sibilant and non-sibilant fricatives respectively; and [ʃ] also has a greater concentration of energy in the central frequency region in comparison with the other two fricatives.
There is no reason to apply pre-emphasis to the fricative spectra. Nevertheless, an example of how this can be done is given below. It makes use of the wrassp
function afdiff()
which must first be applied to the waveforms, before the spectral data are calculated. This in turn requires identifying the paths of the waveforms in the same manner as done in an earlier module. By default, using the method below causes the first differenced waveforms to be stored in the same bundles in which the original waveforms are located. They are stored by default with extension .dwav
if first differencing is carried out on waveforms whose extension in .wav
(as in this case):
# identify the waveforms
kielread_wav_paths =
list.files(file.path(targetDir, "kielread_emuDB"),
pattern = ".*wav$",
recursive = T,
full.names = T)
# first difference the waveforms: the output
# will be stored as files with extension .dwav
afdiff(kielread_wav_paths)
# calculate spectral data on these.
# First locate the first differenced waveforms
kielread_wavd_paths =
list.files(file.path(targetDir, "kielread_emuDB"),
pattern = ".*dwav$",
recursive = T,
full.names = T)
# Calculate spectral data and store the output
# of the spectral data with extension .dftdiff
dftSpectrum(kielread_wavd_paths,
explicitExt = "dftdiff")
# Add the spectral data of the first differenced
# data to the database.
add_ssffTrackDefinition(kielread_DB,
# choose a name
name = "dftdiff",
columnName = "dft",
fileExtension = "dftdiff")
The following repeats many of the earlier commands in this section in order to obtain spectra derived from the first differenced signal averaged by fricative category and speaker.
# get spectral data of the 1st differenced signal
fric.dftdiff = get_trackdata(kielread_DB,
fric.s,
"dftdiff",
cut = .5)
# add a column for the speaker
fric.dftdiff %<>%
mutate(speaker = substring(fric.dftdiff$bundle, 1, 3))
# convert to long format
fric.dftdiff.long = convert_wideToLong(fric.dftdiff,
calcFreqs = T)
# remove the spectral component at 0 Hz
fric.dftdiff.long %>%
filter(track_name != "T1") %>%
ggplot +
aes(x = freq,
y = track_value,
group = sl_rowIdx, col = labels) +
geom_line() +
facet_wrap(~speaker)
# average by fricative category, frequency and speaker
fric.dftdiff.long.mean = fric.dftdiff.long %>%
group_by(labels, freq, speaker) %>%
summarise(track_value = mean(track_value)) %>%
ungroup()
## `summarise()` has grouped output by 'labels', 'freq'. You can override using
## the `.groups` argument.
# plot the result
ggplot(fric.dftdiff.long.mean) +
aes(x = freq,
y = track_value,
col = labels) +
facet_wrap(~speaker) +
geom_line()