If not already done, carry out parts 1-5 of the setup, as described here.
Load these libraries:
library(tidyverse)
library(gridExtra)
Load these objects:
= "https://www.phonetik.uni-muenchen.de/studium_lehre"
urla = "lehrmaterialien/R_speech_processing/Rdf"
urlb = paste(urla, urlb, sep="/")
url = read.table(file.path(url, "intdauer.txt"),
int stringsAsFactors = T)
= read.table(file.path(url, "vdata.txt"),
vdata stringsAsFactors = T)
= read.table(file.path(url, "asp.df.txt"),
asp stringsAsFactors = T)
= read.table(file.path(url, "coronal.txt"),
coronal stringsAsFactors = T)
Plotting data is an indispensable extension of summary statistics described in the previous module. The plotting function to be used is ggplot()
that forms part of the ggplot2
library. ‘gg’ stands for grammar of graphics which refers to the idea that all types of plots can be assembled from a small number of reusable functions.
The type of plot that is needed in speech science depends on the combination(s) of numerical and categorical data are being plotted.
This is a very common type of analysis in speech science. It concerns questions such as:
A boxplot can be used to provide data that is relevant for answering questions of this kind. In the example below, the question is:
asp
.
VOT
C
.Word
: which word the stop is taken from.Vpn
: Versuchsperson i.e. an identifier for the speaker.Stress
: whether the stop is from a lexically stressed or unstressed syllable.In most types of ggplots there are three obligatory components:
ggplot()
aes()
geom_
. Thus is this case,
geom_boxplot()
.
For the data in question, the three components are shown below. Notice that they are separated by a +
sign, thus:ggplot(asp) +
aes(x = C, y = VOT) +
geom_boxplot()
Alternatively, the data-frame can be piped to the ggplot()
function. This is especially useful because the data-frame can be filtered or otherwise manipulated prior to plotting.
%>%
asp ggplot() +
aes(x = C, y = VOT) +
geom_boxplot()
A boxplot gives the following information:
The horizontal line in the box is the median.
The vertical extent of the box is the interquartile range (IQR).
The whiskers extend to the highest/lowest values within the range of 1.5 * IQR.
The points are all the remaining values not within 1-3.
In this case, the question concerns the extent to which one numerical variable can be predicted from another. Examples are:
int
.
Dauer
.dB
.Vpn
is an additional factor for the speaker.The relationship between the variables can be shown with the functions
geom_point()
for a so-called scatter-plot or with
geom_line()
for a line-plot or with both.
Here are the plots:
# Scatter-plot
%>%
int +
ggplot aes(x = Dauer, y = dB) +
geom_point()
# Line-plot
%>%
int +
ggplot aes(x = Dauer, y = dB) +
geom_line()
# Both:
%>%
int +
ggplot aes(x = Dauer, y = dB) +
geom_line() +
geom_point()
Vertical and horizontal lines can be added to plots with reference lines.
%>%
int +
ggplot aes(x = Dauer, y = dB) +
geom_point() +
# add vertical line at 150 ms
geom_vline(xintercept = 150) +
# add horizontal line at 35 ms
geom_hline(yintercept = 35)
There are various instances in speech science and especially social studies of speech concerning how factors (categorical variables) are connected. Examples are:
yes
or no
.creaky
vs. non-creaky
.The relationship between the variables can be shown with the function
geom_bar()
.
The question here is: Is the choice of fricative – s
or sh
(= IPA [ʃ]) – influenced by region?
coronal
.
Fr
Region
Vpn
is an additional factor for the speakerSocialclass
is another factor showing the speaker’s social classThe function geom_bar()
counts the number of occurrences (in this case of s
and sh
) separately. Because the count is shown on the y-axis, the choice between the two fricatives is made by colour differences with the fill
argument, thus:
%>%
coronal +
ggplot aes(fill = Fr, x = Region) +
geom_bar()
The bars can be placed side by side instead of stacked on top of each other:
%>%
coronal +
ggplot aes(fill = Fr, x = Region) +
geom_bar(position = "dodge")
Notice that the barplot is a graphical representation of one of the summary statistics for categorical variables, i.e.Â
%>%
coronal group_by(Region, Fr) %>%
summarise(count = n()) %>%
ungroup()
## `summarise()` has grouped output by 'Region'. You can override using the `.groups` argument.
## # A tibble: 6 × 3
## Region Fr count
## <fct> <fct> <int>
## 1 R1 s 58
## 2 R1 sh 23
## 3 R2 s 59
## 4 R2 sh 21
## 5 R3 s 66
## 6 R3 sh 13
so that row 1 of the above is represented by the red far left bar in the above plot. A graphical representation of the number of times that levels of a factor occur is given by:
%>%
coronal group_by(Fr) %>%
summarise(count = n())
## # A tibble: 2 × 2
## Fr count
## <fct> <int>
## 1 s 183
## 2 sh 57
corresponding to:
%>%
coronal +
ggplot aes(x = Fr) +
geom_bar()
Any of the above can be converted to proportions with the additional argument position = "fill"
within geom_bar()
:
%>%
coronal +
ggplot aes(fill = Fr, x = Region) +
geom_bar(position = "fill")
The above graphs show the proportion of s
and sh
within each region. Numerically, the proportions are given by:
%>%
coronal group_by(Region, Fr) %>%
summarise(count = n()) %>%
mutate(prop = count/sum(count))
## `summarise()` has grouped output by 'Region'. You can override using the `.groups` argument.
## # A tibble: 6 × 4
## # Groups: Region [3]
## Region Fr count prop
## <fct> <fct> <int> <dbl>
## 1 R1 s 58 0.716
## 2 R1 sh 23 0.284
## 3 R2 s 59 0.738
## 4 R2 sh 21 0.262
## 5 R3 s 66 0.835
## 6 R3 sh 13 0.165
The same general principles apply as outlined above when there are multiple variables. Here is a guide as to which plot to use:
geom_boxplot()
: if there is only one numerical variable.
geom_point()
, geom_line()
: if there are two numerical variables.
geom_bar()
: if there are only factors.
A common way to extend the above to multiple variables is to code the additional variables by colour in the first two cases and by using the facet_wrap()
function for all three. Some examples are given below.
geom_boxplot()
and one numerical variableThe following shows how F2 varies by vowel and consonantal context using colour-coding:
%>%
vdata +
ggplot aes(y = F2, x = V, col = Cons) +
geom_boxplot()
Alternatively, to fill the boxes with colours:
%>%
vdata +
ggplot aes(y = F2, x = V, fill = Cons) +
geom_boxplot()
This is a plot of the same information, but with the vowel space shown separately by consonantal context:
%>%
vdata +
ggplot aes(y = F2, x = V) +
geom_boxplot() +
facet_wrap(~Cons)
The number of rows and columns can be specified with nrow =
and ncol =
arguments to facet_wrap()
. For example, to redraw the above but with 3 rows and 1 column:
%>%
vdata +
ggplot aes(y = F2, x = V) +
geom_boxplot() +
facet_wrap(~Cons, nrow = 3, ncol = 1)
The above can be extended to three variables. Here the additional categorical variable, Tense
, is coded by colour:
%>%
vdata +
ggplot aes(y = F2, x = V, col = Tense) +
geom_boxplot() +
facet_wrap(~Cons)
The same information as above but this time with combinations of Tense
and Cons
in their own panels:
%>%
vdata +
ggplot aes(y = F2, x = V) +
geom_boxplot() +
facet_wrap(Tense~Cons)
The following is the same as above but with a colour-coding for Rate
(thus a display of 4 variables):
%>%
vdata +
ggplot aes(y = F2, x = V, col = Rate) +
geom_boxplot() +
facet_wrap(Tense~Cons)
This is the same information as above, but with the variables Tense
, Cons
, Rate
in their own panels:
%>%
vdata +
ggplot aes(y = F2, x = V) +
geom_boxplot() +
facet_wrap(Tense~Cons+Rate)
geom_point()
and two numerical variablesDisplaying additional variables using colours or with facet_wrap()
also applies to the other plotting functions considered so far. Here is a plot of the first two formants on the y- and x-axes but colour-coded separately by vowel:
%>%
vdata +
ggplot aes(y = F1, x = F2, col = V) +
geom_point()
Reversing the axes causes the vowels to be positioned in relation to their position in the cardinal vowel space i.e. with vowels positioned in relationship to phonetic height on the vertical axis and phonetic backness on the horizontal axis (thus with I
in the top left, U
in the top right, and A
at the bottom of the display:
%>%
vdata +
ggplot aes(y = F1, x = F2, col = V) +
geom_point() +
scale_x_reverse() +
scale_y_reverse()
As above, but with separate panels for tense and lax vowels (note the greater separation between the vowels in the tense case):
%>%
vdata +
ggplot aes(y = F1, x = F2, col = V) +
geom_point() +
scale_x_reverse() +
scale_y_reverse() +
facet_wrap(~Tense)
In order to supply text (of e.g. the vowels) rather than points, use geom_text()
:
%>%
vdata +
ggplot aes(y = F1, x = F2, col = V) +
geom_text(aes(y = F1, x = F2, label = V)) +
scale_x_reverse() +
scale_y_reverse() +
facet_wrap(~Tense)
geom_bar()
and all variables are categoricalAdditional factors are best represented with facet_wrap()
, given that colour-coding is often already used when there are two factors. The following plot shows the proportion of s
vs. sh
choices by region but additionally separated by social class:
%>%
coronal +
ggplot aes(fill = Fr, x = Region) +
geom_bar(position="fill") +
facet_wrap(~Socialclass)
The proportions can be calculated numerically as follows (always put the variable that is colour-coded, in this case Fr
, last):
%>%
coronal group_by(Region, Socialclass, Fr) %>%
summarise(count = n()) %>%
mutate(prop = count/sum(count))
## `summarise()` has grouped output by 'Region', 'Socialclass'. You can override using the `.groups` argument.
## # A tibble: 17 × 5
## # Groups: Region, Socialclass [9]
## Region Socialclass Fr count prop
## <fct> <fct> <fct> <int> <dbl>
## 1 R1 LM s 27 0.675
## 2 R1 LM sh 13 0.325
## 3 R1 UM s 26 0.867
## 4 R1 UM sh 4 0.133
## 5 R1 W s 5 0.455
## 6 R1 W sh 6 0.545
## 7 R2 LM s 21 0.808
## 8 R2 LM sh 5 0.192
## 9 R2 UM s 18 1
## 10 R2 W s 20 0.556
## 11 R2 W sh 16 0.444
## 12 R3 LM s 16 0.727
## 13 R3 LM sh 6 0.273
## 14 R3 UM s 29 0.935
## 15 R3 UM sh 2 0.0645
## 16 R3 W s 21 0.808
## 17 R3 W sh 5 0.192
Within the R environment, any plot can be saved as follows:
# Save the plot as `p1`
= vdata %>%
p1 +
ggplot aes(y = F1, x = F2, col = V) +
geom_point() +
scale_x_reverse() +
scale_y_reverse()
No plot is visibly created with the above. To see the resulting plot, enter p1
on its own:
p1
With this method, a plot can be constructed in pieces. For example:
= vdata %>%
p1 +
ggplot aes(y = F1, x = F2, col = V) +
geom_point() +
scale_x_reverse() +
scale_y_reverse()
p1
Modify the plot to separate by Tense
:
+ facet_wrap(~Tense) p1
Alternatively:
= p1 + facet_wrap(~Tense)
p2 p2
Saving plots outside the R environment can be done with ggsave()
. The following saves the above plot p2
as p2.png
with a width and height of 15 cm in the project directory:
ggsave(filename = "p2.png",
plot = p2,
width = 15,
height = 15,
units = "cm")
This can be done with the grid.arrange()
function from the gridExtra
package. The following combined these three separate plots.
# Scatter-plot
= int %>%
i1 +
ggplot aes(x = Dauer, y = dB) +
geom_point()
# Line-plot
= int %>%
i2 +
ggplot aes(x = Dauer, y = dB) +
geom_line()
# Both:
= int %>%
i3 +
ggplot aes(x = Dauer, y = dB) +
geom_line() +
geom_point()
# combine all three plots
grid.arrange(i1, i2, i3)
The arguments nrow=
and ncol=
can be used to define the number of rows and columns in the combination plot. For example, to replot the above as 1 row and three columns:
grid.arrange(i1, i2, i3, nrow = 1, ncol = 3)
The following rather inconveniently shows the progression from left to right of LM
, UM
, W
which are lower-middle class, upper-middle class, and working class.
%>%
coronal +
ggplot aes(fill = Fr, x = Region) +
geom_bar(position="fill") +
facet_wrap(~Socialclass)
A rearrangement of the above so that the progression is in alignment with the social class (i.e., with working class, lower-middle class, and then upper-middle class from left to right) can be most easily done by creating a new factor with the levels arranged in that order, thus:
# make a new class Sclass
# with levels "W", "LM", "UM"
<- coronal %>%
coronal mutate(Sclass = factor(Socialclass,
levels = c("W", "LM", "UM"))
)
Now plot with the variable Sclass
:
%>%
coronal +
ggplot aes(fill = Fr, x = Region) +
geom_bar(position="fill") +
facet_wrap(~Sclass)
Use xlab()
and ylab()
and ggtitle()
for titles for the plot, x- and y-axes:
%>%
asp +
ggplotaes(y = VOT, x = C) +
geom_boxplot() +
xlab("Place of articulation") +
ylab("Duration (ms)") +
ggtitle("VOT in two stops")
The function labs()
can be used to specify all labels together:
%>%
coronal +
ggplot aes(x = Region, fill = Fr) +
geom_bar(position = "fill") +
labs(x = "Region",
y = "Proportion",
title = "Proportion of /s/ and /ʃ",
subtitle = "By region")
The visible range of numerical variables on the x- and y-axes can be set as follows.
xlim()
, ylim()
(and/or scale_x_continuous(limits = c())
and/or scale_y_continuous(limits = c())
). This removes data points and gives a warning. Using this approach influences the way that superimposed elements such as regression lines appear in the figure.coord_cartesian(xlim = c(), ylim = c())
: as above but with no warning and no influence on any superimposed elements on the figure. Further details are given
here.# Without limits:
%>%
int +
ggplot aes(x = dB, y = Dauer) +
geom_point()
# with limits specified by coord_cartesian()
%>%
int +
ggplot aes(x = dB, y = Dauer) +
geom_point() +
coord_cartesian(xlim = c(10,40),
ylim = c(30,280))
# as above but using xlim() and ylim()
%>%
int +
ggplot aes(x = dB, y = Dauer) +
geom_point() +
xlim(10, 40) +
ylim(30, 280)
## Warning: Removed 2 rows containing missing values (geom_point).
By default ggplot2
makes use of the same choice of colours. There are
many other colour possibilities
whose names are also accessible with:
colours()
The most consistent way of changing colours is with the functions scale_colour_manual()
or scale_fill_manual()
, with the latter being used to change filled colours.
American or British spelling of color/colour can be used in most cases thus colors()
and scale_color_manual()
instead of scale_colour_manual()
etc.
Thus, different colours for the earlier boxplot in which boxes were made for F2 of each vowel and then colour-coded by consonant category can be accomplished as follows (always make sure that enough colours are available – thus a minimum of three in this case, because there are three levels in the factor Cons
):
<- c("darkgoldenrod1", "lightblue", "red")
farben %>%
vdata +
ggplot aes(y = F2, x = V, col = Cons) +
geom_boxplot() +
scale_colour_manual(values = farben)
The same with user-defined filled colours:
# farben has already been defined above
%>%
vdata +
ggplot aes(y = F2, x = V, fill = Cons) +
geom_boxplot() +
scale_fill_manual(values = farben)
Analogously for the earlier barplot:
%>%
coronal +
ggplot aes(fill = Fr, x = Region) +
geom_bar(position="fill") +
facet_wrap(~Socialclass) +
scale_fill_manual(values = farben)
The plotting symbols can be specified using pch
. For example for open diamonds:
%>%
int +
ggplot aes(x = dB, y = Dauer) +
geom_point(pch=5)
A list of plotting symbols can be found here.
Lines can be drawn in different linetypes. For example, for a dashed line, specify linetype = 2
. The argument lwd
can be used to increase the line thickness. For example, for a dotted line, specify linetype = 3
and for double thickness, lwd=2
.
%>%
int +
ggplot aes(x = dB, y = Dauer) +
geom_line(linetype=3, lwd=2)
A list of linetypes can be found here.
The default font size is 11 or less. For presentations, this is often too small (the font size should be instead between 16 and 24 points). Changing the font size is done inside the theme()
function. Some of the possibilities are as follows:
text
changes the font size of the all the text.
title
changes the font size of the plot title, axes titles, and legends.
axis.title
changes the font size of the text title on the x- and y-axes. To control x- and y-axes separately, use axis.title.x
and axis.title.y
axis.text
controls the font of the text along the tick labels. (There is also: axis.text.x
and axis.text.y
).
All of 1-4 must be followed by element_text(size = n)
where n
is the desired font size. Some examples:
# make all text size 20
%>%
asp +
ggplot aes(x = C, y = VOT) +
geom_boxplot() +
xlab("Place") +
ylab("Duration (ms)") +
ggtitle("Voice onset time") +
theme(text = element_text(size = 20))
# as above except:
# axis titles size 14,
# text for tick marks, size 24
%>%
asp +
ggplot aes(x = C, y = VOT) +
geom_boxplot() +
xlab("Place") +
ylab("Duration (ms)") +
ggtitle("Voice onset time") +
theme(
text = element_text(size = 20),
axis.title = element_text(size = 14),
axis.text = element_text(size = 24))
# all text has font size 26 except:
# the title (including the Fr title) has size 12
# the axis titles have size 26
# the text of the ticks has size 18
%>%
coronal +
ggplot aes(fill = Fr, x = Region) +
geom_bar(position="fill") +
facet_wrap(~Socialclass) +
ggtitle("Barcharts for [s] and [ʃ]") +
theme(
text = element_text(size = 26),
title = element_text(size = 12),
axis.title = element_text(size = 26),
axis.text = element_text(size = 18))
More information on the variables that can be set with theme()
is
here.
# remove the legend title
%>%
coronal +
ggplot aes(fill = Fr, x = Region) +
geom_bar(position="fill") +
facet_wrap(~Socialclass) +
theme(legend.title = element_blank())
# remove the legend completely
%>%
coronal +
ggplot aes(fill = Fr, x = Region) +
geom_bar(position="fill") +
facet_wrap(~Socialclass) +
theme(legend.position = "none")
# remove the axis legends
%>%
coronal +
ggplot aes(fill = Fr, x = Region) +
geom_bar(position="fill") +
facet_wrap(~Socialclass) +
theme(axis.text = element_blank())
# remove the axis titles
%>%
coronal +
ggplot aes(fill = Fr, x = Region) +
geom_bar(position="fill") +
facet_wrap(~Socialclass) +
theme(axis.title = element_blank())
There are many other ways of modifying plots. For further details:
vignette("ggplot2-specs")