If not already done, carry out parts 1-5 of the setup, as described here.
Load these libraries:
library(tidyverse)
library(gridExtra)Load these objects:
urla = "https://www.phonetik.uni-muenchen.de/studium_lehre"
urlb = "lehrmaterialien/R_speech_processing/Rdf"
url = paste(urla, urlb, sep="/")
int = read.table(file.path(url, "intdauer.txt"),
stringsAsFactors = T)
vdata = read.table(file.path(url, "vdata.txt"),
stringsAsFactors = T)
asp = read.table(file.path(url, "asp.df.txt"),
stringsAsFactors = T)
coronal = read.table(file.path(url, "coronal.txt"),
stringsAsFactors = T)Plotting data is an indispensable extension of summary statistics described in the previous module. The plotting function to be used is ggplot() that forms part of the ggplot2 library. ‘gg’ stands for grammar of graphics which refers to the idea that all types of plots can be assembled from a small number of reusable functions.
The type of plot that is needed in speech science depends on the combination(s) of numerical and categorical data are being plotted.
This is a very common type of analysis in speech science. It concerns questions such as:
A boxplot can be used to provide data that is relevant for answering questions of this kind. In the example below, the question is:
asp.
VOTC.Word: which word the stop is taken from.Vpn: Versuchsperson i.e. an identifier for the speaker.Stress: whether the stop is from a lexically stressed or unstressed syllable.In most types of ggplots there are three obligatory components:
ggplot()aes()geom_. Thus is this case,
geom_boxplot().
For the data in question, the three components are shown below. Notice that they are separated by a + sign, thus:ggplot(asp) +
aes(x = C, y = VOT) +
geom_boxplot()Alternatively, the data-frame can be piped to the ggplot() function. This is especially useful because the data-frame can be filtered or otherwise manipulated prior to plotting.
asp %>%
ggplot() +
aes(x = C, y = VOT) +
geom_boxplot()A boxplot gives the following information:
The horizontal line in the box is the median.
The vertical extent of the box is the interquartile range (IQR).
The whiskers extend to the highest/lowest values within the range of 1.5 * IQR.
The points are all the remaining values not within 1-3.
In this case, the question concerns the extent to which one numerical variable can be predicted from another. Examples are:
int.
Dauer.dB.Vpn is an additional factor for the speaker.The relationship between the variables can be shown with the functions
geom_point()
for a so-called scatter-plot or with
geom_line()
for a line-plot or with both.
Here are the plots:
# Scatter-plot
int %>%
ggplot +
aes(x = Dauer, y = dB) +
geom_point() # Line-plot
int %>%
ggplot +
aes(x = Dauer, y = dB) +
geom_line()# Both:
int %>%
ggplot +
aes(x = Dauer, y = dB) +
geom_line() +
geom_point()Vertical and horizontal lines can be added to plots with reference lines.
int %>%
ggplot +
aes(x = Dauer, y = dB) +
geom_point() +
# add vertical line at 150 ms
geom_vline(xintercept = 150) +
# add horizontal line at 35 ms
geom_hline(yintercept = 35)There are various instances in speech science and especially social studies of speech concerning how factors (categorical variables) are connected. Examples are:
yes or no.creaky vs. non-creaky.The relationship between the variables can be shown with the function
geom_bar().
The question here is: Is the choice of fricative – s or sh (= IPA [ʃ]) – influenced by region?
coronal.
FrRegionVpn is an additional factor for the speakerSocialclass is another factor showing the speaker’s social classThe function geom_bar() counts the number of occurrences (in this case of s and sh) separately. Because the count is shown on the y-axis, the choice between the two fricatives is made by colour differences with the fill argument, thus:
coronal %>%
ggplot +
aes(fill = Fr, x = Region) +
geom_bar()The bars can be placed side by side instead of stacked on top of each other:
coronal %>%
ggplot +
aes(fill = Fr, x = Region) +
geom_bar(position = "dodge")Notice that the barplot is a graphical representation of one of the summary statistics for categorical variables, i.e.Â
coronal %>%
group_by(Region, Fr) %>%
summarise(count = n()) %>%
ungroup()## `summarise()` has grouped output by 'Region'. You can override using the `.groups` argument.
## # A tibble: 6 × 3
## Region Fr count
## <fct> <fct> <int>
## 1 R1 s 58
## 2 R1 sh 23
## 3 R2 s 59
## 4 R2 sh 21
## 5 R3 s 66
## 6 R3 sh 13
so that row 1 of the above is represented by the red far left bar in the above plot. A graphical representation of the number of times that levels of a factor occur is given by:
coronal %>%
group_by(Fr) %>%
summarise(count = n())## # A tibble: 2 × 2
## Fr count
## <fct> <int>
## 1 s 183
## 2 sh 57
corresponding to:
coronal %>%
ggplot +
aes(x = Fr) +
geom_bar()Any of the above can be converted to proportions with the additional argument position = "fill" within geom_bar():
coronal %>%
ggplot +
aes(fill = Fr, x = Region) +
geom_bar(position = "fill")The above graphs show the proportion of s and sh within each region. Numerically, the proportions are given by:
coronal %>%
group_by(Region, Fr) %>%
summarise(count = n()) %>%
mutate(prop = count/sum(count))## `summarise()` has grouped output by 'Region'. You can override using the `.groups` argument.
## # A tibble: 6 × 4
## # Groups: Region [3]
## Region Fr count prop
## <fct> <fct> <int> <dbl>
## 1 R1 s 58 0.716
## 2 R1 sh 23 0.284
## 3 R2 s 59 0.738
## 4 R2 sh 21 0.262
## 5 R3 s 66 0.835
## 6 R3 sh 13 0.165
The same general principles apply as outlined above when there are multiple variables. Here is a guide as to which plot to use:
geom_boxplot(): if there is only one numerical variable.
geom_point(), geom_line(): if there are two numerical variables.
geom_bar(): if there are only factors.
A common way to extend the above to multiple variables is to code the additional variables by colour in the first two cases and by using the facet_wrap() function for all three. Some examples are given below.
geom_boxplot() and one numerical variableThe following shows how F2 varies by vowel and consonantal context using colour-coding:
vdata %>%
ggplot +
aes(y = F2, x = V, col = Cons) +
geom_boxplot()Alternatively, to fill the boxes with colours:
vdata %>%
ggplot +
aes(y = F2, x = V, fill = Cons) +
geom_boxplot()This is a plot of the same information, but with the vowel space shown separately by consonantal context:
vdata %>%
ggplot +
aes(y = F2, x = V) +
geom_boxplot() +
facet_wrap(~Cons)The number of rows and columns can be specified with nrow = and ncol = arguments to facet_wrap(). For example, to redraw the above but with 3 rows and 1 column:
vdata %>%
ggplot +
aes(y = F2, x = V) +
geom_boxplot() +
facet_wrap(~Cons, nrow = 3, ncol = 1)The above can be extended to three variables. Here the additional categorical variable, Tense, is coded by colour:
vdata %>%
ggplot +
aes(y = F2, x = V, col = Tense) +
geom_boxplot() +
facet_wrap(~Cons)The same information as above but this time with combinations of Tense and Cons in their own panels:
vdata %>%
ggplot +
aes(y = F2, x = V) +
geom_boxplot() +
facet_wrap(Tense~Cons)The following is the same as above but with a colour-coding for Rate (thus a display of 4 variables):
vdata %>%
ggplot +
aes(y = F2, x = V, col = Rate) +
geom_boxplot() +
facet_wrap(Tense~Cons)This is the same information as above, but with the variables Tense, Cons, Rate in their own panels:
vdata %>%
ggplot +
aes(y = F2, x = V) +
geom_boxplot() +
facet_wrap(Tense~Cons+Rate)geom_point() and two numerical variablesDisplaying additional variables using colours or with facet_wrap() also applies to the other plotting functions considered so far. Here is a plot of the first two formants on the y- and x-axes but colour-coded separately by vowel:
vdata %>%
ggplot +
aes(y = F1, x = F2, col = V) +
geom_point()Reversing the axes causes the vowels to be positioned in relation to their position in the cardinal vowel space i.e. with vowels positioned in relationship to phonetic height on the vertical axis and phonetic backness on the horizontal axis (thus with I in the top left, U in the top right, and A at the bottom of the display:
vdata %>%
ggplot +
aes(y = F1, x = F2, col = V) +
geom_point() +
scale_x_reverse() +
scale_y_reverse()As above, but with separate panels for tense and lax vowels (note the greater separation between the vowels in the tense case):
vdata %>%
ggplot +
aes(y = F1, x = F2, col = V) +
geom_point() +
scale_x_reverse() +
scale_y_reverse() +
facet_wrap(~Tense)In order to supply text (of e.g. the vowels) rather than points, use geom_text():
vdata %>%
ggplot +
aes(y = F1, x = F2, col = V) +
geom_text(aes(y = F1, x = F2, label = V)) +
scale_x_reverse() +
scale_y_reverse() +
facet_wrap(~Tense)geom_bar() and all variables are categoricalAdditional factors are best represented with facet_wrap(), given that colour-coding is often already used when there are two factors. The following plot shows the proportion of s vs. sh choices by region but additionally separated by social class:
coronal %>%
ggplot +
aes(fill = Fr, x = Region) +
geom_bar(position="fill") +
facet_wrap(~Socialclass)The proportions can be calculated numerically as follows (always put the variable that is colour-coded, in this case Fr, last):
coronal %>%
group_by(Region, Socialclass, Fr) %>%
summarise(count = n()) %>%
mutate(prop = count/sum(count))## `summarise()` has grouped output by 'Region', 'Socialclass'. You can override using the `.groups` argument.
## # A tibble: 17 × 5
## # Groups: Region, Socialclass [9]
## Region Socialclass Fr count prop
## <fct> <fct> <fct> <int> <dbl>
## 1 R1 LM s 27 0.675
## 2 R1 LM sh 13 0.325
## 3 R1 UM s 26 0.867
## 4 R1 UM sh 4 0.133
## 5 R1 W s 5 0.455
## 6 R1 W sh 6 0.545
## 7 R2 LM s 21 0.808
## 8 R2 LM sh 5 0.192
## 9 R2 UM s 18 1
## 10 R2 W s 20 0.556
## 11 R2 W sh 16 0.444
## 12 R3 LM s 16 0.727
## 13 R3 LM sh 6 0.273
## 14 R3 UM s 29 0.935
## 15 R3 UM sh 2 0.0645
## 16 R3 W s 21 0.808
## 17 R3 W sh 5 0.192
Within the R environment, any plot can be saved as follows:
# Save the plot as `p1`
p1 = vdata %>%
ggplot +
aes(y = F1, x = F2, col = V) +
geom_point() +
scale_x_reverse() +
scale_y_reverse()No plot is visibly created with the above. To see the resulting plot, enter p1 on its own:
p1With this method, a plot can be constructed in pieces. For example:
p1 = vdata %>%
ggplot +
aes(y = F1, x = F2, col = V) +
geom_point() +
scale_x_reverse() +
scale_y_reverse()
p1Modify the plot to separate by Tense:
p1 + facet_wrap(~Tense)Alternatively:
p2 = p1 + facet_wrap(~Tense)
p2Saving plots outside the R environment can be done with ggsave(). The following saves the above plot p2 as p2.png with a width and height of 15 cm in the project directory:
ggsave(filename = "p2.png",
plot = p2,
width = 15,
height = 15,
units = "cm")This can be done with the grid.arrange() function from the gridExtra package. The following combined these three separate plots.
# Scatter-plot
i1 = int %>%
ggplot +
aes(x = Dauer, y = dB) +
geom_point()
# Line-plot
i2 = int %>%
ggplot +
aes(x = Dauer, y = dB) +
geom_line()
# Both:
i3 = int %>%
ggplot +
aes(x = Dauer, y = dB) +
geom_line() +
geom_point()
# combine all three plots
grid.arrange(i1, i2, i3)The arguments nrow= and ncol= can be used to define the number of rows and columns in the combination plot. For example, to replot the above as 1 row and three columns:
grid.arrange(i1, i2, i3, nrow = 1, ncol = 3)The following rather inconveniently shows the progression from left to right of LM, UM, W which are lower-middle class, upper-middle class, and working class.
coronal %>%
ggplot +
aes(fill = Fr, x = Region) +
geom_bar(position="fill") +
facet_wrap(~Socialclass)A rearrangement of the above so that the progression is in alignment with the social class (i.e., with working class, lower-middle class, and then upper-middle class from left to right) can be most easily done by creating a new factor with the levels arranged in that order, thus:
# make a new class Sclass
# with levels "W", "LM", "UM"
coronal <- coronal %>%
mutate(Sclass = factor(Socialclass,
levels = c("W", "LM", "UM"))
)Now plot with the variable Sclass:
coronal %>%
ggplot +
aes(fill = Fr, x = Region) +
geom_bar(position="fill") +
facet_wrap(~Sclass)Use xlab() and ylab() and ggtitle() for titles for the plot, x- and y-axes:
asp %>%
ggplot+
aes(y = VOT, x = C) +
geom_boxplot() +
xlab("Place of articulation") +
ylab("Duration (ms)") +
ggtitle("VOT in two stops")The function labs() can be used to specify all labels together:
coronal %>%
ggplot +
aes(x = Region, fill = Fr) +
geom_bar(position = "fill") +
labs(x = "Region",
y = "Proportion",
title = "Proportion of /s/ and /ʃ",
subtitle = "By region")The visible range of numerical variables on the x- and y-axes can be set as follows.
xlim(), ylim() (and/or scale_x_continuous(limits = c()) and/or scale_y_continuous(limits = c())). This removes data points and gives a warning. Using this approach influences the way that superimposed elements such as regression lines appear in the figure.coord_cartesian(xlim = c(), ylim = c()): as above but with no warning and no influence on any superimposed elements on the figure. Further details are given
here.# Without limits:
int %>%
ggplot +
aes(x = dB, y = Dauer) +
geom_point()# with limits specified by coord_cartesian()
int %>%
ggplot +
aes(x = dB, y = Dauer) +
geom_point() +
coord_cartesian(xlim = c(10,40),
ylim = c(30,280))# as above but using xlim() and ylim()
int %>%
ggplot +
aes(x = dB, y = Dauer) +
geom_point() +
xlim(10, 40) +
ylim(30, 280)## Warning: Removed 2 rows containing missing values (geom_point).
By default ggplot2 makes use of the same choice of colours. There are
many other colour possibilities
whose names are also accessible with:
colours()The most consistent way of changing colours is with the functions scale_colour_manual() or scale_fill_manual(), with the latter being used to change filled colours.
American or British spelling of color/colour can be used in most cases thus colors() and scale_color_manual() instead of scale_colour_manual() etc.
Thus, different colours for the earlier boxplot in which boxes were made for F2 of each vowel and then colour-coded by consonant category can be accomplished as follows (always make sure that enough colours are available – thus a minimum of three in this case, because there are three levels in the factor Cons):
farben <- c("darkgoldenrod1", "lightblue", "red")
vdata %>%
ggplot +
aes(y = F2, x = V, col = Cons) +
geom_boxplot() +
scale_colour_manual(values = farben)The same with user-defined filled colours:
# farben has already been defined above
vdata %>%
ggplot +
aes(y = F2, x = V, fill = Cons) +
geom_boxplot() +
scale_fill_manual(values = farben)Analogously for the earlier barplot:
coronal %>%
ggplot +
aes(fill = Fr, x = Region) +
geom_bar(position="fill") +
facet_wrap(~Socialclass) +
scale_fill_manual(values = farben)The plotting symbols can be specified using pch. For example for open diamonds:
int %>%
ggplot +
aes(x = dB, y = Dauer) +
geom_point(pch=5)A list of plotting symbols can be found here.
Lines can be drawn in different linetypes. For example, for a dashed line, specify linetype = 2. The argument lwd can be used to increase the line thickness. For example, for a dotted line, specify linetype = 3 and for double thickness, lwd=2.
int %>%
ggplot +
aes(x = dB, y = Dauer) +
geom_line(linetype=3, lwd=2)A list of linetypes can be found here.
The default font size is 11 or less. For presentations, this is often too small (the font size should be instead between 16 and 24 points). Changing the font size is done inside the theme() function. Some of the possibilities are as follows:
text changes the font size of the all the text.
title changes the font size of the plot title, axes titles, and legends.
axis.title changes the font size of the text title on the x- and y-axes. To control x- and y-axes separately, use axis.title.x and axis.title.y
axis.text controls the font of the text along the tick labels. (There is also: axis.text.x and axis.text.y).
All of 1-4 must be followed by element_text(size = n) where n is the desired font size. Some examples:
# make all text size 20
asp %>%
ggplot +
aes(x = C, y = VOT) +
geom_boxplot() +
xlab("Place") +
ylab("Duration (ms)") +
ggtitle("Voice onset time") +
theme(text = element_text(size = 20))# as above except:
# axis titles size 14,
# text for tick marks, size 24
asp %>%
ggplot +
aes(x = C, y = VOT) +
geom_boxplot() +
xlab("Place") +
ylab("Duration (ms)") +
ggtitle("Voice onset time") +
theme(
text = element_text(size = 20),
axis.title = element_text(size = 14),
axis.text = element_text(size = 24))# all text has font size 26 except:
# the title (including the Fr title) has size 12
# the axis titles have size 26
# the text of the ticks has size 18
coronal %>%
ggplot +
aes(fill = Fr, x = Region) +
geom_bar(position="fill") +
facet_wrap(~Socialclass) +
ggtitle("Barcharts for [s] and [ʃ]") +
theme(
text = element_text(size = 26),
title = element_text(size = 12),
axis.title = element_text(size = 26),
axis.text = element_text(size = 18))More information on the variables that can be set with theme() is
here.
# remove the legend title
coronal %>%
ggplot +
aes(fill = Fr, x = Region) +
geom_bar(position="fill") +
facet_wrap(~Socialclass) +
theme(legend.title = element_blank())# remove the legend completely
coronal %>%
ggplot +
aes(fill = Fr, x = Region) +
geom_bar(position="fill") +
facet_wrap(~Socialclass) +
theme(legend.position = "none")# remove the axis legends
coronal %>%
ggplot +
aes(fill = Fr, x = Region) +
geom_bar(position="fill") +
facet_wrap(~Socialclass) +
theme(axis.text = element_blank())# remove the axis titles
coronal %>%
ggplot +
aes(fill = Fr, x = Region) +
geom_bar(position="fill") +
facet_wrap(~Socialclass) +
theme(axis.title = element_blank())There are many other ways of modifying plots. For further details:
vignette("ggplot2-specs")