1 The influence of categorical variables (factors) on numerical variables: boxplots
2 The relationship between two numerical variables: scatter and lineplots
3 The relationship between two factors: barplots
4 Extension to multiple variables
5 Saving plots
6 Make a combined plot from separate plots
7 Changing the plot order of categories
8 Further modifications to plots

If not already done, carry out parts 1-5 of the setup, as described here.

Load these libraries:

library(tidyverse)
library(gridExtra)

Load these objects:

urla = "https://www.phonetik.uni-muenchen.de/studium_lehre"
urlb = "lehrmaterialien/R_speech_processing/Rdf"
url = paste(urla, urlb, sep="/")
int = read.table(file.path(url, "intdauer.txt"), 
                 stringsAsFactors = T)
vdata = read.table(file.path(url, "vdata.txt"), 
                   stringsAsFactors = T)
asp = read.table(file.path(url, "asp.df.txt"), 
                 stringsAsFactors = T)
coronal = read.table(file.path(url, "coronal.txt"), 
                     stringsAsFactors = T)

Plotting data is an indispensable extension of summary statistics described in the previous module. The plotting function to be used is ggplot() that forms part of the ggplot2 library. ‘gg’ stands for grammar of graphics which refers to the idea that all types of plots can be assembled from a small number of reusable functions.

The type of plot that is needed in speech science depends on the combination(s) of numerical and categorical data are being plotted.

1 The influence of categorical variables (factors) on numerical variables: boxplots

This is a very common type of analysis in speech science. It concerns questions such as:

Is the second formant frequency of /u/ different in the dialects A, B, C?
- Numerical variable: F2
- Factor: dialect with three levels, A, B and C
Is the tongue dorsum position in syllable-final laterals different between English and German (e.g., German Kiel vs. English Keele)?
- Numerical variable: position of tongue dorsum in /l/
- Factor: language consisting of two levels, English and German.

A boxplot can be used to provide data that is relevant for answering questions of this kind. In the example below, the question is:

To what extent is voice onset time different in German syllable-initial /t, k/?
- Numerical variable: VOT
- Factor: place of articulation with two levels, /t/ and /k/
The data-frame for answering this question is asp.
- The voice onset time data is in the variable VOT
- The factor is C.
There are additional factors:
- Word: which word the stop is taken from.
- Vpn: Versuchsperson i.e. an identifier for the speaker.
- Stress: whether the stop is from a lexically stressed or unstressed syllable.

In most types of ggplots there are three obligatory components:

the data-frame that is passed to a function ggplot()
the variables to be plotted that is passed to a function aes()
the type of plot to be made. This is the name of the plot type typically preceded by geom_. Thus is this case, geom_boxplot(). For the data in question, the three components are shown below. Notice that they are separated by a + sign, thus:

ggplot(asp) +
  aes(x = C, y = VOT) +
  geom_boxplot()

Alternatively, the data-frame can be piped to the ggplot() function. This is especially useful because the data-frame can be filtered or otherwise manipulated prior to plotting.

asp %>%
  ggplot() +
  aes(x = C, y = VOT) +
  geom_boxplot()

A boxplot gives the following information:

The horizontal line in the box is the median.
The vertical extent of the box is the interquartile range (IQR).
The whiskers extend to the highest/lowest values within the range of 1.5 * IQR.
The points are all the remaining values not within 1-3.

2 The relationship between two numerical variables: scatter and lineplots

In this case, the question concerns the extent to which one numerical variable can be predicted from another. Examples are:

Does fundamental frequency increase with rising sub-glottal pressure?
- Numerical variable 1: fundamental frequency.
- Numerical variable 2: sub-glottal pressure.
Do longer open vowels have a lower jaw position?
- Numerical variable 1: duration.
- Numerical variable 2: jaw height.
The question to be resolved here is: Is there a relationship between vowel duration and peak intensity?
- Numerical variable 1: duration.
- Numerical variable 2: intensity.
The data-frame for answering this question is int.
- Duration is the numerical variable Dauer.
- Intensity is the numerical variable dB.
- Vpn is an additional factor for the speaker.

The relationship between the variables can be shown with the functions geom_point() for a so-called scatter-plot or with geom_line() for a line-plot or with both.

Here are the plots:

# Scatter-plot
int %>%
  ggplot + 
  aes(x = Dauer, y = dB) +
  geom_point()

# Line-plot
int %>%
  ggplot +  
  aes(x = Dauer, y = dB) + 
  geom_line()

# Both:
int %>%
  ggplot + 
  aes(x = Dauer, y = dB) + 
  geom_line() + 
  geom_point()

Vertical and horizontal lines can be added to plots with reference lines.

int %>% 
  ggplot +
  aes(x = Dauer, y = dB) +
  geom_point() + 
  # add vertical line at 150 ms
  geom_vline(xintercept = 150) + 
  # add horizontal line at 35 ms
  geom_hline(yintercept = 35)

3 The relationship between two factors: barplots

There are various instances in speech science and especially social studies of speech concerning how factors (categorical variables) are connected. Examples are:

Are Munich speakers more likely to use an alveolar as opposed to uvular /r/ than speakers from Cologne or Hamburg?
- Factor 1: Place of articulation with two levels: alveolar vs. uvular.
- Factor 2: Dialect with three categories: Munich vs. Cologne vs. Hamburg.
Are speakers from Augsburg more likely to have /ʃ/ in words like passt or Augsburg than Munich speakers?
- Factor 1: City with two levels: Augsburg, Munich.
- Factor 2: Place of articulation with two levels: /ʃ/ or /s/.
Are smokers more likely to produce vowels with creaky voice than non-smokers?
- Factor 1: Smoker with categories yes or no.
- Factor 2: voice_quality with categories creaky vs. non-creaky.

The relationship between the variables can be shown with the function geom_bar().

The question here is: Is the choice of fricative – s or sh (= IPA [ʃ]) – influenced by region?

The data-frame is coronal.
- The two fricatives are in the factor Fr
- The regions are in the factor Region
- Vpn is an additional factor for the speaker
- Socialclass is another factor showing the speaker’s social class

The function geom_bar() counts the number of occurrences (in this case of s and sh) separately. Because the count is shown on the y-axis, the choice between the two fricatives is made by colour differences with the fill argument, thus:

coronal %>%
  ggplot +
  aes(fill = Fr, x = Region) +
  geom_bar()

The bars can be placed side by side instead of stacked on top of each other:

coronal %>%
  ggplot +
  aes(fill = Fr, x = Region) +
  geom_bar(position = "dodge")

Notice that the barplot is a graphical representation of one of the summary statistics for categorical variables, i.e.

coronal %>%
  group_by(Region, Fr) %>%
  summarise(count = n()) %>%
  ungroup()

## `summarise()` has grouped output by 'Region'. You can override using the `.groups` argument.

## # A tibble: 6 × 3
##   Region Fr    count
##   <fct>  <fct> <int>
## 1 R1     s        58
## 2 R1     sh       23
## 3 R2     s        59
## 4 R2     sh       21
## 5 R3     s        66
## 6 R3     sh       13

so that row 1 of the above is represented by the red far left bar in the above plot. A graphical representation of the number of times that levels of a factor occur is given by:

coronal %>%
  group_by(Fr) %>%
  summarise(count = n())

## # A tibble: 2 × 2
##   Fr    count
##   <fct> <int>
## 1 s       183
## 2 sh       57

corresponding to:

coronal %>%
  ggplot +
  aes(x = Fr) +
  geom_bar()

Any of the above can be converted to proportions with the additional argument position = "fill" within geom_bar():

coronal %>%
  ggplot +
  aes(fill = Fr, x = Region) +
  geom_bar(position = "fill")

The above graphs show the proportion of s and sh within each region. Numerically, the proportions are given by:

coronal %>%
  group_by(Region, Fr) %>%
  summarise(count = n()) %>%
  mutate(prop = count/sum(count))

## `summarise()` has grouped output by 'Region'. You can override using the `.groups` argument.

## # A tibble: 6 × 4
## # Groups:   Region [3]
##   Region Fr    count  prop
##   <fct>  <fct> <int> <dbl>
## 1 R1     s        58 0.716
## 2 R1     sh       23 0.284
## 3 R2     s        59 0.738
## 4 R2     sh       21 0.262
## 5 R3     s        66 0.835
## 6 R3     sh       13 0.165

4 Extension to multiple variables

The same general principles apply as outlined above when there are multiple variables. Here is a guide as to which plot to use:

geom_boxplot(): if there is only one numerical variable.
geom_point(), geom_line(): if there are two numerical variables.
geom_bar(): if there are only factors.

A common way to extend the above to multiple variables is to code the additional variables by colour in the first two cases and by using the facet_wrap() function for all three. Some examples are given below.

4.1 `geom_boxplot()` and one numerical variable

The following shows how F2 varies by vowel and consonantal context using colour-coding:

vdata %>%
  ggplot +
  aes(y = F2, x = V, col = Cons) +
  geom_boxplot()

Alternatively, to fill the boxes with colours:

vdata %>%
  ggplot +
  aes(y = F2, x = V, fill = Cons) +
  geom_boxplot()

This is a plot of the same information, but with the vowel space shown separately by consonantal context:

vdata %>%
  ggplot +
  aes(y = F2, x = V) +
  geom_boxplot() + 
  facet_wrap(~Cons)

The number of rows and columns can be specified with nrow = and ncol = arguments to facet_wrap(). For example, to redraw the above but with 3 rows and 1 column:

vdata %>%
  ggplot +
  aes(y = F2, x = V) +
  geom_boxplot() + 
  facet_wrap(~Cons, nrow = 3, ncol = 1)

The above can be extended to three variables. Here the additional categorical variable, Tense, is coded by colour:

vdata %>%
  ggplot +
  aes(y = F2, x = V, col = Tense) +
  geom_boxplot() + 
  facet_wrap(~Cons)

The same information as above but this time with combinations of Tense and Cons in their own panels:

vdata %>%
  ggplot +
  aes(y = F2, x = V) +
  geom_boxplot() + 
  facet_wrap(Tense~Cons)

The following is the same as above but with a colour-coding for Rate (thus a display of 4 variables):

vdata %>%
  ggplot +
  aes(y = F2, x = V, col = Rate) +
  geom_boxplot() + 
  facet_wrap(Tense~Cons)

This is the same information as above, but with the variables Tense, Cons, Rate in their own panels:

vdata %>%
  ggplot +
  aes(y = F2, x = V) +
  geom_boxplot() + 
  facet_wrap(Tense~Cons+Rate)

4.2 `geom_point()` and two numerical variables

Displaying additional variables using colours or with facet_wrap() also applies to the other plotting functions considered so far. Here is a plot of the first two formants on the y- and x-axes but colour-coded separately by vowel:

vdata %>%
  ggplot +
  aes(y = F1, x = F2, col = V) +
  geom_point()

Reversing the axes causes the vowels to be positioned in relation to their position in the cardinal vowel space i.e. with vowels positioned in relationship to phonetic height on the vertical axis and phonetic backness on the horizontal axis (thus with I in the top left, U in the top right, and A at the bottom of the display:

vdata %>%
  ggplot +
  aes(y = F1, x = F2, col = V) +
  geom_point() +
  scale_x_reverse() +
  scale_y_reverse()

As above, but with separate panels for tense and lax vowels (note the greater separation between the vowels in the tense case):

vdata %>%
  ggplot +
  aes(y = F1, x = F2, col = V) +
  geom_point() +
  scale_x_reverse() +
  scale_y_reverse() +
  facet_wrap(~Tense)

In order to supply text (of e.g. the vowels) rather than points, use geom_text():

vdata %>%
  ggplot +
  aes(y = F1, x = F2, col = V) +
  geom_text(aes(y = F1, x = F2, label = V)) +
  scale_x_reverse() +
  scale_y_reverse() +
  facet_wrap(~Tense)

4.3 `geom_bar()` and all variables are categorical

Additional factors are best represented with facet_wrap(), given that colour-coding is often already used when there are two factors. The following plot shows the proportion of s vs. sh choices by region but additionally separated by social class:

coronal %>%
  ggplot +
  aes(fill = Fr, x = Region) +
  geom_bar(position="fill") +
  facet_wrap(~Socialclass)

The proportions can be calculated numerically as follows (always put the variable that is colour-coded, in this case Fr, last):

coronal %>%
  group_by(Region, Socialclass, Fr) %>%
  summarise(count = n()) %>%
  mutate(prop = count/sum(count))

## `summarise()` has grouped output by 'Region', 'Socialclass'. You can override using the `.groups` argument.

## # A tibble: 17 × 5
## # Groups:   Region, Socialclass [9]
##    Region Socialclass Fr    count   prop
##    <fct>  <fct>       <fct> <int>  <dbl>
##  1 R1     LM          s        27 0.675 
##  2 R1     LM          sh       13 0.325 
##  3 R1     UM          s        26 0.867 
##  4 R1     UM          sh        4 0.133 
##  5 R1     W           s         5 0.455 
##  6 R1     W           sh        6 0.545 
##  7 R2     LM          s        21 0.808 
##  8 R2     LM          sh        5 0.192 
##  9 R2     UM          s        18 1     
## 10 R2     W           s        20 0.556 
## 11 R2     W           sh       16 0.444 
## 12 R3     LM          s        16 0.727 
## 13 R3     LM          sh        6 0.273 
## 14 R3     UM          s        29 0.935 
## 15 R3     UM          sh        2 0.0645
## 16 R3     W           s        21 0.808 
## 17 R3     W           sh        5 0.192

5 Saving plots

Within the R environment, any plot can be saved as follows:

# Save the plot as `p1`
p1 = vdata %>%
  ggplot +
  aes(y = F1, x = F2, col = V) +
  geom_point() +
  scale_x_reverse() +
  scale_y_reverse()

No plot is visibly created with the above. To see the resulting plot, enter p1 on its own:

p1

With this method, a plot can be constructed in pieces. For example:

p1 = vdata %>%
  ggplot +
  aes(y = F1, x = F2, col = V) +
  geom_point() +
  scale_x_reverse() +
  scale_y_reverse()
p1

Modify the plot to separate by Tense:

p1 + facet_wrap(~Tense)

Alternatively:

p2 = p1 + facet_wrap(~Tense)
p2

Saving plots outside the R environment can be done with ggsave(). The following saves the above plot p2 as p2.png with a width and height of 15 cm in the project directory:

ggsave(filename = "p2.png", 
       plot = p2, 
       width = 15, 
       height = 15, 
       units = "cm")

6 Make a combined plot from separate plots

This can be done with the grid.arrange() function from the gridExtra package. The following combined these three separate plots.

# Scatter-plot
i1 = int %>%
  ggplot + 
  aes(x = Dauer, y = dB) +
  geom_point() 

# Line-plot
i2 = int %>%
  ggplot +  
  aes(x = Dauer, y = dB) + 
  geom_line()

# Both:
i3 = int %>%
  ggplot + 
  aes(x = Dauer, y = dB) + 
  geom_line() + 
  geom_point()

# combine all three plots
grid.arrange(i1, i2, i3)

The arguments nrow= and ncol= can be used to define the number of rows and columns in the combination plot. For example, to replot the above as 1 row and three columns:

grid.arrange(i1, i2, i3, nrow = 1, ncol = 3)

7 Changing the plot order of categories

The following rather inconveniently shows the progression from left to right of LM, UM, W which are lower-middle class, upper-middle class, and working class.

coronal %>%
  ggplot +
  aes(fill = Fr, x = Region) +
  geom_bar(position="fill") +
  facet_wrap(~Socialclass)

A rearrangement of the above so that the progression is in alignment with the social class (i.e., with working class, lower-middle class, and then upper-middle class from left to right) can be most easily done by creating a new factor with the levels arranged in that order, thus:

# make a new class Sclass
# with levels "W", "LM", "UM"
coronal <- coronal %>%
  mutate(Sclass = factor(Socialclass, 
                         levels = c("W", "LM", "UM"))
         )

Now plot with the variable Sclass:

coronal %>%
  ggplot +
  aes(fill = Fr, x = Region) +
  geom_bar(position="fill") +
  facet_wrap(~Sclass)

8 Further modifications to plots

8.1 Axis labels

Use xlab() and ylab() and ggtitle() for titles for the plot, x- and y-axes:

asp %>%
  ggplot+ 
  aes(y = VOT, x = C) + 
  geom_boxplot() + 
  xlab("Place of articulation") + 
  ylab("Duration (ms)") + 
  ggtitle("VOT in two stops")

The function labs() can be used to specify all labels together:

coronal %>%
  ggplot + 
  aes(x = Region, fill = Fr) + 
  geom_bar(position = "fill") + 
  labs(x = "Region", 
       y = "Proportion", 
       title = "Proportion of /s/ and /ʃ",
       subtitle = "By region")

8.2 Setting the range

The visible range of numerical variables on the x- and y-axes can be set as follows.

withxlim(), ylim() (and/or scale_x_continuous(limits = c()) and/or scale_y_continuous(limits = c())). This removes data points and gives a warning. Using this approach influences the way that superimposed elements such as regression lines appear in the figure.
with coord_cartesian(xlim = c(), ylim = c()): as above but with no warning and no influence on any superimposed elements on the figure. Further details are given here.

# Without limits:
int %>%
  ggplot + 
  aes(x = dB, y = Dauer) + 
  geom_point()

# with limits specified by coord_cartesian()
int %>%
  ggplot +
  aes(x = dB, y = Dauer) + 
  geom_point() + 
  coord_cartesian(xlim = c(10,40), 
                  ylim = c(30,280))

# as above but using xlim() and ylim()
int %>%
  ggplot + 
  aes(x = dB, y = Dauer) + 
  geom_point() + 
  xlim(10, 40) + 
  ylim(30, 280)

## Warning: Removed 2 rows containing missing values (geom_point).

8.3 Colours

By default ggplot2 makes use of the same choice of colours. There are many other colour possibilities whose names are also accessible with:

colours()

The most consistent way of changing colours is with the functions scale_colour_manual() or scale_fill_manual(), with the latter being used to change filled colours.

American or British spelling of color/colour can be used in most cases thus colors() and scale_color_manual() instead of scale_colour_manual() etc.

Thus, different colours for the earlier boxplot in which boxes were made for F2 of each vowel and then colour-coded by consonant category can be accomplished as follows (always make sure that enough colours are available – thus a minimum of three in this case, because there are three levels in the factor Cons):

farben <- c("darkgoldenrod1", "lightblue", "red")
vdata %>%
  ggplot +
  aes(y = F2, x = V, col = Cons) +
  geom_boxplot() +
  scale_colour_manual(values = farben)

The same with user-defined filled colours:

# farben has already been defined above
vdata %>%
  ggplot +
  aes(y = F2, x = V, fill = Cons) +
  geom_boxplot() +
  scale_fill_manual(values = farben)

Analogously for the earlier barplot:

coronal %>%
  ggplot +
  aes(fill = Fr, x = Region) +
  geom_bar(position="fill") +
  facet_wrap(~Socialclass) +
  scale_fill_manual(values = farben)

8.4 Plotting symbols and linetype

The plotting symbols can be specified using pch. For example for open diamonds:

int %>%
  ggplot + 
  aes(x = dB, y = Dauer) + 
  geom_point(pch=5)

A list of plotting symbols can be found here.

Lines can be drawn in different linetypes. For example, for a dashed line, specify linetype = 2. The argument lwd can be used to increase the line thickness. For example, for a dotted line, specify linetype = 3 and for double thickness, lwd=2.

int %>%
  ggplot + 
  aes(x = dB, y = Dauer) + 
  geom_line(linetype=3, lwd=2)

A list of linetypes can be found here.

8.5 Font size

The default font size is 11 or less. For presentations, this is often too small (the font size should be instead between 16 and 24 points). Changing the font size is done inside the theme() function. Some of the possibilities are as follows:

text changes the font size of the all the text.
title changes the font size of the plot title, axes titles, and legends.
axis.title changes the font size of the text title on the x- and y-axes. To control x- and y-axes separately, use axis.title.x and axis.title.y
axis.text controls the font of the text along the tick labels. (There is also: axis.text.x and axis.text.y).

All of 1-4 must be followed by element_text(size = n) where n is the desired font size. Some examples:

# make all text size 20
asp %>%
  ggplot + 
  aes(x = C, y = VOT) + 
  geom_boxplot() + 
  xlab("Place") + 
  ylab("Duration (ms)") + 
  ggtitle("Voice onset time") + 
  theme(text = element_text(size = 20))

# as above except:
# axis titles size 14, 
# text for tick marks, size 24
asp %>%
  ggplot + 
  aes(x = C, y = VOT) + 
  geom_boxplot() + 
  xlab("Place") + 
  ylab("Duration (ms)") + 
  ggtitle("Voice onset time") + 
  theme(
    text = element_text(size = 20), 
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 24))

# all text has font size  26 except:
# the title (including the Fr title) has size 12
# the axis titles have size 26
# the text of the ticks has size 18
coronal %>%
  ggplot +
  aes(fill = Fr, x = Region) +
  geom_bar(position="fill") +
  facet_wrap(~Socialclass) +
  ggtitle("Barcharts for [s] and [ʃ]") + 
  theme(
    text = element_text(size = 26), 
    title = element_text(size = 12), 
    axis.title = element_text(size = 26),
    axis.text = element_text(size = 18))

More information on the variables that can be set with theme() is here.

8.6 Removing axes, the legend, and their associated text

# remove the legend title
coronal %>%
  ggplot +
  aes(fill = Fr, x = Region) +
  geom_bar(position="fill") +
  facet_wrap(~Socialclass) +
  theme(legend.title = element_blank())

# remove the legend completely
coronal %>%
  ggplot +
  aes(fill = Fr, x = Region) +
  geom_bar(position="fill") +
  facet_wrap(~Socialclass) +
  theme(legend.position = "none")

# remove the axis legends
coronal %>%
  ggplot +
  aes(fill = Fr, x = Region) +
  geom_bar(position="fill") +
  facet_wrap(~Socialclass) +
  theme(axis.text = element_blank())

# remove the axis titles
coronal %>%
  ggplot +
  aes(fill = Fr, x = Region) +
  geom_bar(position="fill") +
  facet_wrap(~Socialclass) +
  theme(axis.title = element_blank())

8.7 Other specifications

There are many other ways of modifying plots. For further details:

vignette("ggplot2-specs")

An introduction to plotting data in R

Jonathan Harrington, Johanna Cronenberg

1 The influence of categorical variables (factors) on numerical variables: boxplots

2 The relationship between two numerical variables: scatter and lineplots

3 The relationship between two factors: barplots

4 Extension to multiple variables

4.1 `geom_boxplot()` and one numerical variable

4.2 `geom_point()` and two numerical variables

4.3 `geom_bar()` and all variables are categorical

5 Saving plots

6 Make a combined plot from separate plots

7 Changing the plot order of categories

8 Further modifications to plots

8.1 Axis labels

8.2 Setting the range

8.3 Colours

8.4 Plotting symbols and linetype

8.5 Font size

8.6 Removing axes, the legend, and their associated text

8.7 Other specifications

An introduction to plotting data in R

Jonathan Harrington, Johanna Cronenberg

1 The influence of categorical variables (factors) on numerical variables: boxplots

2 The relationship between two numerical variables: scatter and lineplots

3 The relationship between two factors: barplots

4 Extension to multiple variables

4.1 geom_boxplot() and one numerical variable

4.2 geom_point() and two numerical variables

4.3 geom_bar() and all variables are categorical

5 Saving plots

6 Make a combined plot from separate plots

7 Changing the plot order of categories

8 Further modifications to plots

8.1 Axis labels

8.2 Setting the range

8.3 Colours

8.4 Plotting symbols and linetype

8.5 Font size

8.6 Removing axes, the legend, and their associated text

8.7 Other specifications

4.1 `geom_boxplot()` and one numerical variable

4.2 `geom_point()` and two numerical variables

4.3 `geom_bar()` and all variables are categorical