1 Analysis of variance: general considerations.
2 Relationship to t-test.
- 2.1 ANOVA and a paired t-test.
- 2.2 ANOVA and an unpaired t-test.
3 Running an ANOVA with more than one factor.

If not already done, carry out parts 1-5 of the setup, as described here.

Load these libraries and data-frames:

library(tidyverse)
library(broom)
library(afex)
library(emmeans)
urla = "https://www.phonetik.uni-muenchen.de/"
urlb = "studium_lehre/lehrmaterialien/R_speech_processing/Rdf/"
url = paste0(urla, urlb)
e.df = read.table(file.path(url, "e.txt"), 
                  stringsAsFactors = T)
form.df = read.table(file.path(url, "form.df.txt"),
                     stringsAsFactors = T)
dg.df = read.table(file.path(url, "dg.txt"), 
                   stringsAsFactors = T)
ssb.df = read.table(file.path(url, "ssb.txt"), 
                   stringsAsFactors = T)

1 Analysis of variance: general considerations.

An analysis of variance (ANOVA) is an extension of a t-test to cases involving several levels e.g.:

‘Do /p, t, k/ differ in voice onset time?’. Factor: place of articulation, three levels /p, t, k/.

and/or to multiple factors e.g.

‘Do /p, b/ in stressed and unstressed syllables differ in voice onset time?’ which is the same as: ‘Is voice onset time affected by stop voicing and stress’. Factor 1: stop voicing, levels /p, b/; factor 2: stress, levels ‘stressed’, ‘unstressed’.

An ANOVA is used to calculate the relationship between two type of variation:

the systematic variation between levels.
the random variation within levels.

This is schematically outlined in the example figure below of first formant (F1) frequencies of three vowels (i.e. three levels) /ɪ,ɛ,a/

F1 varies systematically between the three vowels. This systematic variation comes about for acoustic-phonetic reasons by which F1 increases with phonetic vowel height through /ɪ,ɛ,a/ respectively.
F1 varies within each level (in this case vowel): that is, there is random F1-variation in /ɪ/, random F1-variation in /ɛ/, and random F1-variation in /a/.

In an ANOVA, the F-ratio, which is the ratio of these two quantities, is tested for significance.

\[ Fratio = \frac{systematic-variation-BETWEEN-the-levels}{random-variation-WITHIN-the-levels} \]

If there is a high degree of systematic separation between the vowels, then the numerator in the above equation has a high value (causing an increase in the F-ratio).
If the random variation within /ɪ/ and/or within /ɛ/, and/or within /a/ is high, then the the denominator is also high (causing a decrease in the F-ratio).

Thus, the greater the systematic differences between the vowels, and the less the random variation within vowels, the greater the F-ratio.

ANOVA then tests whether the F-ratio is significantly greater than chance (significantly greater than 1) using an F-test.

2 Relationship to t-test.

As when running a t-test, it is essential to understand how within and between subjects factors are different

2.1 ANOVA and a paired t-test.

Recall from here that a paired t-test was used when for a within-subject factor. When testing whether F2 was influenced by the within-subject factorStress, the command was:

form.df %>%
  pivot_wider(names_from = Stress, 
              values_from = F2) %>%
  mutate(F2diff = stressed - unstressed) %>%
  pull(F2diff) %>%
  t.test()

## 
##  One Sample t-test
## 
## data:  .
## t = 4.3543, df = 11, p-value = 0.001147
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  134.0163 407.9837
## sample estimates:
## mean of x 
##       271

which was reported as:

There was a significant influence of stress on F2 (t[11] = 4.4, p < 0.01).

The corresponding command with an ANOVA is:

form.aov = 
  aov_4(F2 ~ Stress  + (Stress|Subject), data = form.df)
form.aov

## Anova Table (Type 3 tests)
## 
## Response: F2
##   Effect    df      MSE        F  ges p.value
## 1 Stress 1, 11 23241.00 18.96 ** .411    .001
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1

# or with dot notation
form.df %>%
  aov_4(F2 ~ Stress  + (Stress|Subject), .)

## Anova Table (Type 3 tests)
## 
## Response: F2
##   Effect    df      MSE        F  ges p.value
## 1 Stress 1, 11 23241.00 18.96 ** .411    .001
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1

(Stress|Subject) is the syntax for indicating that Stress is a within-subjects factors in the aov_4() function.

The most important parts of the output are:

two types of degrees of freedom.
an F-value.
a significance level and probability value.

This should be reported as:

There was a significant influence of stress on F2 (F[1, 11] = 18.96, p < 0.01).

The choice of p < 0.01 is because of the ** which means that the probability under p.value is between 0.001 and 0.01. The p.value in this output is rounded to three decimal places. To get the actual value, pass the output of aov_4() to anova() i.e. in this case:

anova(form.aov)

## Anova Table (Type 3 tests)
## 
## Response: F2
##        num Df den Df   MSE     F     ges   Pr(>F)   
## Stress      1     11 23241 18.96 0.41137 0.001147 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-value obtained from the ANOVA is the t-statistic squared from the paired t-test. Thus:

# square of the t-statistic
(-4.3543)^2

## [1] 18.95993

# (same as the F-value. other than rounding differences).

Note also that (rounding differences apart) the probability values from the t-statistic (p-value = 0.001147) and from Pr(>F) in the ANOVA are the same.

The output of the above ANOVA also gives a column ges which is generalised-eta-squared. It can be used to give an indication of the size of the effect found. More details on this statistic can be found here. The paper suggests that a generalised-eta-squared value of over 0.26 is large, and less than 0.02 is small. A small effect might be one in which e.g. a study showed a significant change in F2 of only 5 Hz. This effect size is very small in relation to the typically value of F2 (between 220 and 3000 Hz) and is probably meaningless even though significant (because F2 changes less than 10 Hz are probably not perceptible). In this hypothetical case, the ges values is likely to be small.

2.2 ANOVA and an unpaired t-test.

Recall also from here that a unpaired t-test was used for a between-subject factor. When testing whether F2 was influenced by the between-subject factor Sprache, the command was:

e.df %>%
t.test(F2 ~ Sprache, .) %>%
  tidy()

## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
## 1     167.     2032.     1865.      2.26  0.0344      21.1     13.5      321.
## # ℹ 2 more variables: method <chr>, alternative <chr>

and reported as:

F2 was significantly influenced by language (t[21.1] = 2.3, p < 0.05).

The corresponding command with an ANOVA is:

e.df %>%
aov_4(F2 ~ Sprache + (1|Vpn), data = .)

## Contrasts set to contr.sum for the following variables: Sprache

## Anova Table (Type 3 tests)
## 
## Response: F2
##    Effect    df      MSE      F  ges p.value
## 1 Sprache 1, 25 34738.38 5.36 * .177    .029
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1

and can be reported as:

F2 was significantly influenced by language (F[1,25] = 5.36, p < 0.05).

Information is reported Contrasts set to contr.sum for the following variables: Sprache. This is a type of coding of the levels that is commonly used in ANOVAs often known as deviation coding: for this example, it compares the mean F2 for each level of Sprache ito the overall mean of the F2 in entire data-frame. More details are given here.

Different results are produced for the unpaired t-test and ANOVA. This is because in an unpaired t-test, the variances of the levels are not assumed to be equal (resulting in fractional degrees of freedom) whereas in an ANOVA they are (hence integer degrees of freedom). The same result is obtained if the t-test is run under the assumption of equal variance which as explained in the final note here is given by:

e.df %>%
t.test(F2 ~ Sprache, ., var.equal=T)

## 
##  Two Sample t-test
## 
## data:  F2 by Sprache
## t = 2.3149, df = 25, p-value = 0.02912
## alternative hypothesis: true difference in means between group D and group E is not equal to 0
## 95 percent confidence interval:
##   18.4301 315.7681
## sample estimates:
## mean in group D mean in group E 
##        2031.672        1864.573

The t-statistic squared, degrees of freedom, and probability value from this unpaired t-test under the assumption of equal variance are now the same as those obtained with the above ANOVA.

3 Running an ANOVA with more than one factor.

The question to be resolved for the data-frame dg is whether F2 is influenced by region (Region) and speaker sex (Gen). Here then there are two (between-subject) factors: Region has three levels and Gen has two levels.

There are therefore two reasons why this problem has to be accomplished with an ANOVA (and why a t-test is insufficient):

there is more than one factor.
one of the factors has more than two levels.

summary(dg.df)

##        F2       Region Gen         Vpn    
##  Min.   :1082   A:20   m:30   S1     : 1  
##  1st Qu.:1518   B:20   w:30   S10    : 1  
##  Median :1780   C:20          S11    : 1  
##  Mean   :1828                 S12    : 1  
##  3rd Qu.:2179                 S13    : 1  
##  Max.   :2682                 S14    : 1  
##                               (Other):54

The steps in carrying out the ANOVA are:

identify as before whether these are between- or within-factors.
make a plot of the data.
assess whether there are interactions (which is dealt with separately below).
run aov_4()
if there is an interaction, run so-called post-hoc tests.

1-4 are considered below; 5 is further explored at the end of this document.

3.1 Between and within factors.

Both are likely to be between factors, because they deal with attributes of the speaker. For the two-factor case, this can be most easily seen as follows:

dg.df %>%
mutate(Factors = interaction(Region, Gen)) %>%
  select(Vpn, Factors) %>%
  table()

##      Factors
## Vpn   A.m B.m C.m A.w B.w C.w
##   S1    1   0   0   0   0   0
##   S10   1   0   0   0   0   0
##   S11   0   1   0   0   0   0
##   S12   0   1   0   0   0   0
##   S13   0   1   0   0   0   0
##   S14   0   1   0   0   0   0
##   S15   0   1   0   0   0   0
##   S16   0   1   0   0   0   0
##   S17   0   1   0   0   0   0
##   S18   0   1   0   0   0   0
##   S19   0   1   0   0   0   0
##   S2    1   0   0   0   0   0
##   S20   0   1   0   0   0   0
##   S21   0   0   1   0   0   0
##   S22   0   0   1   0   0   0
##   S23   0   0   1   0   0   0
##   S24   0   0   1   0   0   0
##   S25   0   0   1   0   0   0
##   S26   0   0   1   0   0   0
##   S27   0   0   1   0   0   0
##   S28   0   0   1   0   0   0
##   S29   0   0   1   0   0   0
##   S3    1   0   0   0   0   0
##   S30   0   0   1   0   0   0
##   S31   0   0   0   1   0   0
##   S32   0   0   0   1   0   0
##   S33   0   0   0   1   0   0
##   S34   0   0   0   1   0   0
##   S35   0   0   0   1   0   0
##   S36   0   0   0   1   0   0
##   S37   0   0   0   1   0   0
##   S38   0   0   0   1   0   0
##   S39   0   0   0   1   0   0
##   S4    1   0   0   0   0   0
##   S40   0   0   0   1   0   0
##   S41   0   0   0   0   1   0
##   S42   0   0   0   0   1   0
##   S43   0   0   0   0   1   0
##   S44   0   0   0   0   1   0
##   S45   0   0   0   0   1   0
##   S46   0   0   0   0   1   0
##   S47   0   0   0   0   1   0
##   S48   0   0   0   0   1   0
##   S49   0   0   0   0   1   0
##   S5    1   0   0   0   0   0
##   S50   0   0   0   0   1   0
##   S51   0   0   0   0   0   1
##   S52   0   0   0   0   0   1
##   S53   0   0   0   0   0   1
##   S54   0   0   0   0   0   1
##   S55   0   0   0   0   0   1
##   S56   0   0   0   0   0   1
##   S57   0   0   0   0   0   1
##   S58   0   0   0   0   0   1
##   S59   0   0   0   0   0   1
##   S6    1   0   0   0   0   0
##   S60   0   0   0   0   0   1
##   S7    1   0   0   0   0   0
##   S8    1   0   0   0   0   0
##   S9    1   0   0   0   0   0

Because for all speakers there is a single 1 and remaining zeros, these are between-subject factors (i.e. a speaker is e.g. from region A and is male and therefore not from the other two regions nor female).

3.2 A plot of the data.

dg.df %>%
  ggplot+
  aes(y = F2, x = Region, col = Gen) +
  geom_boxplot()

In an analysis with two factors, three questions are considered:

is F2 influenced by Gen?
is F2 influenced by Region?
is there an interaction between Gen and Region?

For the first two questions, the plot suggests:

F2 is higher for females than for males
F2 is about the same in A and B and both of these have a higher F2 than C.

3.3 Checking for an interaction.

The third question - which is to do with the interaction - means either:

is the difference between regions the same for males and females?

and:

is the difference between males and females the same in all three regions?

The plot suggests that the answer to this third question is ‘no’. This is because the male-female difference is much less in region C than in either of the other two regions. Therefore, the statistics are likely to show a significant interaction. No interaction between the factors would occur if the male-female differences in regions A, B, and C were about the same (which is not the case).

3.4 Running `aov_4()` with two factors.

The syntax is much as before except that two factors A and B are now specified as A * B (or for an equivalent result as A + B). The specification for the subject factor follows from the earlier examples:

If (as in this example) both are between factors: (1|Vpn).
If one factor is within: (A|Vpn).
If both factors are within: (A + B|Vpn) or (A * B|Vpn)

For the present example, therefore:

dg.df %>% 
  aov_4(F2 ~ Region * Gen + (1|Vpn), .)

## Contrasts set to contr.sum for the following variables: Region, Gen

## Anova Table (Type 3 tests)
## 
## Response: F2
##       Effect    df      MSE          F  ges p.value
## 1     Region 2, 54 23312.68 119.64 *** .816   <.001
## 2        Gen 1, 54 23312.68 106.15 *** .663   <.001
## 3 Region:Gen 2, 54 23312.68  12.08 *** .309   <.001
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1

The results show:

F2 was significantly influenced by Region (F[2,54] = 119.64, p < 0.001).
F2 was significantly influenced by Gen (F[1,54] = 106.15, p < 0.001).
There was a significant (F[2,54] = 12.08, p< 0.001) interaction between these factors.

All of these results were expected from the plot.

AN ANOVA can only be run if for each speaker there is:

ONE value for EACH level of each within factor.
ONE value for ONE level of each between factor.

This makes it possible to calculate in advance exactly how many observations are input to an ANOVA, given the number of speakers, and a knowledge of the factors (whether between or within) and the number of levels. For example, suppose there are 12 speakers, 7 old and 5 young who produce three different words. There is, then, one between factor (Age: two levels, young, old) and one within factor (Word: three levels, word1, word2, word3). Therefore there must be 12 x 1 (Between) x 3 (within) = 36 observations in total (not more, not less) for the ANOVA to be applicable. Phonetics experiments usually deviate from this principle because they typically collect multiple repetitions of the same factor or factor combinations. An example of this type of data is in the data-frame ssb.df. Here younger and older speakers (between factor Alter: two levels) produced swoop, used who'd (within factor Wort: three levels) up to 10 times and the second formant frequency (F2) was measured.

ssb.df %>%
  mutate(both = interaction(Wort, Alter)) %>%
  select(Vpn, both) %>%
  table()

##       both
## Vpn    swoop.alt used.alt who'd.alt swoop.jung used.jung who'd.jung
##   arkn        10       10        10          0         0          0
##   elwi         9       10        10          0         0          0
##   frwa        10       10        10          0         0          0
##   gisa        10       10        10          0         0          0
##   jach         0        0         0         10        10         10
##   jeny         0        0         0         10        10         10
##   kapo         0        0         0         10        10         10
##   mapr        10       10        10          0         0          0
##   nata        10       10        10          0         0          0
##   rohi         0        0         0         10        10         10
##   rusy         0        0         0         10        10         10
##   shle         0        0         0         10        10         10

Data such as these have to be aggregated (averaged) so that there is only one (and not 10) value for each level of the within factor (and one value for one of the levels of the between factor):

ssb.m.df =
  ssb.df %>%
  group_by(Alter, Wort, Vpn) %>%
  summarise(F2 = mean(F2)) %>%
  ungroup()

## `summarise()` has grouped output by 'Alter', 'Wort'. You can override using the
## `.groups` argument.

Now the data are ready for the ANOVA because there is one value for each level of the within factor per speaker:

ssb.m.df %>%
  mutate(both = interaction(Wort, Alter)) %>%
  select(Vpn, both) %>%
  table()

##       both
## Vpn    swoop.alt used.alt who'd.alt swoop.jung used.jung who'd.jung
##   arkn         1        1         1          0         0          0
##   elwi         1        1         1          0         0          0
##   frwa         1        1         1          0         0          0
##   gisa         1        1         1          0         0          0
##   jach         0        0         0          1         1          1
##   jeny         0        0         0          1         1          1
##   kapo         0        0         0          1         1          1
##   mapr         1        1         1          0         0          0
##   nata         1        1         1          0         0          0
##   rohi         0        0         0          1         1          1
##   rusy         0        0         0          1         1          1
##   shle         0        0         0          1         1          1

In fact, the aov_4() function will do the averaging, as shown here:

ssb.df %>%
aov_4(F2 ~  Wort * Alter + (Wort|Vpn), .)

## Warning: More than one observation per design cell, aggregating data using `fun_aggregate = mean`.
## To turn off this warning, pass `fun_aggregate = mean` explicitly.

## Contrasts set to contr.sum for the following variables: Alter

## Anova Table (Type 3 tests)
## 
## Response: F2
##       Effect          df  MSE         F  ges p.value
## 1      Alter       1, 10 4.13  14.88 ** .552    .003
## 2       Wort 1.37, 13.72 0.62 78.51 *** .574   <.001
## 3 Alter:Wort 1.37, 13.72 0.62   9.89 ** .145    .004
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
## 
## Sphericity correction method: GG

and give the warning “More than one observation per design cell, aggregating data using fun_aggregate = mean.” Because aov_4() has taken care of the averaging, the above is exactly equivalent to:

ssb.m.df %>%
aov_4(F2 ~  Wort * Alter + (Wort|Vpn), .)

## Contrasts set to contr.sum for the following variables: Alter

## Anova Table (Type 3 tests)
## 
## Response: F2
##       Effect          df  MSE         F  ges p.value
## 1      Alter       1, 10 4.13  14.88 ** .552    .003
## 2       Wort 1.37, 13.72 0.62 78.51 *** .574   <.001
## 3 Alter:Wort 1.37, 13.72 0.62   9.89 ** .145    .004
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
## 
## Sphericity correction method: GG

An ANOVA can also not be run if there is missing data. Suppose for example, we remove observation 5 from the averaged data: then speaker mapr will only have produced used and who'd but not swoop. This is evident from the zero in mapr for swoop:

ssb.m.df %>%
  # remove observation 5
  slice(-5) %>%
  mutate(both = interaction(Wort, Alter)) %>%
  select(Vpn, both) %>%
  table()

##       both
## Vpn    swoop.alt used.alt who'd.alt swoop.jung used.jung who'd.jung
##   arkn         1        1         1          0         0          0
##   elwi         1        1         1          0         0          0
##   frwa         1        1         1          0         0          0
##   gisa         1        1         1          0         0          0
##   jach         0        0         0          1         1          1
##   jeny         0        0         0          1         1          1
##   kapo         0        0         0          1         1          1
##   mapr         0        1         1          0         0          0
##   nata         1        1         1          0         0          0
##   rohi         0        0         0          1         1          1
##   rusy         0        0         0          1         1          1
##   shle         0        0         0          1         1          1

If aov_4() is run on this data with the missing value, then it removes all of that speaker’s data so that there are once again no missing values:

 ssb.m.df %>%
  # remove observation 5
  slice(-5) %>%
  aov_4(F2 ~  Wort * Alter + (Wort|Vpn),  .)

## Warning: Missing values for 1 ID(s), which were removed before analysis:
## mapr
## Below the first few rows (in wide format) of the removed cases with missing data.
##      Vpn Alter swoop     used    who.d
## # 8 mapr   alt    NA 13.09001 9.250057

## Contrasts set to contr.sum for the following variables: Alter

## Anova Table (Type 3 tests)
## 
## Response: F2
##       Effect          df  MSE         F  ges p.value
## 1      Alter        1, 9 4.58  12.29 ** .534    .007
## 2       Wort 1.30, 11.73 0.68 71.77 *** .563   <.001
## 3 Alter:Wort 1.30, 11.73 0.68   9.76 ** .149    .006
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
## 
## Sphericity correction method: GG

On a different point, the output also reports (in black) “Sphericity correction method: GG”. The abbreviation “GG” stands for Greenhouse-Geisser. This correction only ever applies if a within-factor has more than two levels (as in the case of Wort here). The correction is to compensate for the different variances of the within-factor levels. The compensation is done by changing the degrees of freedom often to a fractional value (which is why Wort and Alter:Wort both have degrees of freedom [1.37, 13.72]) and adjusting the probabilities upwards (so that it is slightly more difficult to achieve significance). Note that this correction never applies to any between-factor.

3.5 An introduction to the `emmeans()` function and post-hoc tests.

Following a significant interaction, further tests should be carried out to test which combination(s) of levels is/are significant. The package of choice for doing so is emmeans. As its name suggests, the post-hoc test is by default based on estimated marginal means that are also often called least squares or adjusted means.

Estimated marginal means are based on so-called marginal means (MM) which are the means of the levels of factors. As explained in further details here they are used to correct for differences in the number of observations between levels. In a simple data-frame such as the following modelled on an example from here:

x = c(4, 6, 2, 7, 3, 5, 4, 2, 3)
dialect = c(rep("A", 3), rep("B", 3), rep("C", 3))
x.df = data.frame(x, dialect)
x.df

##   x dialect
## 1 4       A
## 2 6       A
## 3 2       A
## 4 7       B
## 5 3       B
## 6 5       B
## 7 4       C
## 8 2       C
## 9 3       C

the data-frame is ‘balanced’ because there are three observations for each level of dialect. In this special case, the arithmetic and marginal mean (MM) are the same. The arithmetic mean is just the mean of the numerical variable across all observations:

x.df %>% 
  pull(x) %>%
  mean()

## [1] 4

and the MM is the mean of each of the three means, one per level:

x.df %>%
  # means of "A", "B", "C"
  group_by(dialect) %>%
  summarise(m = mean(x)) %>%
  # mean of these three means
  pull(m) %>%
  mean()

## [1] 4

That is, both are 4. Now consider the same calculation but after removing observation 4. In this case, the data-frame is ‘imbalanced’ because whereas levels A and C each have three observations, B now has only two. Because this data-frame is imbalanced, the arithmetic mean and MM will be different:

# same calculations as above but remove obs. 4
x.df %>% 
  slice(-4) %>%
  pull(x) %>%
  # Arithmetic mean
  mean()

## [1] 3.625

x.df %>%
  slice(-4) %>%
  group_by(dialect) %>%
  summarise(m = mean(x)) %>%
  pull(m) %>%
  # MM
  mean()

## [1] 3.666667

In post-hoc tests with the EMM package, the marginal means are estimated (hence EMM) by compensating for such imbalances. According to the vignette, using EMMs “is a much fairer way to compute marginal means, in that they are not biased by imbalances in the data. We are, in a sense, estimating what the marginal means would be, had the experiment been balanced’. For this reason, EMMs are typically based on a model of the marginal means (and not on the actual means themselves).

For the present example, there are nine tests to be carried out.

Gender: Do males and females differ within each of the three regions? (3 tests).

male vs. female in region A
male vs. female in region B
male vs. female in region C

Region: Do regions differ within each sex? (6 tests).

region A vs. region B in males
region A vs. region C in males
region B vs. region C in males
region A vs. region B in females
region A vs. region C in females
region B vs. region C in females

These nine tests can be carried out as follows

dg.df %>% 
  # Anova
  aov_4(F2 ~ Region * Gen + (1|Vpn), .) %>%
  # emmeans to test for interaction
  emmeans(., ~ Region * Gen) %>%
  # the results of the post-hoc tests
  pairs(., simple = "each", combine=T)

## Contrasts set to contr.sum for the following variables: Region, Gen

##  Gen Region contrast estimate   SE df t.ratio p.value
##  m   .      A - B          46 68.3 54   0.674  1.0000
##  m   .      A - C         465 68.3 54   6.808  <.0001
##  m   .      B - C         419 68.3 54   6.135  <.0001
##  w   .      A - B         176 68.3 54   2.572  0.1162
##  w   .      A - C         925 68.3 54  13.550  <.0001
##  w   .      B - C         750 68.3 54  10.978  <.0001
##  .   A      m - w        -603 68.3 54  -8.828  <.0001
##  .   B      m - w        -473 68.3 54  -6.930  <.0001
##  .   C      m - w        -142 68.3 54  -2.087  0.3747
## 
## P value adjustment: bonferroni method for 9 tests

The first command with emmeans() calculates the estimated marginal means and their 95% confidence intervals. The second command with pairs() runs the required nine tests. The first line in the output shows the difference between regions A and B for male speakers. The last line shows the difference between male and female speakers for region C.

An appropriate summary of the output might be as follows.

Post-hoc Bonferroni corrected tests showed a significant difference between regions A and C (p < 0.001) and between regions B and C (p < 0.001) for both male and female speakers. There were no differences between regions A and B for either male nor female speakers. The post-hoc tests also shows significant differences between male and female speakers within regions A (p < 0.001) and B (p < 0.001) but not within C.

Some further information on the output and emmeans() is as follows.

Bonferroni correction for multiple tests. The information in the last line of the output P value adjustment: bonferroni method for 9 tests means that the probabilities have been adjusted to take into account that multiple tests have been run. The adjustment is made because running multiple tests increases the likelihood that one or more of them will be significant by chance i.e., of wrongly rejecting the null hypothesis (sometimes known as a Type I error). The correction is to multiply the resulting probability by the number of tests that are made - or equivalently to reset the level at which the null hypothesis is rejected from 1-α (e.g. 0.05) to (1-α)/9 (0.0055) for 9 tests as in the above example. Everything else remains the same. Notice that when the same test is run without Bonferroni correction which can be done with adjust=none, then the resulting probabilities are 1/9th of those as when Bonferroni corrected:

dg.df %>% 
aov_4(F2 ~ Region * Gen + (1|Vpn), .) %>%
  # emmeans to test for interaction
  emmeans(., ~ Region * Gen) %>%
  # no Bonferroni correction
  pairs(., simple = "each", combine=T, adjust="none")

## Contrasts set to contr.sum for the following variables: Region, Gen

##  Gen Region contrast estimate   SE df t.ratio p.value
##  m   .      A - B          46 68.3 54   0.674  0.5034
##  m   .      A - C         465 68.3 54   6.808  <.0001
##  m   .      B - C         419 68.3 54   6.135  <.0001
##  w   .      A - B         176 68.3 54   2.572  0.0129
##  w   .      A - C         925 68.3 54  13.550  <.0001
##  w   .      B - C         750 68.3 54  10.978  <.0001
##  .   A      m - w        -603 68.3 54  -8.828  <.0001
##  .   B      m - w        -473 68.3 54  -6.930  <.0001
##  .   C      m - w        -142 68.3 54  -2.087  0.0416

As discussed here, the Bonferroni test is quite conservative meaning that it is often more difficult to achieve significance than with other post-hoc tests. It can, therefore, be helpful to try other methods for post-hoc testing, such as adjust=mvt or adjust=holm. There is no unambiguous answer as to which post-hoc test is the correct one to use.

The arguments to pairs().

Leaving out the last two arguments causes all the possible pairwise tests to be listed. In this case, there are 15 such tests (15 = 6!/4!2! where ! is factorial; 6! because there are 3 levels in Dialect x 2 levels in Gen):

dg.df %>% 
aov_4(F2 ~ Region * Gen + (1|Vpn), .) %>%
  # emmeans to test for interaction
  emmeans(., ~ Region * Gen) %>%
  # no Bonferroni correction
  pairs(.)

## Contrasts set to contr.sum for the following variables: Region, Gen

##  contrast  estimate   SE df t.ratio p.value
##  A m - B m       46 68.3 54   0.674  0.9841
##  A m - C m      465 68.3 54   6.808  <.0001
##  A m - A w     -603 68.3 54  -8.828  <.0001
##  A m - B w     -427 68.3 54  -6.256  <.0001
##  A m - C w      322 68.3 54   4.722  0.0002
##  B m - C m      419 68.3 54   6.135  <.0001
##  B m - A w     -649 68.3 54  -9.502  <.0001
##  B m - B w     -473 68.3 54  -6.930  <.0001
##  B m - C w      276 68.3 54   4.048  0.0022
##  C m - A w    -1068 68.3 54 -15.636  <.0001
##  C m - B w     -892 68.3 54 -13.065  <.0001
##  C m - C w     -142 68.3 54  -2.087  0.3094
##  A w - B w      176 68.3 54   2.572  0.1221
##  A w - C w      925 68.3 54  13.550  <.0001
##  B w - C w      750 68.3 54  10.978  <.0001
## 
## P value adjustment: tukey method for comparing a family of 6 estimates

However, combinations in which levels of both factors are different such as comparing male-B with female-C are not usually needed. Such combinations are removed with simple="each":

dg.df %>% 
aov_4(F2 ~ Region * Gen + (1|Vpn), .) %>%
  # emmeans to test for interaction
  emmeans(., ~ Region * Gen) %>%
  # no Bonferroni correction
  pairs(., simple="each")

## Contrasts set to contr.sum for the following variables: Region, Gen

## $`simple contrasts for Region`
## Gen = m:
##  contrast estimate   SE df t.ratio p.value
##  A - B          46 68.3 54   0.674  0.7797
##  A - C         465 68.3 54   6.808  <.0001
##  B - C         419 68.3 54   6.135  <.0001
## 
## Gen = w:
##  contrast estimate   SE df t.ratio p.value
##  A - B         176 68.3 54   2.572  0.0340
##  A - C         925 68.3 54  13.550  <.0001
##  B - C         750 68.3 54  10.978  <.0001
## 
## P value adjustment: tukey method for comparing a family of 3 estimates 
## 
## $`simple contrasts for Gen`
## Region = A:
##  contrast estimate   SE df t.ratio p.value
##  m - w        -603 68.3 54  -8.828  <.0001
## 
## Region = B:
##  contrast estimate   SE df t.ratio p.value
##  m - w        -473 68.3 54  -6.930  <.0001
## 
## Region = C:
##  contrast estimate   SE df t.ratio p.value
##  m - w        -142 68.3 54  -2.087  0.0416

to give 9 tests. The last argument combine=T gives the appropriate Bonferroni correction (thus changing the probabilities) but is in all other respects the same.

Analysis of variance

Jonathan Harrington, Johanna Cronenberg