1 When the population mean and standard error have to be estimated: the t-distribution
2 Setting a confidence interval with the t-distribution
3 A t-test of the difference between means: general considerations
- 3.1 Between and within-subjects factors
- 3.2 Significance testing
4 Applying t-tests
- 4.1 A paired t-test
- 4.2 The unpaired t-test
5 If the data are not normally distributed

If not already done, carry out parts 1-5 of the setup, as described here.

Load these libraries/data-frames:

library(tidyverse)
library(broom)
urla = "https://www.phonetik.uni-muenchen.de/"
urlb = "studium_lehre/lehrmaterialien/R_speech_processing/Rdf/"
url = paste0(urla, urlb)
e.df = read.table(file.path(url, "e.txt"), 
                  stringsAsFactors = T)
form.df = read.table(file.path(url, "form.df.txt"), 
                  stringsAsFactors = T)

1 When the population mean and standard error have to be estimated: the t-distribution

In all the examples from the normal distribution, the population means and standard error for the mean score from a number of dice could be calculated from theoretical considerations.

But in most cases, this is not possible and these parameters have to be estimated. In such cases, the probability distribution that is used for calculating probabilities is the so-called t-distribution. This is an approximation to the normal distribution that depends on the sample size that is used to estimate these parameters. In the t-distribution, the sample size is defined by the number of degrees of freedom. Here is a plot of the t-distribution with 3 (red) and 10 (blue) degrees of freedom and the overlaid normal curve(black) for comparison. The mean and standard error are in all cases 0 and 1 respectively:

ggplot() +
  # t-distribution with 3 degrees of freedeom
  geom_function(fun = dt, 
                args = list(df = 3), col = "red") +
  # ... with 10
  geom_function(fun = dt, col = "blue",
                args = list(df = 10)) +
  xlim(-4, 4) + 
  ylab("Probability density") +
  xlab("Sample mean") +
  geom_function(fun = dnorm,
                args = list(mean = 0, sd = 1))

2 Setting a confidence interval with the t-distribution

A confidence interval can be set using the t-distribution on the assumption that the samples are drawn at random and that there is no evidence to show that they are not normally distributed. For example, the following are 15 F1 (first formant frequency) values in producing [a] by 15 standard German female speakers:

F1 = c(994, 877, 897, 753, 752, 959, 696, 651, 943, 709, 700, 686, 641, 819, 966)
F1

##  [1] 994 877 897 753 752 959 696 651 943 709 700 686 641 819 966

In order to obtain a confidence interval, the population mean and standard error are needed. It can be shown that when they are unknown, then they can be best estimated from the mean and standard error of the sample.

# estimated population mean
F1.mean = mean(F1)
# estimated standard error
F1.SE = sd(F1)/sqrt(15)

The sample of 15 F1 values is, as far as statistical testing is concerned, exactly analogous to obtaining a sample by throwing 15 dice in the earlier module. Recall from that module that the standard error, SE, was calculated from σ/√N where N is the sample size: thus division by √15 when working out the standard error of the mean either when throwing 15 dice together, or as in this case, estimating the standard error of the mean from a sample 15 F1 values.

A 95% confidence interval can now be set with the qt() function (rather than the qnorm() function when dealing with the normal distribution). The degrees of freedom need to be specified which is always 1 less than the sample size. Thus:

lower = F1.mean + F1.SE * qt(0.025, 14)
lower

## [1] 733.3541

upper = F1.mean + F1.SE * qt(0.975, 14)
upper

## [1] 872.3792

# or in a single line:
F1.mean + F1.SE * qt(c(0.025, 0.975), 14)

## [1] 733.3541 872.3792

The 95% confidence interval can also obtained using the t-test() function thus:

t.test(F1)

## 
##  One Sample t-test
## 
## data:  F1
## t = 24.772, df = 14, p-value = 5.811e-13
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  733.3541 872.3792
## sample estimates:
## mean of x 
##  802.8667

Thus based on this sample, F1 lies between 733 Hz and 872 Hz with a probability of 95%.

In carrying out this simple but effective test, use has been made of a random sample (in this case of 15 values) to make a probability statement, not about the sample, but instead about the population from which the sample has been drawn. Thus the implication of the finding is that F1 for female speakers of standard German producing [a] lies between 733 Hz and 872 Hz with a probability of 95%. This statement is of course only true to the extent that:

the sample is random.
it follows the normal distribution.
it is representative of female standard German speakers producing [a].

Details on testing whether a sample follows the normal distribution can be found here.

3 A t-test of the difference between means: general considerations

A t-test is most commonly applied to work out whether the difference between two means is significant. In a t-test, there is:

one numerical variable and
one categorical variable or factor that consists of two unique categories or levels.

Two examples of data that can be analysed with a t-test are as follows.

Do English and German differ in their F2 (second formant frequency) of [i]?

head(e.df)

##         F2 Sprache Vpn
## 1 1904.296       E  S1
## 2 1938.509       E  S2
## 3 1519.615       E  S3
## 4 1904.047       E  S4
## 5 1891.154       E  S5
## 6 1938.889       E  S6

With respect to these data the question becomes: Is F2 influenced by Sprache? F2 is clearly numeric and Sprache is a factor with two levels, as can be seen from the output of summary():

summary(e.df)

##        F2       Sprache      Vpn    
##  Min.   :1490   D:15    S1     : 1  
##  1st Qu.:1837   E:12    S10    : 1  
##  Median :1947           S11    : 1  
##  Mean   :1957           S12    : 1  
##  3rd Qu.:2112           S13    : 1  
##  Max.   :2266           S14    : 1  
##                         (Other):21

Is F2 of [i] affected by Stress?

head(form.df)

##     F2   Stress Subject
## 1 2577 stressed      S1
## 2 2122 stressed      S2
## 3 2192 stressed      S3
## 4 2581 stressed      S4
## 5 2227 stressed      S5
## 6 2481 stressed      S6

In this case, F2 is the numeric variable and Stress is the factor with two levels:

summary(form.df)

##        F2              Stress      Subject  
##  Min.   :1728   stressed  :12   S1     : 2  
##  1st Qu.:2047   unstressed:12   S10    : 2  
##  Median :2150                   S11    : 2  
##  Mean   :2176                   S12    : 2  
##  3rd Qu.:2333                   S2     : 2  
##  Max.   :2581                   S3     : 2  
##                                 (Other):12

Thus both e.df and form.df can be analysed with a t-test.

3.1 Between and within-subjects factors

There are two types of t-test, unpaired and paired, that depend on whether the categorical variable is a so-called between-subjects or within-subjects factor.

Between-subjects factor

A between subjects factor is when each subject has a value for only one of its levels. Sprache of e.df is a between-subjects factor for this reason:

e.df %>% 
  select(Vpn, Sprache) %>%
  table()

##      Sprache
## Vpn   D E
##   S1  0 1
##   S10 0 1
##   S11 0 1
##   S12 0 1
##   S13 1 0
##   S14 1 0
##   S15 1 0
##   S16 1 0
##   S17 1 0
##   S18 1 0
##   S19 1 0
##   S2  0 1
##   S20 1 0
##   S21 1 0
##   S22 1 0
##   S23 1 0
##   S24 1 0
##   S25 1 0
##   S26 1 0
##   S27 1 0
##   S3  0 1
##   S4  0 1
##   S5  0 1
##   S6  0 1
##   S7  0 1
##   S8  0 1
##   S9  0 1

That is, a subject is EITHER English OR German but never both.

Within-subjects factor

A within-subjects factor is when the subject has a value for more than one (and typically all of) its levels. For this reason, Stress of form.df is a within-subjects factor:

form.df %>% 
  select(Subject, Stress) %>%
  table()

##        Stress
## Subject stressed unstressed
##     S1         1          1
##     S10        1          1
##     S11        1          1
##     S12        1          1
##     S2         1          1
##     S3         1          1
##     S4         1          1
##     S5         1          1
##     S6         1          1
##     S7         1          1
##     S8         1          1
##     S9         1          1

That is, for each subject there is an F2 value for stressed AND unstressed.

In speech science, between-subjects factors are quite easy to identify: these are ones that typically deal with attributes of the subject such as whether the subject is male OR female, whether the subject is old OR young, whether the subject is a smoker OR non-smoker etc. By contrast, linguistic, prosodic, or phonetic categories are often (but by no means always!) within-subjects factors. For example:

in analysing vowels, there are subject measurements for both [e] AND [a] (within-subjects factor: Vowel, levels e, a).
in analysing prosody, there are measurements for a subject at a slow AND fast rate of speech (within-subjects factor Rate, levels slow, fast)
in analysing grammatical categories, there are measurements for a subject on both content AND function words (within-subject factor Category, levels content, function).

3.2 Significance testing

This involves setting up a null hypothesis H0, an alternative hypothesis H1, and one or more so-called alpha levels which are used to accept or reject the null hypothesis. With regard to the first of the earlier examples:

H0: F2 is unaffected by language (English vs. German)
H1: F2 is affected by language (English vs. German)
α levels: 0.95, 0.99, 0.999.

The t-test works out the probability that the difference between the the two sample means could be zero. If this probability is less than 0.05 (i.e. less than (1 - α)(), then H0 is rejected and H1 accepted. If this probability is greater than 0.05, then H0 is accepted.

A common practice in speech science is (as above) to choose three α levels of 0.95, 0.99, 0.999 which gives significant levels or thresholds of 1 - c(0.95, 0.99, 0.999), i.e. 0.05, 0.01, 0.001. If three such levels are set, then:

Probabilities above or equal to 0.05 are non-significant.
Probabilities below 0.05 and above or equal to 0.01 are significant at the 0.05 level (and reported as p < 0.05).
Probabilities below 0.01 and above or equal to 0.001 are significant at the 0.01 level (and reported as p < 0.01).
Probabilities below 0.001 are significant at the 0.001 level (and reported as p < 0.001).

4 Applying t-tests

The procedure for applying t-tests or indeed for carrying out any statistical test is to:

obtain a preliminary assessment of differences using a figure.
carry out the test.

For all the reasons given here, a boxplot is an appropriate figure when concerned (as with a t-test) with the effects of a factor on a numerical variable.

4.1 A paired t-test

This is the t-test when the two-level categorical variable is within-subject, as in the case of Stress in form.df. The test is whether the mean of the differences between the two levels calculated separately by subject is different for the two levels.

There are two appropriate boxplots for such data. The first shows the within-subject differences between stressed and unstressed:

form.df %>%
ggplot() +
  aes(y = F2, x = Stress, col = Subject, group = Subject) +
  geom_point() +
  geom_line()

If the majority of the lines (one per subject) is flat, then the difference is unlikely to be significant. In the above example, on the other hand, most lines slope downwards i.e. show a difference for most subjects such that F2 stressed is greater than F2 unstressed.

The second type of figure is a boxplot of the within-subject differences between stressed and unstressed:

form.df %>%
  pivot_wider(names_from = Stress, 
              values_from = F2) %>%
  mutate(F2diff = stressed - unstressed) %>%
  ggplot +
  aes(y = F2diff) +
  geom_boxplot()

In this plot, the F2 value of unstressed has been subtracted from the F2 value of stressed separately for each subject – so the boxplot consists of 12 values (because there are 12 subjects). If the interquartile range of the boxplot includes zero, then the difference is likely to be non-significant. In this case however, the interquartile range does not include zero and this suggests that the t-test may well be significant.

The paired t-test calculates a confidence interval for the within-subject difference between stressed and unstressed that can be used to assess the probability that this mean difference (across all subjects) is zero:

form.df %>%
  pivot_wider(names_from = Stress, 
              values_from = F2) %>%
  mutate(F2diff = stressed - unstressed) %>%
  pull(F2diff) %>%
  t.test()

## 
##  One Sample t-test
## 
## data:  .
## t = 4.3543, df = 11, p-value = 0.001147
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  134.0163 407.9837
## sample estimates:
## mean of x 
##       271

The test gives the following information.

mean of x.

This is the mean of stressed - unstressed calculated separately for each subject:

form.df %>%
  pivot_wider(names_from = Stress, 
              values_from = F2) %>%
  mutate(F2diff = stressed - unstressed) %>%
  summarise(mean = mean(F2diff))

## # A tibble: 1 × 1
##    mean
##   <dbl>
## 1   271

95 percent confidence interval.

This is the 95% confidence interval for the mean within-subject difference between stressed and unstressed. It states that this mean difference (specifically stressed - unstressed) lies between 134.0163 and 407.9837 Hz with a probability of 95%. Importantly, since this range does not include zero, then probability that this difference could be zero is less than 0.05 (and so significant, at an α level of 0.95).

t-statistic, degrees of freedom, probability.

df: The degrees of freedom are related to the sample size – in this case to the number of subjects. There are 12 subjects and the number of degrees of freedom is one less than this.
The t-statistic is the distance measured in numbers of standard errors between zero and the mean of distribution.

The (estimated population) standard error, SE is given by σ√N, where σ is the (estimated population) standard deviation and N is 12 (because there are 12 subjects). The t-statistic is then given by (271 - 0)/SE. The manual calculation of these quantities for this data is as follows:

form.df %>%
  pivot_wider(names_from = Stress, 
              values_from = F2) %>%
  mutate(F2diff = stressed - unstressed) %>%
  summarise(m = mean(F2diff), 
    SE = sd(F2diff)/sqrt(12), 
    tstat = m/SE)

## # A tibble: 1 × 3
##       m    SE tstat
##   <dbl> <dbl> <dbl>
## 1   271  62.2  4.35

p-value: This is the probability that the difference between the means could be zero. The probability in this case is well below 0.05 and significant at the 0.01 level.

The results are reported by:

rounding the t-statistic to two decimal points and (usually) ignoring the sign.
stating the degrees of freedom.
choosing the appropriate level of significance (see above) i.e. NS (non-significant) or p < 0.05 or p < 0.01 or p < 0.001).

The report in this case is either:

There was a significant influence of stress on F2 (t[11] = 4.35, p < 0.01)

or:

F2 was significantly influenced by stress (t[11] = 4.35, p < 0.01)

or:

the F2 difference between stressed and unstressed was significant (t[11] = 4.35, p < 0.01).

Further details on Null hypothesis testing and p values are covered here (in German).

If you have loaded the package library(broom), then the output of a t-test can be displayed in a more readable form using the tidy() function after applying the t-test, thus:

form.df %>%
  pivot_wider(names_from = Stress, 
              values_from = F2) %>%
  mutate(F2diff = stressed - unstressed) %>%
  pull(F2diff) %>%
  t.test() %>%
  tidy()

## # A tibble: 1 × 8
##   estimate statistic p.value parameter conf.low conf.high method     alternative
##      <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl> <chr>      <chr>      
## 1      271      4.35 0.00115        11     134.      408. One Sampl… two.sided

4.2 The unpaired t-test

This is appropriate when the categorical variable is a between-subjects factor as established earlier for the data-frame e.df. First a plot:

e.df %>%
  ggplot() +
  aes(y = F2, x = Sprache) +
  geom_boxplot()

The F2 values for the German speakers are generally greater than those of the English speakers. The corresponding t-test is a test of whether the difference between (i) the F2-mean for the English speakers and (ii) the F2-mean for the German speakers is significantly different from zero. The difference between the two means:

e.df %>%
  group_by(Sprache) %>%
  # calculate F2-means, one per language
  summarise(F2 = mean(F2)) %>%
  # extract these F2-means
  pull(F2) %>%
  # and subtract one from the other
  diff()

## [1] -167.0991

is 167.0991 Hz and, as the t-test shows below, significantly different from 0 Hz.

This t-test is accomplished with:

t.test(F2 ~ Sprache, data = e.df)

## 
##  Welch Two Sample t-test
## 
## data:  F2 by Sprache
## t = 2.2613, df = 21.101, p-value = 0.03443
## alternative hypothesis: true difference in means between group D and group E is not equal to 0
## 95 percent confidence interval:
##   13.46719 320.73097
## sample estimates:
## mean in group D mean in group E 
##        2031.672        1864.573

Or more simply still using the so-called dot notation where the dot stands for the data-frame (and here also using the tidy() function):

e.df %>%
t.test(F2 ~ Sprache, .) %>%
  tidy()

## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
## 1     167.     2032.     1865.      2.26  0.0344      21.1     13.5      321.
## # ℹ 2 more variables: method <chr>, alternative <chr>

The corresponding report is:

F2 was significantly influenced by language (t[21.1] = 2.26, p < 0.05).

There was a significant influence of language on F2 (t[21.1] = 2.26, p < 0.05).

The F2 difference between English and German speakers was significant (t[21.1] = 2.26, p < 0.05).

For complex reasons whose explanation is beyond the scope of this module, the unpaired t-test in R gives fractional degrees of freedom (21.1 in this case). This comes about because by default the unpaired t-test is carried out under the assumption that the F2-variance (square of the standard deviation) for the German data and the F2-variance for the English data are unequal. It is possible to carry out the unpaired t-test under the assumption that the variances are equal. This would be: t.test(F2 ~ Sprache, data = e.df, var.equal=T) in which case the number of degrees of freedom is equal to the number of observations (assuming 1 per subject) minus 2 (27 -2 = 25 in this case).

5 If the data are not normally distributed

Most parameters in speech are likely to follow the normal distribution and if they do not, then this is typically because of outliers. There are various diagnostic tests for whether the data are normally distributed that have been discussed repeatedly and so will not be reviewed here: see instead e.g. here for a summary). The analysis with t-tests can always be recomputed using the Wilcoxon signed ranked test which makes no assumptions about the data being normally distributed. The syntax is the same as when using t-tests. For the unpaired t-test analysed above:

form.df %>%
  pivot_wider(names_from = Stress, 
              values_from = F2) %>%
  mutate(F2diff = stressed - unstressed) %>%
  pull(F2diff) %>% 
  wilcox.test()

## Warning in wilcox.test.default(.): cannot compute exact p-value with ties

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  .
## V = 75, p-value = 0.005338
## alternative hypothesis: true location is not equal to 0

which can be reported as: - There was a significant influence of stress on F2 (Wilcoxon signed rank test, p < 0.01).

For the paired t-test analysed above:

e.df %>%
wilcox.test(F2 ~ Sprache, .)

## 
##  Wilcoxon rank sum exact test
## 
## data:  F2 by Sprache
## W = 131, p-value = 0.04687
## alternative hypothesis: true location shift is not equal to 0

which can be reported as: - F2 was significantly influenced by language (Wilcoxon rank sum test, p < 0.05).

t-tests of the differences between means

Jonathan Harrington, Johanna Cronenberg