If not already done, carry out parts 1-5 of the setup, as described here.
Load these libraries/data-frames:
library(tidyverse)
library(broom)
urla = "https://www.phonetik.uni-muenchen.de/"
urlb = "studium_lehre/lehrmaterialien/R_speech_processing/Rdf/"
url = paste0(urla, urlb)
e.df = read.table(file.path(url, "e.txt"),
stringsAsFactors = T)
form.df = read.table(file.path(url, "form.df.txt"),
stringsAsFactors = T)
In all the examples from the normal distribution, the population means and standard error for the mean score from a number of dice could be calculated from theoretical considerations.
But in most cases, this is not possible and these parameters have to be estimated. In such cases, the probability distribution that is used for calculating probabilities is the so-called t-distribution. This is an approximation to the normal distribution that depends on the sample size that is used to estimate these parameters. In the t-distribution, the sample size is defined by the number of degrees of freedom. Here is a plot of the t-distribution with 3 (red) and 10 (blue) degrees of freedom and the overlaid normal curve(black) for comparison. The mean and standard error are in all cases 0 and 1 respectively:
ggplot() +
# t-distribution with 3 degrees of freedeom
geom_function(fun = dt,
args = list(df = 3), col = "red") +
# ... with 10
geom_function(fun = dt, col = "blue",
args = list(df = 10)) +
xlim(-4, 4) +
ylab("Probability density") +
xlab("Sample mean") +
geom_function(fun = dnorm,
args = list(mean = 0, sd = 1))
A confidence interval can be set using the t-distribution on the assumption that the samples are drawn at random and that there is no evidence to show that they are not normally distributed. For example, the following are 15 F1 (first formant frequency) values in producing [a] by 15 standard German female speakers:
## [1] 994 877 897 753 752 959 696 651 943 709 700 686 641 819 966
In order to obtain a confidence interval, the population mean and standard error are needed. It can be shown that when they are unknown, then they can be best estimated from the mean and standard error of the sample.
The sample of 15 F1 values is, as far as statistical testing is concerned, exactly analogous to obtaining a sample by throwing 15 dice in the earlier module. Recall from that module that the standard error, SE, was calculated from σ/√N where N is the sample size: thus division by √15 when working out the standard error of the mean either when throwing 15 dice together, or as in this case, estimating the standard error of the mean from a sample 15 F1 values.
A 95% confidence interval can now be set with the qt()
function (rather than the qnorm()
function when dealing with the normal distribution). The degrees of freedom need to be specified which is always 1 less than the sample size. Thus:
## [1] 733.3541
## [1] 872.3792
## [1] 733.3541 872.3792
The 95% confidence interval can also obtained using the t-test()
function thus:
##
## One Sample t-test
##
## data: F1
## t = 24.772, df = 14, p-value = 5.811e-13
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 733.3541 872.3792
## sample estimates:
## mean of x
## 802.8667
Thus based on this sample, F1 lies between 733 Hz and 872 Hz with a probability of 95%.
In carrying out this simple but effective test, use has been made of a random sample (in this case of 15 values) to make a probability statement, not about the sample, but instead about the population from which the sample has been drawn. Thus the implication of the finding is that F1 for female speakers of standard German producing [a] lies between 733 Hz and 872 Hz with a probability of 95%. This statement is of course only true to the extent that:
Details on testing whether a sample follows the normal distribution can be found here.
A t-test is most commonly applied to work out whether the difference between two means is significant. In a t-test, there is:
Two examples of data that can be analysed with a t-test are as follows.
## F2 Sprache Vpn
## 1 1904.296 E S1
## 2 1938.509 E S2
## 3 1519.615 E S3
## 4 1904.047 E S4
## 5 1891.154 E S5
## 6 1938.889 E S6
With respect to these data the question becomes: Is F2
influenced by Sprache
? F2
is clearly numeric and Sprache
is a factor with two levels, as can be seen from the output of summary()
:
## F2 Sprache Vpn
## Min. :1490 D:15 S1 : 1
## 1st Qu.:1837 E:12 S10 : 1
## Median :1947 S11 : 1
## Mean :1957 S12 : 1
## 3rd Qu.:2112 S13 : 1
## Max. :2266 S14 : 1
## (Other):21
Stress
?## F2 Stress Subject
## 1 2577 stressed S1
## 2 2122 stressed S2
## 3 2192 stressed S3
## 4 2581 stressed S4
## 5 2227 stressed S5
## 6 2481 stressed S6
In this case, F2
is the numeric variable and Stress
is the factor with two levels:
## F2 Stress Subject
## Min. :1728 stressed :12 S1 : 2
## 1st Qu.:2047 unstressed:12 S10 : 2
## Median :2150 S11 : 2
## Mean :2176 S12 : 2
## 3rd Qu.:2333 S2 : 2
## Max. :2581 S3 : 2
## (Other):12
Thus both e.df
and form.df
can be analysed with a t-test.
There are two types of t-test, unpaired and paired, that depend on whether the categorical variable is a so-called between-subjects or within-subjects factor.
A between subjects factor is when each subject has a value for only one of its levels. Sprache
of e.df
is a between-subjects factor for this reason:
## Sprache
## Vpn D E
## S1 0 1
## S10 0 1
## S11 0 1
## S12 0 1
## S13 1 0
## S14 1 0
## S15 1 0
## S16 1 0
## S17 1 0
## S18 1 0
## S19 1 0
## S2 0 1
## S20 1 0
## S21 1 0
## S22 1 0
## S23 1 0
## S24 1 0
## S25 1 0
## S26 1 0
## S27 1 0
## S3 0 1
## S4 0 1
## S5 0 1
## S6 0 1
## S7 0 1
## S8 0 1
## S9 0 1
That is, a subject is EITHER English
OR German
but never both.
A within-subjects factor is when the subject has a value for more than one (and typically all of) its levels. For this reason, Stress
of form.df
is a within-subjects factor:
## Stress
## Subject stressed unstressed
## S1 1 1
## S10 1 1
## S11 1 1
## S12 1 1
## S2 1 1
## S3 1 1
## S4 1 1
## S5 1 1
## S6 1 1
## S7 1 1
## S8 1 1
## S9 1 1
That is, for each subject there is an F2 value for stressed
AND unstressed
.
In speech science, between-subjects factors are quite easy to identify: these are ones that typically deal with attributes of the subject such as whether the subject is male OR female, whether the subject is old OR young, whether the subject is a smoker OR non-smoker etc. By contrast, linguistic, prosodic, or phonetic categories are often (but by no means always!) within-subjects factors. For example:
Vowel
, levels e
, a
).Rate
, levels slow
, fast
)Category
, levels content
, function
).This involves setting up a null hypothesis H0
, an alternative hypothesis H1
, and one or more so-called alpha levels which are used to accept or reject the null hypothesis. With regard to the first of the earlier examples:
H0
: F2 is unaffected by language (English vs. German)H1
: F2 is affected by language (English vs. German)α
levels: 0.95, 0.99, 0.999.The t-test works out the probability that the difference between the the two sample means could be zero. If this probability is less than 0.05 (i.e. less than (1 - α)
(), then H0
is rejected and H1
accepted. If this probability is greater than 0.05, then H0
is accepted.
A common practice in speech science is (as above) to choose three α
levels of 0.95, 0.99, 0.999 which gives significant levels or thresholds of 1 - c(0.95, 0.99, 0.999), i.e. 0.05, 0.01, 0.001. If three such levels are set, then:
The procedure for applying t-tests or indeed for carrying out any statistical test is to:
For all the reasons given here, a boxplot is an appropriate figure when concerned (as with a t-test) with the effects of a factor on a numerical variable.
This is the t-test when the two-level categorical variable is within-subject, as in the case of Stress
in form.df
. The test is whether the mean of the differences between the two levels calculated separately by subject is different for the two levels.
There are two appropriate boxplots for such data. The first shows the within-subject differences between stressed
and unstressed
:
form.df %>%
ggplot() +
aes(y = F2, x = Stress, col = Subject, group = Subject) +
geom_point() +
geom_line()
If the majority of the lines (one per subject) is flat, then the difference is unlikely to be significant. In the above example, on the other hand, most lines slope downwards i.e. show a difference for most subjects such that F2 stressed
is greater than F2 unstressed
.
The second type of figure is a boxplot of the within-subject differences between stressed and unstressed:
form.df %>%
pivot_wider(names_from = Stress,
values_from = F2) %>%
mutate(F2diff = stressed - unstressed) %>%
ggplot +
aes(y = F2diff) +
geom_boxplot()
In this plot, the F2 value of unstressed has been subtracted from the F2 value of stressed separately for each subject – so the boxplot consists of 12 values (because there are 12 subjects). If the interquartile range of the boxplot includes zero, then the difference is likely to be non-significant. In this case however, the interquartile range does not include zero and this suggests that the t-test may well be significant.
The paired t-test calculates a confidence interval for the within-subject difference between stressed and unstressed that can be used to assess the probability that this mean difference (across all subjects) is zero:
form.df %>%
pivot_wider(names_from = Stress,
values_from = F2) %>%
mutate(F2diff = stressed - unstressed) %>%
pull(F2diff) %>%
t.test()
##
## One Sample t-test
##
## data: .
## t = 4.3543, df = 11, p-value = 0.001147
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 134.0163 407.9837
## sample estimates:
## mean of x
## 271
The test gives the following information.
mean of x
.This is the mean of stressed
- unstressed
calculated separately for each subject:
form.df %>%
pivot_wider(names_from = Stress,
values_from = F2) %>%
mutate(F2diff = stressed - unstressed) %>%
summarise(mean = mean(F2diff))
## # A tibble: 1 × 1
## mean
## <dbl>
## 1 271
This is the 95% confidence interval for the mean within-subject difference between stressed and unstressed. It states that this mean difference (specifically stressed
- unstressed
) lies between 134.0163 and 407.9837 Hz with a probability of 95%. Importantly, since this range does not include zero, then probability that this difference could be zero is less than 0.05 (and so significant, at an α level of 0.95).
df: The degrees of freedom are related to the sample size – in this case to the number of subjects. There are 12 subjects and the number of degrees of freedom is one less than this.
The t-statistic is the distance measured in numbers of standard errors between zero and the mean of distribution.
The (estimated population) standard error, SE is given by σ√N, where σ is the (estimated population) standard deviation and N is 12 (because there are 12 subjects). The t-statistic is then given by (271 - 0)/SE. The manual calculation of these quantities for this data is as follows:
form.df %>%
pivot_wider(names_from = Stress,
values_from = F2) %>%
mutate(F2diff = stressed - unstressed) %>%
summarise(m = mean(F2diff),
SE = sd(F2diff)/sqrt(12),
tstat = m/SE)
## # A tibble: 1 × 3
## m SE tstat
## <dbl> <dbl> <dbl>
## 1 271 62.2 4.35
The results are reported by:
The report in this case is either:
or:
or:
Further details on Null hypothesis testing and p values are covered here (in German).
If you have loaded the package library(broom)
, then the output of a t-test can be displayed in a more readable form using the tidy()
function after applying the t-test, thus:
form.df %>%
pivot_wider(names_from = Stress,
values_from = F2) %>%
mutate(F2diff = stressed - unstressed) %>%
pull(F2diff) %>%
t.test() %>%
tidy()
## # A tibble: 1 × 8
## estimate statistic p.value parameter conf.low conf.high method alternative
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 271 4.35 0.00115 11 134. 408. One Sampl… two.sided
This is appropriate when the categorical variable is a between-subjects factor as established earlier for the data-frame e.df
. First a plot:
The F2 values for the German speakers are generally greater than those of the English speakers. The corresponding t-test is a test of whether the difference between (i) the F2-mean for the English speakers and (ii) the F2-mean for the German speakers is significantly different from zero. The difference between the two means:
e.df %>%
group_by(Sprache) %>%
# calculate F2-means, one per language
summarise(F2 = mean(F2)) %>%
# extract these F2-means
pull(F2) %>%
# and subtract one from the other
diff()
## [1] -167.0991
is 167.0991 Hz and, as the t-test shows below, significantly different from 0 Hz.
This t-test is accomplished with:
##
## Welch Two Sample t-test
##
## data: F2 by Sprache
## t = 2.2613, df = 21.101, p-value = 0.03443
## alternative hypothesis: true difference in means between group D and group E is not equal to 0
## 95 percent confidence interval:
## 13.46719 320.73097
## sample estimates:
## mean in group D mean in group E
## 2031.672 1864.573
Or more simply still using the so-called dot notation where the dot stands for the data-frame (and here also using the tidy()
function):
## # A tibble: 1 × 10
## estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 167. 2032. 1865. 2.26 0.0344 21.1 13.5 321.
## # ℹ 2 more variables: method <chr>, alternative <chr>
The corresponding report is:
or
or
For complex reasons whose explanation is beyond the scope of this module, the unpaired t-test in R gives fractional degrees of freedom (21.1 in this case). This comes about because by default the unpaired t-test is carried out under the assumption that the F2-variance (square of the standard deviation) for the German data and the F2-variance for the English data are unequal. It is possible to carry out the unpaired t-test under the assumption that the variances are equal. This would be: t.test(F2 ~ Sprache, data = e.df, var.equal=T)
in which case the number of degrees of freedom is equal to the number of observations (assuming 1 per subject) minus 2 (27 -2 = 25 in this case).
Most parameters in speech are likely to follow the normal distribution and if they do not, then this is typically because of outliers. There are various diagnostic tests for whether the data are normally distributed that have been discussed repeatedly and so will not be reviewed here: see instead e.g. here for a summary). The analysis with t-tests can always be recomputed using the Wilcoxon signed ranked test which makes no assumptions about the data being normally distributed. The syntax is the same as when using t-tests. For the unpaired t-test analysed above:
form.df %>%
pivot_wider(names_from = Stress,
values_from = F2) %>%
mutate(F2diff = stressed - unstressed) %>%
pull(F2diff) %>%
wilcox.test()
## Warning in wilcox.test.default(.): cannot compute exact p-value with ties
##
## Wilcoxon signed rank test with continuity correction
##
## data: .
## V = 75, p-value = 0.005338
## alternative hypothesis: true location is not equal to 0
which can be reported as: - There was a significant influence of stress on F2 (Wilcoxon signed rank test, p < 0.01).
For the paired t-test analysed above:
##
## Wilcoxon rank sum exact test
##
## data: F2 by Sprache
## W = 131, p-value = 0.04687
## alternative hypothesis: true location shift is not equal to 0
which can be reported as: - F2 was significantly influenced by language (Wilcoxon rank sum test, p < 0.05).