If not already done, carry out parts 1-5 of the setup, as described here.
Load this library:
library(tidyverse)
Load these objects:
= "https://www.phonetik.uni-muenchen.de/studium_lehre/"
urla = "lehrmaterialien/R_speech_processing/Rdf"
urlb = paste0(urla, urlb)
url = read.table(file.path(url, "intdauer.txt"),
int stringsAsFactors = T)
= read.table(file.path(url, "vdata.txt"),
vdata stringsAsFactors = T)
The module is about using summary statistics that are almost always applied before doing any statistical test. Summary statistics are used to provide a general overview of the data. Different types of summary statistics are applied to numeric and categorical data. The important differences between these two types of data should be reviewed here before continuing with this module.
Most of the functions for summary statistics are evident in the output of the summary()
function, e.g. as applied to the data-frame vdata
:
summary(vdata)
## X Y F1 F2
## Min. :41.27 Min. :-9.110 Min. : 0.0 Min. : 0
## 1st Qu.:52.39 1st Qu.: 1.323 1st Qu.: 300.0 1st Qu.:1073
## Median :55.94 Median : 4.500 Median : 366.0 Median :1422
## Mean :56.44 Mean : 4.573 Mean : 407.3 Mean :1405
## 3rd Qu.:59.74 3rd Qu.: 7.590 3rd Qu.: 509.8 3rd Qu.:1742
## Max. :78.77 Max. :19.890 Max. :1114.0 Max. :2746
##
## dur V Tense Cons Rate Subj
## Min. : 41.64 %:426 -:1503 K:1002 a:1470 bk:415
## 1st Qu.:109.81 A:432 +:1479 P: 985 b:1512 ck:435
## Median :133.22 E:425 T: 995 fs:436
## Mean :141.47 I:424 hp:436
## 3rd Qu.:166.39 O:428 ht:421
## Max. :307.20 U:423 mh:420
## Y:424 ta:419
Some further details are as follows:
# create a vector of values
= c(10, 12, -8, 4, 2, 0, 19, -3, 18)
vec # mean
mean(vec)
## [1] 6
# variance
var(vec)
## [1] 87.25
# standard deviation
sd(vec)
## [1] 9.340771
# minimum
min(vec)
## [1] -8
# maximum
max(vec)
## [1] 19
# range
range(vec)
## [1] -8 19
# first quartile (25% quantile)
quantile(vec, 0.25)
## 25%
## 0
# third quartile
quantile(vec, 0.75)
## 75%
## 12
# median or 2nd quartile
median(vec)
## [1] 4
# as above
quantile(vec, 0.5)
## 50%
## 4
# interquartile range
IQR(vec)
## [1] 12
# as above
quantile(vec, 0.75) - quantile(vec, 0.25)
## 75%
## 12
sum(vec)/length(vec)
## [1] 6
# the same
mean(vec)
## [1] 6
N
-1)/2 + 1 where N
is the length of the vector. The length of the vector in this example is 9:length(vec)
## [1] 9
so the middle ordered value is ((9-1) * 0.5) + 1 = 5. Now sort the vector in ascending order and find the 5th value:
= sort(vec)
vec.sort 5] vec.sort[
## [1] 4
# the same
median(vec)
## [1] 4
The median is less sensitive to outliers than the median. For example:
# make another vector, the same as `vec`
= vec
vec2 # make the first value of `vec2` an outlier
1] = 20000
vec2[ vec
## [1] 10 12 -8 4 2 0 19 -3 18
vec2
## [1] 20000 12 -8 4 2 0 19 -3 18
# mean of vec and vec2: big difference
mean(vec)
## [1] 6
mean(vec2)
## [1] 2227.111
# median of vec and vec2: the same
median(vec)
## [1] 4
median(vec2)
## [1] 4
vec.sort
## [1] -8 -3 0 2 4 10 12 18 19
3] vec.sort[
## [1] 0
# the same
quantile(vec, 0.25)
## 25%
## 0
vec
## [1] 10 12 -8 4 2 0 19 -3 18
# this has a smaller variance than thos in `vec`
# because they cluster more tightly around the mean
= c(10, 12, -1, 4, 2, 0, 10, -1, 11)
vec2 var(vec)
## [1] 87.25
var(vec2)
## [1] 30.19444
# the variance of this must be zero
var(rep(1, 10))
## [1] 0
sqrt(var(vec))
## [1] 9.340771
# or
var(vec))^0.5 (
## [1] 9.340771
# the same:
sd(vec)
## [1] 9.340771
There are numerous ways of applying any of the above functions to numeric variables of data-frames. Here are 4 of them used to calculate the mean of the numeric variable F1
in the data-frame vdata
:
mean(vdata$F1)
## [1] 407.2877
with(vdata, mean(F1))
## [1] 407.2877
# pull(F1) extracts the F1 values from the data-frame
%>% pull(F1) %>% mean() vdata
## [1] 407.2877
# this makes another data-frame with the variable F1mean
%>% summarise(F1mean = mean(F1)) vdata
## F1mean
## 1 407.2877
By far the most useful of these for plotting and applying statistical functions will be the last involving summarise()
because it can be applied to several variables at once:
# mean of F1 and mean of F2. These summary variables
# are stored in mF1 and mF2; but you can choose other names
%>%
vdata summarise(mF1 = mean(F1), mF2 = mean(F2))
## mF1 mF2
## 1 407.2877 1404.922
# as above but include the standard deviations and interquartile ranges
%>%
vdata summarise(mF1 = mean(F1),
mF2 = mean(F2),
sF1 = sd(F1),
sF2 = sd(F2),
iF1 = IQR(F1),
iF2 = IQR(F2))
## mF1 mF2 sF1 sF2 iF1 iF2
## 1 407.2877 1404.922 145.7893 481.802 209.75 668.5
The function mutate()
that was discussed in an earlier module
produces the same numerical output as summarise()
. The main difference is that whereas the output from summarise()
is in a separate data-frame consisting of just the newly created variable, mutate()
attaches the new variable to an existing data-frame. For example, the earlier command:
%>%
vdata summarise(mF1 = mean(F1), mF2 = mean(F2))
## mF1 mF2
## 1 407.2877 1404.922
could also be accomplished with:
= vdata %>%
mdata mutate(mF1 = mean(F1), mF2 = mean(F2))
head(mdata)
## X Y F1 F2 dur V Tense Cons Rate Subj mF1 mF2
## 1 52.99 4.36 313 966 106.9 % - P a bk 407.2877 1404.922
## 2 53.61 3.65 322 2058 86.0 I - T a bk 407.2877 1404.922
## 3 55.14 10.44 336 1186 123.4 Y - K a bk 407.2877 1404.922
## 4 53.06 4.75 693 2149 119.2 E - T a bk 407.2877 1404.922
## 5 52.74 6.46 269 2008 196.3 Y + K a bk 407.2877 1404.922
## 6 53.30 4.70 347 931 77.5 Y - P a bk 407.2877 1404.922
from which it becomes clear that the mean F1 and mean F2 values have been attached to every observation of the data-frame vdata
. In general, summarise()
is the preferred function when (as the name suggests) providing summary statistics across several observations i.e. when applying summary functions like mean()
, median()
, var()
and so on. mutate()
should instead be used when a separately calculated value is needed for each observation.
Consider in this regard a common transformation in speech science of converting numeric (especially formant) data into so-called z-scores which show how many standard deviations a value is from the mean of the distribution. Applied to the earlier example of the vector of numbers, this is given by the following:
- mean(vec))/sd(vec) (vec
## [1] 0.4282302 0.6423453 -1.4988056 -0.2141151 -0.4282302 -0.6423453 1.3917481
## [8] -0.9635179 1.2846905
The same can be applied to a data-frame using mutate()
. In this case, z-scores are calculated for F1
and the output is additionally stored as a new data frame:
# store the output in the new data-frame `zdata`
=
zdata %>%
vdata mutate(F1z = (F1 - mean(F1)/sd(F1)))
The above could also be done with summarise()
but mutate()
is for the reason stated above the preferred function in this case. Finally, the above can also be accomplished by writing a function then passing that function to mutate()
. Here is a function called zscore()
for computing z-scores from any vector:
= function(x) {
zscore - mean(x))/sd(x)
(x }
Here is the z-score for vec
:
vec
## [1] 10 12 -8 4 2 0 19 -3 18
zscore(vec)
## [1] 0.4282302 0.6423453 -1.4988056 -0.2141151 -0.4282302 -0.6423453 1.3917481
## [8] -0.9635179 1.2846905
Here are the z-scores for F1 and for F2 of vdata
:
=
zdata %>%
vdata mutate(zF1 = zscore(F1), zF2 = zscore(F2))
The simplest way to obtain this kind of information for a categorical variable in a data-frame is with the functions group_by()
and n()
inside summarise()
. For example, to make a table to show how often the separate levels of the factor V
in the data-frame vdata
occur:
%>%
vdata group_by(V) %>%
summarise(count = n()) %>%
ungroup()
## # A tibble: 7 × 2
## V count
## <fct> <int>
## 1 % 426
## 2 A 432
## 3 E 425
## 4 I 424
## 5 O 428
## 6 U 423
## 7 Y 424
# do the same but store it as a new data-frame:
=
Vcount %>%
vdata group_by(V) %>%
summarise(count = n()) %>%
ungroup()
Vcount
## # A tibble: 7 × 2
## V count
## <fct> <int>
## 1 % 426
## 2 A 432
## 3 E 425
## 4 I 424
## 5 O 428
## 6 U 423
## 7 Y 424
from which it becomes immediately apparent that:
V
is made up of 7 levelsThe output of the above code with group_by()
returns an object called a tibble as shown here:
class(Vcount)
## [1] "tbl_df" "tbl" "data.frame"
A tibble is a simplified version of a data-frame and the difference between the two will be largely unimportant for the subsequent statistical analyses to be considered.
Conversion back to a data-frame is possible as follows:
= Vcount %>% as.data.frame()
Vcount.df class(Vcount.df)
## [1] "data.frame"
The function as_tibble()
can be used to convert a data-frame to a tibble. More information about tibbles is in
chapter 10, R for Data Science.
It’s always a good idea to use the function ungroup()
after using the group_by()
function because otherwise R makes assumptions about the grouping of variables in a data-frame that can influence subsequent calculations in unexpected ways. Using ungroup()
in the above example was in fact unnecessary because there was only one variable V
but when there is more than one variable (as in the example below), ungroup()
should always be used.
A table of levels and their frequency of occurrence for the combination of any factors can be made by passing the names of the factors to the group_by()
function together, thus:
%>%
vdata group_by(V, Cons) %>%
summarise(count = n()) %>%
ungroup()
## `summarise()` has grouped output by 'V'. You can override using the `.groups` argument.
## # A tibble: 21 × 3
## V Cons count
## <fct> <fct> <int>
## 1 % K 142
## 2 % P 142
## 3 % T 142
## 4 A K 147
## 5 A P 141
## 6 A T 144
## 7 E K 144
## 8 E P 140
## 9 E T 141
## 10 I K 142
## # … with 11 more rows
For example, the fourth line means that there are 147 observations which have level A
(of factor V
) and level K
(of factor Cons
). The first few observations above show that there are three consonant levels per vowel level. The function n_distinct()
can be used to count how many levels of one factor occur relative to any other. For example:
%>%
vdata group_by(V) %>%
summarise(count = n_distinct(Cons)) %>%
ungroup()
## # A tibble: 7 × 2
## V count
## <fct> <int>
## 1 % 3
## 2 A 3
## 3 E 3
## 4 I 3
## 5 O 3
## 6 U 3
## 7 Y 3
means that for each of the 7 vowel levels in factor V
there are three consonant levels (of factor Cons
). The following:
%>%
vdata group_by(V, Cons) %>%
summarise(count = n_distinct(Subj)) %>%
ungroup()
## `summarise()` has grouped output by 'V'. You can override using the `.groups` argument.
## # A tibble: 21 × 3
## V Cons count
## <fct> <fct> <int>
## 1 % K 7
## 2 % P 7
## 3 % T 7
## 4 A K 7
## 5 A P 7
## 6 A T 7
## 7 E K 7
## 8 E P 7
## 9 E T 7
## 10 I K 7
## # … with 11 more rows
means that for each combination of consonant and vowel levels, there are 7 subject levels.
The group_by()
function can be also used to calculate summary statistics for numeric variables separately for the levels of a factor. For example, instead of calculating mean F1 for the entire data-frame, separate F1 means might instead be calculated for each of the 7 vowel levels in V
. This is done as follows:
%>%
vdata # for each level in `V`
group_by(V) %>%
# calculate the mean of F1
summarise(mF1 = mean(F1)) %>%
ungroup()
## # A tibble: 7 × 2
## V mF1
## <fct> <dbl>
## 1 % 424.
## 2 A 645.
## 3 E 426.
## 4 I 311.
## 5 O 434.
## 6 U 304.
## 7 Y 302.
The above can be extended to include any number of factors. Thus to calculate mean F1 for each level when combining consonant and vowel factors:
=
cvmeans %>%
vdata group_by(Cons, V) %>%
summarise(mF1 = mean(F1)) %>%
ungroup()
## `summarise()` has grouped output by 'Cons'. You can override using the `.groups` argument.
cvmeans
## # A tibble: 21 × 3
## Cons V mF1
## <fct> <fct> <dbl>
## 1 K % 410.
## 2 K A 637.
## 3 K E 419.
## 4 K I 307.
## 5 K O 431.
## 6 K U 296.
## 7 K Y 296.
## 8 P % 440.
## 9 P A 646.
## 10 P E 436.
## # … with 11 more rows
The meaning of the 4th line in the above is: the mean of F1 in /KA/ (i.e. the mean of observations for which Cons
is K
and for which V
is A
) is 307 Hz.