1 Summary statistics for numeric variables
2 Summary statistics for categorical variables (factors)
3 Summary statistics for numeric variables grouped by factor.

If not already done, carry out parts 1-5 of the setup, as described here.

Load this library:

library(tidyverse)

Load these objects:

urla = "https://www.phonetik.uni-muenchen.de/studium_lehre/"
urlb = "lehrmaterialien/R_speech_processing/Rdf"
url = paste0(urla, urlb)
int = read.table(file.path(url, "intdauer.txt"), 
                 stringsAsFactors = T)
vdata = read.table(file.path(url, "vdata.txt"), 
                   stringsAsFactors = T)

The module is about using summary statistics that are almost always applied before doing any statistical test. Summary statistics are used to provide a general overview of the data. Different types of summary statistics are applied to numeric and categorical data. The important differences between these two types of data should be reviewed here before continuing with this module.

1 Summary statistics for numeric variables

Most of the functions for summary statistics are evident in the output of the summary() function, e.g. as applied to the data-frame vdata:

summary(vdata)

##        X               Y                F1               F2      
##  Min.   :41.27   Min.   :-9.110   Min.   :   0.0   Min.   :   0  
##  1st Qu.:52.39   1st Qu.: 1.323   1st Qu.: 300.0   1st Qu.:1073  
##  Median :55.94   Median : 4.500   Median : 366.0   Median :1422  
##  Mean   :56.44   Mean   : 4.573   Mean   : 407.3   Mean   :1405  
##  3rd Qu.:59.74   3rd Qu.: 7.590   3rd Qu.: 509.8   3rd Qu.:1742  
##  Max.   :78.77   Max.   :19.890   Max.   :1114.0   Max.   :2746  
##                                                                  
##       dur         V       Tense    Cons     Rate     Subj    
##  Min.   : 41.64   %:426   -:1503   K:1002   a:1470   bk:415  
##  1st Qu.:109.81   A:432   +:1479   P: 985   b:1512   ck:435  
##  Median :133.22   E:425            T: 995            fs:436  
##  Mean   :141.47   I:424                              hp:436  
##  3rd Qu.:166.39   O:428                              ht:421  
##  Max.   :307.20   U:423                              mh:420  
##                   Y:424                              ta:419

Some further details are as follows:

# create a vector of values
vec = c(10, 12, -8, 4, 2, 0, 19, -3, 18)
# mean 
mean(vec)

## [1] 6

# variance
var(vec)

## [1] 87.25

# standard deviation
sd(vec)

## [1] 9.340771

# minimum
min(vec)

## [1] -8

# maximum
max(vec)

## [1] 19

# range
range(vec)

## [1] -8 19

# first quartile (25% quantile)
quantile(vec, 0.25)

## 25% 
##   0

# third quartile
quantile(vec, 0.75)

## 75% 
##  12

# median or 2nd quartile
median(vec)

## [1] 4

# as above
quantile(vec, 0.5)

## 50% 
##   4

# interquartile range
IQR(vec)

## [1] 12

# as above
quantile(vec, 0.75) - quantile(vec, 0.25)

## 75% 
##  12

The mean is the sum of values divided by the number of values:

sum(vec)/length(vec)

## [1] 6

# the same
mean(vec)

## [1] 6

The median or 0.5 quantile is the middle ordered value after the values have been sorted. The middle ordered value is (N-1)/2 + 1 where N is the length of the vector. The length of the vector in this example is 9:

length(vec)

## [1] 9

so the middle ordered value is ((9-1) * 0.5) + 1 = 5. Now sort the vector in ascending order and find the 5th value:

vec.sort = sort(vec)
vec.sort[5]

## [1] 4

# the same
median(vec)

## [1] 4

The median is less sensitive to outliers than the median. For example:

# make another vector, the same as `vec`
vec2 = vec
# make the first value of `vec2` an outlier
vec2[1] = 20000
vec

## [1] 10 12 -8  4  2  0 19 -3 18

vec2

## [1] 20000    12    -8     4     2     0    19    -3    18

# mean of vec and vec2: big difference
mean(vec)

## [1] 6

mean(vec2)

## [1] 2227.111

# median of vec and vec2: the same
median(vec)

## [1] 4

median(vec2)

## [1] 4

The 25% or 0.25 quantile is obtained in the same way as the median, but this time multiply by 0.25 to find which sorted value is needed: ((9-1) * 0.25) + 1 = 3. So the 0.25 quantile is:

vec.sort

## [1] -8 -3  0  2  4 10 12 18 19

vec.sort[3]

## [1] 0

# the same
quantile(vec, 0.25)

## 25% 
##   0

The variance is a measure of how dispersed the values are.

vec

## [1] 10 12 -8  4  2  0 19 -3 18

# this has a smaller variance than thos in `vec` 
# because they cluster more tightly around the mean
vec2 = c(10, 12, -1, 4, 2, 0, 10, -1, 11)
var(vec)

## [1] 87.25

var(vec2)

## [1] 30.19444

# the variance of this must be zero
var(rep(1, 10))

## [1] 0

The standard deviation is the square root of the variance

sqrt(var(vec))

## [1] 9.340771

# or
(var(vec))^0.5

## [1] 9.340771

# the same:
sd(vec)

## [1] 9.340771

There are numerous ways of applying any of the above functions to numeric variables of data-frames. Here are 4 of them used to calculate the mean of the numeric variable F1 in the data-frame vdata:

mean(vdata$F1)

## [1] 407.2877

with(vdata, mean(F1))

## [1] 407.2877

# pull(F1) extracts the F1 values from the data-frame
vdata %>% pull(F1) %>% mean()

## [1] 407.2877

# this makes another data-frame with the variable F1mean
vdata %>% summarise(F1mean = mean(F1))

##     F1mean
## 1 407.2877

By far the most useful of these for plotting and applying statistical functions will be the last involving summarise() because it can be applied to several variables at once:

# mean of F1 and mean of F2. These summary variables
# are stored in mF1 and mF2; but you can choose other names
vdata %>%
  summarise(mF1 = mean(F1), mF2 = mean(F2))

##        mF1      mF2
## 1 407.2877 1404.922

# as above but include the standard deviations and interquartile ranges
vdata %>%
  summarise(mF1 = mean(F1), 
            mF2 = mean(F2),
            sF1 = sd(F1),
            sF2 = sd(F2),
            iF1 = IQR(F1), 
            iF2 = IQR(F2))

##        mF1      mF2      sF1     sF2    iF1   iF2
## 1 407.2877 1404.922 145.7893 481.802 209.75 668.5

The function mutate() that was discussed in an earlier module produces the same numerical output as summarise(). The main difference is that whereas the output from summarise() is in a separate data-frame consisting of just the newly created variable, mutate() attaches the new variable to an existing data-frame. For example, the earlier command:

vdata %>%
  summarise(mF1 = mean(F1), mF2 = mean(F2))

##        mF1      mF2
## 1 407.2877 1404.922

could also be accomplished with:

mdata = vdata %>%
  mutate(mF1 = mean(F1), mF2 = mean(F2))
head(mdata)

##       X     Y  F1   F2   dur V Tense Cons Rate Subj      mF1      mF2
## 1 52.99  4.36 313  966 106.9 %     -    P    a   bk 407.2877 1404.922
## 2 53.61  3.65 322 2058  86.0 I     -    T    a   bk 407.2877 1404.922
## 3 55.14 10.44 336 1186 123.4 Y     -    K    a   bk 407.2877 1404.922
## 4 53.06  4.75 693 2149 119.2 E     -    T    a   bk 407.2877 1404.922
## 5 52.74  6.46 269 2008 196.3 Y     +    K    a   bk 407.2877 1404.922
## 6 53.30  4.70 347  931  77.5 Y     -    P    a   bk 407.2877 1404.922

from which it becomes clear that the mean F1 and mean F2 values have been attached to every observation of the data-frame vdata. In general, summarise() is the preferred function when (as the name suggests) providing summary statistics across several observations i.e. when applying summary functions like mean(), median(), var() and so on. mutate() should instead be used when a separately calculated value is needed for each observation.

Consider in this regard a common transformation in speech science of converting numeric (especially formant) data into so-called z-scores which show how many standard deviations a value is from the mean of the distribution. Applied to the earlier example of the vector of numbers, this is given by the following:

(vec - mean(vec))/sd(vec)

## [1]  0.4282302  0.6423453 -1.4988056 -0.2141151 -0.4282302 -0.6423453  1.3917481
## [8] -0.9635179  1.2846905

The same can be applied to a data-frame using mutate(). In this case, z-scores are calculated for F1 and the output is additionally stored as a new data frame:

# store the output in the new data-frame `zdata`
zdata = 
  vdata %>%
  mutate(F1z  = (F1 - mean(F1)/sd(F1)))

The above could also be done with summarise() but mutate() is for the reason stated above the preferred function in this case. Finally, the above can also be accomplished by writing a function then passing that function to mutate(). Here is a function called zscore() for computing z-scores from any vector:

zscore = function(x) {
  (x - mean(x))/sd(x)
}

Here is the z-score for vec:

vec

## [1] 10 12 -8  4  2  0 19 -3 18

zscore(vec)

## [1]  0.4282302  0.6423453 -1.4988056 -0.2141151 -0.4282302 -0.6423453  1.3917481
## [8] -0.9635179  1.2846905

Here are the z-scores for F1 and for F2 of vdata:

zdata = 
  vdata %>%
  mutate(zF1 = zscore(F1), zF2 = zscore(F2))

2 Summary statistics for categorical variables (factors)

The simplest way to obtain this kind of information for a categorical variable in a data-frame is with the functions group_by() and n() inside summarise(). For example, to make a table to show how often the separate levels of the factor V in the data-frame vdata occur:

vdata %>%
  group_by(V) %>%
  summarise(count = n()) %>%
  ungroup()

## # A tibble: 7 × 2
##   V     count
##   <fct> <int>
## 1 %       426
## 2 A       432
## 3 E       425
## 4 I       424
## 5 O       428
## 6 U       423
## 7 Y       424

# do the same but store it as a new data-frame:
Vcount = 
  vdata %>%
  group_by(V) %>%
  summarise(count = n()) %>%
  ungroup()
Vcount

## # A tibble: 7 × 2
##   V     count
##   <fct> <int>
## 1 %       426
## 2 A       432
## 3 E       425
## 4 I       424
## 5 O       428
## 6 U       423
## 7 Y       424

from which it becomes immediately apparent that:

the factor V is made up of 7 levels
the levels are “%” (=/ø/ as in German ‘Söhne’), “A”, “E”, “I”, “O”, “U”, “Y”
they are each over 400 times in the data-frame.

The output of the above code with group_by() returns an object called a tibble as shown here:

class(Vcount)

## [1] "tbl_df"     "tbl"        "data.frame"

A tibble is a simplified version of a data-frame and the difference between the two will be largely unimportant for the subsequent statistical analyses to be considered.

Conversion back to a data-frame is possible as follows:

Vcount.df = Vcount %>% as.data.frame()
class(Vcount.df)

## [1] "data.frame"

The function as_tibble() can be used to convert a data-frame to a tibble. More information about tibbles is in chapter 10, R for Data Science.

It’s always a good idea to use the function ungroup() after using the group_by() function because otherwise R makes assumptions about the grouping of variables in a data-frame that can influence subsequent calculations in unexpected ways. Using ungroup() in the above example was in fact unnecessary because there was only one variable V but when there is more than one variable (as in the example below), ungroup() should always be used.

A table of levels and their frequency of occurrence for the combination of any factors can be made by passing the names of the factors to the group_by() function together, thus:

vdata %>%
  group_by(V, Cons) %>%
  summarise(count = n()) %>%
  ungroup()

## `summarise()` has grouped output by 'V'. You can override using the `.groups` argument.

## # A tibble: 21 × 3
##    V     Cons  count
##    <fct> <fct> <int>
##  1 %     K       142
##  2 %     P       142
##  3 %     T       142
##  4 A     K       147
##  5 A     P       141
##  6 A     T       144
##  7 E     K       144
##  8 E     P       140
##  9 E     T       141
## 10 I     K       142
## # … with 11 more rows

For example, the fourth line means that there are 147 observations which have level A (of factor V) and level K (of factor Cons). The first few observations above show that there are three consonant levels per vowel level. The function n_distinct() can be used to count how many levels of one factor occur relative to any other. For example:

vdata %>% 
  group_by(V) %>% 
  summarise(count = n_distinct(Cons)) %>%
  ungroup()

## # A tibble: 7 × 2
##   V     count
##   <fct> <int>
## 1 %         3
## 2 A         3
## 3 E         3
## 4 I         3
## 5 O         3
## 6 U         3
## 7 Y         3

means that for each of the 7 vowel levels in factor V there are three consonant levels (of factor Cons). The following:

vdata %>% 
  group_by(V, Cons) %>% 
  summarise(count = n_distinct(Subj)) %>% 
  ungroup()

## `summarise()` has grouped output by 'V'. You can override using the `.groups` argument.

## # A tibble: 21 × 3
##    V     Cons  count
##    <fct> <fct> <int>
##  1 %     K         7
##  2 %     P         7
##  3 %     T         7
##  4 A     K         7
##  5 A     P         7
##  6 A     T         7
##  7 E     K         7
##  8 E     P         7
##  9 E     T         7
## 10 I     K         7
## # … with 11 more rows

means that for each combination of consonant and vowel levels, there are 7 subject levels.

3 Summary statistics for numeric variables grouped by factor.

The group_by() function can be also used to calculate summary statistics for numeric variables separately for the levels of a factor. For example, instead of calculating mean F1 for the entire data-frame, separate F1 means might instead be calculated for each of the 7 vowel levels in V. This is done as follows:

vdata %>%
  # for each level in `V`
  group_by(V) %>%
  # calculate the mean of F1
  summarise(mF1 = mean(F1)) %>%
  ungroup()

## # A tibble: 7 × 2
##   V       mF1
##   <fct> <dbl>
## 1 %      424.
## 2 A      645.
## 3 E      426.
## 4 I      311.
## 5 O      434.
## 6 U      304.
## 7 Y      302.

The above can be extended to include any number of factors. Thus to calculate mean F1 for each level when combining consonant and vowel factors:

cvmeans = 
  vdata %>%
  group_by(Cons, V) %>%
  summarise(mF1 = mean(F1)) %>%
  ungroup()

## `summarise()` has grouped output by 'Cons'. You can override using the `.groups` argument.

cvmeans

## # A tibble: 21 × 3
##    Cons  V       mF1
##    <fct> <fct> <dbl>
##  1 K     %      410.
##  2 K     A      637.
##  3 K     E      419.
##  4 K     I      307.
##  5 K     O      431.
##  6 K     U      296.
##  7 K     Y      296.
##  8 P     %      440.
##  9 P     A      646.
## 10 P     E      436.
## # … with 11 more rows

The meaning of the 4th line in the above is: the mean of F1 in /KA/ (i.e. the mean of observations for which Cons is K and for which V is A) is 307 Hz.

Summary statistics

Jonathan Harrington, Johanna Cronenberg

1 Summary statistics for numeric variables

2 Summary statistics for categorical variables (factors)

3 Summary statistics for numeric variables grouped by factor.