If not already done, carry out parts 1-5 of the setup, as described here.

Load these libraries:

library(tidyverse)

The module is concerned with an introduction to probability theory which forms the basis for most types of statistical tests applied to speech data. Most of these tests are based on the idea that the variables being tested statistically follow or approximate a type of probability distribution known as the normal distribution.

1 Approximation to the normal distribution

1.1 Sample mean and population mean

The normal or Gaussian distribution describes a pattern that is followed when randomly occurring observations of an event are repeated. The approximation to the normal distribution can be demonstrated by:

  1. taking a random sample of observations.
  2. averaging the sample.

and then

  1. repeating 1. and 2 very many times.

The greater the number of repetitions in 3, the closer the approximation to the normal distribution.

An approximation towards the normal distribution can be demonstrated with:

  1. throwing a number of dice.
  2. averaging the score: this known as a sample mean.
  3. repeating 1. and 2.

In R, a random throw of an unbiased six-sided dice can be accomplished with:

throw = sample(1:6, 1, replace=T)
throw
## [1] 5

To throw, say, N = 10 dice all at the same time:

N = 10
throw10 = sample(1:6, N, replace=T)
throw10
##  [1] 5 4 2 5 3 6 2 5 2 6

To follow step 2. above and get the average or sample mean:

N = 10
throw10 = sample(1:6, N, replace=T)
# sample mean
mean(throw10)
## [1] 4.3

For step 3, repeat step 2 a certain number of times. The following does the repetition k = 50 times with the aid of a so-called for-loop and stores the output in a data-frame, results.df:

results = NULL
k = 50
for (j in 1:k) {
  N = 10
  throw10 = sample(1:6, N, replace=T)
  results = c(results, mean(throw10))
}
results.df = data.frame(mean_scores = results)
results.df %>% head()
##   mean_scores
## 1         3.7
## 2         2.9
## 3         4.5
## 4         3.5
## 5         3.9
## 6         3.4
results.df %>% nrow()
## [1] 50

The output of results is 50 values and each value is a sample mean when throwing 10 dice at the same time. Notice that each time the above code snippet is executed, the output will be different: this is because throwing dice is a random process.

What is the most likely sample mean when 10 dice are thrown and averaged? The possible range of the sample means is from 1 to 6. But neither 1 nor 6 seem very likely because this would require throwing 10 dice all with a score of 1 or all with a score of 6. The most likely sample mean is in fact given by the mean of the six equiprobable outcomes of throwing an unbiased dice i.e. mean(1:6) which is 3.5 and also known as the population mean (and printed with the Greek letter as μ = 3.5). Notice that this does not imply that 3.5 actually has to occur in the sample of 50 means; but instead that sample means clustered around 3.5 are most likely to occur.

1.2 Histograms, the standard error, and the normal curve

A histogram of these 50 sample means should show a clustering around the population mean of 3.5 and a progressive falling off towards the lowest scores of 1 and 6:

# the argument color = "white" puts white spaces
# between the bars; binwidth specifies the bar width
results.df %>%
  ggplot +
  aes(x = mean_scores) + 
  geom_histogram(color = "white", 
                 binwidth=.25)

The above histogram counts how many occurrences there are per bar width. In order to establish the relationship to the normal distribution, the y-axis needs to be probability density. When this axis is used, the histogram shape is the same, but the sum of the areas of the bar areas is 1:

results.df %>%
  ggplot + 
  aes(x = mean_scores) + 
  geom_histogram(aes(y = after_stat(density)),
                 color = "white",
                 binwidth = .25) +
  ylab("Probability density") +
  xlab("Sample mean")

The relationship for any bar between probability density, count, and the bar width is: probability_density = count/(n * bindwidth) where count is the height of bar, n is the total number of observations (50 in this case), and binwidth is the width of a bar. Thus if a bar has a height (count) of 5, its probability density for these data is given by 5/(nrow(results.df) * 0.25).

The above histogram approximates ever more closely the normal distribution as the number of repetitions, k (step 3 above), increases. Any normal distribution of sample means can be derived by knowing the population mean (μ = 3.5 in this case, as established earlier) and the standard error of the mean (often abbreviated as SE or SEM) which for this example is given by:

SE = sd(1:6) * sqrt(5/6) / sqrt(10)
SE
## [1] 0.5400617

The above calculation of the standard error of the mean in R is SE = σ/√N where σ is the population standard deviation which, for this example, is sd(1:6) * sqrt(5/6) as above. The reason for the extra term sqrt(5/6) is because the sd() function in R divides by m-1 where m is the number of equiprobable outcomes (i.e. 6 for a six-sided dice). Another example: the SE of the sample mean obtained from throwing 20 octagons (sides numbered 1 to 8) together and averaging the score is sd(1:8) * sqrt(7/8)/ sqrt(20).

Once the population mean and standard error have been computed, the corresponding normal distribution can be plotted. The normal curve of the sample means obtained by throwing 10 dice together and averaging is:

mu = 3.5
SE = sd(1:6) * sqrt(5/6) / sqrt(10)
ggplot() +
  geom_function(fun = dnorm, 
                args = list(mean = mu, sd = SE))  +
  xlim(1, 6) + 
  ylab("Probability density") +
  xlab("Sample mean")

Some important points to note about any normal distribution:

  1. The area under the normal curve is 1.
  2. The tails at the left and right edge go on forever i.e. extend to ±infinity ( ±∞ ).
  3. The normal distribution is symmetrical about the population mean which has the highest value of probability density.

The normal distribution can be overlaid on the actual sample means as follows:

results.df %>%
  ggplot + 
  aes(x = mean_scores) +
  geom_histogram(aes(y = after_stat(density)),
                 color = "white",
                 binwidth = .25) +
  geom_function(fun = dnorm,
                args = list(mean = mu, sd = SE))  +
  xlim(1, 6) +
  ylab("Probability density") +
  xlab("Sample mean")
## Warning: Removed 2 rows containing missing values (`geom_bar()`).

The normal distribution is the resulting histogram that would emerge if step 3. above – the number of times the 10 dice are thrown together and averaged – were repeated not 50 but an infinite number of times. Obviously, it’s not possible in practice to draw the outcome of repeating something indefinitely. However, if step 3. were repeated e.g. 20,000 instead of 50 times, then the distribution of the 20,000 sample means should be a lot closer to the normal distribution than for the 50 sample means. And this, of course, can be simulated in R. Here the width of the bins is made much smaller to show more clearly the close approximation to the normal distribution:

# throw 10 dice together, average the score, and repeat
# 20,000 times
results = NULL
k = 20000
for (j in 1:k) {
  N = 10
  throw10 = sample(1:6, N, replace=T)
  results = c(results, mean(throw10))
}
# store as a data-frame
results2.df = data.frame(mean_scores = results)
# verify there are 20,000 observations
nrow(results2.df)
## [1] 20000
# make a histogram of the sample means
# and overlay the normal distribution
results2.df %>%
  ggplot + 
  aes(x = mean_scores) +
  geom_histogram(aes(y =after_stat(density)), 
                 color = "white",
                 binwidth = .1) +
  geom_function(fun = dnorm, 
                args = list(mean = mu, sd = SE), 
                col = "blue", 
                lwd = 1.5)  +
  ylab("Probability density") +
  xlab("Sample mean")

2 Calculating probability

2.1 Proportional areas under the normal curve

The area under the normal curve is used to calculate probability which varies between 0 (never happens) and 1 (always happens). The area under any normal curve is always 1. This is the graphical equivalent to the statement: the probability that a sample mean occurs between ±infinity is 1. The proportional area under a normal curve can be used to answer questions such as:

If ten dice are rolled and averaged to give a sample mean:

  1. what is the probability of a sample mean being less than 2.5?
  2. what is the probability of a sample mean being greater than 3.7?
  3. what is the probability of a sample mean being greater than 4 and less than 4.7?

The proportions under the normal curve associated with these areas graphically and numerically are:

# Question 1:
# p(sample_mean < 2.5) read as: the probability that the
# sample mean is less than 2.5
ggplot() +
  geom_area(aes(c(1, 6)), 
            stat = "function", 
            fun = dnorm, 
            args = list(mean = mu, sd = SE),
            fill = "darkgreen", 
            xlim = c(1, 2.5)) + 
  geom_function(fun = dnorm, 
                args = list(mean = mu, sd = SE))  +
  ylab("Probability density") +
  xlab("Sample mean")

The function pnorm() calculates the area under the normal curve from -∞ (minus infinity) to the desired value (or quantile) which in this case is 2.5: that is, the probability of obtaining a sample mean of 2.5 or less when 10 dice are thrown together and averaged is:

# p(sample_mean < 2.5)
pnorm(2.5, mu, SE)
## [1] 0.03203875

which is also the area shown under the normal curve in the preceding figure.

# Question 2:
# p(sample_mean > 3.7) 
ggplot() +
  geom_area(aes(c(1, 6)), 
            stat = "function", 
            fun = dnorm, 
            args = list(mean = mu, sd = SE),
            fill = "darkgreen", 
            xlim = c(3.7, 6)) + 
  geom_function(fun = dnorm, 
                args = list(mean = mu, sd = SE))  +
  ylab("Probability density") +
  xlab("Sample mean")

Because pnorm() calculates from -∞ to a quantile, and because the area under the normal curve is always one, then the corresponding area/probability above a quantile is given by subtracting from one, thus:

# p(sample_mean > 3.7)
1 - pnorm(3.7, mu, SE)
## [1] 0.355569

So the green area shown in the preceding figure is just over 35% of the area under the normal curve.

# Question 3:
# the probability that the sample mean lies between 4 and 4.7
# p(4 < sample_mean < 4.7) 
ggplot() +
  geom_area(aes(c(1, 6)), 
            stat = "function", 
            fun = dnorm, 
            args = list(mean = mu, sd = SE),
            fill = "darkgreen", 
            xlim = c(4, 4.7)) + 
  geom_function(fun = dnorm, 
                args = list(mean = mu, sd = SE))  +
  ylab("Probability density") +
  xlab("Sample mean")

# p(4 < sample_mean < 4.7)
pnorm(4.7, mu, SE) - pnorm(4, mu, SE)
## [1] 0.164127

2.2 Setting a confidence interval

Statistics often makes use of a so-called confidence interval which defines the interval within which an event is likely to occur with a certain probability. If the probability is set to 0.95, then this is known as a 95% confidence interval. For the preceding example, setting a 95% confidence interval is equivalent to finding a solution to the following question:

  • If 10 dice are thrown together and averaged, within what range is the sample mean likely to occur with a probability of 0.95?

Or equivalently:

  • Find two values a and b within which the above sample mean is likely to occur with a probability of 0.95.

The 95% confidence interval is the extent of the green shading on x-axis in the figure shown below. The figure was drawn such that the white areas on either side comprise an area of 0.025 (hence a total area between them of 0.05):

ggplot() +
  geom_area(aes(c(1, 6)), 
            stat = "function", 
            fun = dnorm, 
            args = list(mean = mu, sd = SE),
            fill = "darkgreen", 
            xlim = c(2.441498, 4.558502)) + 
  geom_function(fun = dnorm, 
                args = list(mean = mu, sd = SE)) +
  ylab("Probability density") +
  xlab("Sample mean") +
  ggtitle("95% confidence interval")

The edges of the shaded area (x-axis boundaries between white and green) can be calculated with the qnorm() function as follows:

qnorm(0.975, mu, SE)
## [1] 4.558502
qnorm(0.025, mu, SE)
## [1] 2.441498

The following confirms that the shaded area is 0.95 (apart from a rounding error):

pnorm(4.558502, mu, SE) - pnorm(2.441498, mu, SE)
## [1] 0.9500001

The 95% confidence interval can now be stated as follows. If 10 dice are thrown and their scores averaged, then this average score or sample mean occurs between 2.441498 and 4.558502 with a probability of 0.95.