1 Correlation and the relationship between two numeric variables
2 Regression
3 Testing the significance of the regression slope/correlation coefficient
4 Conditions for applying tests of significance

If not already done, carry out parts 1-5 of the setup, as described here.

Load these libraries and data-frames:

library(tidyverse)
library(broom)
library(gridExtra)

urla = "https://www.phonetik.uni-muenchen.de/"
urlb = "studium_lehre/lehrmaterialien/R_speech_processing/Rdf/"
url = paste0(urla, urlb)
reg.df = read.table(file.path(url, "reg.df.txt"), 
                    stringsAsFactors = T)

1 Correlation and the relationship between two numeric variables

Both correlation and linear regression are estimates of the extent to which there is a linear, predictable relationship between two continuous (numeric) variables. Linear means that the scatter of data points can be approximated i.e. modelled by a straight line. Predictable means that one of the variables can be reliably estimated from the other. The data-frame reg.df consists of 100 observations and three numerical variables A, B, C:

summary(reg.df)

##        A                  B                 C                Vpn    
##  Min.   :-2.47845   Min.   :-2.3548   Min.   :-2.1375   S1     : 1  
##  1st Qu.:-0.60341   1st Qu.:-0.2543   1st Qu.: 0.1743   S10    : 1  
##  Median :-0.07297   Median : 0.4314   Median : 0.9232   S100   : 1  
##  Mean   :-0.02041   Mean   : 0.3796   Mean   : 0.8804   S11    : 1  
##  3rd Qu.: 0.60800   3rd Qu.: 1.1156   3rd Qu.: 1.5659   S12    : 1  
##  Max.   : 2.59640   Max.   : 3.0582   Max.   : 3.6475   S13    : 1  
##                                                         (Other):94

The issue to be considered is whether there is such a linear predictable relationship in the data-frame reg.df between the pairs of numerical variables:

A and B
A and C
B and C

Here is a plot of the above pairs of variables:

p1 = reg.df %>%
  ggplot() +
  aes(y = A, x = B) +
  geom_point()
p2 = reg.df %>%
  ggplot() +
  aes(y = A, x = C) +
  geom_point()
p3 = reg.df %>%
  ggplot() +
  aes(y = B, x = C) +
  geom_point()
grid.arrange(p1, p2, p3, ncol=3)

The middle and rightmost plots show a clear negative and positive slope:

as A increases, C decreases (middle, negative slope).
as B increases, C increases (right, positive slope).

The leftmost plot between A and B also has a positive slope, but the relationship between the two variables is much weaker:

as A increases, B increases (left, weakly positive slope).

The correlation coefficient expresses the strength of the linear relationship between variables and can be computed in R with the function cor(). The correlation coefficent, r, varies between ±1. A value of zero means that the variables are completely uncorrelated and that one variable cannot be predicted from another. By contrast, the closer the values are to -1 or +1 the more the variables are correlated and mutually predictable: at values of -1 or +1, one variable is entirely predictable from the other. From this point of view, the r values for the middle and right plots should be further from zero than the leftmost plot; and the middle and right plot should differ in slope (because the variables are negatively associated in the middle plot and positively associated in the right plot). The following verifies that this is so:

# left plot
reg.df %>%
  summarise(cor(A, B))

##   cor(A, B)
## 1 0.1138534

# middle plot
reg.df %>%
  summarise(cor(A, C))

##    cor(A, C)
## 1 -0.5072916

# right plot
reg.df %>%
  summarise(cor(B, C))

##   cor(B, C)
## 1 0.6554501

2 Regression

Both (linear) regression and correlation measure the strength of association between two numerical variables. In regression, this is done by calculating which straight line best fits the scatter of data points. The function for linear regression is lm() and the line of best fit can be drawn in a scatter of points using the geom_smooth() function, thus:

r1 = ggplot(reg.df) + 
  aes(y = A, x = B) + 
  geom_point() + 
  geom_smooth(method = "lm", se = F, color = "blue") 
  
r2 = ggplot(reg.df) + 
  aes(y = A, x = C) + 
  geom_point() + 
  geom_smooth(method = "lm", se = F, color = "blue")

r3 = ggplot(reg.df) + 
  aes(y = B, x = C) + 
  geom_point() + 
  geom_smooth(method = "lm", se = F, color = "blue")

grid.arrange(r1, r2, r3, ncol=3)

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

The plot is consistent with the results from the correlation analysis:

The line slopes downwards for the middle plot, but upwards for the right plot (the correlation coefficients were negative and positive respectively for these data).
The line slopes upwards for both the left and right plots, but not as steeply for the plot of the left (the correlation coefficients were both positive for the left and right plots, but closer to zero for the the A ~ B relationship on the left than the B ~ C relationship on the right).

The term se in the geom_smooth() function is the standard error. If se is set to TRUE (which can be done by omitting the se argument), then the line is drawn with a 95% confidence interval (shown by shading), as shown here for the A ~ C relationship:

ggplot(reg.df) + 
  aes(y = A, x = C) + 
  geom_point() + 
  geom_smooth(method = "lm", color = "blue")

## `geom_smooth()` using formula = 'y ~ x'

The geom_smooth() function gives a warning that the formula y ~ x is used. This goes away by specifying the formula. For a straight line regression:

ggplot(reg.df) + 
  aes(y = A, x = C) + 
  geom_point() + 
  geom_smooth(method = "lm", color = "blue", formula = y ~ x)

In straight line regression of the above kind, a slope m and an intercept k are calculated in the equation \(y = mx + k\). In this equation, m and k are the coefficients. For the A on C regression (i.e. where A is y and C in x in the above equation), the coefficients are derived from the lm() function which calculates the regression line. Thus:

reg.df %>%
  lm(A ~ C, .) %>%
  coef()

## (Intercept)           C 
##   0.3831549  -0.4584051

which for the middle plot above gives a slope of m = -0.4584051 and an intercept of k = -0.3831549.

When a straight line regression is calculated, then the points in the scatter are modelled by corresponding points on the regression line. Thus for every black point, there is a corresponding fitted red point with the same x-axis value but that is positioned on the regression line itself, as shown in this plot:

reg.df %>%
  lm(A ~ C, .) %>%
  augment() %>%
  ggplot() + 
  aes(y = A, x = C) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue", formula = y ~ x) +
  # add the fitted values
  geom_point(aes(y = .fitted, x = C), col = "red")

The slope and intercept of the regression line are calculated by minimising a quantity known as SSE (the Sum of the Squared Error) which is the sum of the squared vertical distances between the red and black points. SSE is given by:

reg.df %>%
  lm(A ~ C, .) %>%
  deviance()

## [1] 72.22293

which is the same as the sum of the squared distances between the actual (black) and fitted (red) values in the above plot:

reg.df %>%
  lm(A ~ C, .) %>%
  augment() %>%
  summarise(sum((A - .fitted)^2))

## # A tibble: 1 × 1
##   `sum((A - .fitted)^2)`
##                    <dbl>
## 1                   72.2

The closer SSE is to zero, the better the fit of the straight line to the points – so that if SSE is actually zero, then the straight line would pass exactly through all the points.

A further point to note is that, in contrast to the correlation coefficient, the slope of the regression does not have to occur within the range ±1 and often does not. For example, if the values of A are made much larger by (in this case) multiplying by 1000, then the coefficients also become correspondingly larger:

reg.df %>%
  mutate(Alarge = 1000 * A) %>%
  lm(Alarge ~ C, .) %>%
  coef()

## (Intercept)           C 
##    383.1549   -458.4051

Regression can be used to predict (in this case) new values of A from C. For example, to predict the values of A when C is -3 and -4:

reg.df %>%
  lm(A ~ C, .) %>%
  predict(., data.frame(C = c(-3, -4)))

##        1        2 
## 1.758370 2.216775

The predicted values always lie on the regression line itself, as shown in the plot below:

ggplot(reg.df) + 
  aes(y = A, x = C) + 
  geom_point() +
  # add the two new points to the plot
  annotate(geom = "point", x=-3, y = 1.758370, col="red") +
  annotate(geom = "point", x=-4, y = 2.216775, col="red") +
  # note the need to use fullrange=T to extend
  # the regression line
  geom_smooth(method = "lm", se = F, 
              color = "blue", fullrange=T)

## `geom_smooth()` using formula = 'y ~ x'

The slope calculated with lm() and the correlation coefficient (which as discussed above is bounded by ±1) are related by \(m = r \cdot \frac{sd(y)}{sd(x)}\) where these four quantities are respectively:

the slope from lm(), m
the correlation coefficient, r
the standard deviation of y
the standard deviation of x.

Thus for this example:

reg.df %>%
  #correlation coefficient
  summarise(r = cor(A, C), 
            # sd of A
            sdy = sd(A), 
            # sd of C
            sdx = sd(C)) %>%
  # r sd(y)sd(x)
  mutate(slope = r * sdy/sdx)

##            r       sdy      sdx      slope
## 1 -0.5072916 0.9911202 1.096818 -0.4584051

which gives the same slope of -0.4584051, as calculated with lm() above.

3 Testing the significance of the regression slope/correlation coefficient

A statistical test can be carried out to determine the probability that the slope of the regression line is not zero (the null hypothesis is that it is zero). If the slope of the regression line (or correlation coefficient) is zero, then there is no association between the two variables which means that the values of one variable cannot be predicted from the other. There are three equivalent ways of testing for significance. The one most commonly used is the third method, as outlined below.

3.1 Testing whether the regression line slope is different from zero

Testing whether the slope is zero can be done as follows:

reg.df %>%
  lm(A ~ C, .) %>%
  summary() %>%
  tidy()

## # A tibble: 2 × 5
##   term        estimate std.error statistic      p.value
##   <chr>          <dbl>     <dbl>     <dbl>        <dbl>
## 1 (Intercept)    0.383    0.110       3.47 0.000766    
## 2 C             -0.458    0.0787     -5.83 0.0000000719

The slope is in line 2 (the intercept is also tested but that’s not generally important). The result gives the slope value itself (column 2), the t-statistic (column 4) and the probability that that the slope could be zero (last column: p < 0.001). So the conclusion is that there is a significant (t = 5.83, p < 0.001) linear relationship between A and C (i.e. that A can be predicted from C significantly greater than chance).

3.2 Testing whether the correlation coefficient is different from zero

Since the slope in regression is the same as the correlation coefficient multiplied by a constant (by sdy/sdx as described above), exactly the same result is obtained by testing whether the correlation between the two variables is significantly different from zero. This can be done with the cor.test() function (with a slightly different syntax), thus:

reg.df %>%
  cor.test(~ A + C, .) %>%
  tidy()

## # A tibble: 1 × 8
##   estimate statistic     p.value parameter conf.low conf.high method alternative
##      <dbl>     <dbl>       <dbl>     <int>    <dbl>     <dbl> <chr>  <chr>      
## 1   -0.507     -5.83     7.19e-8        98   -0.640    -0.345 Pears… two.sided

The information from the above from left to right is:

column 1: the correlation coefficient.
column 2: the t-statistic: the same as from the regression.
column 3: the p-value: the same as for the regression.
column 4: the degrees of freedom: two less than the number of observations.
columns 5 and 6: the 95% confidence interval for the correlation coefficient. This states that the correlation coefficient occurs between -0.64 and -0.345 with a probability of 95%. Because this range does not include zero, the result is significant.

3.3 Testing whether \(R^2\) is different from zero

The preferred method for reporting the statistical summary is to include a quantity known as \(R^2\) (R-squared) which is the correlation coefficient squared. Since the correlation coefficient varies between ±1, then \(R^2\) must vary between 0 and 1 (since the square of negative numbers is a positive). The value of \(R^2\) can be seen by leaving out the tidy() function at the end of the summary() and the lm() command, thus:

reg.df %>%
  lm(A ~ C, .) %>%
  summary()

## 
## Call:
## lm(formula = A ~ C, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.44509 -0.56414  0.00729  0.54727  2.05738 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.38315    0.11030   3.474 0.000766 ***
## C           -0.45841    0.07866  -5.827 7.19e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8585 on 98 degrees of freedom
## Multiple R-squared:  0.2573, Adjusted R-squared:  0.2498 
## F-statistic: 33.96 on 1 and 98 DF,  p-value: 7.191e-08

Use Multiple R-squared when calculating regression of the form y ~ x with just two variables. The other quantity Adjusted R-squared is preferred in multiple regression with more than two variables (because the more predictors there are in the regression model, the greater R-squared can be artificially inflated – and Adjusted R-squared compensates for this).

The information in the penultimate line gives an \(R^2\) value of 0.2573 which is the same as correlation coefficient squared:

reg.df %>%
  summarise(cor(A, C)^2)

##   cor(A, C)^2
## 1   0.2573448

The F-statistic in the last line at 33.96 is the t-statistic squared i.e. (-5.83)^2. The results showed a significant (\(R^2\) = 0.2573, F[1, 98] = 33.96, p < 0.001) linear association between A and C (note again that the probability value is exactly the same in reporting the statistic as for the other two ways given above).

\(R^2\) is also equal to SSR/SSY which are related by SSY = SSR + SSE.

SSE was calculated earlier from:

SSE = reg.df %>%
  lm(A ~ C, .) %>%
  deviance()
SSE

## [1] 72.22293

SSY is the actual values of y (in this case of A) minus the mean of y squared and then summed:

SSY = reg.df %>%
  summarise(sum ((A - mean(A))^2))
SSY

##   sum((A - mean(A))^2)
## 1             97.24961

Therefore SSR is:

SSR = SSY - SSE
SSR

##   sum((A - mean(A))^2)
## 1             25.02668

and so \(R^2\) is:

SSR/SSY

##   sum((A - mean(A))^2)
## 1            0.2573448

4 Conditions for applying tests of significance

Statistical tests based on linear regression are usually only valid if there is not more than one pair of observations per subject or participant. This is because this type of regression presupposes that the observations are independent of each other and this is not the case if more than one observation pair is taken from the same subject. If there are multiple observations per subject, then there is certainly no harm in drawing a regression line through the scatter using the techniques described in this module, but a different method such as mixed models should instead be used to test for statistical significance.

Beyond the above consideration (which is by far the most important one in phonetics and speech science), statistical tests based on linear regression are also only valid if the distribution of the so-called residuals fulfils various conditions. The residuals are the (vertical) differences between the actual and fitted values (the latter being on the regression line). The residuals are in the .resid column after applying lm() and then augment():

reg.df %>%
  lm(A ~ C, .) %>%
  augment()

## # A tibble: 100 × 8
##          A      C .fitted  .resid   .hat .sigma    .cooksd .std.resid
##      <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>      <dbl>      <dbl>
##  1  1.71    1.54  -0.321   2.03   0.0136  0.837 0.0393         2.38  
##  2 -0.653   1.43  -0.271  -0.382  0.0125  0.862 0.00127       -0.448 
##  3 -0.175   0.429  0.187  -0.361  0.0117  0.862 0.00106       -0.423 
##  4 -0.0540  1.45  -0.280   0.226  0.0127  0.863 0.000451       0.265 
##  5 -0.726   0.389  0.205  -0.931  0.0120  0.858 0.00724       -1.09  
##  6  0.122  -0.330  0.535  -0.413  0.0223  0.862 0.00270       -0.486 
##  7 -0.354   1.84  -0.459   0.105  0.0177  0.863 0.000138       0.124 
##  8 -1.10    0.324  0.234  -1.33   0.0126  0.852 0.0155        -1.56  
##  9 -0.0555  1.01  -0.0807  0.0251 0.0101  0.863 0.00000444     0.0294
## 10  2.07   -0.166  0.459   1.61   0.0192  0.847 0.0352         1.90  
## # ℹ 90 more rows

The following verifies that the residuals are just the difference between the raw (A) and fitted (.fitted) values:

reg.df %>%
  lm(A ~ C, .) %>%
  augment() %>%
  # calculate the residuals manually
  mutate(residuals = A - .fitted) %>%
  # subtract these from the given `.resid` column
  summarise(residuals - .resid) %>%
  # there should be 100 zeros
  table()

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## residuals - .resid
##   0 
## 100

The conditions on the residuals are discussed in detail here.

Some of these are as follows:

the mean of the residuals should be around zero:

reg.df %>%
  lm(A ~ C, .) %>%
  augment() %>%
  summarise(mean(.resid))

## # A tibble: 1 × 1
##   `mean(.resid)`
##            <dbl>
## 1      -1.50e-16

which is very clearly fulfilled in this case.

the residuals and the x-values (in this case C) should not be correlated:

reg.df %>%
  lm(A ~ C, .) %>%
  augment() %>%
  cor.test(~ .resid + C, .) %>%
  tidy()

## # A tibble: 1 × 8
##   estimate statistic p.value parameter conf.low conf.high method     alternative
##      <dbl>     <dbl>   <dbl>     <int>    <dbl>     <dbl> <chr>      <chr>      
## 1 1.03e-16  1.02e-15    1.00        98   -0.196     0.196 Pearson's… two.sided

and there is no evidence that they are in this case, given that the correlation coefficient is almost zero.

the residuals should be approximately normally distributed. This can be tested with the shapiro.test() function:

reg.df %>%
  lm(A ~ C, .) %>%
  augment() %>%
  # extract the residuals
  pull(.resid) %>%
  # test whether these are not normally distribued
  shapiro.test()

## 
##  Shapiro-Wilk normality test
## 
## data:  .
## W = 0.99624, p-value = 0.9952

A p-value less than 0.05 would be evidence that the residuals are not normally distributed (the high p-value in this case suggests that the residuals are likely to be normally distributed). If they are not normally distributed, then this might be because of outliers which can be identified in a so-called qqplot in which, the greater the deviation from a straight line, the less likely the residuals follow a normal distribution:

reg.df %>%
  lm(A ~ C, .) %>%
  plot(which = 2)

The above shows that observations 1, 43, 73 are outliers. Thus, if the residuals were not normally distributed, the statistical tests (and plots) could be carried out once again, after removing these outliers from the data frame (using the slice() function).

the residuals should not be auto-correlated.

A good example of an auto-correlated signal is a periodic speech signal because the same pattern (a pitch period) is repeated (as a result of which successive pitch periods are auto-correlated i.e. correlated with each other). One way to test for autocorrelation is with the acf() function which tests whether the signal is correlated with itself at lag 0, 1, 2 … (lag 0 is always 1 and can be ignored: this is the correlation of the signal with itself):

par(mfrow=c(1,1))
reg.df %>%
  lm(A ~ C, .) %>%
  augment() %>%
  # extract the residuals
  pull(.resid) %>%
  acf()

If (apart from lag 0) the lag values (especially at lag 1 and 2) are within the blue lines, then it can usually be assumed that there is no auto-correlation (and so the test can proceed).

the residuals should have what is called ‘equal variance’.

This means that when they are plotted as a function of the fitted values, there should be no discernible pattern in the data – in particular the red line in the plot below should be more or less flat about the horizontal line of zero.

reg.df %>%
  lm(A ~ C, .) %>%
  plot(which = 1)

Linear regression

Jonathan Harrington, Johanna Cronenberg