If not already done, carry out parts 1-5 of the setup, as described here.
Load these libraries and data-frames:
library(tidyverse)
library(broom)
library(gridExtra)
= "https://www.phonetik.uni-muenchen.de/"
urla = "studium_lehre/lehrmaterialien/R_speech_processing/Rdf/"
urlb = paste0(urla, urlb)
url = read.table(file.path(url, "reg.df.txt"),
reg.df stringsAsFactors = T)
Both correlation and linear regression are estimates of the extent to which there is a linear, predictable relationship between two continuous (numeric) variables. Linear means that the scatter of data points can be approximated i.e. modelled by a straight line. Predictable means that one of the variables can be reliably estimated from the other. The data-frame reg.df
consists of 100 observations and three numerical variables A
, B
, C
:
summary(reg.df)
## A B C Vpn
## Min. :-2.47845 Min. :-2.3548 Min. :-2.1375 S1 : 1
## 1st Qu.:-0.60341 1st Qu.:-0.2543 1st Qu.: 0.1743 S10 : 1
## Median :-0.07297 Median : 0.4314 Median : 0.9232 S100 : 1
## Mean :-0.02041 Mean : 0.3796 Mean : 0.8804 S11 : 1
## 3rd Qu.: 0.60800 3rd Qu.: 1.1156 3rd Qu.: 1.5659 S12 : 1
## Max. : 2.59640 Max. : 3.0582 Max. : 3.6475 S13 : 1
## (Other):94
The issue to be considered is whether there is such a linear predictable relationship in the data-frame reg.df
between the pairs of numerical variables:
A
and B
A
and C
B
and C
Here is a plot of the above pairs of variables:
= reg.df %>%
p1 ggplot() +
aes(y = A, x = B) +
geom_point()
= reg.df %>%
p2 ggplot() +
aes(y = A, x = C) +
geom_point()
= reg.df %>%
p3 ggplot() +
aes(y = B, x = C) +
geom_point()
grid.arrange(p1, p2, p3, ncol=3)
The middle and rightmost plots show a clear negative and positive slope:
A
increases, C
decreases (middle, negative slope).B
increases, C
increases (right, positive slope).The leftmost plot between A
and B
also has a positive slope, but the relationship between the two variables is much weaker:
A
increases, B
increases (left, weakly positive slope).The correlation coefficient expresses the strength of the linear relationship between variables and can be computed in R with the function cor()
. The correlation coefficent, r, varies between ±1. A value of zero means that the variables are completely uncorrelated and that one variable cannot be predicted from another. By contrast, the closer the values are to -1 or +1 the more the variables are correlated and mutually predictable: at values of -1 or +1, one variable is entirely predictable from the other. From this point of view, the r values for the middle and right plots should be further from zero than the leftmost plot; and the middle and right plot should differ in slope (because the variables are negatively associated in the middle plot and positively associated in the right plot). The following verifies that this is so:
# left plot
%>%
reg.df summarise(cor(A, B))
## cor(A, B)
## 1 0.1138534
# middle plot
%>%
reg.df summarise(cor(A, C))
## cor(A, C)
## 1 -0.5072916
# right plot
%>%
reg.df summarise(cor(B, C))
## cor(B, C)
## 1 0.6554501
Both (linear) regression and correlation measure the strength of association between two numerical variables. In regression, this is done by calculating which straight line best fits the scatter of data points. The function for linear regression is lm()
and the line of best fit can be drawn in a scatter of points using the geom_smooth()
function, thus:
= ggplot(reg.df) +
r1 aes(y = A, x = B) +
geom_point() +
geom_smooth(method = "lm", se = F, color = "blue")
= ggplot(reg.df) +
r2 aes(y = A, x = C) +
geom_point() +
geom_smooth(method = "lm", se = F, color = "blue")
= ggplot(reg.df) +
r3 aes(y = B, x = C) +
geom_point() +
geom_smooth(method = "lm", se = F, color = "blue")
grid.arrange(r1, r2, r3, ncol=3)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
The plot is consistent with the results from the correlation analysis:
A ~ B
relationship on the left than the B ~ C
relationship on the right).The term se
in the geom_smooth()
function is the standard error. If se
is set to TRUE
(which can be done by omitting the se
argument), then the line is drawn with a 95% confidence interval (shown by shading), as shown here for the A ~ C
relationship:
ggplot(reg.df) +
aes(y = A, x = C) +
geom_point() +
geom_smooth(method = "lm", color = "blue")
## `geom_smooth()` using formula 'y ~ x'
The geom_smooth()
function gives a warning that the formula y ~ x
is used. This goes away by specifying the formula. For a straight line regression:
ggplot(reg.df) +
aes(y = A, x = C) +
geom_point() +
geom_smooth(method = "lm", color = "blue", formula = y ~ x)
In straight line regression of the above kind, a slope m and an intercept k are calculated in the equation \(y = mx + k\). In this equation, m and k are the coefficients. For the A
on C
regression (i.e. where A
is y and C
in x in the above equation), the coefficients are derived from the lm()
function which calculates the regression line. Thus:
%>%
reg.df lm(A ~ C, .) %>%
coef()
## (Intercept) C
## 0.3831549 -0.4584051
which for the middle plot above gives a slope of m = -0.4584051 and an intercept of k = -0.3831549.
When a straight line regression is calculated, then the points in the scatter are modelled by corresponding points on the regression line. Thus for every black point, there is a corresponding fitted red point with the same x-axis value but that is positioned on the regression line itself, as shown in this plot:
%>%
reg.df lm(A ~ C, .) %>%
augment() %>%
ggplot() +
aes(y = A, x = C) +
geom_point() +
geom_smooth(method = "lm", color = "blue", formula = y ~ x) +
# add the fitted values
geom_point(aes(y = .fitted, x = C), col = "red")
The slope and intercept of the regression line are calculated by minimising a quantity known as SSE (the Sum of the Squared Error) which is the sum of the squared vertical distances between the red and black points. SSE is given by:
%>%
reg.df lm(A ~ C, .) %>%
deviance()
## [1] 72.22293
which is the same as the sum of the squared distances between the actual (black) and fitted (red) values in the above plot:
%>%
reg.df lm(A ~ C, .) %>%
augment() %>%
summarise(sum((A - .fitted)^2))
## # A tibble: 1 × 1
## `sum((A - .fitted)^2)`
## <dbl>
## 1 72.2
The closer SSE is to zero, the better the fit of the straight line to the points – so that if SSE is actually zero, then the straight line would pass exactly through all the points.
A further point to note is that, in contrast to the correlation coefficient, the slope of the regression does not have to occur within the range ±1 and often does not. For example, if the values of A
are made much larger by (in this case) multiplying by 1000, then the coefficients also become correspondingly larger:
%>%
reg.df mutate(Alarge = 1000 * A) %>%
lm(Alarge ~ C, .) %>%
coef()
## (Intercept) C
## 383.1549 -458.4051
Regression can be used to predict (in this case) new values of A
from C
. For example, to predict the values of A
when C
is -3 and -4:
%>%
reg.df lm(A ~ C, .) %>%
predict(., data.frame(C = c(-3, -4)))
## 1 2
## 1.758370 2.216775
The predicted values always lie on the regression line itself, as shown in the plot below:
ggplot(reg.df) +
aes(y = A, x = C) +
geom_point() +
# add the two new points to the plot
geom_point(aes(x = -3, y = 1.758370), col = "red") +
geom_point(aes(x = -4, y = 2.216775), col = "red") +
# note the need to use fullrange=T to extend
# the regression line
geom_smooth(method = "lm", se = F,
color = "blue", fullrange=T)
## `geom_smooth()` using formula 'y ~ x'
The slope calculated with lm()
and the correlation coefficient (which as discussed above is bounded by ±1) are related by \(m = r \cdot \frac{sd(y)}{sd(x)}\) where these four quantities are respectively:
lm()
, my
x
.Thus for this example:
%>%
reg.df #correlation coefficient
summarise(r = cor(A, C),
# sd of A
sdy = sd(A),
# sd of C
sdx = sd(C)) %>%
# r sd(y)sd(x)
mutate(slope = r * sdy/sdx)
## r sdy sdx slope
## 1 -0.5072916 0.9911202 1.096818 -0.4584051
which gives the same slope of -0.4584051, as calculated with lm()
above.
A statistical test can be carried out to determine the probability that the slope of the regression line is not zero (the null hypothesis is that it is zero). If the slope of the regression line (or correlation coefficient) is zero, then there is no association between the two variables which means that the values of one variable cannot be predicted from the other. There are three equivalent ways of testing for significance. The one most commonly used is the third method, as outlined below.
Testing whether the slope is zero can be done as follows:
%>%
reg.df lm(A ~ C, .) %>%
summary() %>%
tidy()
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.383 0.110 3.47 0.000766
## 2 C -0.458 0.0787 -5.83 0.0000000719
The slope is in line 2 (the intercept is also tested but that’s not generally important). The result gives the slope value itself (column 2), the t-statistic (column 4) and the probability that that the slope could be zero (last column: p < 0.001). So the conclusion is that there is a significant (t = 5.83, p < 0.001) linear relationship between A
and C
(i.e. that A
can be predicted from C
significantly greater than chance).
Since the slope in regression is the same as the correlation coefficient multiplied by a constant (by sdy/sdx as described above), exactly the same result is obtained by testing whether the correlation between the two variables is significantly different from zero. This can be done with the cor.test()
function (with a slightly different syntax), thus:
%>%
reg.df cor.test(~ A + C, .) %>%
tidy()
## # A tibble: 1 × 8
## estimate statistic p.value parameter conf.low conf.high method alternative
## <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr> <chr>
## 1 -0.507 -5.83 7.19e-8 98 -0.640 -0.345 Pears… two.sided
The information from the above from left to right is:
The preferred method for reporting the statistical summary is to include a quantity known as \(R^2\) (R-squared) which is the correlation coefficient squared. Since the correlation coefficient varies between ±1, then \(R^2\) must vary between 0 and 1 (since the square of negative numbers is a positive). The value of \(R^2\) can be seen by leaving out the tidy()
function at the end of the summary()
and the lm()
command, thus:
%>%
reg.df lm(A ~ C, .) %>%
summary()
##
## Call:
## lm(formula = A ~ C, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.44509 -0.56414 0.00729 0.54727 2.05738
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.38315 0.11030 3.474 0.000766 ***
## C -0.45841 0.07866 -5.827 7.19e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8585 on 98 degrees of freedom
## Multiple R-squared: 0.2573, Adjusted R-squared: 0.2498
## F-statistic: 33.96 on 1 and 98 DF, p-value: 7.191e-08
Use Multiple R-squared
when calculating regression of the form y ~ x
with just two variables. The other quantity Adjusted R-squared
is preferred in multiple regression with more than two variables (because the more predictors there are in the regression model, the greater R-squared can be artificially inflated – and Adjusted R-squared
compensates for this).
The information in the penultimate line gives an \(R^2\) value of 0.2573 which is the same as correlation coefficient squared:
%>%
reg.df summarise(cor(A, C)^2)
## cor(A, C)^2
## 1 0.2573448
The F-statistic in the last line at 33.96 is the t-statistic squared i.e. (-5.83)^2. The results showed a significant (\(R^2\) = 0.2573, F[1, 98] = 33.96, p < 0.001) linear association between A
and C
(note again that the probability value is exactly the same in reporting the statistic as for the other two ways given above).
\(R^2\) is also equal to SSR/SSY which are related by SSY = SSR + SSE.
= reg.df %>%
SSE lm(A ~ C, .) %>%
deviance()
SSE
## [1] 72.22293
SSY is the actual values of y (in this case of A
) minus the mean of y squared and then summed:
= reg.df %>%
SSY summarise(sum ((A - mean(A))^2))
SSY
## sum((A - mean(A))^2)
## 1 97.24961
Therefore SSR is:
= SSY - SSE
SSR SSR
## sum((A - mean(A))^2)
## 1 25.02668
and so \(R^2\) is:
/SSY SSR
## sum((A - mean(A))^2)
## 1 0.2573448
Statistical tests based on linear regression are usually only valid if there is not more than one pair of observations per subject or participant. This is because this type of regression presupposes that the observations are independent of each other and this is not the case if more than one observation pair is taken from the same subject. If there are multiple observations per subject, then there is certainly no harm in drawing a regression line through the scatter using the techniques described in this module, but a different method such as mixed models should instead be used to test for statistical significance.
Beyond the above consideration (which is by far the most important one in phonetics and speech science), statistical tests based on linear regression are also only valid if the distribution of the so-called residuals fulfils various conditions. The residuals are the (vertical) differences between the actual and fitted values (the latter being on the regression line). The residuals are in the .resid
column after applying lm()
and then augment()
:
%>%
reg.df lm(A ~ C, .) %>%
augment()
## # A tibble: 100 × 8
## A C .fitted .resid .hat .sigma .cooksd .std.resid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.71 1.54 -0.321 2.03 0.0136 0.837 0.0393 2.38
## 2 -0.653 1.43 -0.271 -0.382 0.0125 0.862 0.00127 -0.448
## 3 -0.175 0.429 0.187 -0.361 0.0117 0.862 0.00106 -0.423
## 4 -0.0540 1.45 -0.280 0.226 0.0127 0.863 0.000451 0.265
## 5 -0.726 0.389 0.205 -0.931 0.0120 0.858 0.00724 -1.09
## 6 0.122 -0.330 0.535 -0.413 0.0223 0.862 0.00270 -0.486
## 7 -0.354 1.84 -0.459 0.105 0.0177 0.863 0.000138 0.124
## 8 -1.10 0.324 0.234 -1.33 0.0126 0.852 0.0155 -1.56
## 9 -0.0555 1.01 -0.0807 0.0251 0.0101 0.863 0.00000444 0.0294
## 10 2.07 -0.166 0.459 1.61 0.0192 0.847 0.0352 1.90
## # … with 90 more rows
The following verifies that the residuals are just the difference between the raw (A
) and fitted (.fitted
) values:
%>%
reg.df lm(A ~ C, .) %>%
augment() %>%
# calculate the residuals manually
mutate(residuals = A - .fitted) %>%
# subtract these from the given `.resid` column
summarise(residuals - .resid) %>%
# there should be 100 zeros
table()
## residuals - .resid
## 0
## 100
The conditions on the residuals are discussed in detail here.
Some of these are as follows:
%>%
reg.df lm(A ~ C, .) %>%
augment() %>%
summarise(mean(.resid))
## # A tibble: 1 × 1
## `mean(.resid)`
## <dbl>
## 1 -7.83e-17
which is very clearly fulfilled in this case.
C
) should not be correlated:%>%
reg.df lm(A ~ C, .) %>%
augment() %>%
cor.test(~ .resid + C, .) %>%
tidy()
## # A tibble: 1 × 8
## estimate statistic p.value parameter conf.low conf.high method alternative
## <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr> <chr>
## 1 3.01e-16 2.98e-15 1.00 98 -0.196 0.196 Pearson's… two.sided
and there is no evidence that they are in this case, given that the correlation coefficient is almost zero.
shapiro.test()
function:%>%
reg.df lm(A ~ C, .) %>%
augment() %>%
# extract the residuals
pull(.resid) %>%
# test whether these are not normally distribued
shapiro.test()
##
## Shapiro-Wilk normality test
##
## data: .
## W = 0.99624, p-value = 0.9952
A p-value less than 0.05 would be evidence that the residuals are not normally distributed (the high p-value in this case suggests that the residuals are likely to be normally distributed). If they are not normally distributed, then this might be because of outliers which can be identified in a so-called qqplot in which, the greater the deviation from a straight line, the less likely the residuals follow a normal distribution:
%>%
reg.df lm(A ~ C, .) %>%
plot(which = 2)
The above shows that observations 1, 43, 73 are outliers. Thus, if the residuals were not normally distributed, the statistical tests (and plots) could be carried out once again, after removing these outliers from the data frame (using the slice()
function).
A good example of an auto-correlated signal is a periodic speech signal because the same pattern (a pitch period) is repeated (as a result of which successive pitch periods are auto-correlated i.e. correlated with each other). One way to test for autocorrelation is with the acf()
function which tests whether the signal is correlated with itself at lag 0, 1, 2 … (lag 0 is always 1 and can be ignored: this is the correlation of the signal with itself):
par(mfrow=c(1,1))
%>%
reg.df lm(A ~ C, .) %>%
augment() %>%
# extract the residuals
pull(.resid) %>%
acf()
If (apart from lag 0) the lag values (especially at lag 1 and 2) are within the blue lines, then it can usually be assumed that there is no auto-correlation (and so the test can proceed).
This means that when they are plotted as a function of the fitted values, there should be no discernible pattern in the data – in particular the red line in the plot below should be more or less flat about the horizontal line of zero.
%>%
reg.df lm(A ~ C, .) %>%
plot(which = 1)