If not already done, carry out parts 1-5 of the setup, as described here.
Load these libraries and data-frames:
library(tidyverse)
library(broom)
urla = "https://www.phonetik.uni-muenchen.de/"
urlb = "studium_lehre/lehrmaterialien/R_speech_processing/Rdf/"
url = paste0(urla, urlb)
vh.df = read.table(file.path(url, "vh.df.txt"),
stringsAsFactors = T)
pvp = read.table(file.path(url, "pvp.txt"),
stringsAsFactors = T)
sz = read.table(file.path(url, "sz.txt"),
stringsAsFactors = T)
sigmoid = function(x, k = 0, m = 1) {
exp(m * x + k) / (1 + exp(m * x + k))
}
A common use of logistic regression in phonetics and speech science is in perception experiments concerned with the influence of a changing numerical variable on the choice between two categories. For example, listeners hear a synthetic vowel and have to decide whether it is /i/ or /u/. The synthetic vowels differ in a range of F2 values between low (500 Hz) and high (2500 Hz). The question is whether listener’s choice between /i, u/ is influenced by F2. Suppose that vowels have been synthesised at 11 steps between F2 = 850 and F2 = 2500:
and suppose that 5 synthetic vowels are presented at each step. These synthetic vowels are randomised. Upon hearing each vowel, a listener has to judge whether it is was /i/ or /u/ (this is known as a forced choice test). At F2 = 500 Hz, a listener might label all 5 synthetic stimuli as /u/ and at F2 = 2500 as /i/ (i.e., not /u/). Between these two extremes, a listener is increasingly likely to label more steps with /i/; and at some F2 value there is often a sudden jump in preferred choices from /u/ to /i/. This is especially so if the two categories are phonologically different (and they are in English, as in beet and boot). Thus, the number of times a listener labelled as /u/ the five vowels at any F2 step might look like this:
Since 5 vowels were presented per step, then the number of /i/ responses is just 5 minus the number of /u/ responses:
## [1] 0 0 0 1 1 2 4 5 5 5 5
These three variables can be put into a data-frame
and now the proportion of /i/-responses (in other words: not /u/-responses) can be calculated per step: this is the number of /i/-responses divided by the total number of responses (which is always 5):
A plot of the proportion of /i/-responses as a function of F2 is as follows:
The issue to be determined now is whether the proportion of /i/-responses (on the y-axis) is influenced by F2 (on the x-axis). The figure just created suggests that this must be so. This is because for low values of F2 the proportion of /i/-responses is zero (i.e. they are all /u/-responses) whereas for high F2 values, the subject always responds with /i/. Thus F2 obviously does have an influence on the subject’s choice between /i/ and /u/.
Perhaps the way to quantify this association statistically would be to fit a straight line to the data using the linear regression techniques from the last module. However this is not appropriate for two main reasons:
A sigmoid function can be plotted with the function sigmoid()
inside stat_function()
as follows:
As shown by args(sigmoid)
, any sigmoid is uniquely defined by just two parameters: the intercept and the slope. Lower slope values cause the S-shape to become flatter. Here are four sigmoids with slopes 1 (the default), 0.5, 0.25, and 0 in black, red, green, and blue:
ggplot() +
xlim(-10, 10) +
stat_function(fun = sigmoid) +
stat_function(fun = sigmoid, args=list(m = .5), col="red") +
stat_function(fun = sigmoid, args=list(m = .25), col="green") +
stat_function(fun = sigmoid, args=list(m = 0), col="blue")
so that at slope 0, the sigmoid is completely flat and the sigmoid becomes a straight line with no slope.
Logistic regression fits a sigmoid to two variables in which the dependent variable (y
in y ~ x
) consists of proportions i.e. a factor with two levels. Logistic regression also tests the probability that the two variables follow a sigmoid function. Essentially, this involves determining whether the sigmoidal relationship between the two variables is significantly different from a sigmoid with no slope and just an intercept like the blue line shown above. If not, then there is no significant association between the two variables (because one cannot be predicted from the other) and they do not follow any kind of sigmoidal relationship.
The steps in carrying out logistic regression will be illustrated using the data frame vh.df
:
## Year Vowel
## Min. :1950 high: 82
## 1st Qu.:1960 low :138
## Median :1971
## Mean :1976
## 3rd Qu.:1993
## Max. :2005
## Year Vowel
## 1 1950 high
## 2 1950 high
## 3 1950 high
## 4 1950 high
## 5 1950 high
## 6 1950 high
The data-frame consists of two factors Year
and Vowel
. The first of these, Year
, is the year in which the word lost was recorded and produced in Standard Southern British English. The second, Vowel
, consists of two levels high
and low
. These are choices made by a transcriber when listening to these words. If the vowel in lost was produced as /lost/ with a high vowel, then the observation has high
. Otherwise if the vowel in lost was produced as /lɔst/ with a low vowel, the observation is low
. The issue to be considered is whether the choice between high
and low
was influenced by Year
. As always, it is a good idea to start with a plot of the data. Here the plot should be of the proportion with which either high
or low
was produced. For example, by counting the number of occurrences of high
and low
:
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
## # A tibble: 12 × 3
## # Groups: Year [6]
## Year Vowel n
## <int> <fct> <int>
## 1 1950 high 30
## 2 1950 low 5
## 3 1960 high 18
## 4 1960 low 21
## 5 1971 high 15
## 6 1971 low 26
## 7 1980 high 13
## 8 1980 low 20
## 9 1993 high 4
## 10 1993 low 32
## 11 2005 high 2
## 12 2005 low 34
it is clear that in 1950 there were 30 occurrences of high
and 5 occurrences of low
. The proportion with which high
occurred in 1950 is therefore 30/35 = 0.8571429 and the proportion with which low
occurred is 1 less this value i.e. 1 - 0.8571429 = 0.1428571. It is only necessary to calculate one set of proportions since the proportions of high
and low
are predictable from each other. The final line chooses low
:
vh.df %>%
# Warning: it's very important to order the
# factors as group_by(x, y). Therefore, for
# Vowel ~ Year, the order is (Year, Vowel)
group_by(Year, Vowel) %>%
summarise(n = n()) %>%
mutate(prop = n/sum(n)) %>%
filter(Vowel == levels(Vowel)[2])
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 4
## # Groups: Year [6]
## Year Vowel n prop
## <int> <fct> <int> <dbl>
## 1 1950 low 5 0.143
## 2 1960 low 21 0.538
## 3 1971 low 26 0.634
## 4 1980 low 20 0.606
## 5 1993 low 32 0.889
## 6 2005 low 34 0.944
There are two levels to the dependent variable, Vowel
:
## [1] "high" "low"
The command levels(Vowel)[2]
chooses the second which is low
:
## [1] "low"
The reason for choosing the second is because this ensures that the proportions and the overlaid sigmoid in the resulting plot interpret the vertical axis in the same way (in which for both, 0 to 1 corresponds to increasing proportions of low
).
The plot without the sigmoid consists of the same commands passed to ggplot
:
vh.df %>%
group_by(Year, Vowel) %>%
summarise(n = n()) %>%
mutate(prop = n/sum(n)) %>%
filter(Vowel == levels(Vowel)[2]) %>%
ggplot() +
aes(y = prop, x = Year) +
geom_point() +
ylab("Proportion of 'low' responses")
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
The function glm(family=binomial)
combined with coef()
calculates the best fitting sigmoid that relates Vowel
and Year
. The term family = binomial
is to tell the glm()
function that the relationship is logistic in which the dependent variable is a choice between one of two categories (high
vs low
). In these commands, the fitted sigmoids’s coefficients are stored as the vector km
:
## (Intercept) Year
## -138.11741886 0.07026313
The function glm()
actually fits a straight line to so-called log-odds data from which a sigmoid is then derived (in other words, glm()
does not calculate the sigmoid directly from proportions). The odds of low
are the number of occurrences of low
divided by the number of occurrences of high
. Log. odds is then just the logarithm of this ratio, thus:
vh.df %>%
group_by(Year, Vowel) %>%
summarise(n = n()) %>%
group_by(Vowel) %>%
pivot_wider(names_from = Vowel,
values_from = n) %>%
mutate(Odds = low/high,
Log_odds = log(Odds))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 5
## Year high low Odds Log_odds
## <int> <int> <int> <dbl> <dbl>
## 1 1950 30 5 0.167 -1.79
## 2 1960 18 21 1.17 0.154
## 3 1971 15 26 1.73 0.550
## 4 1980 13 20 1.54 0.431
## 5 1993 4 32 8 2.08
## 6 2005 2 34 17 2.83
The intercept and slope of this best fitting straight line are given by km
calculated earlier, thus:
vh.df %>%
group_by(Year, Vowel) %>%
summarise(n = n()) %>%
group_by(Vowel) %>%
pivot_wider(names_from = Vowel,
values_from = n) %>%
mutate(Odds = low/high, Log_odds = log(Odds)) %>%
ggplot() +
aes(y = Log_odds, x = Year) +
geom_point() +
geom_abline(slope = km[2], intercept = km[1])
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
The intercept of -138.11741886 is the theoretical log-odds when x
(in this case Year
) is zero and is given by sigmoid(0, -138.11741886, 0.07026313)
, which an extremely small value. (The intercept is indeed at year 0, i.e. the birth of Christ which is statistically correct but of course absurd given that the problem is to do with the choice between two vowel qualities between 1950 and 1990). The slope of 0.07026313 is the log odds for a change from high
to low
from one year to the next. Further details on interpreting these coefficients is given here.
Thus the above shows that the sigmoid that best fits these proportional data has an intercept of -138.11741886 and a slope of 0.07026313. This sigmoid can now be overlaid on the raw proportions as follows. The first part uses the same commands to calculate the proportions; the second overlays the sigmoid using the coefficients in km
:
vh.df %>%
group_by(Year, Vowel) %>%
summarise(n = n()) %>%
mutate(prop = n/sum(n)) %>%
filter(Vowel == levels(Vowel)[2]) %>%
ggplot() +
aes(y = prop, x = Year) +
geom_point() +
stat_function(fun = sigmoid, args=list(k = km[1], m = km[2])) +
ylab("Proportion low")
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
The fitted sigmoid doesn’t look particularly S-shaped or sigmoidal, but this is just because the fit has been made to the upper part of the ‘S’. Extending the range from e.g. 1920 to 2020 brings out the sigmoidal shape more clearly:
vh.df %>%
group_by(Year, Vowel) %>%
summarise(n = n()) %>%
mutate(prop = n/sum(n)) %>%
filter(Vowel == "low") %>%
ggplot() +
aes(y = prop, x = Year) +
geom_point() +
stat_function(fun = sigmoid, args=list(k = km[1], m = km[2])) +
ylab("Proportion low") +
scale_x_continuous(limits = c(1920,2020)) +
scale_y_continuous(limits = c(0,1))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
The fitted or predicted values of the sigmoid are given by entering the Year
into the sigmoid()
function. For example, the fitted value at 1960 is:
## Year
## 0.4009057
There is another way to get the predicted value using thepredict()
function, thus:
## 1
## -0.4016927
The value returned above is a log-odd. It can be converted into a proportion by converting to odds (by taking the exponential of the log-odds); then using the relationship that odds = p/(1-p) (where p is a proportion); and finally rearranging to give p = odds/(1+odds), thus:
## [1] 0.4009057
## Year
## 0.4009057
The sigmoid function can therefore be used to predict the values between e.g. 2010 and 2020:
## [1] 0.9573631 0.9601408 0.9627445 0.9651844 0.9674698 0.9696100 0.9716134
## [8] 0.9734884 0.9752428 0.9768837 0.9784184
These values are proportions of low
. Thus, the prediction is made using the fitted sigmoid that by 2010, 95.73631% of lost words will be produced with a low vowel.
The fitted sigmoid can also be used to calculate the year at which the choice between high
and low
is ambiguous i.e. when the proportions are both equal to 0.5. This is the steepest point of the sigmoid and it can be shown that this cross-over point (“Umkipppunkt”) is given by -k/m
where k
and m
are the fitted intercept and slope, thus in this case:
## (Intercept)
## 1965.717
The reason why -km[1]/km[2]
(or more generally) -k/m is the cross-over boundary is because this is the value in any sigmoid at which the proportions are equal at 0.5. This can be informally shown as follows. The sigmoid function (which can be seen in R by entering sigmoid()
on its own) is given by:
\(f(x) = \frac{e^{mx+k}}{1 + e^{mx+k}}\)
Now substitute x=-k/m
in the above function:
\(f(-k/m) = \frac{e^{m(-k/m)+k}}{1 + e^{m(-k/m)+k}}\)
which is equal to:
\(f(-k/m) = \frac{e^{0}}{1 + e^{0}}\)
Since anything raised to the power of zero is 1, then the above is the same as
\(f(-k/m) = \frac{1}{2}\)
which shows that an x-axis value of -k/m corresponds to a proportion of 0.5.
The above gives a cross-over boundary around 1965: this is the year at which a majority of /lost/ with a high vowel (before 1965) begins to give way to a majority of /lɔst/ with a low vowel (after 1965). The cross-over boundary can be superimposed on the display as follows:
vh.df %>%
group_by(Year, Vowel) %>%
summarise(n = n()) %>%
mutate(prop = n/sum(n)) %>%
filter(Vowel == "low") %>%
ggplot() +
aes(y = prop, x = Year) +
geom_point() +
stat_function(fun = sigmoid, args=list(k = km[1], m = km[2])) +
ylab("Proportion low") +
geom_vline(xintercept = -km[1]/km[2], lty=2) +
geom_hline(yintercept = .5, lty=2)
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
The final step is to test whether the probability that the relationship between proportion of low
and Year
follows a sigmoid function. This is done using a so-called \(χ^2\) test (chi-squared, the first word is pronounced /kai/):
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Vowel
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 219 290.57
## Year 1 61.121 218 229.45 5.367e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The relevant parameter is deviance which is tested with the \(χ^2\) distribution: the greater the deviance, the greater the likelihood that proportion and year are related by a sigmoid function. Thus for this case, the conclusion could be written up as: The choice between high and low was significantly (\(χ^2\)[1] = 61.121, p < 0.001) affected by year.
Deviance is a measure of how well the sigmoid (the predicted values) and the actual data fit: if the deviance is zero, then the fit is perfect. The statistical test is of whether the deviance is significantly different from the deviance from the same model but with an intercept only. That is, the analysis is of whether the deviance (given under Residual deviance below) obtained from:
##
## Call: glm(formula = Vowel ~ Year, family = binomial, data = .)
##
## Coefficients:
## (Intercept) Year
## -138.11742 0.07026
##
## Degrees of Freedom: 219 Total (i.e. Null); 218 Residual
## Null Deviance: 290.6
## Residual Deviance: 229.5 AIC: 233.5
is significantly different compared with the deviance obtained from:
##
## Call: glm(formula = Vowel ~ 1, family = binomial, data = .)
##
## Coefficients:
## (Intercept)
## 0.5205
##
## Degrees of Freedom: 219 Total (i.e. Null); 219 Residual
## Null Deviance: 290.6
## Residual Deviance: 290.6 AIC: 292.6
Since this second model fits a horizontal line to the data (i.e. an intercept in which the slope is zero), then the test is essentially of whether or not the fitted sigmoid has a slope (i.e. is s-shaped or sigmoidal). A non-significant result means that the data points are just as likely to be modelled by a straight line with no sigmoidal shape. The size of the deviance i.e., 61.121 in (\(χ^2\)[1] = 61.121, p < 0.001) is the difference between Null Deviance
(the deviance with an intercept only) and Residual Deviance
(the deviance with an intercept and slope).
Logistic regression can also be used if x
in y ~ x
is a factor (i.e. categorical). (There is no need to draw a sigmoid nor to calculate a cross-over boundary in this case). An example of such an analysis is in the data frame sz
:
## Frikativ Dialekt Vpn
## s:10 BY: 9 S1 : 1
## z:10 SH:11 S10 : 1
## S11 : 1
## S12 : 1
## S13 : 1
## S14 : 1
## (Other):14
which has a factor Frikativ
showing the production of the initial consonant of Sonne
with an /s/ or /z/ and a factor Dialekt
of whether the fricative was produced by a speaker from BY
(Bayern) or from Schleswig-Holstein (SH
). The question is whether the choice between /s/ or /z/ is affected by region i.e. in terms of y ~ x
whether Frikativ
is influenced by Dialekt
. Since both are factors (i.e. categorical variables), an appopriate figure is with geom_bar()
:
which shows that /z/ is much more frequent in Schleswig-Holstein. The test is similar to the one as before:
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Frikativ
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 19 27.726
## Dialekt 1 5.3002 18 22.426 0.02132 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Compatibly with the figure, the choice between /s/ or /z/ was significantly influenced by region (\(χ^2\)[1] = 5.3, p < 0.05).
Even when y
and x
are both factors as in the above example, logistic regression nevertheless calculates coefficients, i.e. an intercept and a slope:
## (Intercept) DialektSH
## -1.252763 2.233592
For reasons that are explained in further detail here, the intercept and slope can be related to log odds of the two levels of x
i.e. of Dialekt
. The odds and log-odds for these two levels are calculated as follows:
sz %>%
group_by(Dialekt, Frikativ) %>%
summarise(n = n()) %>%
group_by(Frikativ) %>%
pivot_wider(names_from = Frikativ,
values_from = n) %>%
mutate(Odds = z/s,
Log_odds = log(Odds))
## `summarise()` has grouped output by 'Dialekt'. You can override using the
## `.groups` argument.
## # A tibble: 2 × 5
## Dialekt s z Odds Log_odds
## <fct> <int> <int> <dbl> <dbl>
## 1 BY 7 2 0.286 -1.25
## 2 SH 3 8 2.67 0.981
Thus the odds of /z/ occurring in BY is 2/7 and the odds of /z/ occurring in SH is 8/3 (log. odds and just the logarithms of these).
The intercept is the log odds of BY which, as the above table shows, is -1.25.
The slope is the log odds of SH minus the log odds of BY = 0.981 - (-1.25) = 2.231: this is the same as the log of the odds ratio between SH and BY i.e. log(2.67 /0.286).