1 An overview of logistic regression and the sigmoid function
2 Fitting and plotting a sigmoid
3 Logistic regression when the predictor is a factor

If not already done, carry out parts 1-5 of the setup, as described here.

Load these libraries and data-frames:

library(tidyverse)
library(broom)

urla = "https://www.phonetik.uni-muenchen.de/"
urlb = "studium_lehre/lehrmaterialien/R_speech_processing/Rdf/"
url = paste0(urla, urlb)
vh.df = read.table(file.path(url, "vh.df.txt"), 
                   stringsAsFactors = T)
pvp = read.table(file.path(url, "pvp.txt"), 
                 stringsAsFactors = T)
sz = read.table(file.path(url, "sz.txt"), 
                stringsAsFactors = T)

sigmoid = function(x, k = 0, m = 1) {
  exp(m * x + k) / (1 + exp(m * x + k))
}

1 An overview of logistic regression and the sigmoid function

A common use of logistic regression in phonetics and speech science is in perception experiments concerned with the influence of a changing numerical variable on the choice between two categories. For example, listeners hear a synthetic vowel and have to decide whether it is /i/ or /u/. The synthetic vowels differ in a range of F2 values between low (500 Hz) and high (2500 Hz). The question is whether listener’s choice between /i, u/ is influenced by F2. Suppose that vowels have been synthesised at 11 steps between F2 = 850 and F2 = 2500:

F2 = seq(500, 2500, length=11)

and suppose that 5 synthetic vowels are presented at each step. These synthetic vowels are randomised. Upon hearing each vowel, a listener has to judge whether it is was /i/ or /u/ (this is known as a forced choice test). At F2 = 500 Hz, a listener might label all 5 synthetic stimuli as /u/ and at F2 = 2500 as /i/ (i.e., not /u/). Between these two extremes, a listener is increasingly likely to label more steps with /i/; and at some F2 value there is often a sudden jump in preferred choices from /u/ to /i/. This is especially so if the two categories are phonologically different (and they are in English, as in beet and boot). Thus, the number of times a listener labelled as /u/ the five vowels at any F2 step might look like this:

u_response = c(5, 5, 5, 4, 4, 3, 1, 0, 0, 0, 0)

Since 5 vowels were presented per step, then the number of /i/ responses is just 5 minus the number of /u/ responses:

i_response = 5 - u_response
i_response

##  [1] 0 0 0 1 1 2 4 5 5 5 5

These three variables can be put into a data-frame

vowel.df = data.frame(i_response, u_response, F2)

and now the proportion of /i/-responses (in other words: not /u/-responses) can be calculated per step: this is the number of /i/-responses divided by the total number of responses (which is always 5):

vowel.df = vowel.df %>%
  mutate(p = i_response/5)

A plot of the proportion of /i/-responses as a function of F2 is as follows:

vowel.df %>%
  ggplot() +
  aes(y = p, x = F2) +
  geom_point() +
  ylab("Proportion of /i/ responses")

The issue to be determined now is whether the proportion of /i/-responses (on the y-axis) is influenced by F2 (on the x-axis). The figure just created suggests that this must be so. This is because for low values of F2 the proportion of /i/-responses is zero (i.e. they are all /u/-responses) whereas for high F2 values, the subject always responds with /i/. Thus F2 obviously does have an influence on the subject’s choice between /i/ and /u/.

Perhaps the way to quantify this association statistically would be to fit a straight line to the data using the linear regression techniques from the last module. However this is not appropriate for two main reasons:

linear regression assumes unbounded values between ± infinity: but these data are proportions with a minimum and maximum of zero and one respectively.
more importantly, the data points in this case don’t follow a straight line. They follow instead an S-like shape known as a logistic or sigmoid function.

A sigmoid function can be plotted with the function sigmoid() inside stat_function() as follows:

ggplot() +
  xlim(-5, 5) +
  stat_function(fun = sigmoid)

As shown by args(sigmoid), any sigmoid is uniquely defined by just two parameters: the intercept and the slope. Lower slope values cause the S-shape to become flatter. Here are four sigmoids with slopes 1 (the default), 0.5, 0.25, and 0 in black, red, green, and blue:

ggplot() +
  xlim(-10, 10) +
  stat_function(fun = sigmoid) +
  stat_function(fun = sigmoid, args=list(m = .5), col="red") +
  stat_function(fun = sigmoid, args=list(m = .25), col="green") +
  stat_function(fun = sigmoid, args=list(m = 0), col="blue")

so that at slope 0, the sigmoid is completely flat and the sigmoid becomes a straight line with no slope.

Logistic regression fits a sigmoid to two variables in which the dependent variable (y in y ~ x) consists of proportions i.e. a factor with two levels. Logistic regression also tests the probability that the two variables follow a sigmoid function. Essentially, this involves determining whether the sigmoidal relationship between the two variables is significantly different from a sigmoid with no slope and just an intercept like the blue line shown above. If not, then there is no significant association between the two variables (because one cannot be predicted from the other) and they do not follow any kind of sigmoidal relationship.

2 Fitting and plotting a sigmoid

The steps in carrying out logistic regression will be illustrated using the data frame vh.df:

summary(vh.df)

##       Year       Vowel    
##  Min.   :1950   high: 82  
##  1st Qu.:1960   low :138  
##  Median :1971             
##  Mean   :1976             
##  3rd Qu.:1993             
##  Max.   :2005

head(vh.df)

##   Year Vowel
## 1 1950  high
## 2 1950  high
## 3 1950  high
## 4 1950  high
## 5 1950  high
## 6 1950  high

The data-frame consists of two factors Year and Vowel. The first of these, Year, is the year in which the word lost was recorded and produced in Standard Southern British English. The second, Vowel, consists of two levels high and low. These are choices made by a transcriber when listening to these words. If the vowel in lost was produced as /lost/ with a high vowel, then the observation has high. Otherwise if the vowel in lost was produced as /lɔst/ with a low vowel, the observation is low. The issue to be considered is whether the choice between high and low was influenced by Year. As always, it is a good idea to start with a plot of the data. Here the plot should be of the proportion with which either high or low was produced. For example, by counting the number of occurrences of high and low:

vh.df %>%
  group_by(Year, Vowel) %>%
  summarise(n = n())

## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.

## # A tibble: 12 × 3
## # Groups:   Year [6]
##     Year Vowel     n
##    <int> <fct> <int>
##  1  1950 high     30
##  2  1950 low       5
##  3  1960 high     18
##  4  1960 low      21
##  5  1971 high     15
##  6  1971 low      26
##  7  1980 high     13
##  8  1980 low      20
##  9  1993 high      4
## 10  1993 low      32
## 11  2005 high      2
## 12  2005 low      34

it is clear that in 1950 there were 30 occurrences of high and 5 occurrences of low. The proportion with which high occurred in 1950 is therefore 30/35 = 0.8571429 and the proportion with which low occurred is 1 less this value i.e. 1 - 0.8571429 = 0.1428571. It is only necessary to calculate one set of proportions since the proportions of high and low are predictable from each other. The final line chooses low:

vh.df %>%
  # Warning: it's very important to order the
  # factors as group_by(x, y). Therefore, for
  # Vowel ~ Year, the order is (Year, Vowel)
  group_by(Year, Vowel) %>%
  summarise(n = n()) %>%
  mutate(prop = n/sum(n)) %>%
  filter(Vowel == levels(Vowel)[2])

## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.

## # A tibble: 6 × 4
## # Groups:   Year [6]
##    Year Vowel     n  prop
##   <int> <fct> <int> <dbl>
## 1  1950 low       5 0.143
## 2  1960 low      21 0.538
## 3  1971 low      26 0.634
## 4  1980 low      20 0.606
## 5  1993 low      32 0.889
## 6  2005 low      34 0.944

There are two levels to the dependent variable, Vowel:

lev = vh.df %>%
  pull(Vowel) %>%
  levels()
lev

## [1] "high" "low"

The command levels(Vowel)[2] chooses the second which is low:

lev[2]

## [1] "low"

The reason for choosing the second is because this ensures that the proportions and the overlaid sigmoid in the resulting plot interpret the vertical axis in the same way (in which for both, 0 to 1 corresponds to increasing proportions of low).

The plot without the sigmoid consists of the same commands passed to ggplot:

vh.df %>%
  group_by(Year, Vowel) %>%
  summarise(n = n()) %>%
  mutate(prop = n/sum(n)) %>%
  filter(Vowel == levels(Vowel)[2]) %>%
  ggplot() +
  aes(y = prop, x = Year) +
  geom_point() +
  ylab("Proportion of 'low' responses")

## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.

The function glm(family=binomial) combined with coef() calculates the best fitting sigmoid that relates Vowel and Year. The term family = binomial is to tell the glm() function that the relationship is logistic in which the dependent variable is a choice between one of two categories (high vs low). In these commands, the fitted sigmoids’s coefficients are stored as the vector km:

km = vh.df %>%
  glm(Vowel ~ Year, ., family=binomial) %>%
  coef()
km

##   (Intercept)          Year 
## -138.11741886    0.07026313

The function glm() actually fits a straight line to so-called log-odds data from which a sigmoid is then derived (in other words, glm() does not calculate the sigmoid directly from proportions). The odds of low are the number of occurrences of low divided by the number of occurrences of high. Log. odds is then just the logarithm of this ratio, thus:

vh.df %>%
  group_by(Year, Vowel) %>%
  summarise(n = n()) %>%
  group_by(Vowel) %>%
  pivot_wider(names_from = Vowel,
              values_from = n) %>%
  mutate(Odds = low/high, 
         Log_odds = log(Odds))

## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.

## # A tibble: 6 × 5
##    Year  high   low   Odds Log_odds
##   <int> <int> <int>  <dbl>    <dbl>
## 1  1950    30     5  0.167   -1.79 
## 2  1960    18    21  1.17     0.154
## 3  1971    15    26  1.73     0.550
## 4  1980    13    20  1.54     0.431
## 5  1993     4    32  8        2.08 
## 6  2005     2    34 17        2.83

The intercept and slope of this best fitting straight line are given by km calculated earlier, thus:

vh.df %>%
  group_by(Year, Vowel) %>%
  summarise(n = n()) %>%
  group_by(Vowel) %>%
  pivot_wider(names_from = Vowel,
              values_from = n) %>%
  mutate(Odds = low/high, Log_odds = log(Odds)) %>%
  ggplot() +
  aes(y = Log_odds, x = Year) + 
  geom_point() +
  geom_abline(slope = km[2], intercept = km[1])

## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.

The intercept of -138.11741886 is the theoretical log-odds when x (in this case Year) is zero and is given by sigmoid(0, -138.11741886, 0.07026313), which an extremely small value. (The intercept is indeed at year 0, i.e. the birth of Christ which is statistically correct but of course absurd given that the problem is to do with the choice between two vowel qualities between 1950 and 1990). The slope of 0.07026313 is the log odds for a change from high to low from one year to the next. Further details on interpreting these coefficients is given here.

Thus the above shows that the sigmoid that best fits these proportional data has an intercept of -138.11741886 and a slope of 0.07026313. This sigmoid can now be overlaid on the raw proportions as follows. The first part uses the same commands to calculate the proportions; the second overlays the sigmoid using the coefficients in km:

vh.df %>%
  group_by(Year, Vowel) %>%
  summarise(n = n()) %>%
  mutate(prop = n/sum(n)) %>%
  filter(Vowel == levels(Vowel)[2]) %>%
  ggplot() +
  aes(y = prop, x = Year) +
  geom_point() +
  stat_function(fun = sigmoid, args=list(k = km[1], m = km[2])) +
  ylab("Proportion low")

## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.

The fitted sigmoid doesn’t look particularly S-shaped or sigmoidal, but this is just because the fit has been made to the upper part of the ‘S’. Extending the range from e.g. 1920 to 2020 brings out the sigmoidal shape more clearly:

vh.df %>%
  group_by(Year, Vowel) %>%
  summarise(n = n()) %>%
  mutate(prop = n/sum(n)) %>%
  filter(Vowel == "low") %>%
  ggplot() +
  aes(y = prop, x = Year) + 
  geom_point() +
  stat_function(fun = sigmoid, args=list(k = km[1], m = km[2])) +
  ylab("Proportion low") +
  scale_x_continuous(limits = c(1920,2020)) +
  scale_y_continuous(limits = c(0,1))

## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.

The fitted or predicted values of the sigmoid are given by entering the Year into the sigmoid() function. For example, the fitted value at 1960 is:

sigmoid(1960, km[1], km[2])

##      Year 
## 0.4009057

There is another way to get the predicted value using thepredict() function, thus:

vh.df %>%
  glm(Vowel ~ Year, ., family=binomial) %>%
  predict(., data.frame(Year = 1960))

##          1 
## -0.4016927

The value returned above is a log-odd. It can be converted into a proportion by converting to odds (by taking the exponential of the log-odds); then using the relationship that odds = p/(1-p) (where p is a proportion); and finally rearranging to give p = odds/(1+odds), thus:

# these are the odds
odds = exp(-0.4016927)
# proportion
odds/(1+odds)

## [1] 0.4009057

# same as 
sigmoid(1960, km[1], km[2])

##      Year 
## 0.4009057

The sigmoid function can therefore be used to predict the values between e.g. 2010 and 2020:

sigmoid(2010:2020, km[1], km[2])

##  [1] 0.9573631 0.9601408 0.9627445 0.9651844 0.9674698 0.9696100 0.9716134
##  [8] 0.9734884 0.9752428 0.9768837 0.9784184

These values are proportions of low. Thus, the prediction is made using the fitted sigmoid that by 2010, 95.73631% of lost words will be produced with a low vowel.

The fitted sigmoid can also be used to calculate the year at which the choice between high and low is ambiguous i.e. when the proportions are both equal to 0.5. This is the steepest point of the sigmoid and it can be shown that this cross-over point (“Umkipppunkt”) is given by -k/m where k and m are the fitted intercept and slope, thus in this case:

-km[1]/km[2]

## (Intercept) 
##    1965.717

The reason why -km[1]/km[2] (or more generally) -k/m is the cross-over boundary is because this is the value in any sigmoid at which the proportions are equal at 0.5. This can be informally shown as follows. The sigmoid function (which can be seen in R by entering sigmoid() on its own) is given by:

\(f(x) = \frac{e^{mx+k}}{1 + e^{mx+k}}\)

Now substitute x=-k/m in the above function:

\(f(-k/m) = \frac{e^{m(-k/m)+k}}{1 + e^{m(-k/m)+k}}\)

which is equal to:

\(f(-k/m) = \frac{e^{0}}{1 + e^{0}}\)

Since anything raised to the power of zero is 1, then the above is the same as

\(f(-k/m) = \frac{1}{2}\)

which shows that an x-axis value of -k/m corresponds to a proportion of 0.5.

The above gives a cross-over boundary around 1965: this is the year at which a majority of /lost/ with a high vowel (before 1965) begins to give way to a majority of /lɔst/ with a low vowel (after 1965). The cross-over boundary can be superimposed on the display as follows:

vh.df %>%
  group_by(Year, Vowel) %>%
  summarise(n = n()) %>%
  mutate(prop = n/sum(n)) %>%
  filter(Vowel == "low") %>%
  ggplot() +
  aes(y = prop, x = Year) + 
  geom_point() +
  stat_function(fun = sigmoid, args=list(k = km[1], m = km[2])) +
  ylab("Proportion low") +
  geom_vline(xintercept = -km[1]/km[2], lty=2) +
  geom_hline(yintercept = .5, lty=2)

## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.

The final step is to test whether the probability that the relationship between proportion of low and Year follows a sigmoid function. This is done using a so-called \(χ^2\) test (chi-squared, the first word is pronounced /kai/):

vh.df %>%
  glm(Vowel ~ Year, ., family=binomial) %>%
  anova(test="Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Vowel
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                   219     290.57              
## Year  1   61.121       218     229.45 5.367e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The relevant parameter is deviance which is tested with the \(χ^2\) distribution: the greater the deviance, the greater the likelihood that proportion and year are related by a sigmoid function. Thus for this case, the conclusion could be written up as: The choice between high and low was significantly (\(χ^2\)[1] = 61.121, p < 0.001) affected by year.

Deviance is a measure of how well the sigmoid (the predicted values) and the actual data fit: if the deviance is zero, then the fit is perfect. The statistical test is of whether the deviance is significantly different from the deviance from the same model but with an intercept only. That is, the analysis is of whether the deviance (given under Residual deviance below) obtained from:

vh.df %>%
  # intercept and slope
  glm(Vowel ~ Year, ., family=binomial)

## 
## Call:  glm(formula = Vowel ~ Year, family = binomial, data = .)
## 
## Coefficients:
## (Intercept)         Year  
##  -138.11742      0.07026  
## 
## Degrees of Freedom: 219 Total (i.e. Null);  218 Residual
## Null Deviance:       290.6 
## Residual Deviance: 229.5     AIC: 233.5

is significantly different compared with the deviance obtained from:

vh.df %>%
  # intercept only
  glm(Vowel ~ 1, ., family=binomial)

## 
## Call:  glm(formula = Vowel ~ 1, family = binomial, data = .)
## 
## Coefficients:
## (Intercept)  
##      0.5205  
## 
## Degrees of Freedom: 219 Total (i.e. Null);  219 Residual
## Null Deviance:       290.6 
## Residual Deviance: 290.6     AIC: 292.6

Since this second model fits a horizontal line to the data (i.e. an intercept in which the slope is zero), then the test is essentially of whether or not the fitted sigmoid has a slope (i.e. is s-shaped or sigmoidal). A non-significant result means that the data points are just as likely to be modelled by a straight line with no sigmoidal shape. The size of the deviance i.e., 61.121 in (\(χ^2\)[1] = 61.121, p < 0.001) is the difference between Null Deviance (the deviance with an intercept only) and Residual Deviance (the deviance with an intercept and slope).

3 Logistic regression when the predictor is a factor

Logistic regression can also be used if x in y ~ x is a factor (i.e. categorical). (There is no need to draw a sigmoid nor to calculate a cross-over boundary in this case). An example of such an analysis is in the data frame sz:

summary(sz)

##  Frikativ Dialekt      Vpn    
##  s:10     BY: 9   S1     : 1  
##  z:10     SH:11   S10    : 1  
##                   S11    : 1  
##                   S12    : 1  
##                   S13    : 1  
##                   S14    : 1  
##                   (Other):14

which has a factor Frikativ showing the production of the initial consonant of Sonne with an /s/ or /z/ and a factor Dialekt of whether the fricative was produced by a speaker from BY (Bayern) or from Schleswig-Holstein (SH). The question is whether the choice between /s/ or /z/ is affected by region i.e. in terms of y ~ x whether Frikativ is influenced by Dialekt. Since both are factors (i.e. categorical variables), an appopriate figure is with geom_bar():

sz %>%
  ggplot() +
  aes(fill = Frikativ, x = Dialekt) + 
  geom_bar(position="fill")

which shows that /z/ is much more frequent in Schleswig-Holstein. The test is similar to the one as before:

sz %>%
  glm(Frikativ ~ Dialekt, ., family=binomial) %>%
  anova(., test = "Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Frikativ
## 
## Terms added sequentially (first to last)
## 
## 
##         Df Deviance Resid. Df Resid. Dev Pr(>Chi)  
## NULL                       19     27.726           
## Dialekt  1   5.3002        18     22.426  0.02132 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Compatibly with the figure, the choice between /s/ or /z/ was significantly influenced by region (\(χ^2\)[1] = 5.3, p < 0.05).

Even when y and x are both factors as in the above example, logistic regression nevertheless calculates coefficients, i.e. an intercept and a slope:

sz %>%
  glm(Frikativ ~ Dialekt, ., family=binomial) %>%
  coef()

## (Intercept)   DialektSH 
##   -1.252763    2.233592

For reasons that are explained in further detail here, the intercept and slope can be related to log odds of the two levels of x i.e. of Dialekt. The odds and log-odds for these two levels are calculated as follows:

sz %>%
  group_by(Dialekt, Frikativ) %>%
  summarise(n = n()) %>%
  group_by(Frikativ) %>%
  pivot_wider(names_from = Frikativ,
              values_from = n) %>%
  mutate(Odds = z/s, 
         Log_odds = log(Odds))

## `summarise()` has grouped output by 'Dialekt'. You can override using the
## `.groups` argument.

## # A tibble: 2 × 5
##   Dialekt     s     z  Odds Log_odds
##   <fct>   <int> <int> <dbl>    <dbl>
## 1 BY          7     2 0.286   -1.25 
## 2 SH          3     8 2.67     0.981

Thus the odds of /z/ occurring in BY is 2/7 and the odds of /z/ occurring in SH is 8/3 (log. odds and just the logarithms of these).

The intercept is the log odds of BY which, as the above table shows, is -1.25.
The slope is the log odds of SH minus the log odds of BY = 0.981 - (-1.25) = 2.231: this is the same as the log of the odds ratio between SH and BY i.e. log(2.67 /0.286).

An introduction to logistic regression

Jonathan Harrington, Johanna Cronenberg

1 An overview of logistic regression and the sigmoid function

2 Fitting and plotting a sigmoid

3 Logistic regression when the predictor is a factor