Carry out parts 1-5 of the setup, as described here.

Load this library:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

1 Typing commands into R

The console (see above) is the window for typing commands (or executing code). For example:

1 + 1
## [1] 2
10 - 5
## [1] 5
3 * 4
## [1] 12
12 / 6
## [1] 2
2^4
## [1] 16

R is a so-called interpreted language which means that it gives an answer back to what was typed in (instead of having to get the answer after running a compiled program). Commands in R make use of different kinds of objects. A brief introduction to the most important of these is below.

1.1 Vectors

A vector is a type of object for storing numerical or character data. A vector or indeed any object can be created with either <- or = which mean ‘become’. The name of the vector in this case is x (everything entered after # is a comment and not executed) which in the example below has the value 2:

x = 2
# equivalently
x = 1 + 1

Entering the vector on its own returns its value:

x
## [1] 2

The vector can now be used for further calculations, thus:

x + 3
## [1] 5

Note that objects in R are over-written. For example:

x = 4
x
## [1] 4
x = 3
x
## [1] 3

1.2 Functions

A function is a type of object that executes code. A function has a name followed by round brackets that may contain so called arguments. The following function that stands for list objects can be executed with no arguments and returns the objects that have been stored in the current session:

ls()
## [1] "x"

The function rm() can be used to remove objects as in this example in which the variable x created above is removed or deleted:

rm(x)
ls()
## character(0)

1.3 Types of objects

The function class() gives information about the type of object. The ones created so far and those below are numeric (vectors):

x = 3.2   
class(x)
## [1] "numeric"
y = 4     
class(y)
## [1] "numeric"

Character vectors in R can be created with double or single quotes, thus:

z = "Munich"
# equivalently
z = 'Munich'
class(z)
## [1] "character"

A logical vector consists of True and/or False elements, thus:

a = TRUE   
# equivalently
a = T

b = FALSE  
# equivalently
b = F
class(a)
## [1] "logical"
class(b)
## [1] "logical"

1.4 Vectors and concatenation

So far, all of the objects have consisted of just one element or were vectors of length 1. This can also be verified with the length() function. Thus the following shows that x is a numeric vector of length 1.

x = 4
class(x)
## [1] "numeric"
length(x)
## [1] 1

The function c() (combine values) can be used to make a vector consisting of several elements. Thus the following shows that the object y is a numeric vector of length 5:

y = c(10, 0, -1, 8, 2.5)
class(y)
## [1] "numeric"
length(y)
## [1] 5

Character vectors of several elements can be created in the same way. Note that each element is delimited by quotes. Thus the following is a character of length 1 even though the words have spaces between them:

z = "Institute of Phonetics"
class(z)
## [1] "character"
length(z)
## [1] 1

whereas the following is a character vector of length 3:

z = c("Institute", "of", "Phonetics")
class(z)
## [1] "character"
length(z)
## [1] 3

If numeric and character elements are combined, then the result is a character vector:

z = c(4, "a", 5)
class(z)
## [1] "character"

If numeric and logical elements are concatenated, then the result is a numeric vector in which T and F are converted to 1 and 0 respectively:

z = c(4, T, 5,  F)
z
## [1] 4 1 5 0
class(z)
## [1] "numeric"

1.5 Arithmetic operations on vectors

An arithmetic operation can be applied to two vectors in the form x + y (or x * y, x/y, x-y). There are two main cases to consider. The first is when the length of one of the vectors is 1.

x = c(10, 4, 20)
y = 10
x * y
## [1] 100  40 200

Thus in the above case the vector y of length 1 is multiplied with all the elements of x. The second case is when the length of the two vectors is the same.

x = c(10, 4, 20)
y = c(3, -9, 12)
x + y
## [1] 13 -5 32

In the above, notice that vectors are summed element by element. The third case is something that should never be done which is when the two vectors are of unequal length. This gives a warning message and the operation follows a complicated principle of so-called recycling:

x = c(10, 4, 20)
y = c(3, -9)
x + y
## Warning in x + y: longer object length is not a multiple of shorter object
## length
## [1] 13 -5 23

Beyond simple numerical operations such as the above, there are also almost countless functions and operations that can applied to numeric vectors e.g.:

a = c(5, 4, 12)
sum(a)        # sum all the elements
## [1] 21
sqrt(a)       # take the square root of each element
## [1] 2.236068 2.000000 3.464102
a^(0.5)       # as above
## [1] 2.236068 2.000000 3.464102
a^3           # the cube of each element
## [1]  125   64 1728
log(a)        # the natural log. of each element
## [1] 1.609438 1.386294 2.484907
log(a, base=10) # log. to the base 10 of each element
## [1] 0.698970 0.602060 1.079181
exp(a)        # exponential of each element
## [1]    148.41316     54.59815 162754.79142

1.6 Some other functions for manipulating vectors

Sequences of whole numbers can be created with the colon notation:

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
10:1
##  [1] 10  9  8  7  6  5  4  3  2  1
-5:5
##  [1] -5 -4 -3 -2 -1  0  1  2  3  4  5

The seq() function creates equally spaced numeric intervals in various ways:

# five equally spaced valued between 10 and 20
seq(10, 20, length.out=5)
## [1] 10.0 12.5 15.0 17.5 20.0
# in equal steps of 1.5
seq(10, 20, by = 1.5)
## [1] 10.0 11.5 13.0 14.5 16.0 17.5 19.0
# seq increment in steps of 3 up to 25
seq(10, by=3, to=25)
## [1] 10 13 16 19 22 25

Arguments in functions

One way to determine the arguments of a function is by looking it up in help pages.

# list the help page for seq()
?seq

Sometimes the function args() can be used for the same purpose:

# the arguments of rm()
args(rm)
## function (..., list = character(), pos = -1, envir = as.environment(pos), 
##     inherits = FALSE) 
## NULL

Any argument name can be omitted as long as it corresponds to the order in which the argument is specified in the function. Thus, because from and to are the first and second arguments of seq(), then the following give equivalent output:

seq(10, 20, length.out=5)
## [1] 10.0 12.5 15.0 17.5 20.0
seq(from = 10, to = 20, length.out=5)
## [1] 10.0 12.5 15.0 17.5 20.0
seq(10, to = 20, length.out=5)
## [1] 10.0 12.5 15.0 17.5 20.0

If the argument names are provided, then the order in which they appear is irrelevant. Thus these are also equivalent:

seq(from = 10, to = 20, length.out=5)
## [1] 10.0 12.5 15.0 17.5 20.0
seq(length.out = 5, from = 10, to = 20)
## [1] 10.0 12.5 15.0 17.5 20.0

The function rep() can be used to repeat the elements of a vector (whether numeric, character, or logical) in various ways:

# repeat 1 three times
rep(1,  3)
## [1] 1 1 1
# repeat the character "a" twice
rep("a", times = 2)
## [1] "a" "a"
# repeat the sequence "a", "b" four times
vec <- c("a", "b")
rep(vec)
## [1] "a" "b"
# repeat each element of vec four times
rep(vec, each = 4)
## [1] "a" "a" "a" "a" "b" "b" "b" "b"

In its simplest form, the paste() function makes a single element from several characters with the possibility of including a separator between them, thus:

# make a single element "a b c"
vec = paste("a", "b", "c", sep = " ")
vec
## [1] "a b c"
length(vec)
## [1] 1
# make a single element "abc"
paste("a", "b", "c", sep = "")
## [1] "abc"
# the same
paste0("a", "b", "c")
## [1] "abc"

paste() can also be used to attach prefixes or suffixes, thus:

paste("S", 1:10, sep="_")
##  [1] "S_1"  "S_2"  "S_3"  "S_4"  "S_5"  "S_6"  "S_7"  "S_8"  "S_9"  "S_10"
paste(1:10, "S", sep="_")
##  [1] "1_S"  "2_S"  "3_S"  "4_S"  "5_S"  "6_S"  "7_S"  "8_S"  "9_S"  "10_S"

The unique() function lists those elements of a vector that are unique.

vec <- c(1, 5, 2, 7, 6, 3, 7, 5)
unique(vec)
## [1] 1 5 2 7 6 3
vec <- c("i", "i", "a", "a", "E", "E", "E", "E", "U")
unique(vec) 
## [1] "i" "a" "E" "U"

The function table() in its simplest form additionally counts the number of times each unique element occurs:

vec <- c("i", "i", "a", "a", "E", "E", "E", "E", "U")
table(vec)  
## vec
## a E i U 
## 2 4 2 1

2 Data Frames

A data-frame is a table with a certain number of rows and columns that forms an essential structure for all forms of statistical and speech analysis in R. Understanding the nature and properties of a data-frame therefore forms a completely essential part of the information in all future models.

2.1 Observations and variables

The rows of a data-frame are sometimes called observations because this is a record of what an experimenter has observed in collecting data for analysis. For example, if I record 3 male and 4 female participants each producing one [i] and one [a] vowel and then measure the average fundamental frequency per vowel, then I have 14 observations (7 participants \(\times\) 2 vowel types) and thus 14 rows in the data-frame. The columns of the data-frame are known as variables that contain information about the type of data that was collected. In this simple example, there are four variables:

  • the fundamental frequency values
  • the subject identifier
  • the sex of the subject
  • the vowel

This means that this is a 14 \(\times\) 4 data-frame in this example: that is, a data-frame with 14 observations and 4 variables i.e. with 14 rows and 4 columns.

To make this more concrete, the above will be simulated with some artificial data that is used to build a data frame:

# the fundamental frequency values
fund = c(110, 115, 98, 102, 141, 156, 180, 179, 165, 180, 195, 200, 183, 184)
# the vowels
vowels = rep(c("a", "i"),  7)
# the subject identifier
subj = paste0(1:14, "S")
# the sex of the subject
speakersex = c(rep("M", 6), rep("F", 8))

This information can now be assembled into a data-frame as follows.

df = data.frame(fund, vowels, subj, speakersex)

The following confirms that it is a 14 \(\times\) 4 data-frame i.e. one with 14 observations and 4 columns:

# number of observations x variables (rows x columns)
dim(df)
## [1] 14  4

The alternative way of extracting information from a data-frame that is discussed extensively in the next module is to make use of the pipe %>% operator. The following is equivalent to the above:

df %>% dim()
## [1] 14  4

The above literally means: pipe all the information in the data-frame df to the function dim() and then apply that function to the data-frame. Here are similarly two equivalent ways for looking at the first few observations of the data-frame:

# look at the first few observations:
head(df)
##   fund vowels subj speakersex
## 1  110      a   1S          M
## 2  115      i   2S          M
## 3   98      a   3S          M
## 4  102      i   4S          M
## 5  141      a   5S          M
## 6  156      i   6S          M
# or
df %>% head()
##   fund vowels subj speakersex
## 1  110      a   1S          M
## 2  115      i   2S          M
## 3   98      a   3S          M
## 4  102      i   4S          M
## 5  141      a   5S          M
## 6  156      i   6S          M

It looks like the above data-frame has 5 variables when only 4 were entered. In fact, the numbers on the far left are not a variable at all but the (row)names of the observations (which by default extend from 1 to the number of observations as a character vector). These can be seen with the rownames() function:

# look at the row names (the names of the observations)
rownames(df)
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
# or
df %>% rownames()
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"

2.2 Numeric and categorical variables

Another absolutely fundamental issue both for statistics and just about all speech research in R is the distinction between numeric and categorical variables. The distinction between the two is easily seen when using the summary() function on the data-frame:

summary(df)
##       fund          vowels              subj            speakersex       
##  Min.   : 98.0   Length:14          Length:14          Length:14         
##  1st Qu.:121.5   Class :character   Class :character   Class :character  
##  Median :172.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :156.3                                                           
##  3rd Qu.:182.2                                                           
##  Max.   :200.0

Those variables with numerical details about the minimum, maximum, median, mean, and quartiles are numeric variables. The others are in almost all cases categorical variables. For this example, fund is obviously numeric also because fundamental frequencye takes on a range of continuous, numeric values; the other variables to do with the vowels, subject, and speaker sex are categorical.

An exceptionally important point for the statistical analysis of speech in R is the following: Categorical variables are always formed by making a choice per observation between a finite number of categories. So for the categorical variable vowels there are obviously two categories: for each observation, a choice must be made between either [i] or [u]. For speakersex there are also two categories: an observation is either male or female. For subj, the choice is always between any of one of 7 different categories, in this case the 7 different speakers. As already observed, the unique() function when applied to a vector is informative about what the possible category choices:

unique(vowels)
## [1] "a" "i"
unique(speakersex)
## [1] "M" "F"
unique(subj)
##  [1] "1S"  "2S"  "3S"  "4S"  "5S"  "6S"  "7S"  "8S"  "9S"  "10S" "11S" "12S"
## [13] "13S" "14S"

Sometimes, this idea that categorical variables make choices between a finite number of categories is made explicit in R (especially in statistical analyses) by declaring the categorical variable to be a factor with a certain number of levels. This can be done with the function factor(), thus:

vowels = factor(vowels)
speakersex = factor(speakersex)
subj = factor(subj)

All these objects are now no longer of class character as before, but of class factor:

class(vowels)
## [1] "factor"

Moreover, entering any factor on its own in its entirety or otherwise always lists what the levels (i.e. possible categorical choices) are, as can be seen here:

vowels
##  [1] a i a i a i a i a i a i a i
## Levels: a i
speakersex
##  [1] M M M M M M F F F F F F F F
## Levels: F M
subj
##  [1] 1S  2S  3S  4S  5S  6S  7S  8S  9S  10S 11S 12S 13S 14S
## Levels: 10S 11S 12S 13S 14S 1S 2S 3S 4S 5S 6S 7S 8S 9S

The factors can be bound together in the data-frame as before:

# number of observations x variables (rows x columns)
df = data.frame(fund, vowels, subj, speakersex)

The summary() function makes exactly the same distinction between numeric and categorical variables as before:

summary(df)
##       fund       vowels      subj   speakersex
##  Min.   : 98.0   a:7    10S    :1   F:8       
##  1st Qu.:121.5   i:7    11S    :1   M:6       
##  Median :172.0          12S    :1             
##  Mean   :156.3          13S    :1             
##  3rd Qu.:182.2          14S    :1             
##  Max.   :200.0          1S     :1             
##                         (Other):8

but now also provides for each categorical variable information about the levels i.e. its distinct categories i.e. the choices that have to be made for each observation.

2.3 Storing a data-frame as a plain text file

A data-frame can be stored outside of R as a text file with the function write.table(). Three important arguments are:

  • the name of the data-frame to be stored or exported
  • the name of the file into which it is exported
  • quote=F: this is to suppress or prevent quote marks around the values that are exported

To store the data-frame df that has just been created as the plain text file fund.df.txt:

write.table(df, file = "fund.df.txt", quote=F)

If you are running R from a project called ipsR as recommended here then the effect of the above command is to create the file fund.df.txt as a plain text file in the ipsR directory, as shown below.

The text file just created in the ipsRdirectory could now be edited: in the figure below, another row has been added for 190 Hz, [i] vowel, produced by subject 14S who is female:

and then read back into R with the read.table() function:

read.table("fund.df.txt")
##    fund vowels subj speakersex
## 1   110      a   1S          M
## 2   115      i   2S          M
## 3    98      a   3S          M
## 4   102      i   4S          M
## 5   141      a   5S          M
## 6   156      i   6S          M
## 7   180      a   7S          F
## 8   179      i   8S          F
## 9   165      a   9S          F
## 10  180      i  10S          F
## 11  195      a  11S          F
## 12  200      i  12S          F
## 13  183      a  13S          F
## 14  184      i  14S          F
# or stored in the file name df2
df2 = read.table("fund.df.txt")

Thus one of the ways to build your own data-frame is to create a plain text file on your hard drive with rows and columns (observations and variables) in a manner similar to that in fund.df.txt and then read it into R.

Back to R programming language

Next topic