1 Preliminaries and starting up R
2 Converting a text collection into an Emu database
3 Forced alignment
4 Forced alignment: Albanian
- 4.1 From a text collection
- 4.2 From a canonical representation

1 Preliminaries and starting up R

The assumption is that you have a project called ipsR and that it contains the following directories.

If not, please see preliminaries here.

Start up R in the project you are using for this course.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(emuR)

## 
## Attaching package: 'emuR'
## 
## The following object is masked from 'package:base':
## 
##     norm

library(wrassp)

In R, store the path to the directory testsample as sourceDir in exactly the following way:

sourceDir = "./testsample"

And also store in R the path to emu_databases as targetDir:

targetDir = "./emu_databases"

2 Converting a text collection into an Emu database

The directory ./testsample/german on your computer contains .wav files and .txt files. Define the path to this database in R and check you can see these files with thenlist.files() function:

path.german = file.path(sourceDir, "german")
list.files(path.german)

## [1] "K01BE001.txt" "K01BE001.wav" "K01BE002.txt" "K01BE002.wav"

The above is an example of a text collection because it contains matching .wav and .txt files in the same directory such that, for each .wav file, the .txt file contains the corresponding orthography.

This command converts the text collection into an Emu database with name ger2 and stores the resulting Emu database as ger2_DB in targetDir:

convert_txtCollection(dbName = "ger2",
                      sourceDir = path.german,
                      targetDir = targetDir)

## INFO: Parsing plain text collection containing 2 file pair(s)...
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
## INFO: Copying 2 media files to EMU database...
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
##   INFO: Rewriting 2 _annot.json files to file system...
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%

Load the database in R.

ger2_DB = load_emuDB(file.path(targetDir, "ger2_emuDB"))

## INFO: Loading EMU database from ./emu_databases/ger2_emuDB... (2 bundles found)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%

summary(ger2_DB)

##

## ── Summary of emuDB ────────────────────────────────────────────────────────────

## Name:     ger2 
## UUID:     db066511-addb-416a-b3e5-1c21116400ca 
## Directory:    /Users/jmh/Desktop/ipsR/emu_databases/ger2_emuDB 
## Session count: 1 
## Bundle count: 2 
## Annotation item count:  2 
## Label count:  4 
## Link count:  0

##

## ── Database configuration ──────────────────────────────────────────────────────

##

## ── SSFF track definitions ──

##

## data frame with 0 columns and 0 rows

## ── Level definitions ──

##  name   type nrOfAttrDefs attrDefNames          
##  bundle ITEM 2            bundle; transcription;

## ── Link definitions ──

## data frame with 0 columns and 0 rows

Look at the database. Switch to hierarchy view. The words are a single item in the attribute tier of bundle with name transcription:

serve(ger2_DB, useViewer = F)

That the words are stored in this way is evident in querying this database. Note that calcTimes=F is needed because the tier transcription is of type ITEM and unlinked to a time-based (SEGMENT or EVENT) tier.

# segment list
t.s = query(ger2_DB, "transcription =~ .*", calcTimes=F)
# labels
t.s$labels

## [1] "heute ist schönes Frühlingswetter" "die Sonne lacht"

3 Forced alignment

We are now going to run the Munich Automatic Segmentation (MAUS) over the database. The language selected is German deu-DE. See information of the available languages and search for LANGUAGE.

Available languages

At the time of writing, the available languages are:

LANGUAGE: [cat, deu, eng, fin, hat, hun, ita, mlt, nld, nze, pol, aus-AU, afr-ZA, sqi-AL, eus-ES, eus-FR, cat-ES, nld-NL-GN, nld-NL, nld-NL-OH, nld-NL-PR, eng-US, eng-AU, eng-GB, eng-GB-OH, eng-GB-OHFAST, eng-GB-LE, eng-SC, eng-NZ, eng-CA, eng-GH, eng-IN, eng-IE, eng-KE, eng-NG, eng-PH, eng-ZA, eng-TZ, ekk-EE, kat-GE, fin-FI, fra-FR, deu-DE, gsw-CH-BE, gsw-CH-BS, gsw-CH-GR, gsw-CH-SG, gsw-CH-ZH, gsw-CH, hat-HT, hun-HU, isl-IS, ita-IT, jpn-JP, gup-AU, sampa, ltz-LU, mlt-MT, nor-NO, fas-IR, pol-PL, ron-RO, rus-RU, slk-SK, spa-ES, spa-AR, spa-BO, spa-CL, spa-CO, spa-CR, spa-DO, spa-EC, spa-SV, spa-GT, spa-HN, spa-MX, spa-NI, spa-PA, spa-PY, spa-PE, spa-PR, spa-US, spa-UY, spa-VE, swe-SE, tha-TH, guf-AU]

# you must have an active web connection for this to work!
runBASwebservice_all(ger2_DB,
transcriptionAttributeDefinitionName = "transcription",
language = "deu-DE", 
runMINNI = F)

## INFO: Preparing temporary database. This may take a while...
## INFO: Checking if cache needs update for 1 sessions and 2 bundles ...
## INFO: Performing precheck and calculating checksums (== MD5 sums) for _annot.json files ...
## INFO: Nothing to update!
## INFO: Sending ping to webservices provider.
## INFO: Running G2P tokenizer on emuDB containing 2 bundle(s)...
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
## INFO: Sending ping to webservices provider.
## INFO: Running G2P on emuDB containing 2 bundle(s)...
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
## INFO: Sending ping to webservices provider.
## INFO: Running MAUS on emuDB containing 2 bundle(s)...
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
## INFO: Sending ping to webservices provider.
## INFO: Running Pho2Syl (canonical) on emuDB containing 2 bundle(s)...
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
## INFO: Sending ping to webservices provider.
## INFO: Running Pho2Syl (segmental) on emuDB containing 2 bundle(s)...
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
## INFO: Autobuilding syllable -> segment links from time information
##   INFO: Rewriting 2 _annot.json files to file system...
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%

Now look at the information stored in the database and note the extra tiers that have been created (KAN, KAS, MAU, MAS):

summary(ger2_DB)

## ── Summary of emuDB ────────────────────────────────────────────────────────────

## Name:     ger2 
## UUID:     db066511-addb-416a-b3e5-1c21116400ca 
## Directory:    /Users/jmh/Desktop/ipsR/emu_databases/ger2_emuDB 
## Session count: 1 
## Bundle count: 2 
## Annotation item count:  58 
## Label count:  74 
## Link count:  52

##

## ── Database configuration ──────────────────────────────────────────────────────

##

## ── SSFF track definitions ──

##

## data frame with 0 columns and 0 rows

## ── Level definitions ──

##  name   type    nrOfAttrDefs attrDefNames          
##  bundle ITEM    2            bundle; transcription;
##  ORT    ITEM    3            ORT; KAN; KAS;        
##  MAU    SEGMENT 1            MAU;                  
##  MAS    ITEM    1            MAS;

## ── Link definitions ──

##  type        superlevelName sublevelName
##  ONE_TO_MANY bundle         ORT         
##  ONE_TO_MANY ORT            MAS         
##  ONE_TO_MANY MAS            MAU

serve(ger2_DB, useViewer = F)

Look at the hierarchy view. Identify the levels, links, and attributes.

More complex queries are now possible, e.g. find the word-initial MAU segments of all polysyllabic words:

mau.s = query(ger2_DB, 
              "[[MAU =~.* & Start(ORT, MAU)=1] ^ Num(ORT, MAS) > 1]")
mau.s

## # A tibble: 4 × 16
##   labels start   end db_uuid      session bundle start_item_id end_item_id level
##   <chr>  <dbl> <dbl> <chr>        <chr>   <chr>          <int>       <int> <chr>
## 1 h       117.  167. db066511-ad… 0000    K01BE…             7           7 MAU  
## 2 S       597.  717. db066511-ad… 0000    K01BE…            13          13 MAU  
## 3 f       977. 1027. db066511-ad… 0000    K01BE…            18          18 MAU  
## 4 z       967. 1027. db066511-ad… 0000    K01BE…             8           8 MAU  
## # ℹ 7 more variables: attribute <chr>, start_item_seq_idx <int>,
## #   end_item_seq_idx <int>, type <chr>, sample_start <int>, sample_end <int>,
## #   sample_rate <int>

Check their word labels:

requery_hier(ger2_DB, mau.s, "ORT")$labels

## [1] "heute"           "schönes"         "Frühlingswetter" "Sonne"

4 Forced alignment: Albanian

4.1 From a text collection

The task is to try out forced alignment for a different language and also to show how forced alignment can be done from a canonical phonemic transcription instead of from text. The database is here.

path.albanian = file.path(sourceDir, "albanian")

Begin by converting the text collection into an Emu database.

convert_txtCollection(dbName = "alb",
                      sourceDir = path.albanian,
                      targetDir = targetDir)

## INFO: Parsing plain text collection containing 4 file pair(s)...
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%
## INFO: Copying 4 media files to EMU database...
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%
##   INFO: Rewriting 4 _annot.json files to file system...
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%

alb_DB = load_emuDB(file.path(targetDir, "alb_emuDB"))

## INFO: Loading EMU database from ./emu_databases/alb_emuDB... (4 bundles found)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%

summary(alb_DB)

## ── Summary of emuDB ────────────────────────────────────────────────────────────

## Name:     alb 
## UUID:     db5795fe-8024-4297-ac6a-e25fc58522e3 
## Directory:    /Users/jmh/Desktop/ipsR/emu_databases/alb_emuDB 
## Session count: 1 
## Bundle count: 4 
## Annotation item count:  4 
## Label count:  8 
## Link count:  0

##

## ── Database configuration ──────────────────────────────────────────────────────

##

## ── SSFF track definitions ──

##

## data frame with 0 columns and 0 rows

## ── Level definitions ──

##  name   type nrOfAttrDefs attrDefNames          
##  bundle ITEM 2            bundle; transcription;

## ── Link definitions ──

## data frame with 0 columns and 0 rows

Look at the database, switch to hierarchy view, and verify that the words have been located at bundle -> transcription as for the German database above.

serve(alb_DB, useViewer = F)

Now run MAUS, just as before. The language here is sqi-AL for Albanian. (Note that this will take a bit longer than for German – possibly a couple of minutes at least).

runBASwebservice_all(alb_DB,
transcriptionAttributeDefinitionName = "transcription",
language = "sqi-AL", 
 runMINNI = F)

## INFO: Preparing temporary database. This may take a while...
## INFO: Checking if cache needs update for 1 sessions and 4 bundles ...
## INFO: Performing precheck and calculating checksums (== MD5 sums) for _annot.json files ...
## INFO: Nothing to update!
## INFO: Sending ping to webservices provider.
## INFO: Running G2P tokenizer on emuDB containing 4 bundle(s)...
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%
## INFO: Sending ping to webservices provider.
## INFO: Running G2P on emuDB containing 4 bundle(s)...
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%
## INFO: Sending ping to webservices provider.
## INFO: Running MAUS on emuDB containing 4 bundle(s)...
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%
## INFO: Sending ping to webservices provider.
## INFO: Running Pho2Syl (canonical) on emuDB containing 4 bundle(s)...
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%
## INFO: Sending ping to webservices provider.
## INFO: Running Pho2Syl (segmental) on emuDB containing 4 bundle(s)...
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%
## INFO: Autobuilding syllable -> segment links from time information
##   INFO: Rewriting 4 _annot.json files to file system...
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |======================================================================| 100%

summary(alb_DB)

## ── Summary of emuDB ────────────────────────────────────────────────────────────

## Name:     alb 
## UUID:     db5795fe-8024-4297-ac6a-e25fc58522e3 
## Directory:    /Users/jmh/Desktop/ipsR/emu_databases/alb_emuDB 
## Session count: 1 
## Bundle count: 4 
## Annotation item count:  138 
## Label count:  176 
## Link count:  125

##

## ── Database configuration ──────────────────────────────────────────────────────

##

## ── SSFF track definitions ──

##

## data frame with 0 columns and 0 rows

## ── Level definitions ──

##  name   type    nrOfAttrDefs attrDefNames          
##  bundle ITEM    2            bundle; transcription;
##  ORT    ITEM    3            ORT; KAN; KAS;        
##  MAU    SEGMENT 1            MAU;                  
##  MAS    ITEM    1            MAS;

## ── Link definitions ──

##  type        superlevelName sublevelName
##  ONE_TO_MANY bundle         ORT         
##  ONE_TO_MANY ORT            MAS         
##  ONE_TO_MANY MAS            MAU

Look at the database and verify that the same kind of information has been automatically derived, as for the German database earlier.

serve(alb_DB, useViewer = F)

4.2 From a canonical representation

MAUS also allows an automatic segmentation to be derived directly from the canonical level. This can be useful when the canonical representation provided by MAUS deviates considerably from what was actually said. For one of the words in 0001BF_1syll_1, the canonical representation has J E when what was actually said was closer to n J E.

First switch in hierarchy view from ORT->KAN and then change the node J E of the ORT:KAN tier to n j E for file 0001BF_1syll_1 in the manner of 4.1. See section 9.2.2 of the Emu SDMS manual for how to handle hierarchical annotations manually.

Figure 4.1: A fragment of a hierarchy view.

In order to run MAUS on this more appropriate pronunciation, first change it as in 4.1 above. (Don’t forget to save the annotation after editing).

Now run MAUS directly on this canonical level. Store the MAU segmentations in a new tier MAU2 (to differentiate it from the already created MAU tier).

runBASwebservice_maus(alb_DB,
                      canoAttributeDefinitionName = "KAN",
                      mausAttributeDefinitionName = "MAU2",
                      language = "sqi-AL")

Inspect the database again. There should now be a new tier MAU2

summary(alb_DB)

Verify that there is a visible tier with the added segment /n/:

serve(alb_DB, useViewer = F)

Forced alignment in EmuR

Jonathan Harrington