The assumption is that you have a project called ipsR
and that it contains the following directories.
If not, please see preliminaries here.
Start up R in the project you are using for this course.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'emuR'
##
## The following object is masked from 'package:base':
##
## norm
In R, store the path to the directory testsample
as sourceDir
in exactly the following way:
And also store in R the path to emu_databases as targetDir
:
The directory ./testsample/german
on your computer contains .wav
files and .txt
files. Define the path to this database in R and check you can see these files with thenlist.files()
function:
## [1] "K01BE001.txt" "K01BE001.wav" "K01BE002.txt" "K01BE002.wav"
The above is an example of a text collection because it contains matching .wav
and .txt
files in the same directory such that, for each .wav
file, the .txt
file contains the corresponding orthography.
This command converts the text collection into an Emu database with name ger2
and stores the resulting Emu database as ger2_DB
in targetDir
:
## INFO: Parsing plain text collection containing 2 file pair(s)...
##
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
## INFO: Copying 2 media files to EMU database...
##
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
## INFO: Rewriting 2 _annot.json files to file system...
##
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
Load the database in R.
## INFO: Loading EMU database from ./emu_databases/ger2_emuDB... (2 bundles found)
##
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
##
## ── Summary of emuDB ────────────────────────────────────────────────────────────
## Name: ger2
## UUID: db066511-addb-416a-b3e5-1c21116400ca
## Directory: /Users/jmh/Desktop/ipsR/emu_databases/ger2_emuDB
## Session count: 1
## Bundle count: 2
## Annotation item count: 2
## Label count: 4
## Link count: 0
##
## ── Database configuration ──────────────────────────────────────────────────────
##
## ── SSFF track definitions ──
##
## data frame with 0 columns and 0 rows
## ── Level definitions ──
## name type nrOfAttrDefs attrDefNames
## bundle ITEM 2 bundle; transcription;
## ── Link definitions ──
## data frame with 0 columns and 0 rows
Look at the database. Switch to hierarchy view. The words are a single item in the attribute
tier of bundle
with name transcription
:
That the words are stored in this way is evident in querying this database. Note that calcTimes=F
is needed because the tier transcription
is of type ITEM
and unlinked to a time-based (SEGMENT
or EVENT
) tier.
## [1] "heute ist schönes Frühlingswetter" "die Sonne lacht"
We are now going to run the Munich Automatic Segmentation (MAUS) over the database. The language selected is German deu-DE
. See information of the available languages and search for LANGUAGE.
Available languages
At the time of writing, the available languages are:
LANGUAGE: [cat, deu, eng, fin, hat, hun, ita, mlt, nld, nze, pol, aus-AU, afr-ZA, sqi-AL, eus-ES, eus-FR, cat-ES, nld-NL-GN, nld-NL, nld-NL-OH, nld-NL-PR, eng-US, eng-AU, eng-GB, eng-GB-OH, eng-GB-OHFAST, eng-GB-LE, eng-SC, eng-NZ, eng-CA, eng-GH, eng-IN, eng-IE, eng-KE, eng-NG, eng-PH, eng-ZA, eng-TZ, ekk-EE, kat-GE, fin-FI, fra-FR, deu-DE, gsw-CH-BE, gsw-CH-BS, gsw-CH-GR, gsw-CH-SG, gsw-CH-ZH, gsw-CH, hat-HT, hun-HU, isl-IS, ita-IT, jpn-JP, gup-AU, sampa, ltz-LU, mlt-MT, nor-NO, fas-IR, pol-PL, ron-RO, rus-RU, slk-SK, spa-ES, spa-AR, spa-BO, spa-CL, spa-CO, spa-CR, spa-DO, spa-EC, spa-SV, spa-GT, spa-HN, spa-MX, spa-NI, spa-PA, spa-PY, spa-PE, spa-PR, spa-US, spa-UY, spa-VE, swe-SE, tha-TH, guf-AU]
# you must have an active web connection for this to work!
runBASwebservice_all(ger2_DB,
transcriptionAttributeDefinitionName = "transcription",
language = "deu-DE",
runMINNI = F)
## INFO: Preparing temporary database. This may take a while...
## INFO: Checking if cache needs update for 1 sessions and 2 bundles ...
## INFO: Performing precheck and calculating checksums (== MD5 sums) for _annot.json files ...
## INFO: Nothing to update!
## INFO: Sending ping to webservices provider.
## INFO: Running G2P tokenizer on emuDB containing 2 bundle(s)...
##
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
## INFO: Sending ping to webservices provider.
## INFO: Running G2P on emuDB containing 2 bundle(s)...
##
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
## INFO: Sending ping to webservices provider.
## INFO: Running MAUS on emuDB containing 2 bundle(s)...
##
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
## INFO: Sending ping to webservices provider.
## INFO: Running Pho2Syl (canonical) on emuDB containing 2 bundle(s)...
##
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
## INFO: Sending ping to webservices provider.
## INFO: Running Pho2Syl (segmental) on emuDB containing 2 bundle(s)...
##
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
## INFO: Autobuilding syllable -> segment links from time information
## INFO: Rewriting 2 _annot.json files to file system...
##
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
Now look at the information stored in the database and note the extra tiers that have been created (KAN
, KAS
, MAU
, MAS
):
## ── Summary of emuDB ────────────────────────────────────────────────────────────
## Name: ger2
## UUID: db066511-addb-416a-b3e5-1c21116400ca
## Directory: /Users/jmh/Desktop/ipsR/emu_databases/ger2_emuDB
## Session count: 1
## Bundle count: 2
## Annotation item count: 58
## Label count: 74
## Link count: 52
##
## ── Database configuration ──────────────────────────────────────────────────────
##
## ── SSFF track definitions ──
##
## data frame with 0 columns and 0 rows
## ── Level definitions ──
## name type nrOfAttrDefs attrDefNames
## bundle ITEM 2 bundle; transcription;
## ORT ITEM 3 ORT; KAN; KAS;
## MAU SEGMENT 1 MAU;
## MAS ITEM 1 MAS;
## ── Link definitions ──
## type superlevelName sublevelName
## ONE_TO_MANY bundle ORT
## ONE_TO_MANY ORT MAS
## ONE_TO_MANY MAS MAU
Look at the hierarchy view. Identify the levels, links, and attributes.
More complex queries are now possible, e.g. find the word-initial MAU
segments of all polysyllabic words:
## # A tibble: 4 × 16
## labels start end db_uuid session bundle start_item_id end_item_id level
## <chr> <dbl> <dbl> <chr> <chr> <chr> <int> <int> <chr>
## 1 h 117. 167. db066511-ad… 0000 K01BE… 7 7 MAU
## 2 S 597. 717. db066511-ad… 0000 K01BE… 13 13 MAU
## 3 f 977. 1027. db066511-ad… 0000 K01BE… 18 18 MAU
## 4 z 967. 1027. db066511-ad… 0000 K01BE… 8 8 MAU
## # ℹ 7 more variables: attribute <chr>, start_item_seq_idx <int>,
## # end_item_seq_idx <int>, type <chr>, sample_start <int>, sample_end <int>,
## # sample_rate <int>
Check their word labels:
## [1] "heute" "schönes" "Frühlingswetter" "Sonne"
The task is to try out forced alignment for a different language and also to show how forced alignment can be done from a canonical phonemic transcription instead of from text. The database is here.
Begin by converting the text collection into an Emu database.
## INFO: Parsing plain text collection containing 4 file pair(s)...
##
|
| | 0%
|
|================== | 25%
|
|=================================== | 50%
|
|==================================================== | 75%
|
|======================================================================| 100%
## INFO: Copying 4 media files to EMU database...
##
|
| | 0%
|
|================== | 25%
|
|=================================== | 50%
|
|==================================================== | 75%
|
|======================================================================| 100%
## INFO: Rewriting 4 _annot.json files to file system...
##
|
| | 0%
|
|================== | 25%
|
|=================================== | 50%
|
|==================================================== | 75%
|
|======================================================================| 100%
## INFO: Loading EMU database from ./emu_databases/alb_emuDB... (4 bundles found)
##
|
| | 0%
|
|================== | 25%
|
|=================================== | 50%
|
|==================================================== | 75%
|
|======================================================================| 100%
## ── Summary of emuDB ────────────────────────────────────────────────────────────
## Name: alb
## UUID: db5795fe-8024-4297-ac6a-e25fc58522e3
## Directory: /Users/jmh/Desktop/ipsR/emu_databases/alb_emuDB
## Session count: 1
## Bundle count: 4
## Annotation item count: 4
## Label count: 8
## Link count: 0
##
## ── Database configuration ──────────────────────────────────────────────────────
##
## ── SSFF track definitions ──
##
## data frame with 0 columns and 0 rows
## ── Level definitions ──
## name type nrOfAttrDefs attrDefNames
## bundle ITEM 2 bundle; transcription;
## ── Link definitions ──
## data frame with 0 columns and 0 rows
Look at the database, switch to hierarchy view, and verify that the words have been located at bundle -> transcription
as for the German database above.
Now run MAUS, just as before. The language here is sqi-AL
for Albanian. (Note that this will take a bit longer than for German – possibly a couple of minutes at least).
runBASwebservice_all(alb_DB,
transcriptionAttributeDefinitionName = "transcription",
language = "sqi-AL",
runMINNI = F)
## INFO: Preparing temporary database. This may take a while...
## INFO: Checking if cache needs update for 1 sessions and 4 bundles ...
## INFO: Performing precheck and calculating checksums (== MD5 sums) for _annot.json files ...
## INFO: Nothing to update!
## INFO: Sending ping to webservices provider.
## INFO: Running G2P tokenizer on emuDB containing 4 bundle(s)...
##
|
| | 0%
|
|================== | 25%
|
|=================================== | 50%
|
|==================================================== | 75%
|
|======================================================================| 100%
## INFO: Sending ping to webservices provider.
## INFO: Running G2P on emuDB containing 4 bundle(s)...
##
|
| | 0%
|
|================== | 25%
|
|=================================== | 50%
|
|==================================================== | 75%
|
|======================================================================| 100%
## INFO: Sending ping to webservices provider.
## INFO: Running MAUS on emuDB containing 4 bundle(s)...
##
|
| | 0%
|
|================== | 25%
|
|=================================== | 50%
|
|==================================================== | 75%
|
|======================================================================| 100%
## INFO: Sending ping to webservices provider.
## INFO: Running Pho2Syl (canonical) on emuDB containing 4 bundle(s)...
##
|
| | 0%
|
|================== | 25%
|
|=================================== | 50%
|
|==================================================== | 75%
|
|======================================================================| 100%
## INFO: Sending ping to webservices provider.
## INFO: Running Pho2Syl (segmental) on emuDB containing 4 bundle(s)...
##
|
| | 0%
|
|================== | 25%
|
|=================================== | 50%
|
|==================================================== | 75%
|
|======================================================================| 100%
## INFO: Autobuilding syllable -> segment links from time information
## INFO: Rewriting 4 _annot.json files to file system...
##
|
| | 0%
|
|================== | 25%
|
|=================================== | 50%
|
|==================================================== | 75%
|
|======================================================================| 100%
## ── Summary of emuDB ────────────────────────────────────────────────────────────
## Name: alb
## UUID: db5795fe-8024-4297-ac6a-e25fc58522e3
## Directory: /Users/jmh/Desktop/ipsR/emu_databases/alb_emuDB
## Session count: 1
## Bundle count: 4
## Annotation item count: 138
## Label count: 176
## Link count: 125
##
## ── Database configuration ──────────────────────────────────────────────────────
##
## ── SSFF track definitions ──
##
## data frame with 0 columns and 0 rows
## ── Level definitions ──
## name type nrOfAttrDefs attrDefNames
## bundle ITEM 2 bundle; transcription;
## ORT ITEM 3 ORT; KAN; KAS;
## MAU SEGMENT 1 MAU;
## MAS ITEM 1 MAS;
## ── Link definitions ──
## type superlevelName sublevelName
## ONE_TO_MANY bundle ORT
## ONE_TO_MANY ORT MAS
## ONE_TO_MANY MAS MAU
Look at the database and verify that the same kind of information has been automatically derived, as for the German database earlier.
MAUS also allows an automatic segmentation to be derived directly from the canonical level. This can be useful when the canonical representation provided by MAUS deviates considerably from what was actually said. For one of the words in 0001BF_1syll_1
, the canonical representation has J E
when what was actually said was closer to n J E
.
First switch in hierarchy view from ORT
->KAN
and then change the node J E
of the ORT:KAN
tier to n j E
for file 0001BF_1syll_1
in the manner of 4.1. See section 9.2.2 of the Emu SDMS manual for how to handle hierarchical annotations manually.
Figure 4.1: A fragment of a hierarchy view.
In order to run MAUS on this more appropriate pronunciation, first change it as in 4.1 above. (Don’t forget to save the annotation after editing).
Now run MAUS directly on this canonical level. Store the MAU segmentations in a new tier MAU2
(to differentiate it from the already created MAU
tier).
runBASwebservice_maus(alb_DB,
canoAttributeDefinitionName = "KAN",
mausAttributeDefinitionName = "MAU2",
language = "sqi-AL")
Inspect the database again. There should now be a new tier MAU2
Verify that there is a visible tier with the added segment /n/: