HTK -- TUTORIAL F. Schiel 25.06.97 / 18.07.97 Abbreviations in this document: $DATA = /n/weissbier/xb/schiel/NUMBERS $SRCE = /n/weissbier/xa/schiel/NUMBERS $DRSP = /u/drspeech/data/numbers95 $SULIN = ~sulin/speech/project/numbers_cs If not explicitly given, a file is found or a command is issued in $SRCE. ========================================================================= ========================================================================= Contents PART 1 - INTRODUCTION TO HTK 0. General I. Preparation of Data 1. Speech Corpora 2. Scripts for Further Processing 3. Dictionary 4. Master Label Files (MLF) 5. List of HMMs 6. Language Model and Lattice II. Topology of HMMs 1. Create prototypes 2. Analyze train+cv set for phone durations III. Bootstrap, Viterbi training and Baum-Welch training 1. HInit 2. HRest IV. Testbed 1. Viterbi 2. Evaluation of results V. Silence Modeling VI. Embedded Training 1. Re-Align Training Set 2. Embedded Re-estimation VII. Tuning 1. Beam Width Pruning (-t) 2. LM scaling (-s) 3. Word End Penalty (-p) PART 2 - HTK TOOL BOX 0. General I. HLEd - Script Editor for Label Files II. HDMan - Script Editor for Dictionaries III. HLStats, HBuild - Create Language Models and Lattices IV. HHEd - Manipulation of HMMs 1. HHEd Commands 2. HHEd Itemlists 3. HHEd Macros 4. Some Examples on the Numbers Task V. Miscs 0. Documentation 1. Our Top Performance on Numbers 2. Weighted pronunciations and Re-scoring of Lattices 3. The Mixture Problem 4. Efficiency 5. Technical Stuff ========================================================================= ========================================================================= PART 1 - INTRODUCTION TO HTK The following coarse introduction to HTK will give examples how to set up a basic recognizer for the Numbers_95 task. It includes the steps data preparation, model design, bootstrapping, silence modeling, embedded training and tuning. It does not give all the details of all the described HTK tools. For a more in-depth discussion of the HTK Tools please refer to Part 2 of this tutorial. Also, this part does not discuss the usage of the HTK libraries within user made frameworks. Intended audience: people who like to know what HTK is and how it feels to use it. ========================================================================= 0. General All HTK commands begin with the letter 'H' and reside in the ESPS installation of ICSI. To use them you must have a proper '.cshrc' to define all needed paths and parameters. An example for a '.cshrc' that should work on all ICSI platforms can be found in .cshrc.icsi Since we have two floating licenses at the moment, maximal two users are allowed to issue as many HTK commands on as many machines they wish. The first time you issue a command a license is checked out for you. If both licenses are in use, you get a warning and the command is not executed (this can be a hassle in long batches; use 'HCheckout' to verify that a license is available). To free a used license issue the command 'HFree' after you're done. In the following examples the issued command lines start all with a '%'. At the end of each action I give the sources and the produced output of this action. I. Preparation of Data Before any training or recognition can be done with HTK, we have to set up the required data in a format that suits HTK. This section gives examples how to do that for the Numbers task. 1. Speech Corpora HTK requires either wave files or preprocessed feature files (htk) for each utterance in a separate file (there is nothing alike a pfile in HTK). To preprocess wave files use the command HCopy. a) train+cv set: % HCopy -C config -S HCopytrain.slist where: HCopytrainlist is a 'script' file telling source and destination of each file: /u/drspeech/data/numbers95/wavfile/cs/31/NU-3192.other1.wav /xa/schiel/NUM BERS/TRAINING/NU-3192.other1.htk ... (Source: $SULIN/list/*list.txt) config is the general HTK configuration file which in this case contains the following parameters: SOURCEKIND = WAVEFORM SOURCEFORMAT = NIST TARGETKIND = MFCC_Z_E_D_A LOPASS = 300 HIPASS = 3400 NUMCHANS = 26 NUMCEPS = 12 ENORMALIZE = T CEPLIFTER = 22 TARGETRATE = 100000 WINDOWSIZE = 250000 ZMEANSOURCE = T USEHAMMING = T PREEMCOEF = 0.97 SAVECOMPRESSED = T Some explanations: TARGETKIND describes the type of used features. In this case it's: Mel Frequency Cepstral Coefficients, with cepstral mean subtraction (Z), log Energy (not C0!), Delta and Delta delta (Acceleration). LOPASS, HIPASS set the boundaries for filterbank NUMCHANS is the number of filterbank channels ENORMALIZE is true (T) : log energy is normalized to 1.0 for each utterance CEPLIFTER : re-scale cepstral coeff to emphazise higher orders, so that all dimensions have about the same magnitude TARGETRATE, WINDOWSIZE : all times are given in 100 nsec units ZMEANSOURCE is true (T) : bias is subtracted from waveform if any SAVECOMPRESSED : save htk files in a binary form to speed up reading The contents of *.htk files can be viewed with the command HList (lots of options) Source: /u/drspeech/data/numbers95/wavfile/cs/... HCopytrain.slist Target: $DATA/TRAINING b) dev and test set : Same procedure with appropriate script files Source: /u/drspeech/data/numbers95/wavfile/cs/... HCopytest.slist Target: $DATA/DEV Source: /u/drspeech/data/numbers95/wavfile/cs/... HCopydev.slist Target: $DATA/TEST 2. Scripts for Further Processing Most of the HTK tools get their input files from the command line. However, if there are thousands of them, it make sense to put them into so called 'script' files. % cat HCopy.slist | gawk '{ print $2}' > .slist Source: HCopy.slist Target: .slist 3. Dictionary HTK dictionaries have a simple syntax as follows: [] [ ...] E.g. eight ah ey tcl t eight eh ey tcl t ... The dictionary for our task has 32 words and 180 pronunciations. We manually add two silence words '' and '' denoting beginning and ending silence to go conform with the LM. Since all variants in the lexicon are of equal probability, we can omit this column. Source: $SULIN/lex/nowaylex/simpcounts-gildea90per-boot_noway.dict Target: DICT 4. Master Label Files (MLF) HTK can use separate label files for each speech file, but more efficient is the usage of so called Master Label Files (MLF) that store the label files independently of the location of the wave files into one common structure. The HTK tool HLEd is a general purpose label editor. It can be used to arrange individual label files into an MLF. For our task we need two different types of label files: a) Word reference MLFs for all sets (*.ref) For each set (train+cv, dev, test) do: - copy the word text files into a dir - break them in to one word per line % cat | tr -s ' ' '\012' > - transform them into a MLF % HLEd -l '*' -i -ref.mlf dummy.led *.ref (dummy.led is an empty script) Source: $DRSP/wrdfile/cs-clean Target: -ref.mlf b) Segmental MLF for train+cv set (*.lab) Several ways to get these: A : there are manually segmented data Pool these together in a dir and then use HLEd to transform them into an MLF: % cd dir % HLEd -l '*' -i train+cv.mlf dummy.led *.lab (dummy.led is an empty script) Remark: HTK can read ESPS label files! Source: $DRSP/phnfile/cs-esps Target: train+cv.mlf B : there is no segmental data Take the word reference MLF and expand it into a segmental MLF by using a pronunciation dictionary: % HLEd -l '*' -I train+cv-ref.mlf -i train+cv.mlf -d DICT ex.led (ex.led has only one command: EX) Remark: Always the first pronunciation in the dictionary is used! Source: DICT, train+cv-ref.mlf Target: train+cv.mlf In our example we follow method A. 5. List of HMMs Most of the HTK tools require a list of the used models as input. The loaded HMM definition files may include much more model definitions, but only that one listed will be used. This can be very convenient, if you pool together different types of models into one set. For our example task we want to train with the manually labelled data. Consequently the set of HMMs should cover both the dictionary and the set of segments found in the train+cv set. First we analyze the dictionary for distinct hmms: % HDMan -l lex.stat -o DICT We find (lex.stat) only 32 used phonemes. Then we analyze the train+cv set (MLF) for distinct HMMs: % HLEd -n train+cv.stat -i dummy.mlf dummy.led train+cv.mlf % rm dummy.mlf and find (train+cv.stat) that the following phones needed for DICT cannot be trained because lack of data: /ae/, /uh/, /hv/, /m/ (These phones were selected by the transcribers to denote special cases of pronunciation and are therefore quite rare.) To come around this we 'tie' them to similar phonemes. This is done by simply listing the physical model after the (now called) logical model: Our resulting model list numbersphone.txt contains then 32 logical and 28 physical models and looks like: ae ah ah ao ax ay d dcl eh er ey f h# hh hv hh ih iy k kcl l m n n ow r s t tcl th uh ah uw v w z Source: DICT, train+cv.mlf Target: numbersphone.txt 6. Language Model and Lattice HTK cannot use a language model directly. It requires a lattice file (*.lat) instead. Lattices can be calculated from different sources: Bigrams, finite-states, grammars, forced Viterbi ... The tool HBuild is used to compile lattices from other formats. It can input: bigram matrix, DARPA bigram, finite-state, etc. We use the DARPA bigram/backoff of Su-Lin. Source: $SULIN/lm/numbers_cs_train.abigram.gz Target: LM To calculate the lattice we issue: % HBuild -b -n LM -s '' '' WRDLIST LAT where: WRDLIST is simply the first column of the DICT. It is used by HBuild to reduce the output to the list of words in WRDLIST) Option '-s' defines the initial and final word symbols used for each utterance (default is !ENTRY and !EXIT) Option '-b' forces the output to be in binary format (default is ASCII) Source: LM Target: LAT Now we have all the data we need to get started. Next we'll have to think about the structure of the models itself. ========================================================================= II. Topology of HMMs Disclaimer: For a theoretical description of HMM technique please refer elsewhere. For the following I expect you to have the basic knowledge of how a HMM is used and how it looks like. HMMs in HTK are defined in ascii or binary files. You can have a separate file for each different model (named by the model label) or you can have all definitions packed into a 'Master Macro File' (MMF). 1. Create prototypes First thing to do is to define a set of basic prototypes that are later cloned and used for your phone models. Edit a prototype file protos/proto-6mix with 4 different prototypes: 2 - 5 states (named 'two' - 'five') 6 mixtures per state diagonalized covariance matrices The numbers are arbitrary (except of mix weights and transition probabilities, which must sum up to 1.0); only the structure is significant. Only transition probabilities that are non zero will be considered. Remark: first and last states are virtual! 2. Analyze train+cv set for phone durations To decide which models get which number of states we look at the average durations of the phones in the training set and then set up a mapping of each phoneme to one of the four prototypes: 2-state-phones: 2.7 - 8.0 frames 3-state-phones: 8.0 - 12.0 4-state-phones: 12.0 - 16.0 5-state-phones: 16.0 - 24.0 (The script struct.awk does the job for us.) Source: train+cv-durations.txt Target: state-phones.txt ========================================================================= III. Bootstrap, Viterbi training and Baum-Welch training We are now ready to bootstrap the phone models. The recipe for that is to clone the appropriate prototype (mapped in state-phones.txt), bootstrap it, run Viterbi on the bootstrap data and run a final Baum-Welch on the bootstrap data. 1. HInit The tool HInit does roughly the following: - reads the prototype structure for phoneme X - collect all segments from bootstrap corpus (here train+cv) of phoneme X - divide the frame stream of each segment linear in number of states - cluster data of each state in mixtures and calculate mean and variances - run Viterbi training to all segments until overall likelihood converges Example for the model /ah/ which is bootstrapped here into a prototype with three states: % HInit -i 20 -l ah -H protos/proto-6mix -o ah -v 0.0001 -I total.mlf \ -S train+cv.slist -M hmm1 -A -T 1 three % mv hmm1/proto-6mix hmm1/ah where: Option '-i' is the maximal number of Viterbi training iterations Option '-l' is the label to look for in the train+cv set for training Option '-H' loads the prototype file (all prototypes are stored in one file) Option '-o' is the label (macro name) of the resulting model Option '-v' is the floored variance Option '-I' is the MLF with the segmental information of the training set Option '-S' is the 'script' file (list of files to process) Option '-M' is the directory where to store the resulting model Option '-A -T' tracing options Since HInit has to be called for each phoneme, it makes sense to put that into a script (INIT). WARNING: The output HMM definition file has always the name of the prototype file. Therefore it's necessary to rename the output before the next HInit is called (see script INIT) or else HInit will overwrite the last model definition file. After INIT is run, we have a separate HMM definition file for each phoneme in dir hmm1. These can be combined into a so-called Master Macro File (MMF) by using the HMM editor HHEd: % HHEd -w hmm1/NUMBERS.mmf -d hmm1 co.hed numbersphone.txt where co.hed is a command file that does essentially nothing. Source: protos/proto-6mix, train+cv.mlf, numbersphone.txt Target: hmm1/NUMBERS.mmf 2. HRest The tool HRest takes the same data (segments) like HInit and runs iterative Baum Welch training instead of Viterbi training. It does not do any bootstrapping; that is HInit must be performed first! Applied to the Numbers task it did not improve the quality of the models. However, in the Verbmobil task we found a significant improvement. ========================================================================= IV. Testbed We now have the first usable models to run a test. 1. Viterbi The script TEST shows the call of a version of the recognizer HVite that performs recognition to a test set of data and reports the word accuracy to a corresponding reference label set. The test set can be one of the three sets, but we'll use only the dev set here. Example for a call of HVite: HVite -t 105.0 -p -2.0 -s 6.5 -i output.mlf -w LAT -H hmm1/NUMBERS.mmf \ -T 1 -o ST -A -S dev.slist DICT numbersphone.txt As you can see, there are a bunch of parameters that control the Viterbi search. The most important are: - pruning width (beam width in log prob) (-t) - LM weighting and offset (-s) - word end penalty (-p) These parameters have to be tuned to the dev set (details in Section VII). Other parameters are: Option '-i' is the recognition output in form of a MLF Option '-w' is the used lattice (language model) Option '-H' is the MMF with the model definitions (HHMs) Option '-o' defines the format of the output MLF; here timing and scoring information is suppressed Option '-S' script with list of processed files (dev set) Source: hmm1/NUMBERS.mmf, numbersphone.txt, LAT, DICT Target: output.mlf 2. Evaluation of results Also in TEST you see the command HResults, which reads the output MLF of the HVite command and matches it to the reference MLF (here dev-ref.mlf) The output is the standard error calculation of replacements, deletions and insertions. A test to the bootstrapped and Viterbi-trained HMMs yields 82.62 [H=4117, D=233, S=323, I=256, N=4673] word accuracy A test to the re-estimated models using HRest (see above) yields: 81.53 [H=4042, D=316, S=315, I=232, N=4673] ========================================================================= V. Silence Modeling Up to now only the leading and trailing silence in the utterances is modeled during recognition by the words '' and '' that have the simple pronunciation /h#/' (silence model). Since the LM does not take into account inter-word-silence, I added a new 'silence phone' /sp/ to the end of each word in the dictionary. I build this model by: - cloning the model /h#/ to /sp/ - deleting the first and third state in /sp/ - tying the remaining state to the second state of the model /h#/ - add a transition from the virtual start to the virtual end of /sp/ enabling it to be skipped totally (so called 'tee model') Furthermore I added a transition from the third to the first state in the /h#/ model to make it more robust. Source: hmm2 Target: hmm3 Test: 89.21 [H=4345, D=62, S=266, I=176, N=4673] ========================================================================= VI. Embedded Training Up to now the HMMs were only trained within the fixed boundaries of the manual segmentation. Now we want to run an embedded training, that is the boundaries are iteratively aligned by the Viterbi. 1. Re-Align Training Set The problem arises that we cannot perform an embedded training to the manual labels of train+cv, because since these are hand labels they contain more phonemes than our phone list numbersphone.txt. These phonemes are very sparse, so it makes no sense to train them anyway. There are two ways out of this: - map the sparse labels to known and trainable phonemes - don't use the hand labels for embedded training but use a re-alignment of the train+cv set to the (multi-pronunciation) dictionary. Since we use this dictionary for testing, the second alternative makes more sense. To produce a new MLF with the best aligned dictionary words we can use the decoder tool HVite in the following command: % HVite -l '*' -o STW -b '' -a -H hmm3/NUMBERS.3.mmf \ -i train+cv-ali.mlf -m -t 250.0 -I train+cv-ref.mlf \ -y lab -S train+cv.slist DICT numbersphone.txt Explanations: -l '*' : write output MLF with relative paths '*/' instead of full paths. This makes the MLF independent of location of the htk files -o STW : controls the format of the output to the MLF files; S : scores suppressed T : times suppressed W : word labels suppressed -b '' : since we have no lattice here, use the word '' as initial and ending silence model -H ... : HMM definition file -i ... : MLF output file containing the new alignment -m : keep track of boundaries -t 250.0 : beam search pruning factor (rather irrelevant here) -I ... : word reference MLF input -y lab : extension of alignment label files in the output MLF -S ... : script file with list of processed files The MLF train+cv-ali.mlf now contains sequences of labels that are all parts of the multi-pronunciation dictionary DICT. Source: hmm3, train+cv-ref.mlf, DICT, numbersphone.txt Target: train+cv-ali.mlf 2. Embedded Re-estimation Now we can use the new alignments in train+cv-ali.mlf to run an embedded re-estimation. This is done with the command HERest in the script HEREST. HERest runs only one iteration over the data at a time. For iterative training you have to call the command again. This enables you to run a test to the dev set after each iteration to find the optimal word accuracy. Example for one iteration: % HERest -t 250.0 150.0 1000.0 -v 0.0001 \ -H hmm3/NUMBERS.mmf -I train+cv-ali.mlf -T 00001 \ -S train+cv.slist -M hmm4 -A numbersphone.txt Explanations: -t X Y Z : beam search pruning for alignment. First the pruning factor is set to X. If the alignment fails, x is incremented by Y and retried. This is done until Z is reached. Then an error message is issued. -M : dir to store the re-estimated MMF -T : tracing. 1 = basic reporting (many other options) -A : repeat command line for logging purpose Source: hmm3, train+cv-ali.mlf, DICT, numbersphone.txt Target: hmm4 (1st it.), hmm5 (2nd it.), hmm6 (3rd it.) Test: hmm4 : 89.75 [H=4368, D=54, S=251, I=174, N=4673] hmm5 : 89.81 [H=4377, D=47, S=249, I=180, N=4673] hmm6 : 90.20 ========================================================================= VII. Tuning The Viterbi search needs to be tuned for the upcoming tests in Part 2. We want a compromise between a good performance and speed. 1. Beam Width Pruning (-t) Up to now the pruning width was set arbitrarily. The script tune-prun.csh finds the optimal pruning parameter regarding to a certain decrease of performance from recognition without pruning. In this case I want the performance not drop more than 0.6 % absolute (half of significance at 0.01 level). The optimal value for the pruning parameter is then 57.0 2. LM scaling (-s) The script tune-gramscale.csh tunes in the maximum of the performance regarding the weighting of the language model. Since this is not a monotone function, the script uses a simple gradient search and stops in the first local maximum. It turns out that the optimum is right were the factor was set arbitrarily at 6.5 3. Word End Penalty (-p) The word end penalty adds a fixed value to the accumulated log likelihood each time a new word is entered during the Viterbi search. By this the relation of insertions to deletions can somewhat be steered. Since the behavior of this parameter is quite unclear in terms of word accuracy, we just run a series of experiments varying this factor. The best performance is achieved with a value of -9.0 Using these parameters a test run on the dev set takes about 22 minutes (approx. 2h without pruning). ========================================================================= ========================================================================= PART 2 - HTK TOOL BOX The following section gives a more detailed discussion about most of the HTK tools and gives examples how to use them within the Numbers task. This includes manipulation of label files, dictionaries, language models and lattices, manipulation of HMMs definitions and some miscellaneous stuff that I think is worth to know. Buggy behavior is reported with each command description. Also, in this part I report about the performance of triphones versus monophones and the optimal configuration for the Numbers task that I could find within 4 weeks. Intended audience: people who like to work with HTK ========================================================================= 0. General Aside from the training and decoding algorithms the HTK packages gives you some very powerful general purpose tools that are always needed if you work in speech recognition. Of course all of these are more or less adapted to the HTK formats and needs, but on the other hand the Cambridge people did a nice job to incorporate some of the main standards other than HTK in their tools. Most of the tools works as script editors. That is, they read some input files, apply a script of commands stored in a separate script file and write the changed output to other files. All HTK tools have simple command line calls and none of them requires X, thus making scripting very easy. When you work with HTK, you will use these tools very frequently and for sometimes unexpected purposes. Therefore, the following sections will give you some examples how to use these tools and (hopefully) a general idea, what else they are capable off. =========================================================================== I. HLEd - Script Editor for Label Files The tool HLEd is a quite powerful editor for all kinds of label files used in HTK. Label files in HTK can have a complex structure with several layers (word, syllable, phone), but in the vast majority they are simple lists of symbols with optional timing information. In most cases the usage of HLEd is as follows: % HLEd -i output.mlf -T 1 script.led input.mlf where input.mlf and output.mlf are selfexplanatory script.led stores the edit commands MLFs were mentioned earlier in this tutorial. They are simply a more convenient way to store label information in a compact and location independent form. A MLF with word labels might be looking like this: #!MLF!# "*/NU-1008.streetaddr.lab" seventy two . "*/NU-1008.zipcode.lab" zero two eight ... A MLF with phonetic segments is looking like this: #!MLF!# "*/NU-19.streetaddr.lab" 0 1710000 h# 1710000 2160000 w 2160000 2950000 ah 2950000 3700000 n 3700000 4830000 s 4830000 5270000 ih 5270000 5680000 kcl 5680000 5870000 k 5870000 6370000 s 6370000 6910000 tcl 6910000 7360000 t 7360000 8780000 iy 8780000 9500000 n 9500000 10540000 h# . "*/NU-19.zipcode.lab" 0 1780000 h# 1780000 2920000 w 2920000 3560000 ah ... Note that these MLFs are both location independent. That is, where ever a speech file is stored, the HTK tool will look just for the name of the corresponding label file, because the path is '*'. Most frequent uses of HLEd are: - combining individual label files into a MLF - changing/deleting/inserting labels - splitting/combining labels - converting phones into biphones/triphones - expanding word labels into phone labels - combining phone labels to syllable labels and vice versa Example: The following command changes a monophone MLF into a triphone MLF: % HLEd -n numberstriphone.txt -l '*' -i train+cv-tri.mlf \ triphones.led train+cv-ali.mlf where Option '-n' gives the name of a file with all new created model names Option '-l' causes the new MLF to be location independent (path '*/') and the script triphones.led contains: WB sp WB h# TC Explanation: In HTK left and right context in label names is denoted by '-' and '+'. A model named /tcl-ah+f/ is a /ah/ model in the left context /tcl/ and right context /f/. The 'TC' command expands each model in the MLF train+cv-ali.mlf into the corresponding context dependent triphone model name. The preceding 'WB' commands define two models to be excepted from context expansion and these are the 'inter word' models /h#/ and /sp/. This results into so called within word context models. Applied to the Numbers task this results into 247 distinct triphones. TIP: You will get a somewhat misleading error, if you use a speech file together with a MLF that has a different path stored than the speech file. In that case, take a text editor and simply replace the absolute path in the MLF by '*'. The consequent usage of the the option -l '*' with all HTK tools that produce MLFs can avoid such behavior. ========================================================================= II. HDMan - Script Editor for Dictionaries In a similar way like HLEd the contents of dictionaries may be manipulated by the script editor HDMan. Furthermore, it can be used to combine several different dictionaries in different formats into a HTK compatible target dictionary. For each input an individual script files can be defined and a global script file works on the combined output. We do not need HDMan for the Numbers task, therefore I don't give an example here. HDMan can also be used for determining the list of phone symbols used in a dictionary (see Part 1, Section I.5. for an example). WARNING: HDMan requires the input dictionaries to be sorted. This caused some trouble as I used it for the Verbmobil task, because HDMan followed a different sorting table than ASCII when it comes to special characters like '"'. The workaround is to issue the HDMan command 'IR' (for input raw data) in the individual editing script. It does not work in the global editing script and the command is not called 'IM RAW' like said in the HTK Book! ========================================================================= III. HLStats, HBuild - Create Language Models and Lattices Fortunately we already had a ARPA format bigram model when we get started with our example task. But we could have calculated our own language model from the training set using the tool HLStats. HLStats is a general statistics tool that works on label files or MLFs. It can be used to calculate: - number of occurrence of labels - durations (average, min, max) - bigram matrix - backoff bigram (ARPA) - list of coverage labels from a set of labeled data Here is an example how we could have calculated the ARPA backoff bigram model from the Numbers train+cv set: % HLStats -b LM -o WRDLIST train+cv-ref.mlf WARNINGS: The Option -I does not work! Use the input MLF file instead of the regular label files (last argument on command line) like in the above example and it works. HLStats does not assign the label '!NULL' to out-of-vocabulary-words as said in the reference section. This is quite annoying because you cannot see the out-of-voc rate and you cannot compute bigrams that include knowledge about out-of-voc occurrences. The tool HBuild was mentioned earlier to convert the language model (or grammatical model) into a lattice that can be then used by HVite. What it essentially does is to expand the language model into a fully looped back lattice with all bigrams. This seems to be impossible for very large dictionaries and in fact I never tried it with more than 3500 words. However, HBuild does take into account the backoffs by inserting virtual nodes in the lattice where all the common words of one backoff are 'collected' before connected to the next loop. By that, I guess, it's possible to use very large dictionaries as well. WARNING: In HTK versions less than 2.1 HBuild comes up with a lattice that cannot be read by HVite, if the words contain special characters like '"'. A workaround is to edit the lattice and quote these special characters with a '\' before passing it to HVite. This bug was fixed in HTK 2.1. ========================================================================= IV. HHEd - Manipulation of HMMs This is the far most powerful and important tool in HTK. Like his siblings it works as a script editor, but input and output are HMM definition files (or MMFs). Since model definitions may be stored either in separate files or together in one or more MMFs, there are different ways to use HHEd. One possible usage for our example task would be: % HHEd -H hmm2/NUMBERS.mmf -w hmm3/NUMBERS.mmf script.hed numbersphone.txt where: Option '-w' defines the output MMF Option '-H' defines the input MMF script.hed stores the editing commands numbersphone.txt is the list of HMM names that are to be edited 1. HHEd Commands Like in HLEd and HDMan the editor commands are two capital letter mnemonics. For example RT = remove transition: RT i j itemlist will remove the transition from state i to state j in all transition matrices found in the itemlist and re-normalize the remaining transition probabilities. There are 26 different, sometimes very specialized commands in the reference section of HHEd. The most important things HHEd can do are: - clone to bi- or triphones - tie states or parts of states into macros - manipulate transitions - clustering of states (classic or decision trees) - compact HMM definitions (replace identical definitions by pointers) - split mixtures 2. HHEd Itemlists HHEd uses a sort of pattern matching language to address parts of the loaded models in the commands. These patterns are referred to as 'itemlist' in the synopsis of the different commands. A complete description of this can be found in the reference section, but the general idea is to view a HMM as a C-like structure together with some UNIX-like pattern matching features. Some examples will make the idea clear: {ah.transP} : transition matrix of model /ah/ {*.transP} : transition matrices of all models {*-ah+f.transP} : transition matrices of all triphone models /ah/ with right context /f/ {ah.state[2]} : the second state of model /ah/ {ah.state[2].mix[5]} : the 5th mixture of the second state of model /ah/ {(ah,ae,ax).state[2-5]} : the 2nd to 5th states of the models /ah/, /ae/ and /ax/ {*.state[3].mix[2-5].mean} : the means in the 2nd to 5th mixture in state 3 of all loaded models {*.state[2-5].mix[2-5].cov} : the variances or covariance matrices (depending on the model) of... TIPS: Note that states and mixtures are numbered beginning with 1, but the first and last states are virtual and do not contain any mixtures. The pattern in itemlist may be 'over-specified' in that sense that it may refer to items that are not existent. For example, if you like to tie all variances in all 6 mixtures in all states in all loaded models together, but some models have 3 states while other have 4 states, you can address all these by the following command: TI glob_cov {*.state[2-5].mix[1-6].cov} In this example all variances that match the itemlist will be replaced by a so-called macro 'glob_cov' (for macros see next point), while the macro itself is computed of the maximum values found in all replaced variances. HHEd will notice that some of your loaded models actually don't have a state number 5 and give a warning. But it will correctly tie the existing variances together. 3. HHEd Macros HHEd macros are technically just simple replace operations like C macros. However, HHEd macros each have a certain type that refers to a part of a HMM definition. You can determine the type of a macro by its definition or its call: Examples: ~v "glob_var" 4 1.0 0.8 0.7 1.0 might be a simple version of the macro "glob_var" we saw in the last example. The 'v' in the first line denotes this macro of being the type 'diagonal variance vector'. HHEd will check that this macro is only called in the appropriate places, namely where a variance vector is required in a HMM model definition. HHEd macros are therefore an easy way to use shared structures (the usually result of some tying operation). Macros definitions (like in C) have to appear before they are referred to. HHEd takes care for that, if you write everything into a MMF. If you use distributed model definitions (in several files) you have to take care that the macros are loaded first (usually by the option -H). Technically, everything in a MMF is a macro, even the global option values (~o) and the model definitions itself (~h). But these two are never referred to and hence should not be called macros. 4. Some Examples on the Numbers Task a) Combine distributed HMM definitions into one single MMF After bootstrapping our models (see Part 1, Section III) we had a separate file with the model definition for each phone. To combine these into a single MMF we simply load them into HHEd, do some dummy script and write them out: % HHEd -w hmm2/NUMBERS.mmf -d hmm1 co.hed numbersphone.txt where: Option '-d' tells HHEd where to look for the individual model definition files (default is the current directory) co.hed is a dummy script that does nothing Note that here the file numbersphone.txt contains merely a list of HMM names and HHEd looks for files with the same names in the directory hmm1. If HHEd cannot find a file corresponding to a name in numbersphone.txt, it will give an error message and exit. b) Silence models This is a more detailed description, what was mentioned in Part 1, Section IV. We want to introduce a short optional silence model /sp/ at the end of each word. We already have a 3-state model /h#/ trained to the silence at the begin and end of each utterance. First we take a standard text editor, copy the model definition of /h#/ (~h "h#") to a new model definition named /sp/ (~h "sp") in our MMF hmm1/NUMBERS.mmf. Then we delete the second and fourth state and edit the transition matrix accordingly (the values are not important; only the rows must be sum up to 1.0) and store the whole thing in hmm2/NUMBERS.mmf. We now have a MMF with an additional model definition /sp/ for a one-state silence model. We issue the following HHEd command: % HHEd -w hmm3/NUMBERS.mmf -H hmm2/NUMBERS.mmf sil.hed numbersphone.txt where the script sil.hed contains: AT 4 2 0.2 {h#.transP} AT 1 3 0.3 {sp.transP} TI h#st {h#.state[3],sp.state[2]} The first AT command ('add transition') will add a 'backward transition' into the begin and end silence model /h#/ to make it more robust. The second AT command will add a transition from the (virtual) first state to the (virtual) last state of the model /sp/, thus making it optional (it can be be jumped over without emitting any likelihood). This is called a 'tee model' (after the UNIX 'tee' command). The following tie command TI will create a macro named "h#st" that represents the middle state of the model /h#/ and is also tied to the single emitting state of model /sp/. If you look at the resulting MMF hmm3/NUMBERS.mmf, you will find a new macro of type '~s "h#st" right at the top of the file followed by a full state definition with 6 mixtures. Way down the file in the model definitions of /h#/ and /sp/ this macro will be referred to where the second resp. first emitting states are defined. c) Number of Mixtures In our example in Part 1 we started with prototypes that had 6 mixtures per state. However, this was only (a pretty good) guess of me. To be sure that you have chosen the optimal topology for your models there is no way to avoid the heuristic try-and-fail method. I ran a series of trainings on different number of mixtures. It is recommended to start with a single Gaussian model, train it until it converges on the dev set and then increase the number of mixtures by one, train them and so on. The splitting of mixtures can be done like follows: % HHEd -H hmm3-1mix/NUMBERS.mmf -M hmm3-2mix mu.hed numbersphone.txt where mu.hed contains the single command: MU 2 {*.state[2-6].mix} The MU command ('mix up') will take each mixture found in the itemlist and convert them in the number of mixtures given in the command (here: 2). The algorithm is quite complex (please refer to the reference section for details); it is not merely a split of the means in the direction of the max variance, but MU even takes care that the resulting mixtures have a reasonable weight (deleting mixtures that fall under a weight threshold), that the split is distributed evenly over mixtures, etc. I did this iterative procedure of splitting and training and found that the best performance was achieved with 7 mixtures: Word accuracy: 90.02 [H=4374, D=53, S=234, I=178, N=4661] (Note that this is done with pruning! Therefore it's actually better than the results reported in Part 1, Section VI.) d) Triphones We already produced a MLF with triphones of the training set in Section I using HLEd. The same operation gave us a list numberstriphone.txt of the 247 occurring triphones in the training set. Now we want to produce the corresponding triphone model definitions from 1-mixture monophones: % HHEd -H hmm3-1mix/NUMBERS.mmf -M hmm3-1mix-tri \ triphones.hed numbersphone.txt where triphone.hed contains the clone command: CL numberstriphone.txt The CL command will read the list of triphones from numberstriphone.txt, look for corresponding monophones in the loaded model definitions and clone them into a MMF with triphone definitions. Note that the monophones itself are NOT copied to the output MMF, except they were also listed in the file numberstriphone.txt. (The -M option is another way to redirect the output: the output MMF will have the same name as the input MMF, but written to the directory hmm3-1mix-tri) Now we have two problems that give a wonderful example of what things you run into working with HTK: 1. The DICT still contains the phones /ae/, /uh/, /hv/ and /m/ (you remember that these were in the manually labeled training data, but could not be trained). Since the list of triphones was derived from the converted MLF train+cv-ali.mlf that was produced by a forced alignment, it now happens that these phones do not occur in the triphone list. Because the forced alignment always outputs the physical model name. So, the dictionary and the model set are now incompatible. We solve this by replacing this phone symbols in DICT by the ones already used in Part 1. 2. We find some triphones in DICT that are not found in the training set The reason is simply that the forced alignment did not choose some variants in DICT during the forced alignment. Two possible ways out of this: a) synthesis the required 4 triphones (lot of work) b) delete the 16 variants where these special triphone occur I did the latter and ran a test on the monophone baseline system to make sure that this didn't hurt performance. (It turned out that even without any multi-pronunciation we receive the same performance!) Now we can re-estimate the triphones using the script HEREST. After 4 iterations we achieve: Word accuracy: 81.39 [H=3979, D=82, S=414, I=337, N=4475] This is not very surprising at first, because these models are not tied and probably have not enough data to come up with robust parameters. e) Clustering and Tying States I took the 247 triphones and tried several clustering techniques to them. I spare you the details here and just mention that the biggest problem is the huge variety of parameters you can use in HTK. The best configuration with triphones I could find was: - cluster states class independent and with a minimum of 10 'visits' per state to a maximum within cluster distance of 0.05 - re-estimate (3 it.) - 'mix up' to 6 mixtures per state - re-estimate (8 it.) Word accuracy: 87.11 [H=4303, D=61, S=261, I=274, N=4625] I don't claim that this is the optimum that you can achieve with triphones on this task; I merely ran out of time (and furthermore my scratch disk crashed and I lost all the data about triphone experiments). TIPS to HHEd: Note that the input MMF may store more model definitions as actually being edited. Only the models listed in the model list (last argument) will be considered for manipulation. Sometimes, the input models will not be not be copied to the output MLF. For example, if you give a set of monophone models as an input and let HHEd clone them into triphone models, the output MMF will only contain those triphones and no copies of the original monophones. The TI command will behave very different depending on what type you use it. Be sure that you understand the different ways how TI will calculate the resulting macro from a tying operation. ========================================================================= V. Miscs The following is a collection of miscellaneous stuff 0. Documentation The doc of HTK consists of the HTK Book by Steve Young et al. (including the reference section) and a html version of the book (file:/u/drspeech/opt/htk/HTKBook/HTKBook.html). Nothing more! There are no info or man pages to the HTK tools. If you are planing to use the libraries in your own software, there is no way around to read the source code itself (Have Fun!). 1. Our Top Performance on Numbers I took the 7-mixtures monophone models and tied the variances in each state of the models with less than 500 observations in the training set. The idea was to make these badly estimated states more robust. Then I ran 4 iteration of HEREST. Word accuracy: 90.38 [H=4378, D=51, S=226, I=171, N=4655] I tuned in the word end penalty to -9.0 and achieved Word accuracy: 90.53 [H=4363, D=56, S=239, I=146, N=4658] Then I switched off the beam search pruning and achieved Word accuracy: 91.53 [H=4408, D=50, S=215, I=131, N=4673] This best result was done with a model of roughly 61152 acoustic parameters. 2. Weighted pronunciations and Re-scoring of Lattices As mentioned earlier, HTK (since version 2.1) can use a posteriori probabilities in the dictionary to weight different pronunciations of the same word. There is no need that these sum up to 1.0 and there is an additional option '-r' of HVite that scales up the weighting in the search (this option is nowhere documented!). Before I found out about this new feature, I tried to re-score lattices produced by HVite during recognition and then let HVite match them to the signal again. However, this always resulted in considerable worse results, even if I did nothing to the lattice at all. I do not know exactly the reason for this, but it might be that the lattice is pruned in some sort and looses the best search path that way. 3. The Mixture Problem Better start with a lesser number of mixtures and work your way up. You cannot go in the reverse direction, that is there is now way to merge mixtures in HTK. If you intend to do some state clustering with decision trees, you must use single Gaussian models first. 4. Efficiency Some CPU times on a Ultra 1 with 128MB (averages): HInit 28 models 1 mix : 3:43 h HVite without pruning : 2:23 h HVite with pruning 57.0 : 0:22 h HERest one iteration : 0:35 h If you use HInit with more than a few hours of data, it becomes very slowly. I don't know the reason for this, because the slow part is not the iterative training bit the data collection part. Matbe it's simply bad implemented. For example the HInit of one model in 11 h of speech lasts 9 CPU h in average. 5. Technical Stuff Some configurations cause HVite to hog memory like a gulo gulo. Especially if you use the option '-n' with values higher than 10 (lattice output). HInit 'collects' all segments into RAM first before starting any processing. This can be a problem, if you have more than 100.000 segments of one single class. If you have LaTeX style words in your dictionary or other quote signs (like in "d'accord", HTK tools will interpret a leading quote sign as a opening quotation and miss the closing quote sign. This will result in error messages like: +??13 read string to long Workaround: Quote all " and ' in your dictionary with \ 6. What I really missed No way to combine existing HMM definitions into larger models (eg. phones to syllables). You have to do it yourself. No way to gradually re-estimate HMM in embedded training. Consider for instance that you have trained up very robust monophones, then switch to triphones, where there is much less data for each model. A good thing would be to have some mechanism that used deleted interpolation between the different sets using the occupation statistics.