📄 hlmtutorial.tex
字号:
Class map = 5k.wlist [Class mappings only] saving 75400 ngrams to file lm_5k/data.092918 out of 489516 ngrams stored in 1 files\end{verbatim} % $Because the {\tt -o} option was passed, all $n$-grams containing OOVswill be extracted from the input files and the OOV words mapped to theunknown symbol with the results stored in the files\texttt{lm\_5k/data.*}. A new word map containing the new classsymbols (\texttt{!!UNK} in this case) and only words in the vocabularywill be saved to \texttt{lm\_5k/5k.wmap}. Note how the newly producedOOV $n$-gram files can no longer be decoded by the original word map\texttt{holmes.0/wmap}:\begin{verbatim}$ LGList holmes.0/wmap lm_5k/data.0 | ERROR [+15330] OpenNGramFile: Gram file map Holmes%%5k.wlist inconsistent with Holmes FATAL ERROR - Terminating program LGList\end{verbatim} % $The error is due to the mismatch between the original map's name(``Holmes'') and the name of the map stored in the header of the$n$-gram file we attempted to list (``Holmes\%\%5k.wlist''). The lattername indicates that the word map was derived from the original map\texttt{Holmes} by resolving class membership using the class map\texttt{5k.wlist}. As a further consistency indicator, the originalmap has a sequence count of 1 whilst the class-resolved map has asequence count of 2.The correct command for listing the contents of the OOV $n$-gramfile is:\begin{verbatim}$ LGList lm_5k/5k.wmap lm_5k/data.0 | more4-Gram File lm_5k/data.0[75400 entries]: Text Source: LGCopy!!UNK !!UNK !!UNK !!UNK : 50!!UNK !!UNK !!UNK </s> : 20!!UNK !!UNK !!UNK A : 2!!UNK !!UNK !!UNK ACCOUNTS : 1!!UNK !!UNK !!UNK ACROSS : 1!!UNK !!UNK !!UNK AND : 17...\end{verbatim} % $At the same time the class resolved map \texttt{lm\_5k/5k.wmap} canbe used to list the contents of the $n$-gram, database files -- thenewer map can view the older grams, but not vice-versa.\begin{verbatim}$ LGList lm_5k/5k.wmap holmes.1/data.2 | more4-Gram File holmes.1/data.2[89516 entries]: Text Source: LGCopyTHE SUSSEX MANOR HOUSE : 1THE SWARTHY GIANT GLARED : 1THE SWEEP OF HIS : 1THE SWEET FACE OF : 1THE SWEET PROMISE OF : 1THE SWINGING DOOR OF : 1...\end{verbatim} % $However, any $n$-grams containing OOV words will be discarded sincethese are no longer in the word map.Note that the required word map \texttt{lm\_5k/5k.wmap} can also beproduced also using the \htool{LSubset} tool:\begin{verbatim}$ LSubset -T 1 holmes.0/wmap 5k.wlist lm_5k/5k.wmap\end{verbatim} % $Note also that had the {\tt -o} option not been passed to\htool{LGCopy} then the $n$-gram files built in {\tt lm\_5k} would havecontained not only those with OOV entries but also all the remainingpurely in-vocabulary words, the union of those shown by the twopreceding {\tt LGList} commands, in fact. The method that you chooseto use depends on what experiments you are performing -- the \HTK\tools allow you a degree of flexibility.\mysect{Language model generation}{HLMlanmodgen}Language models are built using the \htool{LBuild} command. If you'reconstructing a class-based model you'll also need the \htool{Cluster}tool, but for now we'll construct a standard word $n$-gram model.You'll probably want to accept the default of using Turing-Gooddiscounting for your $n$-gram model, so the first step in generating alanguage model is to produce a frequency of frequency (FoF) table forthe chosen vocabulary list. This is performed automatically by\htool{LBuild}, but optionally you can generate this yourself usingthe \htool{LFoF} tool and pass the result into \htool{LBuild}. Thishas only a negligable effect on computation time, but the result isinteresting in itself because it provides useful information forsetting cut-offs. Cut-offs are where you choose to discard lowfrequency events from the training text -- you might wish to do thisto decrease model size, or because you judge these infrequent eventsto be unimportant.In this example, you can generate a suitable table from the languagemodel databases and the newly generated OOV $n$-gram files:\begin{verbatim}$ LFoF -T 1 -n 4 -f 32 lm_5k/5k.wmap lm_5k/5k.fof holmes.1/data.* lm_5k/data.*Input file holmes.1/data.0 added, weight=1.0000Input file holmes.1/data.1 added, weight=1.0000Input file holmes.1/data.2 added, weight=1.0000Input file lm_5k/data.0 added, weight=1.0000Calculating FoF table\end{verbatim} % $After executing the command, the FoF table will be stored in\texttt{lm\_5k/5k.fof}. It shows the number of times a word is foundwith a given frequency -- if you recall the definition of Turing-Gooddiscounting you will see that this needs to be known. See chapter\ref{c:hlmfiles} for further details of the FoF file format.You can also pass a configuration parameter to \htool{LFoF} to make itoutput a related table showing the number of $n$-grams that will beleft if different cut-off rates are applied. Rerun \htool{LFoF} andalso pass it the existing configuration file {\tt config}:\begin{verbatim}$ LFoF -C config -T 1 -n 4 -f 32 lm_5k/5k.wmap lm_5k/5k.fof holmes.1/data.* lm_5k/data.*Input file holmes.1/data.0 added, weight=1.0000Input file holmes.1/data.1 added, weight=1.0000Input file holmes.1/data.2 added, weight=1.0000Input file lm_5k/data.0 added, weight=1.0000Calculating FoF tablecutoff 1-g 2-g 3-g 4-g0 5001 128252 330433 471998 1 5001 49014 60314 40602 2 5001 30082 28646 15492 3 5001 21614 17945 8801 ...\end{verbatim} % $The information can be interpreted as follows. A bigram cut-off valueof 1 will leave 49014 bigrams in the model, whilst a trigram cut-offof 3 will result in 17945 trigrams in the model. The configurationfile \texttt{config} forces the tool to print out this extrainformation by setting \texttt{LPCALC: TRACE=3}. This is the tracelevel for one of the library modules, and is separate from the tracelevel for the tool itself (in this case we are passing {\tt -T 1} toset trace level 1. The trace field consists of a series of bits, sosetting trace 3 actually turns on two of those trace flags.We'll now proceed to build our actual language model. In this themodel will be generated in stages by executing the \htool{LBuild}separately for each of the unigram, bigram and trigram sections of themodel (we won't build a 4-gram model in this example, although the$n$-gram files we've build allow us to do so at a later date if we sowish), but you can build the final trigram in one go if you like. Thefollowing command will generate the unigram model:\begin{verbatim}$ LBuild -T 1 -n 1 lm_5k/5k.wmap lm_5k/ug holmes.1/data.* lm_5k/data.*\end{verbatim} % $Look in the {\tt lm\_5k} directory and you'll discover the model {\ttug} which can now be used on its own as a complete ARPA formatunigram language model.We'll now build a bigram model with a cut-off of 1 and to saveregenerating the unigram component we'll include our existing unigram model:\begin{verbatim}$ LBuild -C config -T 1 -t lm_5k/5k.fof -c 2 1 -n 2 -l lm_5k/ug lm_5k/5k.wmap lm_5k/bg1 holmes.1/data.* lm_5k/data.*\end{verbatim} % $Passing the {\tt config} file again means that we get given somediscount coefficient information. Try rerunning the tool without the{\tt -C config} to see the difference. We've also passed in theexisting {\tt lm\_5k/5k.fof} file although this is not necessary --try omitting this and you'll find that the resulting file isidentical. What will be different, however, is that the tool willprint out the cut-off table seen when running \htool{LFoF} with the{\tt LPCALC: TRACE = 3} parameter set; if you don't want to see thisthen don't set {\tt LPCALC: TRACE = 3} in the configuration file (tryrunning the above command without {\tt -t} and {\tt -C}).Note that this bigram model is created in \HTK\'s own binary versionof the ARPA format language model, with just the unigram component intext format by default. This makes the model more compact and fasterto load. If you want to override this then simply add the {\tt -fTEXT} parameter to the command.Finally, the trigram model can be generated using the command:\begin{verbatim}$ LBuild -T 1 -c 3 1 -n 3 -l lm_5k/bg1 lm_5k/5k.wmap lm_5k/tg1_1 holmes.1/data.* lm_5k/data.*\end{verbatim} % $Alternatively instead of the three stages above, you can also buildthe final trigram in one step:\begin{verbatim}$ LBuild -T 1 -c 2 1 -c 3 1 -n 3 lm_5k/5k.wmap lm_5k/tg2-1_1 holmes.1/data.* lm_5k/data.*\end{verbatim} % $If you compare the two trigram models you'll see that they're the samesize -- there will probably be a few insignificant changes inprobability due to more cumulative rounding errors incorporated in thethree stage procedure.\mysect{Testing the LM perplexity}{HLMtestingpp}\index{Perplexity}Once the language models have been generated, their ``goodness'' canbe evaluated by computing the perplexity of previously unseen textdata. This won't necessarily tell you how well the language modelwill perform in a speech recognition task because it takes no accountof acoustic similarities or the vagaries of any particular system, butit will reveal how well a given piece of test text is modelled by yourlanguage model. The directory \texttt{test} contains a single storywhich was withheld from the training text for testing purposes -- ifit had been included in the training text then it wouldn't be fair totest the perplexity on it since the model would have already `seen' it.Perplexity evaluation is carried out using \htool{LPlex}. The toolaccepts input text in one of two forms -- either as an HTK style MLF(this is the default mode) or as a simple text stream. The text streammode, specified with the \texttt{-t} option, will be used to evaluatethe test material in this example.\begin{verbatim}$ LPlex -n 2 -n 3 -t lm_5k/tg1_1 test/red-headed_league.txt LPlex test #0: 2-gramperplexity 131.8723, var 7.8744, utterances 556, words predicted 8588num tokens 10408, OOV 665, OOV rate 6.75% (excl. </s>)Access statistics for lm_5k/tg1_1:Lang model requested exact backed n/a mean stdev bigram 8588 78.9% 20.6% 0.4% -4.88 2.81 trigram 0 0.0% 0.0% 0.0% 0.00 0.00LPlex test #1: 3-gramperplexity 113.2480, var 8.9254, utterances 556, words predicted 8127num tokens 10408, OOV 665, OOV rate 6.75% (excl. </s>)Access statistics for lm_5k/tg1_1:Lang model requested exact backed n/a mean stdev bigram 5357 68.2% 31.1% 0.6% -5.66 2.93 trigram 8127 34.1% 30.2% 35.7% -4.73 2.99\end{verbatim} % $The multiple \texttt{-n} options instruct \htool{LPlex} to perform twoseparate tests on the data. The first test (\texttt{-n 2}) will useonly the bigram part of the model (and unigram when backing off),whilst the second test (\texttt{-n 3}) will use the full trigrammodel. For each test, the first part of the result gives general
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -