📄 hlmtutorial.tex

📁 该压缩包为最新版htk的源代码,htk是现在比较流行的语音处理软件,请有兴趣的朋友下载使用
💻 TEX
📖 第 1 页 / 共 4 页
字号:
information such as the number of utterances and tokens encountered,words predicted and OOV statistics.  The second part of the resultsgives explicit access statistics for the back off model.  For thetrigram model test, the total number of words predicted is 8127. Fromthis number, 34.1\% were found as explicit trigrams in the model, 30.2\%were computed by backing off to the respective bigrams and 35.7\% weresimply computed as bigrams by shortening the word context.These perplexity tests do not include the prediction of words fromcontext which includes OOVs. To include such $n$-grams in the calculation the \texttt{-u} option should be used.\begin{verbatim}$ LPlex -u -n 3 -t lm_5k/tg1_1 test/red-headed_league.txt LPlex test #0: 3-gramperplexity 117.4177, var 8.9075, utterances 556, words predicted 9187num tokens 10408, OOV 665, OOV rate 6.75% (excl. </s>)Access statistics for lm_5k/tg1_1:Lang model  requested  exact backed    n/a     mean    stdev    bigram       5911  68.5%  30.9%   0.6%    -5.75     2.94   trigram       9187  35.7%  31.2%  33.2%    -4.77     2.98\end{verbatim} % $The number of tokens predicted has now risen to 9187.  For analysingOOV rates the tool provides the \texttt{-o} option which will print alist of unique OOVs encountered together with their occurrence counts.Further trace output is available with the\texttt{-T} option.\mysect{Generating and using count-based models}{HLMgeneratingcount}\index{Count-based language models}The language models generated in the previous section are static interms of their size and vocabulary. For example, in order to evaluatea trigram model with cut-offs 2 (bigram) and 2 (trigram) the userwould be required to rebuild the bigram and trigram stages of themodel.  When large amounts of text data are used this can be a verytime consuming operation.The HLM toolkit provides the capabilities to generate and manipulate amore generic type of model, called a count-based models, which can bedynamically adjusted in terms of its size and vocabulary.  Count-basedmodels are produced by specifying the \texttt{-x} option to\htool{LBuild}.  The user may set cut-off parameters which control theinitial size of the model, but if so then once the model is generatedonly higher cut-off values may be specified in the subsequentoperations.  The following command demonstrates how to generate acount-based model:\begin{verbatim}$ LBuild -C config -T 1 -t lm_5k/5k.fof -c 2 1 -c 3 1          -x -n 3 lm_5k/5k.wmap lm_5k/tg1_1c          holmes.1/data.* lm_5k/data.0\end{verbatim} % $Note that in the above example the full trigram model is generatedby a single invocation of the tool and no intermediate files arekept (i.e. the unigram and bigram models files).  The generated model can now be used in perplexity tests and differentmodel sizes can be obtained by specifying new cut-off values via the\texttt{-c} option of \htool{LPlex}.  Thus, using a trigram model with cut-offs (2,2) gives\begin{verbatim}$ LPlex -c 2 2 -c 3 2 -T 1 -u -n 3 -t lm_5k/tg1_1c         test/red-headed_league.txt...LPlex test #0: 3-gram Processing text stream: test/red-headed_league.txtperplexity 126.2665, var 9.0519, utterances 556, words predicted 9187num tokens 10408, OOV 665, OOV rate 6.75% (excl. </s>)...\end{verbatim} % $and a model with cut-offs (3,3) gives\begin{verbatim}$ LPlex -c 2 3 -c 3 3 -T 1 -u -n 3 -t lm_5k/tg1_1c         test/red-headed_league.txt...Processing text stream: test/red-headed_league.txtperplexity 133.4451, var 9.0880, utterances 556, words predicted 9187num tokens 10408, OOV 665, OOV rate 6.75% (excl. </s>)...\end{verbatim} % $However, the count model \texttt{tg1\_1c} cannot be used directly inrecognition tools such as \htool{HVite} or \htool{HLvx}.  An ARPAstyle model of the required size suitable for recognition can bederived using the \htool{HLMCopy} tool:\begin{verbatim}$ HLMCopy -T 1 lm_5k/tg1_1c lm_5k/rtg1_1\end{verbatim} % $This will be the same as the original trigram model built above, withthe exception of some insignificant rounding differences.\mysect{Model interpolation}{HLMmodelinterp}\index{Interpolating language models}The \HTK\ language modelling tools also provide the capabilities toproduce and evaluate interpolated language models.  Interpolatedmodels are generated by combining a number of existing models in aspecified ratio to produce a new model using the tool \htool{LMerge}.Furthermore, \htool{LPlex} can also compute perplexities usinglinearly interpolated $n$-gram probabilities from a number of sourcemodels.  The use of model interpolation will be demonstrated bycombining the previously generated Sherlock Holmes model with anexisting 60,000 word business news domain trigram model(\texttt{60bn\_tg.lm}).  The perplexity measure of the unseen SherlockHolmes text using the business news model is 297 with an OOV rate of1.5\%.  ({\tt LPlex -t -u 60kbn\_tg.lm test/*}). In the followingexample, the perplexity of the test date will be calculated bycombining the two models in the ratio of 0.6 \texttt{60kbn\_tg.lm} and0.4 \texttt{tg1\_1c}:\begin{verbatim}$ LPlex -T 1 -u -n 3 -t -i 0.6 ./60kbn_tg.lm         lm_5k/tg1_1c test/red-headed_league.txtLoading language model from lm_5k/tg1_1cLoading language model from ./60kbn_tg.lmUsing language model(s):   3-gram lm_5k/tg1_1c, weight 0.40  3-gram ./60kbn_tg.lm, weight 0.60Found 60275 unique words in 2 model(s)LPlex test #0: 3-gramProcessing text stream: test/red-headed_league.txtperplexity 188.0937, var 11.2408, utterances 556, words predicted 9721num tokens 10408, OOV 131, OOV rate 1.33% (excl. </s>)Access statistics for lm_5k/tg1_1c:Lang model  requested  exact backed    n/a     mean    stdev    bigram       5479  68.0%  31.3%   0.6%    -5.69     2.93   trigram       8329  34.2%  30.6%  35.1%    -4.75     2.99Access statistics for ./60kbn_tg.lm:Lang model  requested  exact backed    n/a     mean    stdev    bigram       5034  83.0%  17.0%   0.1%    -7.14     3.57   trigram       9683  48.0%  26.9%  25.1%    -5.69     3.53\end{verbatim} % $A single combined model can be generated using \htool{LMerge}:\begin{verbatim}$ LMerge -T 1 -i 0.6 ./60kbn_tg.lm 5k_unk.wlist        lm_5k/rtg1_1 5k_merged.lm\end{verbatim} % $Note that \htool{LMerge} cannot merge count-based models, hence theuse of \texttt{lm\_5k/rtg1\_1} instead of its count-based equivalent\texttt{lm\_5k/tg1\_1c}.  Furthermore, the word list supplied to thetool also includes the OOV symbol (\texttt{!!UNK}) in order topreserve OOV $n$-grams in the output model which in turn allows theuse of the \texttt{-u} option in \htool{LPlex}.Note that the perplexity you will obtain with this combined model ismuch lower than that when interpolating the two together because theword list has been reduced from the union of the 60K and 5K ones downto a single 5K list.  You can build a 5K version of the 60K modelusing \htool{HLMCopy} and the {\tt -w} option, but first you need toconstruct a suitable word list -- if you pass it the {\tt5k\_unk.wlist} one it will complain about the words in it that weren'tfound in the language model.  In the {\tt extras} subdirectory you'llfind a Perl script to rip the word list from the {\tt 60kbn\_tg.lm}model, {\tt getwordlist.pl}, and the result of running it in {\tt60k.wlist} (the script will work with any ARPA type language model).The intersection of the 60K and 5K word lists is what is required, soif you then run the {\tt extras/intersection.pl} Perl script, amendedto use suitable filenames, you'll get the result in {\tt60k-5k-int.wlist}.  Then \htool{HLMCopy} can be used to produce a 5Kvocabulary version of the 60K model:\begin{verbatim}$ HLMCopy -T 1 -w 60k-5k-int.wlist 60kbn_tg.lm 5kbn_tg.lm\end{verbatim} % $This can then be linearly interpolated with the previous 5K model tocompare the perplexity result with that obtained from the\htool{LMerge}-generated model.  If you try this you will find thatthe perplexities are similar, but not exactly the same (a perplexityof 112 with the merged model and 114 with the two models linearlyinterpolated, in fact) -- this is because using \htool{LMerge} tocombine two models and then using the result is not precisely the sameas linearly interpolating two separate models; it is similar, however.It is also possible to add to an existing language model using the\htool{LAdapt} tool, which will construct a new model using suppliedtext and then merge it with the existing one in exactly the same wayas \htool{LMerge}.  Effectively this tool allows you to short-cut theprocess by performing many operations with a single command -- see thedocumentation in section \ref{s:LAdapt} for full details.\mysect{Class-based models}{HLMclassModels}\index{Class language models}A class-based $n$-gram model is similar to a word-based $n$-gram inthat both store probabilities $n$-tuples of tokens -- except in theclass model case these tokens consist of word {\it classes} instead ofwords (although word models typically include at least one class forthe unknown word).  Thus building a class model involves constructingclass $n$-grams.  A second component of the model calculates theprobability of a word given each class.  The HTK tools only supportdeterministic class maps, so each word can only be in one class.Class language models use a separate file to store each of the twocomponents -- the word-given-class probabilities and the class$n$-grams -- as well as a third file which points to the two componentfiles.  Alternatively, the two components can be combined togetherinto a standalone separate file.  In this section we'll see how tobuild these files using the supplied tools.Before a class model can be built it is necessary to construct a classmap which defines which words are in each class.  The supplied\htool{Cluster} tool can derive a class map based on the bigram wordstatistics found in some text, although if you are constructing alarge number of classes it can be rather slow (execution time measuredin hours, typically).  In many systems class models are combined withword models to give further gains, so we'll build a class model basedon the Holmes training text and then interpolate it with our existingword model to see if we can get a better overall model.Constructing a class map requires a decision to be made as to how manyseparate classes are required.  A sensible number depends on what youare building the model for, and whether you intend it purely tointerpolate with a word model.  In the latter case, for example, asensible number of classes is often around the 1000 mark when using a64K word vocabulary.  We only have 5000 words in our vocabulary sowe'll choose to construct 150 classes in this case.Create a directory called {\tt holmes.2} and run \htool{Cluster} with\begin{verbatim}$ Cluster -T 1 -c 150 -i 1 -k -o holmes.2/class lm_5k/5k.wmap         holmes.1/data.* lm_5k/data.0Preparing input gram setInput gram file holmes.1/data.0 added (weight=1.000000)Input gram file lm_5k/data.0 added (weight=1.000000)Beginning iteration 1Iteration completeCluster completed successfully\end{verbatim} % $The word map and gram files are passed as before -- any OOV mappingshould be made before building the class map.  Passing the {\tt -k}option told \htool{Cluster} to keep the unknown word token {\tt !!UNK}in its own singleton class, whilst the {\tt -c 150} options specifiesthat we wish to create 150 classes.  The {\tt -i 1} performs only oneiteration of the clusterer -- performing further iterations is likelyto give further small improvements in performance, but we won't waitfor this here.  Whilst \htool{Cluster} is running you can look at theend of the {\tt holmes.2/class.1.log} to see how far it has got.  On aUnix-like system you could use a command like {\tt tailholmes.2/class.1.log}, or if you wanted to monitor progress then {\tttail -f holmes.2/class.1.log} would do the trick.  The {\tt 1} refersto the iteration, whilst the results are written to this filenamebecause of the {\tt -o holmes.2/class} option which sets the prefixfor all output files.In the {\tt holmes.2} directory you will also see the files {\tt
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -