📄 hlmtutorial.tex
字号:
class.recovery} and {\tt class.recovery.cm} -- these are a recoverystatus file and its associated class map which are exported at regularintervals because the \htool{Cluster} tool can take so long to run.In this way you can kill the tool before it has finished and resumeexecution at a later date by using the {\tt -x} option; in this caseyou would use {\tt -x holmes.2/class.recovery} for example (makingsure you pass the same word map and gram files -- the tool does{\it not} currently check that you pass it the same files when restarting).Once the tool finishes running you should see the file {\ttholmes.2/class.1.cm} which is the resulting class map. It is in plaintext format so feel free to examine it. Note, for example, how {\ttCLASS23} consists almost totally of verb forms ending in {\tt -ED},whilst {\tt CLASS41} lists various general words for a person orobject. Had you created more classes then you would be likely to seemore distinctive classes. We can now use this file to build the class$n$-gram component of our language model.\begin{verbatim}$ LGCopy -T 1 -d holmes.2 -m holmes.2/cmap -w holmes.2/class.1.cm lm_5k/5k.wmap holmes.1/data.* lm_5k/data.0Input file holmes.1/data.0 added, weight=1.0000Input file lm_5k/data.0 added, weight=1.0000Copying 2 input files to output files with 2000000 entriesClass map = holmes.2/class.1.cm saving 162397 ngrams to file holmes.2/data.0 330433 out of 330433 ngrams stored in 1 files\end{verbatim} % $The {\tt -w} option specifies an input class map which is applied whencopying the gram files, so we now have a class gram file in {\ttholmes.2/data.0}. It has an associated word map file {\ttholmes.2/cmap} -- although this only contains class names it istechnically a word map since it is taken as input wherever a word mapis required by the \HTK\ language modelling tools; recall that wordmaps can contain classes as witnessed by {\tt !!UNK} previously.You can examine the class $n$-grams in a similar way to previously byusing \htool{LGList}\begin{verbatim}$ LGList holmes.2/cmap holmes.2/data.0 | more 3-Gram File holmes.2/data.0[162397 entries]: Text Source: LGCopyCLASS1 CLASS10 CLASS103 : 1CLASS1 CLASS10 CLASS11 : 2CLASS1 CLASS10 CLASS118 : 1CLASS1 CLASS10 CLASS12 : 1CLASS1 CLASS10 CLASS126 : 2CLASS1 CLASS10 CLASS140 : 2CLASS1 CLASS10 CLASS147 : 1...\end{verbatim} % $And similarly the class $n$-gram component of the overall languagemodel is built using \htool{LBuild} as previously with\begin{verbatim}$ LBuild -T 1 -c 2 1 -c 3 1 -n 3 holmes.2/cmap lm_5k/cl150-tg_1_1.cc holmes.2/data.*Input file holmes.2/data.0 added, weight=1.0000\end{verbatim} % $To build the word-given-class component of the model we must run\htool{Cluster} again.\begin{verbatim}$ Cluster -l holmes.2/class.1.cm -i 0 -q lm_5k/cl150-counts.wc lm_5k/5k.wmap holmes.1/data.* lm_5k/data.0\end{verbatim} % $This is very similar to how we ran \htool{Cluster} earlier, exceptthat we now want to perform 0 iterations ({\tt -i 0}) and we start byloading in the existing class map with {\tt -l holmes.2/class.1.cm}.We don't need to pass {\tt -k} because we aren't doing any furtherclustering and we don't need to specify the number of classes sincethis is read from the class map along with the class contents. The{\tt -q lm\_5k/cl150-counts.wc} option tells the tool to write word-given-class counts to the specified file. Alternatively we couldhave specified {\tt -p} instead of {\tt -q} and written probabilitiesas opposed to counts. The file is in a plain text format, and eitherthe {\tt -p} or {\tt -q} version is sufficient for forming theword-given-class component of a class language model. Note that infact we could have simply added either {\tt -p} or {\tt -q} thefirst time we ran \htool{Cluster} and generated both the class map andlanguage model component file in one go.Given the two language model components we can now link them togetherto make our overall class $n$-gram language model.\begin{verbatim}$ LLink lm_5k/cl150-counts.wc lm_5k/cl150-tg_1_1.cc lm_5k/cl150-tg_1_1\end{verbatim} % $The \htool{LLink} tool creates a simple text file pointing to the twonecessary components, auto-detecting whether a count or probabilitiesfile has been supplied. The resulting file, {\tt lm\_5k/cl150-tg\_1\_1}is the finished overall class $n$-gram model, which we can now assessthe performance of with \htool{LPlex}.\begin{verbatim}$ LPlex -n 3 -t lm_5k/cl150-tg_1_1 test/red-headed_league.txtLPlex test #0: 3-gramperplexity 125.9065, var 7.4139, utterances 556, words predicted 8127num tokens 10408, OOV 665, OOV rate 6.75% (excl. </s>)Access statistics for lm_5k/cl150-tg_1_1:Lang model requested exact backed n/a mean stdev bigram 2867 95.4% 4.6% 0.0% -4.61 1.64 trigram 8127 64.7% 24.1% 11.2% -4.84 2.72\end{verbatim} % $The class trigram model performs worse than the word trigram (whichhad a perplexity of 117.4), but this is not a surprise since this istrue of almost every reasonably-sized test set -- the class model isless specific. Interpolating the two often leads to furtherimprovements, however. We can find out if this will happen in thiscase by interpolating the models with \htool{LPlex}.\begin{verbatim}$ LPlex -u -n 3 -t -i 0.4 lm_5k/cl150-tg_1_1 lm_5k/tg1_1 test/red-headed_league.txtLPlex test #0: 3-gramperplexity 102.6389, var 7.3924, utterances 556, words predicted 9187num tokens 10408, OOV 665, OOV rate 6.75% (excl. </s>) Access statistics for lm_5k/tg2-1_1:Lang model requested exact backed n/a mean stdev bigram 5911 68.5% 30.9% 0.6% -5.75 2.94 trigram 9187 35.7% 31.2% 33.2% -4.77 2.98 Access statistics for lm_5k/cl150-tg_1_1:Lang model requested exact backed n/a mean stdev bigram 3104 95.5% 4.5% 0.0% -4.67 1.62 trigram 9187 66.2% 23.9% 9.9% -4.87 2.75\end{verbatim} % $So a further gain is obtained -- the interpolated model performssignificantly better. Further improvement might be possible byattempting to optimise the interpolation weight.Note that we could also have used \htool{LLink} to build a singleclass language model file instead of producing a third file whichpoints to the two components. We can do this by using the {\tt -s}single file option.\begin{verbatim}$ LLink -s lm_5k/cl150-counts.wc lm_5k/cl150-tg_1_1.cc lm_5k/cl150-tg_1_1.all\end{verbatim} % $The file {\tt lm\_5k/cl150-tg\_1\_1.all} is now a standalone languagemodel, identical in performance to {\tt lm\_5k/cl150-tg\_1\_1} createdearlier.\mysect{Problem solving}{HLMproblemSolving}\index{Problem solving}Sometimes a tool returns an error message which doesn't seem to makesense when you check the files you've passed and the switchesyou've given. This section provides a few problem-solving hints.\mysubsect{File format problems}{HLMfileproblems}If a file which seems to be in the correct format is giving errorssuch as `Bad header' then make sure that you are using the correctinput filter. If the file is gzipped then ensure you are using asuitable configuration parameter to decompress it on input; similarlyif it isn't compressed then check you're not trying to decompress it.Also check to see if you have two files, one with and one without a{\tt .gz} extension -- maybe you're picking up the wrong one andchecking the other file.You might be missing a switch or configuration file to tell the toolwhich format the file is in. In general none of the \HTK\ languagemodelling tools can auto-detect file formats -- unless you tell themotherwise they will expect the file type they are configured todefault to and will give an error relevant to that type if it does notmatch. For example, if you omit to pass {\tt -t} to \htool{LPlex}then it will treat an input text file as a\HTK\ label file and you will get a `Too many columns' error if a linehas more than 100 words on it or a ridiculously high perplexityotherwise. Check the command documentation in chapter\ref{c:toolref}.\mysubsect{Command syntax problems}{HLMsyntaxproblems}If a tool is giving unexpected syntax errors then check that you haveplaced all the option switches {\it before} the compulsory parameters-- the tools will not work if this rule is not followed. You mustalso place whitespace between switches and any options they expect.The ordering of switches is not important, but the order of compulsoryparameters cannot be changed. Check the switch syntax -- passing aredundant parameter to one will cause problems since it will beinterpreted as the first compulsory parameter.All \HTK\ tools assume that a parameter which starts with a digit is anumber of some kind -- you cannot pass filenames which start with adigit, therefore. This is a limitation of the routines in\htool{HShell}. \mysubsect{Word maps}{HLMwordmapproblems}If your word map and gram file combination is being rejected then makesure they match in terms of their sequence number. Although gramfiles are mainly stored in a binary format the header is in plain textand so if you look at the top of the file you can compare itmanually with the word map. Note it is not a good idea to fiddle thevalues to match since they are bound to be different for a goodreason! Word maps must have the same or a higher sequence id than agram file in order to open that gram file -- the names must match too.The tools might not behave as you expect. For example, \htool{LGPrep}will write its word map to the file {\tt wmap} unless you tell itotherwise, irrespective of the input filename. It will also place itin the same directory as the gram files unless you changed its namefrom {\tt wmap}(!) -- check you are picking up the correct word mapwhen building subsequent gram files.The word ids start at 65536 in order to allow space for that manyclasses below them -- anything lower is assumed to be a class. Inturn the number of classes is limited to 65535.\mysubsect{Memory problems}{HLMmemoryproblems}Should you encounter memory problems then try altering the amount ofspace reserved by the tools using the relevant tool switches such as{\tt -a} and {\tt -b} for {\tt LGPrep} and {\tt LGCopy}. You couldalso try turning on memory tracing to see how much memory is used andfor what (use the configuration {\tt TRACE} parameters and the {\tt-T} option as appropriate. Language models can become very large,however -- hundreds of megabytes in size, for example -- so it isimportant to apply cut-offs and/or discounting as appropriate to keepthem to a suitable size for your system.\mysubsect{Unexpected perplexities}{HLMperpproblems}If perplexities are not what you expected, then there are many thingsthat could have gone wrong -- you may not have constructed a suitablemodel -- but also some mistakes you might have made. Check that youpassed all the switches you intended, and check that you have beenconsistent in your use of {\tt *RAW*} configuration parameters --using escaped characters in the language model without them in yourtest text will lead to unexpected results. If you have not escapedwords in your word map then check they're not escaped in any classmap. When using a class model make sure you're passing the correctinput file of the three separate components.Check the switches to {\tt LPlex} -- did you set {\tt -u} as youintended? If you passed a text file did you pass {\tt -t}? Not doingso will lead either to a format error or to extremely bizarreperplexities!Did you build the length of $n$-gram you meant to? Check the finallanguage model by looking at the header of it, which is always storedin plain text format. You can easily see how many $n$-grams there arefor each size of $n$.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -