📄 cluster.tex
字号:
%% Cluster - Gareth Moore 23/01/02 (updated 27/03/02)%\newpage\mysect{Cluster}{Cluster}\mysubsect{Function}{Cluster-Function}\index{cluster@\htool{Cluster}|(}This program is used to statistically cluster words into deterministicclasses. The main purpose of \htool{Cluster} is to optimise a classmap on the basis of the training text likelihood, although it can alsoimport an existing class map and generate one of the files necessaryfor creating a class-based language model from the \HTK\ languagemodelling tools.Class-based language models use a reduced number of classes relativeto the number of words, with each class containing one or more words,to allow a language model to be able to generalise to unseen trainingcontexts. Class-based models also typically require less trainingtext to produce a well-trained model than a similar complexity wordmodel, and are often more compact due to the much reduced number ofpossible distinct history contexts that can be encountered in thetraining data.\htool{Cluster} takes as input a set of one or more training text gramfiles, which may optionally be weighted on input, and their associatedword map. It then clusters the words in the word map into classesusing a bigram likelihood measure. Due to the computationalcomplexity of this task a sub-optimal greedy algorithm is used, butmultiple iterations of this algorithm may be performed in order tofurther refine the class map, although at some point a local maximumwill be reached where the class map will not changefurther.\footnote{On a 65,000 word vocabulary test set with 170 millionwords of training text this was found to occur after around 45iterations} In practice as few as two iterations may be perfectlyadequate, even with large training data sets.The algorithm works by considering each word in the vocabulary in turnand calculating the change in bigram training text likelihood if theword was moved from its default class (see below) to each other classin turn. The word is then moved to the class which increases thelikelihood the most, or it is left in its current class if no suchincrease is found. Each iteration of the algorithm considers eachword exactly once. Because this can be a slow process, with typicalexecution times measured in terms of a few hours, not a few minutes,the \htool{Cluster} tool also allows \textit{recovery} files to be writtenat regular intervals, which contain the current class map part-waythrough an iteration along with associated files detailing at whatpoint in the iteration the class map was exported. These files arenot essential for operation, but might be desirable if there is a riskof a long-running process being killed via some external influence.During the execution of an iteration the tool claims no newmemory,\footnote{other than a few small local variables taken from thestack as functions are called} so it cannot crashin the middle of an iteration due to a lack of memory (it can,however, fail to start an iteration in the first place).Before beginning an iteration, \htool{Cluster} places each word eitherinto a default class or one specified via the \texttt{-l}, importclassmap, or \texttt{-x}, use recovery, options. The defaultdistribution, given $m$ classes, is to place the most frequent $(m-1)$words into singleton classes and then the remainder into the remainingclass. \htool{Cluster} allows words to be considered in eitherdecreasing frequency of occurrence order, or the order they areencountered in the word map. The popular choice is to use the formermethod, although in experiments it was found that the more randomsecond approach typically gave better class maps after feweriterations in practice.\footnote{Note that these schemes areapproximately similar, since the most frequent words are most likelyto be encountered sooner in the training text and thus occur higher upin the word map} The \texttt{-w} option specifies this choice.During execution \htool{Cluster} will always write a logfiledescribing the changes it makes to the classmap, unless you explicitlydisable this using the \texttt{-n} option. If the \texttt{-v} switchis used then this logfile is written in explicit English, allowing youto easily trace the execution of the clusterer; without \texttt{-v}then similar information is exported in a more compact format.Two or three special classes are also defined. The sentence start andsentence end word tokens are always kept in singleton classes, andoptionally the unknown word token can be kept in a singleton class too-- pass the \texttt{-k} option.\footnote{The author always uses thisoption but has not empirically tested its efficaciousness} Thesetokens are placed in these classes on initialisation and no moves toor from these classes are ever considered.Language model files are built using either the \texttt{-p} or\texttt{-q} options, which are effectively equivalent if usingthe \HTK\ language modelling tools as black boxes. The former createsa word-given-class probabilities file, whilst the latter stores wordcounts and lets the language model code itself calculate the sameprobabilities.\mysubsect{Use}{Cluster-Use}\htool{Cluster} is invoked by the command line\begin{verbatim} Cluster [options] mapfile [mult] gramfile [[mult] gramfile ...]\end{verbatim}The given word map is loaded and then each of the specified gram filesis imported. The list of input gram files can be interspersed withmultipliers. These are floating-point format numbers which must beginwith a plus or minus character (e.g. \texttt{+1.0}, \texttt{-0.5},etc.). The effect of a multiplier \texttt{mult} is to scale the $n$-gramcounts in the following gram files by the factor \texttt{mult}. Theresulting scaled counts are rounded to the nearest integer whenactually used in the clustering algorithm. A multiplier stays ineffect until it is redefined.The allowable options to \htool{Cluster} are as follows\begin{optlist} \ttitem{-c n} Use {\tt n} classes. This specifies the number of classes that should be in the resultant class map. \ttitem{-i n} Perform {\tt n} iterations. This is the number of iterations of the clustering algorithm that should be performed. (If you are using the {\tt -x} option then completing the current iteration does not count towards the total number, so use {\tt -i 0} to complete it and then finish) \ttitem{-k} Keep the special unknown word token in its own singleton class. If not passed it can be moved to or from any class. \ttitem{-l fn} Load the classmap {\tt fn} at start up and when performing any further iterations do so from this starting point. \ttitem{-m} Record the running value of the maximum likelihood function used by the clusterer to optimised the training text likelihood in the log file. This option is principally provided for debugging purposes. \ttitem{-n} Do not write any log file during execution of an iteration. \ttitem{-o fn} Specify the prefix of all output files. All output class map, logfile and recovery files share the same filename prefix, and this is specified via the {\tt -o} switch. The default is {\tt cluster}. \ttitem{-p fn} Write a word-given-class probabilities file. Either this or the {\tt -q} switch are required to actually build a class-based language model. The \HTK\ language model library, \htool{LModel}, supports both probability and count-based class files. There is no difference in use, although each allows different types of manual manipulation of the file. Note that if you do not pass {\tt -p} or {\tt -q} you may run \htool{Cluster} at a later date using the {\tt -l} and {\tt -i 0} options to just produce a language model file. \ttitem{-q fn} Write a word-given-class counts file. See the documentation for {\tt -p}. \ttitem{-r n} Write recovery files after moving {\tt n} words since the previous recovery file was written or an iteration began. Pass {\tt -r n} to disable writing of recovery files. \ttitem{-s tkn} Specify the sentence start token. \ttitem{-t tkn} Specify the sentence end token. \ttitem{-u tkn} Specify the unknown word token. \ttitem{-v} Use verbose log file format. \ttitem{-w [WMAP/FREQ]} Specify the order in which word moves are considered. Default is {\tt WMAP} in which words are considered in the order they are encountered in the word map. Specifying {\tt FREQ} will consider the most frequent word first and then the remainder in decreasing order of frequency. \ttitem{-x fn} Continue execution from recovery file {\tt fn}.\end{optlist}\stdopts{Cluster}\mysubsect{Tracing}{Cluster-Tracing}\htool{Cluster} supports the following trace options, where each trace flag is given using an octal base:\begin{optlist} \ttitem{00001} basic progress reporting. \ttitem{00002} report major file operations - good for following start-up. \ttitem{00004} more detailed progress reporting. \ttitem{00010} trace memory usage during execution and at end.\end{optlist}Trace flags are set using the \texttt{-T} option or the \texttt{TRACE}configuration variable.\index{cluster@\htool{Cluster}|)}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -