⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 cluster.tex

📁 该压缩包为最新版htk的源代码,htk是现在比较流行的语音处理软件,请有兴趣的朋友下载使用
💻 TEX
字号:
%% Cluster - Gareth Moore        23/01/02 (updated 27/03/02)%\newpage\mysect{Cluster}{Cluster}\mysubsect{Function}{Cluster-Function}\index{cluster@\htool{Cluster}|(}This program is used to statistically cluster words into deterministicclasses.  The main purpose of \htool{Cluster} is to optimise a classmap on the basis of the training text likelihood, although it can alsoimport an existing class map and generate one of the files necessaryfor creating a class-based language model from the \HTK\ languagemodelling tools.Class-based language models use a reduced number of classes relativeto the number of words, with each class containing one or more words,to allow a language model to be able to generalise to unseen trainingcontexts.  Class-based models also typically require less trainingtext to produce a well-trained model than a similar complexity wordmodel, and are often more compact due to the much reduced number ofpossible distinct history contexts that can be encountered in thetraining data.\htool{Cluster} takes as input a set of one or more training text gramfiles, which may optionally be weighted on input, and their associatedword map.  It then clusters the words in the word map into classesusing a bigram likelihood measure.  Due to the computationalcomplexity of this task a sub-optimal greedy algorithm is used, butmultiple iterations of this algorithm may be performed in order tofurther refine the class map, although at some point a local maximumwill be reached where the class map will not changefurther.\footnote{On a 65,000 word vocabulary test set with 170 millionwords of training text this was found to occur after around 45iterations}  In practice as few as two iterations may be perfectlyadequate, even with large training data sets.The algorithm works by considering each word in the vocabulary in turnand calculating the change in bigram training text likelihood if theword was moved from its default class (see below) to each other classin turn.  The word is then moved to the class which increases thelikelihood the most, or it is left in its current class if no suchincrease is found.  Each iteration of the algorithm considers eachword exactly once.  Because this can be a slow process, with typicalexecution times measured in terms of a few hours, not a few minutes,the \htool{Cluster} tool also allows \textit{recovery} files to be writtenat regular intervals, which contain the current class map part-waythrough an iteration along with associated files detailing at whatpoint in the iteration the class map was exported.  These files arenot essential for operation, but might be desirable if there is a riskof a long-running process being killed via some external influence.During the execution of an iteration the tool claims no newmemory,\footnote{other than a few small local variables taken from thestack as functions are called} so it cannot crashin the middle of an iteration due to a lack of memory (it can,however, fail to start an iteration in the first place).Before beginning an iteration, \htool{Cluster} places each word eitherinto a default class or one specified via the \texttt{-l}, importclassmap, or \texttt{-x}, use recovery, options.  The defaultdistribution, given $m$ classes, is to place the most frequent $(m-1)$words into singleton classes and then the remainder into the remainingclass.  \htool{Cluster} allows words to be considered in eitherdecreasing frequency of occurrence order, or the order they areencountered in the word map.  The popular choice is to use the formermethod, although in experiments it was found that the more randomsecond approach typically gave better class maps after feweriterations in practice.\footnote{Note that these schemes areapproximately similar, since the most frequent words are most likelyto be encountered sooner in the training text and thus occur higher upin the word map} The \texttt{-w} option specifies this choice.During execution \htool{Cluster} will always write a logfiledescribing the changes it makes to the classmap, unless you explicitlydisable this using the \texttt{-n} option.  If the \texttt{-v} switchis used then this logfile is written in explicit English, allowing youto easily trace the execution of the clusterer; without \texttt{-v}then similar information is exported in a more compact format.Two or three special classes are also defined.  The sentence start andsentence end word tokens are always kept in singleton classes, andoptionally the unknown word token can be kept in a singleton class too-- pass the \texttt{-k} option.\footnote{The author always uses thisoption but has not empirically tested its efficaciousness} Thesetokens are placed in these classes on initialisation and no moves toor from these classes are ever considered.Language model files are built using either the \texttt{-p} or\texttt{-q} options, which are effectively equivalent if usingthe \HTK\ language modelling tools as black boxes.  The former createsa word-given-class probabilities file, whilst the latter stores wordcounts and lets the language model code itself calculate the sameprobabilities.\mysubsect{Use}{Cluster-Use}\htool{Cluster} is invoked by the command line\begin{verbatim}   Cluster [options] mapfile [mult] gramfile [[mult] gramfile ...]\end{verbatim}The given word map is loaded and then each of the specified gram filesis imported.  The list of input gram files can be interspersed withmultipliers. These are floating-point format numbers which must beginwith a plus or minus character (e.g. \texttt{+1.0}, \texttt{-0.5},etc.). The effect of a multiplier \texttt{mult} is to scale the $n$-gramcounts in the following gram files by the factor \texttt{mult}. Theresulting scaled counts are rounded to the nearest integer whenactually used in the clustering algorithm. A multiplier stays ineffect until it is redefined.The allowable options to \htool{Cluster} are as follows\begin{optlist}  \ttitem{-c n} Use {\tt n} classes. This specifies the number of        classes that should be in the resultant class map.  \ttitem{-i n} Perform {\tt n} iterations. This is the number of        iterations of the clustering algorithm that should be        performed. (If you are using the {\tt -x} option then        completing the current iteration does not count towards        the total number, so use {\tt -i 0} to complete it and        then finish)  \ttitem{-k} Keep the special unknown word token in its own        singleton class.  If not passed it can be moved to or from        any class.  \ttitem{-l fn} Load the classmap {\tt fn} at start up and when        performing any further iterations do so from this starting        point.  \ttitem{-m} Record the running value of the maximum likelihood        function used by the clusterer to optimised the training        text likelihood in the log file.  This option is principally        provided for debugging purposes.  \ttitem{-n} Do not write any log file during execution of an        iteration.  \ttitem{-o fn} Specify the prefix of all output files.  All output        class map, logfile and recovery files share the same filename        prefix, and this is specified via the {\tt -o} switch.  The        default is {\tt cluster}.  \ttitem{-p fn} Write a word-given-class probabilities file. Either        this or the {\tt -q} switch are required to actually build a        class-based language model. The \HTK\ language model library,        \htool{LModel}, supports both probability and count-based        class files.  There is no difference in use, although each        allows different types of manual manipulation of the file.        Note that if you do not pass {\tt -p} or {\tt -q} you may        run \htool{Cluster} at a later date using the {\tt -l} and        {\tt -i 0} options to just produce a language model file.  \ttitem{-q fn} Write a word-given-class counts file. See the        documentation for {\tt -p}.  \ttitem{-r n} Write recovery files after moving {\tt n} words        since the previous recovery file was written or an iteration        began.  Pass {\tt -r n} to disable writing of recovery files.  \ttitem{-s tkn} Specify the sentence start token.  \ttitem{-t tkn} Specify the sentence end token.  \ttitem{-u tkn} Specify the unknown word token.  \ttitem{-v} Use verbose log file format.  \ttitem{-w [WMAP/FREQ]} Specify the order in which word moves are        considered. Default is {\tt WMAP} in which words are        considered in the order they are encountered in the word map.        Specifying {\tt FREQ} will consider the most frequent word        first and then the remainder in decreasing order of frequency.  \ttitem{-x fn} Continue execution from recovery file {\tt fn}.\end{optlist}\stdopts{Cluster}\mysubsect{Tracing}{Cluster-Tracing}\htool{Cluster} supports the following trace options, where each trace flag is given using an octal base:\begin{optlist}  \ttitem{00001} basic progress reporting.   \ttitem{00002} report major file operations - good for following start-up.  \ttitem{00004} more detailed progress reporting.  \ttitem{00010} trace memory usage during execution and at end.\end{optlist}Trace flags are set using the \texttt{-T} option or the \texttt{TRACE}configuration variable.\index{cluster@\htool{Cluster}|)}

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -