📄 hlmfiles.tex

📁 该压缩包为最新版htk的源代码,htk是现在比较流行的语音处理软件,请有兴趣的朋友下载使用
💻 TEX
📖 第 1 页 / 共 3 页
字号:
上一页 1 23
\mysubsect{The class probabilities format}{HLMclassprobslmformat}\index{LM file formats!class probabilities}The format of a word-given-class probabilities file, as generated using the\texttt{-p} option from \htool{Cluster}, is very similar to that ofthe counts file described in the previous sub-section, and is as follows:\\\texttt{Word|Class probabilities}\\\textit{[blank line]}\\\texttt{Derived from: <file>}\\\texttt{Number of classes: <int>}\\\texttt{Number of words: <int>}\\\texttt{Iterations: <int>}\\\textit{[blank line]}\\\texttt{Word    Class name   Probability (log)}\\followed by one line for each word in the model of the form:\\\texttt{<word> CLASS<int> <float>}\\As in the previous section, the fields are mostly self-explanatory.The {\tt Iterations:} header is for information only and records howmany iterations had been performed to produce the classmap containedwithin the file, and the {\tt Derived from:} header is similarly alsofor display purposes only.  Any number of headers may be present; theheader section is terminated by finding a line beginning with the fourcharacters making up {\tt Word}. The colon-terminated headers may bein any order.{\tt CLASS<int>} must be the name of aclass in the classmap (technically actually the wordmap) used to buildthe class-given-class history $n$-gram component of the language model-- the file built by \htool{LBuild}.  In the current implementationthese class names are restricted to being of the form {\ttCLASS<int>}, although a modification to the code in \htool{LModel.c}would allow this restriction to be removed.  Each {\tt <float>}specifies the natural logarithm of the probability of the word giventhe class, or -99.9900 if the probability of the word is less than$1.0\times10^{-20}$.\mysubsect{The class LM three file format}{HLMclasslmformat}\index{LM file formats!class}A special class language model file, generated by \htool{LLink},links together either the word-given-class probability or count filesdescribed above (either can be used to give the same results) with aclass-given-class history $n$-gram file constructed using\htool{LBuild}.  It is a simple text file which specifies the filenameof the two relevant components:\noindent\texttt{Class-based LM}\\\texttt{Word|Class counts: <file>} $\;$or$\;$ \texttt{Word|Class probabilities: <file>}\\\texttt{Class|Class grams: <file>}\\The second line must state {\tt counts} or {\tt probabilities} asappropriate for the relevant file.\mysubsect{The class LM single file format}{HLMclassSinglmformat}\index{LM file formats!class}An alternative to the class language model file described in section\ref{s:HLMclasslmformat} is the composite single-file class languagemodel file, produced by {\htool{LLink} \tt -s} -- this does {\it not}require the two component files to be present since it integrates theminto a single file.  The format of this resulting file is as follows:\begin{verbatim}  CLASS MODEL  Word|Class <string: counts/probs>  Derived from: <file>  Number of classes: <int>  Number of words: <int>  Iterations: <int>  Class n-gram counts follow; word|class component is at end of file.\end{verbatim}The second line must state either {\tt counts} or {\tt probabilities} asappropriate for the relevant component file used when constructingthis composite file.  The fields are mostly self-explanatory.  The{\tt Iterations:} header is for information only and records how manyiterations had been performed to produce the classmap contained withinthe file, and the {\tt Derived from:} header is similarly also fordisplay purposes only.  Any number of headers may be present; theheader section is terminated by finding a line beginning with the fivecharacters making up {\tt Class}.  The colon-terminated headers may bein any order.The class-given-classes $n$-gram component of the model then followsimmediately in any of the formats supported by word $n$-gram languagemodels -- ie. those described in section \ref{s:HLMlmfileformats}.  Noblank lines are expected between the header shown above and theincluded model, although they may be supported by the embedded model.Immediately following the class-given-classes $n$-gram componentfollows the body of the word-given-class probabilities or counts fileas described in sections \ref{s:HLMclasscountslmformat} and\ref{s:HLMclassprobslmformat} above.  That is, the remainder of thefile consists of lines of the form:\begin{verbatim}  <word> CLASS<int> <float/int>\end{verbatim}One line is expected for each word as specified in the header at thetop of the file.  Integer word counts should be provided in the finalfield for each word in the case of a counts file, or word-given-classprobabilities if a probabilities file -- as specified by the secondline of the overall file.  In the latter case each {\tt <float>}specifies the natural logarithm of the probability of the word giventhe class, or -99.9900 if the probability of the word is less than$1.0\times10^{-20}$.{\tt CLASS<int>} must be the name of aclass in the classmap (technically actually the wordmap) used to buildthe class-given-class history $n$-gram component of the language model-- the file built by \htool{LBuild}.  In the current implementationthese class names are restricted to being of the form {\ttCLASS<int>}, although a modification to the code in \htool{LModel.c}would allow this restriction to be removed.\mysect{Language modelling tracing}{HLMlibtracing}\index{tracing}Each of the \HTK\ language modelling tools provides its own tracefacilities, as documented with the relevant tool in chapter\ref{c:toolref}.  The standard libraries also provide their own tracesettings, which can be set in a passed configuration file.  Each ofthe supported trace levels is documented below with the octal valuenecessary to enable it.\mysubsect{LCMap}{LCMapTrace}\begin{itemize}\item 0001       Top level tracing\item 0002       Class map loading\end{itemize}\mysubsect{LGBase}{LGBaseTrace}\begin{itemize}\item 0001       Top level tracing\item 0002       Trace $n$-gram squashing\item 0004       Trace $n$-gram buffer sorting\item 0010       Display $n$-gram input set tree\item 0020       Display maximum parallel input streams\item 0040       Trace parallel input streaming\item 0100       Display information on FoF input/output\end{itemize}\mysubsect{LModel}{LModelTrace}\begin{itemize}\item 0001       Top level tracing\item 0002       Trace loading of language models\item 0004       Trace saving of language models\item 0010       Trace word mappings\item 0020       Trace $n$-gram lookup\end{itemize}\mysubsect{LPCalc}{LPCalcTrace}\begin{itemize}\item 0001       Top level tracing\item 0002       FoF table tracing\end{itemize}\mysubsect{LPMerge}{LPMergeTrace}\begin{itemize}\item 0001       Top level tracing\end{itemize}\mysubsect{LUtil}{LUtilTrace}\begin{itemize}\item 0001       Top level tracing\item 0002       Show header processing\item 0004       Hash table tracing\end{itemize}\mysubsect{LWMap}{LWMapTrace}\begin{itemize}\item 0001       Top level tracing\item 0002       Trace word map loading\item 0004       Trace word map sorting\end{itemize}\mysect{Run-time configuration parameters}{HLMconfigParms}\index{configuration parameters!operatingenvironment}Section \ref{s:openvsum} lists the major standard \HTK\ configurationparameter options whilst the rest of chapter \ref{c:openviron}describes the general \HTK\ environment and how to set thoseconfiguration parameters, whilst chapter \ref{c:confvars} provides acomprehensive list.  For ease of reference those parametersspecifically relevant to the language modelling tools are reproducedin table \ref{t:openvcparmsLM}.\begin{center}\begin{tabular}{|p{1.4cm}|p{3.0cm}|p{6.4cm}|} \hlineModule & Name  & Description  \\ \hline\htool{HShell} & \texttt{ABORTONERR}      & Core dump on error (for debugging) \\\htool{HShell} & \texttt{HLANGMODFILTER}  & Filter for language model file input\\\htool{HShell} & \texttt{HLABELFILTER}    & Filter for Label file input\\\htool{HShell} & \texttt{HDICTFILTER}     & Filter for Dictionary file input \\ \htool{HShell} & \texttt{LGRAMFILTER}     & Filter for gram file input\\\htool{HShell} & \texttt{LWMAPFILTER}     & Filter for word map file input\\\htool{HShell} & \texttt{LCMAPFILTER}     & Filter for class map file input\\\htool{HShell} & \texttt{HLANGMODOFILTER} & Filter for language model file output\\\htool{HShell} & \texttt{HLABELOFILTER}   & Filter for Label file output\\\htool{HShell} & \texttt{HDICTOFILTER}    & Filter for Dictionary file output \\ \htool{HShell} & \texttt{LGRAMOFILTER}    & Filter for gram file output\\\htool{HShell} & \texttt{LWMAPOFILTER}    & Filter for word map file output\\\htool{HShell} & \texttt{LCMAPOFILTER}    & Filter for class map file output\\\htool{HShell} & \texttt{MAXTRYOPEN}      & Number of file open retries \\\htool{HShell} & \texttt{NONUMESCAPES}    & Prevent string output using \verb+\012+ format \\\htool{HShell} & \texttt{NATURALREADORDER}  & Enable natural read order for HTK binary files \\\htool{HShell} & \texttt{NATURALWRITEORDER} & Enable natural write order for HTK binary files \\\htool{HMem} & \texttt{PROTECTSTAKS}      & Warn if stack is cut-back (debugging) \\ & \texttt{TRACE}             & Trace control (default=0) \\ & \texttt{STARTWORD}         & Set sentence start symbol ({\tt <s>}) \\ & \texttt{ENDWORD}           & Set sentence end symbol   ({\tt </s>}) \\ & \texttt{UNKNOWNNAME}       & Set OOV class symbol      ({\tt !!UNK}) \\ & \texttt{RAWMITFORMAT}      & Disable \HTK\ escaping for LM tools\\\htool{LWMap}  & \texttt{INWMAPRAW}  & Disable \HTK\ escaping for input word lists and maps \\\htool{LWMap}  & \texttt{OUTWMAPRAW} & Disable \HTK\ escaping for output word lists and maps \\\htool{LCMap}  & \texttt{INCMAPRAW}  & Disable \HTK\ escaping for input class lists and maps \\\htool{LCMap}  & \texttt{OUTCMAPRAW} & Disable \HTK\ escaping for output class lists and maps \\\htool{LCMap}  & \texttt{UNKNOWNID}  & Set unknown symbol class ID (1)\\\htool{LCMap}  & \texttt{USEINTID}   & Use 4 byte ID fields to save binary models (see section \ref{s:HLMuseintid})\\\htool{LPCalc} & \texttt{UNIFLOOR}   & Unigram floor count (1)\\ \htool{LPCalc} & \texttt{KRANGE}     & Good-Turing discounting range (7)\\\htool{LPCalc} & \texttt{\textit{n}G\_CUTOFF} & \textit{n}-gram cutoff     (eg. \texttt{2G\_CUTOFF}) (1)\\\htool{LPCalc} & \texttt{DCTYPE}     & Discounting type     (\texttt{TG} for Turing-Good or \texttt{ABS} for Absolute)%     \texttt{LIN} for Linear) - not fully implemented (!)     (\texttt{TG})\\\htool{LGBase} & \texttt{CHECKORDER} & Check N-gram ordering in files \\\hline\end{tabular}\tabcap{openvcparmsLM}{Configuration Parameters used in Operating Environment}\end{center}\mysubsect{USEINTID}{HLMuseintid}\index{configurationparameters!\texttt{USEINTID}}Setting this to {\tt T} as opposed to its default of {\tt F} forces the\htool{LModel} library to save language models using an unsigned int for eachword ID as opposed to the default of an unsigned short.  In mostsystems these lengths correspond to 4-byte and 2-byte fieldsrespectively.  Note that if you do not set this that \htool{LModel}will automatically choose an int field size if the short field is toosmall -- the exception to this is if you have compiled with {\ttLM\_ID\_SHORT} which limits the field size to an unsigned short, inwhich case the tool will be forced to abort; see section\ref{s:HLMlmidshort} below.\mysect{Compile-time configurationparameters}{HLMctconfigParms}\index{configurationparameters!compile-time}There are some compile-time switches which may be set when buildingthe language modelling library and tools.\mysubsect{LM\_ID\_SHORT}{HLMlmidshort}\index{compile-time parameters!\texttt{LM\_ID\_SHORT}}When compiling the \HTK\ language modelling library, setting {\ttLM\_ID\_SHORT} (for example by passing {\tt -D LM\_ID\_SHORT} to the Ccompiler) forces the compiler to use an unsigned short for eachlanguage model ID it stores, as opposed to the default of an unsignedint -- in most systems this will result in either a 2-byte integer ora 4-byte integer respectively.  If you set this then you {\it must}ensure you also set {\tt LM\_ID\_SHORT} when compiling the \HTK\language modelling tools too, otherwise you will encounter a mismatchleading to strange results!  (Your compiler may warn of this error,however).  For this reason it is safest to set {\tt LM\_ID\_SHORT} via a{\tt \#define} in {\tt LModel.h}.  You might want to set this if youknow how many distinct word ids you require and you do not want towaste memory, although on some systems using shorts can actually beslower than using a full-size int.Note that the run-time {\tt USEINTID} parameter described in section\ref{s:HLMuseintid} above only affects the size of ID fields whensaving a binary model from {\tt LModel}, so is independent of {\ttLM\_ID\_SHORT}.  The only restriction is that you cannot load or save amodel with more ids than can fit into an unsigned short when {\ttLM\_ID\_SHORT} is set -- the tools will abort with an error should youtry this.\mysubsect{LM\_COMPACT}{HLMcompact}\index{compile-time parameters!\texttt{LM\_COMPACT}}When {\tt LM\_COMPACT} is defined at compile time, when a language modelis loaded then its probabilities are compressed into an unsigned short as opposedto being loaded into a float.  The exact size of these typesdepends on your processor architecture, but in general an unsignedshort is more than half as small as a float.  Using the compactstorage type therefore significantly reduces the accuracy with whichprobabilities are stored.The side effect of setting this is therefore reduced accuracy whenrunning a language model, such as when using \htool{LPlex}; or a lossof accuracy when rebuilding from an existing language model using\htool{LMerge}, \htool{LAdapt}, \htool{LBuild} or \htool{HLMCopy}.\mysubsect{LMPROB\_SHORT}{HLMprobshort}\index{compile-time parameters!\texttt{LMPROB\_SHORT}}Setting {\tt LMPROB\_SHORT} causes language model probabilities to bestored and loaded using a short type.  Unlike {\tt LM\_COMPACT}, thisoption certainly {\it does} affect the writing of language modelfiles.  If you save a file using this format then you must ensure youreload it in the same way to ensure you obtain sensible results.\mysubsect{INTERPOLATE\_MAX}{HLMinterpmax}\index{compile-time parameters!\texttt{INTERPOLATE\_MAX}}If the library and tools are compiled with {\tt INTERPOLATE\_MAX} thenlanguage model interpolation in \htool{LPlex} and the \htool{LPMerge}library (which is used by \htool{LAdapt} and \htool{LMerge}) willignore the individual model weights and always pick the highestprobability from each of the models at any given point.  Note thatthis option will \textit{not} normalise the models.\mysubsect{SANITY}{HLMinterpmax}\index{compile-time parameters!\texttt{SANITY}}Turning on {\tt SANITY} when compiling the library will add a word mapcheck to \htool{LGBase} and some sanity checks to \htool{LPCalc}.\mysubsect{INTEGRITY\_CHECK}{HLMintegcheck}\index{compile-time parameters!\texttt{INTEGRITY\_CHECK}}Compiling with {\tt INTEGRITY\_CHECK} will add run-time integritychecking to the \htool{Cluster} tool.  Specifically it will check thatthe class counts have not become corrupted and that all maximumlikelihood move updates have been correctly calculated.  You shouldnot need to enable this unless you suspect a major tool problem, anddoing so will slow down the tool execution.  It could probe useful ifyou wanted to adapt the way the clustering works, however.
上一页 1 23
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -