📄 hlmfiles.tex
字号:
(\texttt{SeqNo=nnn)}\item Last gram. The last $n$-gram in the file (\texttt{gramN = w1 w2 w3 ...})\item Number of distinct $n$-grams in file. (\texttt{Entries = N}) \item Word map check. This is an optional field containing a word and its id. It can be included as a double check that the correct word map is being used to interpret this gram file. The given word is looked up in the word map and if the corresponding id does not match, an error is reported. (\texttt{WMCheck = word id})\item Text source. This is an optional text string describing the text source which was used to generate the gram file (\texttt{Source=...}).\end{enumerate}For example, a typical gram file header might be\begin{verbatim} Ngram = 3 WMap = US_Business_News Entries = 50345980 WMCheck = XEROX 340987 Gram1 = AN ABLE ART GramN = ZEALOUS ZOO OWNERS Source = WSJ Aug 94 to Dec 94\end{verbatim}The $n$-grams themselves begin immediately following the line containingthe keyword \verb+\Grams\+\footnote{That is, the first byte of thebinary data immediately follows the newline character}. They arelisted in lexicographic sort order such that for the $n$-gram $\{w_1 w_2\ldots w_N\}$, $w_1$ varies the least rapidly and $w_N$ varies themost rapidly. Each $n$-gram consists of a sequence of $N$ 3-byte wordids followed by a single 1-byte count. If the $n$-gram occurred morethan 255 times, then it is repeated with the counts being interpretedto the base 256. For example, if a gram file contains thesequence\index{gram files!count encoding}\begin{verbatim} w1 w2 ... wN c1 w1 w2 ... wN c2 w1 w2 ... wN c3\end{verbatim}corresponding to the $n$-gram $\{w_1 w_2 \ldots w_N\}$, the correspondingcount is\index{ngram!count encoding}\index{count encoding}\[ c_1 + c_2*256 + c_3*256^2\]When a group of gram files are used as input to a tool, theymust be organised so that the tool receives $n$-grams as a single streamin sort order i.e.\ as far as the tool is concerned, the net effectmust be as if there is just a single gram file. Of course,\index{gramfile!input} a sufficient approach would be to open all input gramfiles in parallel and then scan them as needed to extract the requiredsorted $n$-gram sequence. However, if two $n$-gram files were organisedsuch that the last $n$-gram in one file was ordered before the first$n$-gram of the second file, it would be much more efficient to open andread the files in sequence. Files such as these are said to be\texttt{sequenced} and in general, \HTK\ tools are supplied with a mixof sequenced and non-sequenced files. To optimise input in thisgeneral case, all \HTK\ tools which input gram files start by scanningthe header fields\index{sequenced gram files} \texttt{gram1} and \texttt{gramN}. This information allows a sequence table to beconstructed which determines the order in which the constituent gramfile must be opened and closed. This sequence table is designed tominimise the number of individual gram files which must be kept openin parallel. \index{gram file!sequencing}This gram file sequencing is invisible to the \HTK\ user, but itis important to be aware of it. When a large number of gram files areaccumulated to form a frequently used database, it may be worthcopying the gram files using \htool{LGCopy}. This will have the effect oftransforming the gram files into a fully sequenced set thus ensuringthat subsequent reading of the data is maximally efficient.\mysect{Frequency-of-frequency (FoF) Files}{FoFs}A FoF file contains a list of the number of times that an $n$-gramoccurred just once, twice, three times, \ldots, n times. Its formatis similar to a word map file. The header contains the followinginformation\index{frequency-of-frequency}\index{FoF files}\begin{enumerate}\item $n$-gram size ie 2 for bigrams, 3 for trigrams, etc. (\texttt{Ngram=N})\item the number of frequencies counted (i.e.\ the number of rows in the FoF table (\texttt{Entries=nnn})\item Text source. This is an optional text string describing the text source which was used to generate the gram files used to compute thisFoF file. (\texttt{Source=...}).\end{enumerate}More header fields may be defined later and the user is free to insertothers.\index{FoF file!header}The data part starts with the keyword \verb+\FoFs\+. Each contains a list of the unigrams, bigrams, \ldots, $n$-grams occurring exactly\texttt{k} times, where \texttt{k} is the number of the row of thetable -- the first row shows the number of $n$-grams occurring exactly1 time, for example.\index{FoF file!counts}As an example, the following is a FoF file computed from a set of trigram gramfiles.\begin{verbatim} Ngram = 3 Entries = 100 Source = WSJ Aug 94 to Dec 94 \FoFs\ 1020 23458 78654 904 19864 56089 ...\end{verbatim}FoF files are generated by the tool \texttt{LFoF}. This tool willalso output a list containing an estimate of the number of $n$-grams thatwill occur in a language model for a given cut-off -- set theconfiguration parameter {\tt LPCALC: TRACE = 3}.\index{lFoF@\htool{LFoF}}\mysect{Word LM file formats}{HLMlmfileformats}\index{ARPA-MIT LM format}\index{LM file formats!binary}\index{LM file formats!ultra}\index{LM file formats!ARPA-MIT format}\index{files!language models}Language models can be stored on disk in three different file formats- {\em text}, {\em binary} and {\em ultra}. The text format is thestandard ARPA-MIT formad used to distribute pre-computed languagemodels. The binary format is a proprietary file format which isoptimised for flexibility and memory usage. All tools will outputmodels in this format unless instructed otherwise. The {\em ultra} LMformat is a further development of the binary LM format optimised forfast loading times and small memory footprint. At the same time,models stored in this format cannot be pruned further in terms of sizeand vocabulary.\mysubsect{The ARPA-MIT LM format}{HLMarpamitlm}\index{ARPA-MIT LM format}This format for storing $n$-gram back-off langauge models is definedas follows\begin{verbatim} <LM_definition> = [ { <comment> } ] \data\ <header> <body> \end\ <comment> = { <word> }\end{verbatim}An ARPA-style language model file comes in two parts - the header andthe $n$-gram definitions. The header contains a description of thecontents of the file.\begin{verbatim} <header> = { ngram <int>=<int> }\end{verbatim}The first {\tt <int>} gives the $n$-gram order and the second {\tt<int>} gives the number of $n$-gram entries stored.For example, a trigram language model consists of three sections - theunigram, bigram and trigram sections respectively. The correspondingentry in the header indicates the number of entries for thatsection. This can be used to aid the loading-in procedure. The bodypart contains all sections of the language model and is defined asfollows:\begin{verbatim} <body> = { <lmpart1> } <lmpart2> <lmpart1> = \<int>-grams: { <ngramdef1> } <lmpart2> = \<int>-grams: { <ngramdef2> } <ngramdef1> = <float> { <word> } <float> <ngramdef2> = <float> { <word> } \end{verbatim}Each $n$-gram definition starts with a probability value stored as$\log_{10}$ followed by a sequence of $n$ words describing the actual$n$-gram. In all sections excepts the last one this is followed by aback-off weight which is also stored as $\log_{10}$. The followingexample shows an extract from a trigram language model stored in theARPA-text format.\begin{verbatim}\data\ngram 1=19979ngram 2=4987955ngram 3=6136155\1-grams:-1.6682 A -2.2371-5.5975 A'S -0.2818-2.8755 A. -1.1409-4.3297 A.'S -0.5886-5.1432 A.S -0.4862...\2-grams:-3.4627 A BABY -0.2884-4.8091 A BABY'S -0.1659-5.4763 A BACH -0.4722-3.6622 A BACK -0.8814...\3-grams:-4.3813 !SENT_START A CAMBRIDGE-4.4782 !SENT_START A CAMEL-4.0196 !SENT_START A CAMERA-4.9004 !SENT_START A CAMP-3.4319 !SENT_START A CAMPAIGN...\end\\end{verbatim}\mysubsect{The modified ARPA-MIT format}{HLMmodifiedarpamitlm}The efficient loading of the language model file requires priorinformation as to memory requirements. Such information is partiallyavailable from the header of the file which shows how many entrieswill be found in each section of the model. From the back-off natureof the language model it is clear that the back-off weight associatedwith an $n$-gram $(w_1, w_2, \ldots, w_{n-1})$ is only useful when$p(w_n | w_1, word_2, \ldots, w_{n-1})$ is an explicitly entry in thefile or computed via backing-off to the corresponding$(n-1)$-grams. In other words, the presence of a back-off weightassociated with the $n$-gram $w_1, w_2, \ldots, w_{n-1}$ can be usedto indicate the existence of explicit $n$-grams $w_1, w_2, \ldots,w_n$. The use of such information can greatly reduce the storagerequirements of the language model since the back-off weight requiresextra storage. For example, considering the statistics shown in table\ref{fg_stats}, such selective memory allocation can result in dramatic savings.\begin{table} \center \begin{tabular}{|l|r|r|} \hline {Component} & {\# with back-off weights} & {Total} \\ \hline {unigram} & 65,467 & 65,467 \\ \hline {bigram} & 2,074,422 & 6,224,660 \\ \hline {trigram} & 4,485,738 & 9,745,297 \\ \hline {fourgram} & 0 & 9,946,193 \\ \hline \end{tabular} \caption{Component statistics for a 65k word fourgram language model with cut-offs: bigram 1, trigram 2, fourgram 2.} \label{fg_stats}\end{table} This information is accommodated by modifying the syntax and semanticsof the rule\begin{verbatim} <ngramdef1> = <float> { <word> } [ <float> ]\end{verbatim}whereby a back-off weight associated with $n$-gram $(w_1, w_2,\ldots,w_{n-1})$ indicates the existence of $n$-grams $(w_1, w_2, \ldots,w_n)$. This version will be referred to as the modified ARPA-textformat. \mysubsect{The binary LM format}{HLMbinarylmformat}\index{LM file formats!binary}This format is the binary version of modified ARPA-text format. It wasdesigned to be a compact, self-contained format which aids the fastloading of large language model files. The format is similar to theoriginal ARPA-text format with the following modification\begin{verbatim} <header> = { (ngram <int>=<int>) | (ngram <int>~<int>) } \end{verbatim}The first alternative in the rule describes a section stored as text,the second one describes a section stored in binary. The unigramsection of a language model file is always stored as text.\begin{verbatim} <ngramdef> = <txtgram> | <bingram> <txtgram> = <float> { <word> } [ <float> ] <bingram> = <f_type> <f_size> <f_float> { <f_word> } [ <f_float> ]\end{verbatim}In the above definition, {\tt <f\_type>} is a 1-byte flags field, {\tt<f\_size>} is a 1-byte unsigned number indicating the total size inbytes of the remaining fields, {\tt <f\_float>} is a 4-bytes field forthe $n$-gram probability, {\tt <f\_word>} is a numeric word id, andthe last {\tt <f\_float>} is the back-off weight. The numeric wordidentifier is an unsigned integer assigned to each word in the orderof occurrence of the words in the unigram section. The minimum size ofthis field is 2-bytes as used in vocabulary lists with up to 65,5355words. If this number is exceeded the field size is automaticallyextended to accommodate all words. The size of the fields used tostore the probability and back-off weight are typically 4 bytes,however this may vary on different computer architectures. The leastsignificant bit of the flags field indicates the presence/absence of aback-off weight with corresponding values 1/0. The remaining bits ofthe flags field are not used at present.\mysect{Class LM file formats}{HLMclasslmfileformats}\index{files!language models}\index{Class language models}Class language models replace the word language model described insection \ref{s:HLMlmfileformats} with an identical component whichmodels class $n$-grams instead of word $n$-grams. They add to this asecond component which includes the deterministic word-to-classmapping with associated word-given-class probabilities, expressedeither as counts (which are normalised to probabilities on loading) oras explicit natural log probabilities. These two components are theneither combined into a single file or are pointed to with a speciallink file.\mysubsect{Class counts format}{HLMclasscountslmformat}\index{LM file formats!class counts}The format of a word-given-class counts file, as generated using the\texttt{-q} option from \htool{Cluster}, is as follows:\\\texttt{Word|Class counts}\\\textit{[blank line]}\\\texttt{Derived from: <file>}\\\texttt{Number of classes: <int>}\\\texttt{Number of words: <int>}\\\texttt{Iterations: <int>}\\\textit{[blank line]}\\\texttt{Word Class name Count}\\followed by one line for each word in the model of the form:\\\texttt{<word> CLASS<int> <int>}\\The fields are mostly self-explanatory. The {\tt Iterations:} headeris for information only and records how many iterations had beenperformed to produce the classmap contained within the file, and the{\tt Derived from:} header is similarly also for display purposesonly. Any number of headers may be present; the header section isterminated by finding a line beginning with the four characters makingup {\tt Word}. The colon-terminated headers may be in any order.{\tt CLASS<int>} must be the name of a class in the classmap(technically actually the wordmap) used to build the class-given-classhistory $n$-gram component of the language model -- the file built by\htool{LBuild}. In the current implementation these class names arerestricted to being of the form {\tt CLASS<int>}, although amodification to the code in \htool{LModel.c} would allow thisrestriction to be removed. Each line after the header specifies thecount of each word and the class it is in, so for example\\\texttt{THE CLASS73 1859}\\would specify that the word {\tt THE} was in class {\tt CLASS73} andoccurred 1859 times.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -