📄 hlmfiles.tex
字号:
%% !HVER!hlmfiles.tex [SJY 05/04/97]%% Updated (and corrected and significantly extended) GLM 25/01/02 - 19/02/02%\mychap{Language Modelling Reference}{hlmfiles}\sidepic{LTool}{120}{As noted in the introduction, building a language model with the \HTK\tools is a two stage process. In the first stage, the $n$-gram data isaccumulated and in the second stage, language models are estimatedfrom this data. The $n$-gram data consists of a set of gram files andtheir associated word map. Each gram file contains a list of $n$-gramsand their counts. Each $n$-gram is represented by a sequence of $n$integer word indices and the word map relates each of these indices tothe actual orthographic word form. As a special case, a word mapcontaining just words and no indices acts as a simple \textit{wordlist}. }In many cases, a class map is also required. Class maps give a nameto a subset of words and associate an index with that name. Oncedefined, a class name can be used like a regular word and, inparticular, it can be listed in a word map. In their most generalform, class maps are used to build class-based language models.However, they are also needed for most word-based language modelsbecause they allow a class to be defined for \textit{unknown} words.This chapter includes descriptions of three basic types of data file: gramfile, word map file, and class map file. As shown in the figure, eachfile type is supported by a specific \HTK\ module which provides therequired input and output and the other related operations.Also described in this chapter is a fourth type of data file called a\textit{frequency-of-frequency} or \textit{FoF} file. A FoF filecontains a count of the number of times an $n$-gram occurred just once,twice, three times, etc. It is used with Turing-Good discounting(although \htool{LBuild} can generate FoF counts on the fly) but itcan also be used to estimate the number of $n$-grams which will beincluded in the final language models.The various language model formats, for both word and class-basedmodels, are also described in detail.Trace levels for each language modelling librarymodules are also described -- see the tool documentation for detailsof tool level tracing options. Finally, run-time and compile-timesettings are documented.\mysect{Words and Classes}{wsandcs}In the \HTK\ language modelling tools, words and classes arerepresented internally by integer indices in the range 0 to $2^{24} -1$ (16777215). This range is chosen to fit exactly into 3 8-bit bytesthereby allowing efficient compaction of large lists of $n$-gram countswithin gram files.These integer indices will be referred to subsequently as\textit{ids}. Class ids are limited to the range 0 to $2^{16}-1$ andword ids fill the remaining range of $2^{16}$ to $2^{24} - 1$. Thus,any id with a zero most significant byte is a \textit{class id} andall other ids are\textit{word ids}.\index{class id}\index{word id}In the context of word maps, the term \textit{word} may refer toeither an orthographic word or the name of a class. Thus, in its mostgeneral form, a word map can contain the ids of both orthographicwords from a source text and class names defined in one or more classmaps.The mapping of orthographic words to ids is relatively permanent andnormally takes place when building gram files (using \htool{LGPrep}).Each time a new word is encountered, it is allocated a unique id.Once allocated, a word id should never be changed. Class ids, on theother hand, are more dynamic since their definition depends on thelanguage model being built. Finally, composite word maps can bederived from a collection of word and class maps using the tool\htool{LSubset}. These derived word maps are typically used to define aworking subset of the name space and this subset can contain both wordand class ids.\mysect{Data File Headers}{dfhdrs}All the data files have headers containing information about thefile and the associated environment. The header is variable-sizebeing terminated by a data symbol (e.g.\ \verb+\Words\+ \verb+\Grams\+\verb+\FoFs\+, etc) followed by the start of the actual data.Each header field is written on a separate line in theform\index{headers}\begin{verbatim} <Field> = <value>\end{verbatim}where \texttt{<Field>} is the name of the field and \texttt{<value>}is its value. The field name is case insensitive and zero or morespaces can surround the \texttt{=} sign. The \texttt{<value>} startswith the first printing character and ends at the last printingcharacter on the line. \HTK\ style escaping is never used in \HLM\headers.Fields may be given in any order. Field names which are unrecognisedby \HTK\ are ignored. Further field names may be introduced infuture, but these are guaranteed not to start with the letter ``U''.(NB. The above format rules do not apply to the files described in section\ref{s:HLMclasslmfileformats} -- see that section for more details)\mysect{Word Map Files}{wmaps}A word map file is a text file consisting of a header and a list ofword entries. The header\index{word map!header} contains thefollowing\begin{enumerate}\item a name consisting of any printable character string (\texttt{Name=sss}).\item the number of word entries (\texttt{Entries=nnn})\item a sequence number (\texttt{SeqNo=nnn})\item whether or not word ids \texttt{ID}s and word frequency counts \texttt{WFC}s are included (\texttt{Fields=ID} or \texttt{Fields=ID,WFC}). When the \texttt{Fields} field is missing, the word map contains only word names and it degenerates to the special case of a word list.\item escaping mode (\texttt{EscMode=HTK} or \texttt{EscMode=RAW}). The default is \texttt{HTK}.\item the language (\texttt{Language=xxx})\item word map source, a text string used with derived word maps to describe the source from which the subset was derived. (\texttt{Source=...}).\end{enumerate}The first two of these fields must always be included, and for wordmaps, the \texttt{Fields} field must also be included. The remainingfields are optional. More header fields may be defined later and theuser is free to insert others.The word entries begin with the keyword \verb+\Words\+. Each word ison a separate line with the format\index{word map!entries}\begin{verbatim} word [id [count]]\end{verbatim}where the id and count are optional. Proper word maps always have an\texttt{id}. When the \texttt{count} is included, it denotes thenumber of times that the word has been encountered during theprocessing of text data.For example, a typical word map file might be\begin{verbatim} Name=US_Business_News SeqNo=13 Entries=133986 Fields=ID,WFC Language=American EscMode=RAW \Words\ <s> 65536 34850 CAN'T 65537 2087 THE 65538 12004 DOLLAR 65539 169 IS 65540 4593 ....\end{verbatim}In this example,\index{word map!example of} the word map is called``US\_Business\_News'' and it has been updated 13 times since it wasoriginally created. It contains a total of 133986 entries and wordfrequency counts are included. The language is ``American'' and thereis no escaping used (e.g.\ can't is written \verb+CAN'T+ rather thanthe standard \HTK\ escaped form of \verb+CAN\'T+).As noted above, when the \texttt{Fields} field is missing, the wordmap contains only the words and serves the purpose of a simple wordlist. For example, a typical word list might be defined as follows\begin{verbatim} Name=US_Business_News Entries=10000 \Words\ A ABLE ABOUT ... ZOO \end{verbatim}Word lists are used to define subsets of the words in a word map.Whenever a tool requires a word list, a simple list of wordscan be input instead of the above. For example, the previous listcould be input as\begin{verbatim} A ABLE ABOUT ... ZOO \end{verbatim}In this case, the default is to assume that all input words areescaped. If raw mode input is required, the configuration variable\texttt{INWMAPRAW} should be set true (seesection~\ref{s:htkstrings}).\index{vocabulary list}As explained in section~\ref{s:htkstrings}, by default \HTK\ toolsoutput word maps in HTK escaped form. However, this can be overriddenby setting the \htool{LWMap} configuration variable\texttt{OUTWMAPRAW} to true.\mysect{Class Map Files}{cmaps}A class map file defines one or more word classes. It has a headersimilar to that of a word map file, containing values for\texttt{Name}, \texttt{Entries}, \texttt{EscMode} and \texttt{Language}. In this case, the number of entries refers to the number of classesdefined. \index{class map!header}The class definitions are introduced by the keyword \verb+\Classes\+.Each class definition has a single line sub-header consisting of aname, an id number, the number of class members (or non-members) and akeyword which must be \texttt{IN} or \texttt{NOTIN}. In the lattercase, the class consists of all words \textit{except} those listedi.e. the class is defined by its complement. \index{classmap!complements}The following is a simple example of a class map file.\begin{verbatim} Name=Simple_Classes Entries=97 EscMode=HTK Language=British \Classes\ ARTICLES 1 3 IN A AN THE COLOURS 2 4 IN RED BLUE GREEN YELLOW SHAPES 3 6 IN SQUARE CIRCLE ... etc\end{verbatim}This class map file defines 97 distinct classes, the first of which isa class called \texttt{ARTICLES} (id=1) with 3 members: (a, an, the).For simple word-based language models, the class map file is used todefine the class of unknown words. This is usually just thecomplement of the vocabulary list. For example, a typical class mapfile defining the \textit{unknown} class \texttt{!!UNKID} mightbe\index{unknown class} \index{class map!defining unknown}\begin{verbatim} Name=Vocab_65k_V2.3 Entries=1 Language=American EscMode=NONE \Classes\ !!UNKID 1 65426 NOTIN A ABATE ABLE ABORT ABOUND ...\end{verbatim}Since this case is so common, the tools also allow a plainvocabulary list to be supplied in place of a proper class map file.For example, supplying a class map file containing just\begin{verbatim} A ABATE ABLE ABORT ABOUND ...\end{verbatim}would have an equivalent effect to the previous example provided thatthe \htool{LCMap} configuration variables \texttt{UNKNOWNID} and\texttt{UNKNOWNNAME} have been set in order to define the id and nameto be used for the unknown class. In the example given, including thefollowing two lines in the configuration file would have the desiredeffect\begin{verbatim} LCMAP: UNKNOWNID = 1 LCMAP: UNKNOWNNAME = !!UNKID\end{verbatim}Notice that the similarity with the special case of word listsdescribed in section~\ref{s:wmaps}. A plain word list can thereforebe used to define both a vocabulary subset and the unknown class. Ina conventional language model, these are, of course, the same thing.In a similar fashion to word maps, the input of a headerless class mapcan be set to raw mode by setting the \htool{LCMap} configurationvariable \texttt{INCMAPRAW} and all class maps can be output in rawmode \index{class map!as vocabulary list} by setting the configurationvariable\texttt{OUTCMAPRAW} to true.\index{vocabulary list}\mysect{Gram Files}{gramfs}Statistical language models are estimated by counting the number ofevents in a sample source text. These event counts are stored in\textit{gram} files. Provided that they share a common word map, gramfiles can be grouped together in arbitrary ways to form the raw datapool from which a language model can be constructed. For example, atext source containing 100m words could be processed and stored as twogram files. A few months later, a 3rd gram file could be generatedfrom a newly acquired text source. This new gram file could then beadded to the original two files to build a new language model. Theoriginal source text is not needed and the gram files need not bechanged. \index{gram files!format}A gram file consists of a header\index{ngram!files} followed by asorted list of $n$-gram counts.\index{gram files!header} The headercontains the following items, each written on a separate line\begin{enumerate}\item $n$-gram size ie 2 for bigrams, 3 for trigrams, etc. (\texttt{Ngram=N})\item Word map. Name of word map to be used with this gram file. (\texttt{WMap=wmapname})\item First gram. The first $n$-gram in the file (\texttt{gram1 = w1 w2 w3 ...})\item Sequence number. If given then the actual word map must have a sequence number which is greater than or equal to this.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -