📄 lgprep.tex
字号:
%% HLMBook - Steve Young 08/01/97%% Updated - Gareth Moore 15/01/02%\newpage\mysect{LGPrep}{LGPrep}\mysubsect{Function}{LGPrep-Function}\index{lgprep@\htool{LGPrep}|(}The function of this tool is to scan a language model training textand generate a set of gram files holding the $n$-grams seen in the textalong with their counts. By default, the output gram files are named\texttt{gram.0}, \texttt{gram.1}, \texttt{gram.2}, etc. However, the root name can be changed using the \texttt{-r} option and the startindex can be set using the\texttt{-i} option. Each output gram file is sorted but the files themselves will not besequenced (see section~\ref{s:gramfs}). Thus, when using\htool{LGPrep} with substantial training texts, it is good practice tosubsequently copy the complete set of output gram files using\htool{LGCopy} to reorder them into sequence. This process will alsoremove duplicate occurrences making the resultant files more compactand faster to read by the \HLM\ processing tools.Since \htool{LGPrep} will often encounter new words in its input, itis necessary to update the word map. The normal operation thereforeis that \htool{LGPrep} begins by reading in a word map containing allthe word ids required to decode all previously generated gram files.This word map is then updated to include all the new words seen in thecurrent input text. On completion, the updated word map is output toa file of the same name as the input word map in the directory used tostore the new gram files. Alternatively, it can be output to aspecified file using the \texttt{-w} option. The sequence number inthe header of the newly created word map will be one greater than thatof the original.\htool{LGPrep} can also apply a set of ``match and replace'' edit rules to the input text stream. The purpose of this facility is notto replace input text conditioning filters but to make simple changesto the text after the main gram files have been generated. Theediting works by passing the text through a window one word at a time.The edit rules consist of a pattern and a replacement text. At eachstep, the pattern of each rule is matched against the window and if amatch occurs, then the matched word sequence is replaced by the stringin the replaced part of the rule. Two sets of gram files aregenerated by this process. A ``negative'' set of gram files contain$n$-grams corresponding to just the text strings which were modified anda ``positive'' set of gram files contain $n$-grams corresponding to themodified text. All text for which no rules matched is ignored andgenerates no gram file output. Once the positive and negative gramfiles have been generated, the positive grams are added (i.e. inputwith a weight of +1) to the original set of gram files and thenegative grams are subtracted (i.e. input with a weight of -1). Thenet result is that the tool reading the full set of gram filesreceives a stream of $n$-grams which will be identical to the stream thatit would have received if the editing commands had been applied to thetext source when the original main gram file set had been generated.The edit rules are stored in a file and read in using the \texttt{-f}option. They consist of set definitions and rule definitions, eachwritten on a separate line. Each set defines a set of words and isidentified by an integer in the range 0 to 255\begin{verbatim} <set-def> = '#'<number> <word1> <word2> ... <wordN>.\end{verbatim}For example, \begin{verbatim} #4 red green blue \end{verbatim}defines set number 4 as being the 3 words ``red", ``green" and ``blue". Rulesconsist of an \textit{application factor}, a \textit{pattern} and and a\textit{replacement}\begin{verbatim} <rule-def> = <app-factor> <pattern> : <replacement> <pattern> = { <word> | '*' | !<set> | %<set> } <replacement> = { '$'<field> | string } % $' - work around emacs % colouring bug\end{verbatim}The application factor should be a real number in the range 0 to 1 andit specifies the proportion of occurrences of the pattern which shouldbe replaced. The pattern consists of a sequence of words, wildcardsymbols (``\texttt{*}") which match anyword, and set references of theform \texttt{\%n} denoting any word which is in set number \texttt{n}and \texttt{!n} denoting any word which is not in set number\texttt{n}. The replacement consists of a sequence of words and fieldreferences of the form \texttt{\$i} which denotes the \texttt{i'th}matching word in the input.As an example, the following rules would translate 50\% of theoccurrences of numbers in the form ``one hundred fifty" to ``onehundred and fifty" and 30\% of the occurrences of ``one hundred" to``a hundred".\begin{verbatim} #0 one two three four five six seven eight nine fifty sixty seventy #1 hundred 0.5 * * hundred %0 * * : $0 $1 $2 and $3 $4 $5 0.3 * * !0 one %1 * * : $0 $1 $2 a $4 $5 $6\end{verbatim}Note finally, that \htool{LGPrep} processes edited text in a parallelstream to normal text, so it is possible to generate edited gram fileswhilst generating the main gram file set. However, normally the maingram files already exist and so it is normal to suppress gram filegeneration using the \texttt{-z} option when using edit rules.\mysubsect{Use}{LGPrep-Use}\htool{LGPrep} is invoked by typing the command line\begin{verbatim} LGPrep [options] wordmap [textfile ...]\end{verbatim}Each text file is processed in turn and treated as a continuous streamof words. If no text files are specified standard input is used andthis is the more usual case since it allows the input text source tobe filtered before input to\htool{LGPrep}, for example, using \htool{LCond.pl} (in {\tt LMTutorial/extras/}).Each $n$-gram in the input stream is stored in a buffer. When the bufferis full it is sorted and multiple occurrences of the same $n$-gram aremerged and the count set accordingly. When this process ceases toyield sufficient buffer space, the contents are written to an outputgram file.The word map file defines the mapping of source words to the numericids used within \HLM\ tools. Any words not in the map are allocatednew ids and added to the map. On completion, a new map with the samename (unless specified otherwise with the \texttt{-w} option) isoutput to the same directory as the output gram files. To initialisethe first invocation of this updating process, a word map file shouldbe created with a text editor containing the following:\begin{verbatim} Name=xxxx SeqNo=0 Language=yyyy Entries=0 Fields=ID \Words\\end{verbatim}where \texttt{xxxx} is an arbitrarily chosen name for the word map and\texttt{yyyy} is the language. Fields specifying the escaping mode to use(\texttt{HTK} or \texttt{RAW}) and changing \texttt{Fields} to includefrequency counts in the output (i.e.\ \texttt{FIELDS = ID,WFC}) canalso be given. Alternatively, they can be added to the output usingcommand line options.The allowable options to \htool{LGPrep} are as follows\begin{optlist} \ttitem{-a n} Allow upto \texttt{n} new words in input texts (default 100000). \ttitem{-b n} Set the internal gram buffer size to n (default 2000000). \htool{LGPrep} stores incoming $n$-grams in this buffer. When the buffer is full, the contents are sorted and written to an output gram file. Thus, the buffer size determines the amount of process memory that \htool{LGPrep} will use and the size of the individual output gram files. \ttitem{-c} Add word counts to the output word map. This overrides the setting in the input word map (default off). \ttitem{-d} Directory in which to store the output gram files (default current directory). \ttitem{-e n} Set the internal edited gram buffer size to \texttt{n} (default 100000). \ttitem{-f s} Fix (i.e. edit) the text source using the rules in \texttt{s}. \ttitem{-h} Do not use HTK escaping in the output word map (default on). \ttitem{-i n} Set the index of the first gram file output to be \texttt{n} (default 0). \ttitem{-n n} Set the output $n$-gram size to \texttt{n} (default 3). \ttitem{-q} Tag words at sentence start with underscore (\_). \ttitem{-r s} Set the root name of the output gram files to \texttt{s} (default ``gram''). \ttitem{-s s} Write the string \texttt{s} into the source field of the output gram files. This string should be a comment describing the text source. \ttitem{-w s} Write the output map file to \texttt{s} (default same as input map name stored in the output gram directory). \ttitem{-z} Suppress gram file output. This option allows \htool{LGPrep} to be used just to compute a word frequency map. It is also normally applied when applying edit rules to the input. \stdoptQ\end{optlist}\stdopts{LGPrep}\mysubsect{Tracing}{LGPrep-Tracing}\htool{LGPrep} supports the following trace options where eachtrace flag is given using an octal base\begin{optlist}\ttitem{00001} basic progress reporting. \ttitem{00002} monitor buffer save operations.\ttitem{00004} Trace word input stream.\ttitem{00010} Trace shift register input.\ttitem{00020} Rule input monitoring.\ttitem{00040} Print rule set.\end{optlist}Trace flags are set using the \texttt{-T} option or the \texttt{TRACE}configuration variable.\index{lgprep@\htool{LGPrep}|)}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -