lgprep.tex

来自「隐马尔可夫模型源代码」· TEX 代码 · 共 219 行

TEX
219
字号
%


% HLMBook - Steve Young    08/01/97


%


% Updated - Gareth Moore   15/01/02


%





\newpage


\mysect{LGPrep}{LGPrep}





\mysubsect{Function}{LGPrep-Function}





\index{lgprep@\htool{LGPrep}|(}





The function of this tool is to scan a language model training text


and generate a set of gram files holding the $n$-grams seen in the text


along with their counts.  By default, the output gram files are named


\texttt{gram.0}, \texttt{gram.1}, \texttt{gram.2}, etc. However, the 


root name can be changed using the \texttt{-r} option and the start


index can be set using the


\texttt{-i} option.  





Each output gram file is sorted but the files themselves will not be


sequenced (see section~\ref{s:gramfs}).  Thus, when using


\htool{LGPrep} with substantial training texts, it is good practice to


subsequently copy the complete set of output gram files using


\htool{LGCopy} to reorder them into sequence. This process will also


remove duplicate occurrences making the resultant files more compact


and faster to read by the \HLM\ processing tools.





Since \htool{LGPrep} will often encounter new words in its input, it


is necessary to update the word map.  The normal operation therefore


is that \htool{LGPrep} begins by reading in a word map containing all


the word ids required to decode all previously generated gram files.


This word map is then updated to include all the new words seen in the


current input text.  On completion, the updated word map is output to


a file of the same name as the input word map in the directory used to


store the new gram files.  Alternatively, it can be output to a


specified file using the \texttt{-w} option.  The sequence number in


the header of the newly created word map will be one greater than that


of the original.





\htool{LGPrep} can also apply a set of ``match and replace'' edit 


rules to the input text stream.  The purpose of this facility is not


to replace input text conditioning filters but to make simple changes


to the text after the main gram files have been generated.  The


editing works by passing the text through a window one word at a time.


The edit rules consist of a pattern and a replacement text. At each


step, the pattern of each rule is matched against the window and if a


match occurs, then the matched word sequence is replaced by the string


in the replaced part of the rule.  Two sets of gram files are


generated by this process.  A ``negative'' set of gram files contain


$n$-grams corresponding to just the text strings which were modified and


a ``positive'' set of gram files contain $n$-grams corresponding to the


modified text.  All text for which no rules matched is ignored and


generates no gram file output.  Once the positive and negative gram


files have been generated, the positive grams are added (i.e. input


with a weight of +1) to the original set of gram files and the


negative grams are subtracted (i.e. input with a weight of -1).  The


net result is that the tool reading the full set of gram files


receives a stream of $n$-grams which will be identical to the stream that


it would have received if the editing commands had been applied to the


text source when the original main gram file set had been generated.





The edit rules are stored in a file and read in using the \texttt{-f}


option.  They consist of set definitions and rule definitions, each


written on a separate line. Each set defines a set of words and is


identified by an integer in the range 0 to 255


\begin{verbatim}


    <set-def>     = '#'<number> <word1> <word2> ... <wordN>.


\end{verbatim}


For example, 


\begin{verbatim}


    #4 red green blue 


\end{verbatim}


defines set number 4 as being the 3 words ``red", ``green" and ``blue".  Rules


consist of an \textit{application factor}, a \textit{pattern} and and a


\textit{replacement}


\begin{verbatim}


    <rule-def>    = <app-factor> <pattern> : <replacement>


    <pattern>     = { <word> | '*' | !<set> | %<set> }


    <replacement> = { '$'<field> | string } % $' - work around emacs


                                            % colouring bug


\end{verbatim}


The application factor should be a real number in the range 0 to 1 and


it specifies the proportion of occurrences of the pattern which should


be replaced.  The pattern consists of a sequence of words, wildcard


symbols (``\texttt{*}") which match anyword, and set references of the


form \texttt{\%n} denoting any word which is in set number \texttt{n}


and \texttt{!n} denoting any word which is not in set number


\texttt{n}.  The replacement consists of a sequence of words and field


references of the form \texttt{\$i} which denotes the \texttt{i'th}


matching word in the input.





As an example, the following rules would translate 50\% of the


occurrences of numbers in the form ``one hundred fifty" to ``one


hundred and fifty" and 30\% of the occurrences of ``one hundred" to


``a hundred".


\begin{verbatim}


    #0 one two three four five six seven eight nine fifty sixty seventy


    #1 hundred


    0.5 * * hundred %0 * * : $0 $1 $2 and $3 $4 $5


    0.3 * * !0 one %1  * * : $0 $1 $2 a $4 $5 $6


\end{verbatim}


Note finally, that \htool{LGPrep} processes edited text in a parallel


stream to normal text, so it is possible to generate edited gram files


whilst generating the main gram file set.  However, normally the main


gram files already exist and so it is normal to suppress gram file


generation using the \texttt{-z} option when using edit rules.





\mysubsect{Use}{LGPrep-Use}





\htool{LGPrep} is invoked by typing the command line


\begin{verbatim}


   LGPrep [options] wordmap [textfile ...]


\end{verbatim}


Each text file is processed in turn and treated as a continuous stream


of words.  If no text files are specified standard input is used and


this is the more usual case since it allows the input text source to


be filtered before input to


\htool{LGPrep}, for example, using \htool{LCond.pl} (in {\tt LMTutorial/extras/}).





Each $n$-gram in the input stream is stored in a buffer.  When the buffer


is full it is sorted and multiple occurrences of the same $n$-gram are


merged and the count set accordingly.  When this process ceases to


yield sufficient buffer space, the contents are written to an output


gram file.





The word map file defines the mapping of source words to the numeric


ids used within \HLM\ tools.  Any words not in the map are allocated


new ids and added to the map.  On completion, a new map with the same


name (unless specified otherwise with the \texttt{-w} option) is


output to the same directory as the output gram files.  To initialise


the first invocation of this updating process, a word map file should


be created with a text editor containing the following:


\begin{verbatim}


    Name=xxxx


    SeqNo=0


    Language=yyyy


    Entries=0


    Fields=ID


    \Words\


\end{verbatim}


where \texttt{xxxx} is an arbitrarily chosen name for the word map and


\texttt{yyyy} is the language. Fields specifying the escaping mode to use


(\texttt{HTK} or \texttt{RAW}) and changing \texttt{Fields} to include


frequency counts in the output (i.e.\ \texttt{FIELDS = ID,WFC}) can


also be given.  Alternatively, they can be added to the output using


command line options.





The allowable options to \htool{LGPrep} are as follows





\begin{optlist}


  \ttitem{-a n} Allow upto \texttt{n} new words in input texts


  (default 100000).





  \ttitem{-b n} Set the internal gram buffer size to n (default


  2000000). \htool{LGPrep} stores incoming $n$-grams in this buffer.


  When the buffer is full, the contents are sorted and written to an


  output gram file.  Thus, the buffer size determines the amount of


  process memory that \htool{LGPrep} will use and the size of the


  individual output gram files.





  \ttitem{-c} Add word counts to the output word map.  This overrides


       the setting in the input word map (default off).





  \ttitem{-d} Directory in which to store the output gram files


             (default current directory).


        


  \ttitem{-e n} Set the internal edited gram buffer size to \texttt{n}


  (default 100000).





  \ttitem{-f s} Fix (i.e. edit) the text source using the rules in


	\texttt{s}.





  \ttitem{-h} Do not use HTK escaping in the output word map (default


              on).





  \ttitem{-i n} Set the index of the first gram file output 


             to be \texttt{n} (default 0).





  \ttitem{-n n} Set the output $n$-gram size to \texttt{n} (default 3).





  \ttitem{-q} Tag words at sentence start with underscore (\_).





  \ttitem{-r s} Set the root name of the output gram files to


       \texttt{s} (default ``gram'').





  \ttitem{-s s} Write the string \texttt{s} into the source field of


       the output gram files.  This string should be a comment


       describing the text source.





  \ttitem{-w s} Write the output map file to \texttt{s} (default same


      as input map name stored in the output gram directory).





  \ttitem{-z} Suppress gram file output. This option allows


      \htool{LGPrep} to be used just to compute a word frequency map.


      It is also normally applied when applying edit rules to the


      input.





  \stdoptQ


\end{optlist}


\stdopts{LGPrep}





\mysubsect{Tracing}{LGPrep-Tracing}





\htool{LGPrep} supports the following trace options where each


trace flag is given using an octal base


\begin{optlist}


\ttitem{00001}  basic progress reporting. 


\ttitem{00002}  monitor buffer save operations.


\ttitem{00004}  Trace word input stream.


\ttitem{00010}  Trace shift register input.


\ttitem{00020}  Rule input monitoring.


\ttitem{00040}  Print rule set.


\end{optlist}


Trace flags are set using the \texttt{-T} option or the \texttt{TRACE}


configuration variable.


\index{lgprep@\htool{LGPrep}|)}


⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?