📄 ngram.html
字号:
<! $Id: ngram.1,v 1.58 2006/07/30 04:54:35 stolcke Exp $><HTML><HEADER><TITLE>ngram</TITLE><BODY><H1>ngram</H1><H2> NAME </H2>ngram - apply N-gram language models<H2> SYNOPSIS </H2><B> ngram </B>[<B>-help</B>]<B></B>option...<H2> DESCRIPTION </H2><B> ngram </B>performs various operations with N-gram-based and related language models,including sentence scoring, perplexity computation, sentences generation,and various types of model interpolation.The N-gram language models are read from files in ARPA<A HREF="ngram-format.html">ngram-format(5)</A>;various extended language model formats are described with the optionsbelow.<H2> OPTIONS </H2><P>Each filename argument can be an ASCII file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicatestdin/stdout.<DL><DT><B> -help </B><DD>Print option summary.<DT><B> -version </B><DD>Print version information.<DT><B>-order</B><I> n</I><B></B><DD>Set the maximal N-gram order to be used, by default 3.NOTE: The order of the model is not set automatically when a modelfile is read, so the same file can be used at various orders.To use models of order higher than 3 it is always necessary to specify thisoption.<DT><B>-debug</B><I> level</I><B></B><DD>Set the debugging output level (0 means no debugging output).Debugging messages are sent to stderr, with the exception of <B> -ppl </B>output as explained below.<DT><B> -memuse </B><DD>Print memory usage statistics for the LM.</DD></DL><P>The following options determine the type of LM to be used.<DL><DT><B> -null </B><DD>Use a `null' LM as the main model (one that gives probability 1 to all words).This is useful in combination with mixture creation or for debugging.<DT><B>-lm</B><I> file</I><B></B><DD>Read the (main) N-gram model from<I>file</I>.<I></I>This option is always required, unless <B> -null </B>was chosen.<DT><B> -tagged </B><DD>Interpret the LM as containing word/tag N-grams.<DT><B> -skip </B><DD>Interpret the LM as a ``skip'' N-gram model.<DT><B>-hidden-vocab</B><I> file</I><B></B><DD>Interpret the LM as an N-gram model containing hidden events between words.The list of hidden event tags is read from<I>file</I>.<I></I><BR>Hidden event definitions may also follow the N-gram definitions in the LM file (the argument to <B>-lm</B>).<B></B>The format for such definitions is<BR> <I>event</I>[<B>-delete</B> <I>D</I>][<B>-repeat</B> <I>R</I>][<B>-insert</B> <I>w</I>][<B>-observed</B>][<B>-omit</B>]<BR>The optional flags after the event name modify the default behavior of hidden events in the model.By default events are unobserved pseudo-words of which at most one can occurbetween regular words, and which are added to the context to predictfollowing words and events.(A typical use would be to model hidden sentence boundaries.)<B> -delete </B>indicates that upon encountering the event,<I> D </I>words are deleted from the next word's context.<B> -repeat </B>indicates that after the event the next<I> R </I>words from the context are to be repeated.<B> -insert </B>specifies that an (unobserved) word <I> w </I>is to be inserted into the history.<B> -observed </B>specifies the event tag is not hidden, but observed in the word stream.<B> -omit </B>indicates that the event tag itself is not to be added to the history forpredicting the following words.<BR>The hidden event mechanism represents a generalization of the disfluencyLM enabled by <B>-df</B>.<B></B><DT><B>-hidden-not</B><I></I><B></B><DD>Modifies processing of hidden event N-grams for the case that the event tags are embedded in the word stream, as opposed to inferred through dynamic programming.<DT><B> -df </B><DD>Interpret the LM as containing disfluency events.This enables an older form of hidden-event LM used inStolcke & Shriberg (1996).It is roughly equivalent to a hidden-event LM with<BR> UH -observed -omit (filled pause)<BR> UM -observed -omit (filled pause)<BR> @SDEL -insert <s> (sentence restart)<BR> @DEL1 -delete 1 -omit (1-word deletion)<BR> @DEL2 -delete 2 -omit (2-word deletion)<BR> @REP1 -repeat 1 -omit (1-word repetition)<BR> @REP2 -repeat 2 -omit (2-word repetition)<DT><B>-classes</B><I> file</I><B></B><DD>Interpret the LM as an N-gram over word classes.The expansions of the classes are given in<I>file</I><I></I>in <A HREF="classes-format.html">classes-format(5)</A>.Tokens in the LM that are not defined as classes in<I> file </I>are assumed to be plain words, so that the LM can contain mixed N-grams overboth words and word classes.<BR>Class definitions may also follow the N-gram definitions in the LM file (the argument to <B>-lm</B>).<B></B>In that case <B>-classes /dev/null</B><B></B>should be specified to trigger interpretation of the LM as a class-based model.Otherwise, class definitions specified with this option override anydefinitions found in the LM file itself.<DT><B>-simple-classes</B><B></B><DD>Assume a "simple" class model: each word is member of at most one word class,and class expansions are exactly one word long.<DT><B>-expand-classes</B><I> k</I><B></B><DD>Replace the read class-N-gram model with an (approximately) equivalentword-based N-gram.The argument<I> k </I>limits the length of the N-grams included in the new model(<I>k</I>=0<I></I>allows N-grams of arbitrary length).<DT><B>-expand-exact</B><I> k</I><B></B><DD>Use a more exact (but also more expensive) algorithm to compute the conditional probabilities of N-grams expanded from classes, forN-grams of length<I> k </I>or longer(<I>k</I>=0<I></I>is a special case and the default, it disables the exact algorithm for allN-grams).The exact algorithm is recommended for class-N-gram models that containmulti-word class expansions, for N-gram lengths exceeding the order of the underlying class N-grams.<DT><B> -decipher </B><DD>Use the N-gram model exactly as the Decipher(TM) recognizer would,i.e., choosing the backoff path if it has a higher probability thanthe bigram transition, and rounding log probabilities to bytelogprecision.<DT><B> -factored </B><DD>Use a factored N-gram model, i.e., a model that represents words as vectors of feature-value pairs and models sequences of words by a set of conditional dependency relations between factors.Individual dependencies are modeled by standard N-gram LMs, allowinghowever for a generalized backoff mechanism to combine multiple backoffpaths (Bilmes and Kirchhoff 2003).The <B>-lm</B>,<B></B><B>-mix-lm</B>,<B></B>etc. options name FLM specification files in the format described inKirchhoff et al. (2002).<DT><B> -hmm </B><DD>Use an HMM of N-grams language model.The <B> -lm </B>option specifies a file that describes a probabilistic graph, with eachline corresponding to a node or state.A line has the format:<BR> <I>statename</I> <I>ngram-file</I> <I>s1</I> <I>p1</I> <I>s2</I> <I>p2</I> ...<BR>where <I> statename </I>is a string identifying the state,<I> ngram-file </I>names a file containing a backoff N-gram model,<I>s1</I>,<I>s2</I>... are names of follow-states, and <I>p1</I>,<I>p2</I>... are the associated transition probabilities.A filename of ``-'' can be used to indicate the N-gram model datais included in the HMM file, after the current line.(Further HMM states may be specified after the N-gram data.)<BR>The names<B> INITIAL </B>and<B> FINAL </B>denote the start and end states, respectively, and have no associatedN-gram model (<I> ngram-file </I>must be specified as ``.'' for these).The <B> -order </B>option specifies the maximal N-gram length in the component models.<BR>The semantics of an HMM of N-grams is as follows: as each state is visited,words are emitted from the associated N-gram model.The first state (corresponding to the start-of-sentence) is<B>INITIAL</B>.<B></B>A state is left with the probability of the end-of-sentence tokenin the respective model, and the next state is chosen according tothe state transition probabilities.Each state has to emit at least one word.The actual end-of-sentence is emitted if and only if the<B> FINAL </B>state is reached.Each word probability is conditioned on all preceding words, regardless of whether they were emitted in the same or a previous state.<DT><B>-count-lm</B><I></I><B></B><DD>Use a count-based interpolated LM.The <B> -lm </B>option specifies a file that describes a set of N-gram counts along withinterpolation weights, based on which Jelinek-Mercer smoothing in theformulation of Chen and Goodman (1998) is performed.The file format is<BR> <B>order</B> <I>N</I><BR> <B>vocabsize</B> <I>V</I><BR> <B>totalcount</B> <I>C</I><BR> <B>mixweights</B> <I>M</I><BR> <I>w01</I> <I>w02</I> ... <I>w0N</I><BR> <I>w11</I> <I>w12</I> ... <I>w1N</I><BR> ...<BR> <I>wM1</I> <I>wM2</I> ... <I>wMN</I><BR> <B>countmodulus</B> <I>m</I><BR> <B>google-counts</B> <I>dir</I><BR> <B>counts</B> <I>file</I><BR>Here <I> N </I>is the model order (maximal N-gram length), although as with backoff models,the actual value used is overridden by the<B> -order </B>command line when the model is read in.<I> V </I>gives the vocabulary size and<I> C </I>the sum of all unigram counts.<I> M </I>specifies the number of mixture weight bins (minus 1).<I> m </I>is the width of a mixture weight bin.Thus, <I> wij </I>is the mixture weight used to interpolate an<I>j</I>-th<I></I>order maximum-likelihood estimate with lower-order estimates given thatthe (<I>j</I>-1)-gram context has been seen with a frequencybetween<I>i</I>*<I>m</I>and(<I>i</I>+1)*times.(For contexts with frequency greater than <I>M</I>*<I>m</I>the <I>i</I>=<I>M</I>weights are used.)The N-gram counts themselves are given in anindexed directory structure rooted at<I>dir</I>,<I></I>in an external<I>file</I>,<I></I>or, if <I> file </I>is the string <B>-</B>,<B></B>starting on the line following the<B> counts </B>keyword.<DT><B>-vocab</B><I> file</I><B></B><DD>Initialize the vocabulary for the LM from<I>file</I>.<I></I>This is especially useful if the LM itself does not specify a completevocabulary, e.g., as with<B>-null</B>.<B></B><DT><B>-vocab-aliases</B><I> file</I><B></B><DD>Reads vocabulary alias definitions from<I>file</I>,<I></I>consisting of lines of the form<BR> <I>alias</I> <I>word</I><BR>This causes all tokens<I> alias </I>to be mapped to<I>word</I>.<I></I><DT><B>-nonevents</B><I> file</I><B></B><DD>Read a list of words from<I> file </I>that are to be considered non-events, i.e., thatshould only occur in LM contexts, but not as predictions.Such words are excluded from sentence generation(<B>-gen</B>)andprobability summation(<B>-ppl -debug 3</B>).<DT><B> -limit-vocab </B><DD>Discard LM parameters on reading that do not pertain to the words specified in the vocabulary.The default is that words used in the LM are automatically added to the vocabulary.This option can be used to reduce the memory requirements for large LMs that are going to be evaluated only on a small vocabulary subset.<DT><B> -unk </B><DD>Indicates that the LM contains the unknown word, i.e., is an open-class LM.<DT><B>-map-unk</B><I> word</I><B></B><DD>Map out-of-vocabulary words to <I>word</I>,<I></I>rather than the default<B> <unk> </B>tag.<DT><B> -tolower </B><DD>Map all vocabulary to lowercase.Useful if case conventions for text/counts and language model differ.<DT><B> -multiwords </B><DD>Split input words consisting of multiwords joined by underscoresinto their components, before evaluating LM probabilities.<DT><B>-mix-lm</B><I> file</I><B></B><DD>Read a second N-gram model for interpolation purposes.The second and any additional interpolated models can also be class N-grams(using the same<B> -classes </B>definitions), but are otherwise constrained to be standard N-grams, i.e.,the options<B>-df</B>,<B></B><B>-tagged</B>,<B></B><B>-skip</B>,<B></B>and<B> -hidden-vocab </B>do not apply to them.<BR><B> NOTE: </B>Unless <B> -bayes </B>(see below) is specified,<B> -mix-lm </B>triggers a static interpolation of the models in memory.In most cases a more efficient, dynamic interpolation is sufficient, requestedby <B>-bayes 0</B>.<B></B>Also, mixing models of different type (e.g., word-based and class-based)will<I> only </I>work correctly with dynamic interpolation.<DT><B>-lambda</B><I> weight</I><B></B><DD>Set the weight of the main model when interpolating with<B>-mix-lm</B>.<B></B>Default value is 0.5.<DT><B>-mix-lm2</B><I> file</I><B></B><DD><DT><B>-mix-lm3</B><I> file</I><B></B><DD><DT><B>-mix-lm4</B><I> file</I><B></B><DD><DT><B>-mix-lm5</B><I> file</I><B></B><DD><DT><B>-mix-lm6</B><I> file</I><B></B><DD><DT><B>-mix-lm7</B><I> file</I><B></B><DD><DT><B>-mix-lm8</B><I> file</I><B></B><DD><DT><B>-mix-lm9</B><I> file</I><B></B><DD>Up to 9 more N-gram models can be specified for interpolation.<DT><B>-mix-lambda2</B><I> weight</I><B></B>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -