📄 ngram.1
字号:
ngram(1) ngram(1)NNAAMMEE ngram - apply N-gram language modelsSSYYNNOOPPSSIISS nnggrraamm [--hheellpp] option ...DDEESSCCRRIIPPTTIIOONN nnggrraamm performs various operations with N-gram-based and related language models, including sentence scoring, per- plexity computation, sentences generation, and various types of model interpolation. The N-gram language models are read from files in ARPA nnggrraamm--ffoorrmmaatt(5); various extended language model formats are described with the options below.OOPPTTIIOONNSS Each filename argument can be an ASCII file, or a com- pressed file (name ending in .Z or .gz), or ``-'' to indi- cate stdin/stdout. --hheellpp Print option summary. --vveerrssiioonn Print version information. --oorrddeerr _n Set the maximal N-gram order to be used, by default 3. NOTE: The order of the model is not set auto- matically when a model file is read, so the same file can be used at various orders. To use models of order higher than 3 it is always necessary to specify this option. --ddeebbuugg _l_e_v_e_l Set the debugging output level (0 means no debug- ging output). Debugging messages are sent to stderr, with the exception of --ppppll output as explained below. --mmeemmuussee Print memory usage statistics for the LM. The following options determine the type of LM to be used. --nnuullll Use a `null' LM as the main model (one that gives probability 1 to all words). This is useful in combination with mixture creation or for debugging. --llmm _f_i_l_e Read the (main) N-gram model from _f_i_l_e. This option is always required, unless --nnuullll was chosen. --ttaaggggeedd Interpret the LM as containing word/tag N-grams. --sskkiipp Interpret the LM as a ``skip'' N-gram model. --hhiiddddeenn--vvooccaabb _f_i_l_e Interpret the LM as an N-gram model containing hid- den events between words. The list of hidden event tags is read from _f_i_l_e. Hidden event definitions may also follow the N-gram definitions in the LM file (the argument to --llmm). The format for such definitions is _e_v_e_n_t [--ddeelleettee _D] [--rreeppeeaatt _R] [--iinnsseerrtt _w] [--oobbsseerrvveedd] [--oommiitt] The optional flags after the event name modify the default behavior of hidden events in the model. By default events are unobserved pseudo-words of which at most one can occur between regular words, and which are added to the context to predict following words and events. (A typical use would be to model hidden sentence boundaries.) --ddeelleettee indicates that upon encountering the event, _D words are deleted from the next word's context. --rreeppeeaatt indicates that after the event the next _R words from the context are to be repeated. --iinnsseerrtt spec- ifies that an (unobserved) word _w is to be inserted into the history. --oobbsseerrvveedd specifies the event tag is not hidden, but observed in the word stream. --oommiitt indicates that the event tag itself is not to be added to the history for predicting the follow- ing words. The hidden event mechanism represents a generaliza- tion of the disfluency LM enabled by --ddff. --hhiiddddeenn--nnoott Modifies processing of hidden event N-grams for the case that the event tags are embedded in the word stream, as opposed to inferred through dynamic pro- gramming. --ddff Interpret the LM as containing disfluency events. This enables an older form of hidden-event LM used in Stolcke & Shriberg (1996). It is roughly equiv- alent to a hidden-event LM with UH -observed -omit (filled pause) UM -observed -omit (filled pause) @SDEL -insert <s> (sentence restart) @DEL1 -delete 1 -omit (1-word deletion) @DEL2 -delete 2 -omit (2-word deletion) @REP1 -repeat 1 -omit (1-word repetition) @REP2 -repeat 2 -omit (2-word repetition) --ccllaasssseess _f_i_l_e Interpret the LM as an N-gram over word classes. The expansions of the classes are given in _f_i_l_e in ccllaasssseess--ffoorrmmaatt(5). Tokens in the LM that are not defined as classes in _f_i_l_e are assumed to be plain words, so that the LM can contain mixed N-grams over both words and word classes. Class definitions may also follow the N-gram defi- nitions in the LM file (the argument to --llmm). In that case --ccllaasssseess //ddeevv//nnuullll should be specified to trigger interpretation of the LM as a class-based model. Otherwise, class definitions specified with this option override any definitions found in the LM file itself. --ssiimmppllee--ccllaasssseess Assume a "simple" class model: each word is member of at most one word class, and class expansions are exactly one word long. --eexxppaanndd--ccllaasssseess _k Replace the read class-N-gram model with an (approximately) equivalent word-based N-gram. The argument _k limits the length of the N-grams included in the new model (_k=0 allows N-grams of arbitrary length). --eexxppaanndd--eexxaacctt _k Use a more exact (but also more expensive) algo- rithm to compute the conditional probabilities of N-grams expanded from classes, for N-grams of length _k or longer (_k=0 is a special case and the default, it disables the exact algorithm for all N- grams). The exact algorithm is recommended for class-N-gram models that contain multi-word class expansions, for N-gram lengths exceeding the order of the underlying class N-grams. --ddeecciipphheerr Use the N-gram model exactly as the Decipher(TM) recognizer would, i.e., choosing the backoff path if it has a higher probability than the bigram transition, and rounding log probabilities to bytelog precision. --ffaaccttoorreedd Use a factored N-gram model, i.e., a model that represents words as vectors of feature-value pairs and models sequences of words by a set of condi- tional dependency relations between factors. Indi- vidual dependencies are modeled by standard N-gram LMs, allowing however for a generalized backoff mechanism to combine multiple backoff paths (Bilmes and Kirchhoff 2003). The --llmm, --mmiixx--llmm, etc. options name FLM specification files in the format described in Kirchhoff et al. (2002). --hhmmmm Use an HMM of N-grams language model. The --llmm option specifies a file that describes a proba- bilistic graph, with each line corresponding to a node or state. A line has the format: _s_t_a_t_e_n_a_m_e _n_g_r_a_m_-_f_i_l_e _s_1 _p_1 _s_2 _p_2 ... where _s_t_a_t_e_n_a_m_e is a string identifying the state, _n_g_r_a_m_-_f_i_l_e names a file containing a backoff N-gram model, _s_1,_s_2, ... are names of follow-states, and _p_1,_p_2, ... are the associated transition probabili- ties. A filename of ``-'' can be used to indicate the N-gram model data is included in the HMM file, after the current line. (Further HMM states may be specified after the N-gram data.) The names IINNIITTIIAALL and FFIINNAALL denote the start and end states, respectively, and have no associated N- gram model (_n_g_r_a_m_-_f_i_l_e must be specified as ``.'' for these). The --oorrddeerr option specifies the maxi- mal N-gram length in the component models. The semantics of an HMM of N-grams is as follows: as each state is visited, words are emitted from the associated N-gram model. The first state (cor- responding to the start-of-sentence) is IINNIITTIIAALL. A state is left with the probability of the end-of- sentence token in the respective model, and the next state is chosen according to the state transi- tion probabilities. Each state has to emit at least one word. The actual end-of-sentence is emitted if and only if the FFIINNAALL state is reached. Each word probability is conditioned on all preced- ing words, regardless of whether they were emitted in the same or a previous state. --ccoouunntt--llmm Use a count-based interpolated LM. The --llmm option specifies a file that describes a set of N-gram counts along with interpolation weights, based on which Jelinek-Mercer smoothing in the formulation of Chen and Goodman (1998) is performed. The file format is oorrddeerr _N vvooccaabbssiizzee _V ttoottaallccoouunntt _C mmiixxwweeiigghhttss _M _w_0_1 _w_0_2 ... _w_0_N _w_1_1 _w_1_2 ... _w_1_N ... _w_M_1 _w_M_2 ... _w_M_N ccoouunnttmmoodduulluuss _m ggooooggllee--ccoouunnttss _d_i_r ccoouunnttss _f_i_l_e Here _N is the model order (maximal N-gram length), although as with backoff models, the actual value used is overridden by the --oorrddeerr command line when the model is read in. _V gives the vocabulary size and _C the sum of all unigram counts. _M specifies the number of mixture weight bins (minus 1). _m is the width of a mixture weight bin. Thus, _w_i_j is the mixture weight used to interpolate an _j-th order maximum-likelihood estimate with lower-order estimates given that the (_j-1)-gram context has been seen with a frequency between _i*_m and (_i+1)*_m-1 times. (For contexts with frequency
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -