📄 ngram.1

📁 这是一款很好用的工具包
💻 1
📖 第 1 页 / 共 2 页
字号:
12 下一页
.\" $Id: ngram.1,v 1.58 2006/07/30 04:54:35 stolcke Exp $.TH ngram 1 "$Date: 2006/07/30 04:54:35 $" "SRILM Tools".SH NAMEngram \- apply N-gram language models.SH SYNOPSIS.B ngram[\c.BR \-help ]option\&....SH DESCRIPTION.B ngramperforms various operations with N-gram-based and related language models,including sentence scoring, perplexity computation, sentences generation,and various types of model interpolation.The N-gram language models are read from files in ARPA.BR ngram-format (5);various extended language model formats are described with the optionsbelow..SH OPTIONS.PPEach filename argument can be an ASCII file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicatestdin/stdout..TP.B \-helpPrint option summary..TP.B \-versionPrint version information..TP.BI \-order " n"Set the maximal N-gram order to be used, by default 3.NOTE: The order of the model is not set automatically when a modelfile is read, so the same file can be used at various orders.To use models of order higher than 3 it is always necessary to specify thisoption..TP.BI \-debug " level"Set the debugging output level (0 means no debugging output).Debugging messages are sent to stderr, with the exception of .B \-ppl output as explained below..TP.B \-memusePrint memory usage statistics for the LM..PPThe following options determine the type of LM to be used..TP.B \-nullUse a `null' LM as the main model (one that gives probability 1 to all words).This is useful in combination with mixture creation or for debugging..TP.BI \-lm " file"Read the (main) N-gram model from.IR file .This option is always required, unless .B \-nullwas chosen..TP.B \-taggedInterpret the LM as containing word/tag N-grams..TP.B \-skipInterpret the LM as a ``skip'' N-gram model..TP.BI \-hidden-vocab " file"Interpret the LM as an N-gram model containing hidden events between words.The list of hidden event tags is read from.IR file ..brHidden event definitions may also follow the N-gram definitions in the LM file (the argument to .BR \-lm ).The format for such definitions is.br	\fIevent\fP[\fB\-delete\fP \fID\fP][\fB\-repeat\fP \fIR\fP][\fB\-insert\fP \fIw\fP][\fB\-observed\fP][\fB\-omit\fP].brThe optional flags after the event name modify the default behavior of hidden events in the model.By default events are unobserved pseudo-words of which at most one can occurbetween regular words, and which are added to the context to predictfollowing words and events.(A typical use would be to model hidden sentence boundaries.).B \-deleteindicates that upon encountering the event,.I D words are deleted from the next word's context..B \-repeatindicates that after the event the next.I Rwords from the context are to be repeated..B \-insertspecifies that an (unobserved) word .I wis to be inserted into the history..B \-observed specifies the event tag is not hidden, but observed in the word stream..B \-omitindicates that the event tag itself is not to be added to the history forpredicting the following words..brThe hidden event mechanism represents a generalization of the disfluencyLM enabled by .BR \-df ..TP.BI \-hidden-notModifies processing of hidden event N-grams for the case that the event tags are embedded in the word stream, as opposed to inferred through dynamic programming..TP.B \-dfInterpret the LM as containing disfluency events.This enables an older form of hidden-event LM used inStolcke & Shriberg (1996).It is roughly equivalent to a hidden-event LM with.br	UH -observed -omit		(filled pause).br	UM -observed -omit		(filled pause).br	@SDEL -insert <s>		(sentence restart).br	@DEL1 -delete 1 -omit	(1-word deletion).br	@DEL2 -delete 2 -omit	(2-word deletion).br	@REP1 -repeat 1 -omit	(1-word repetition).br	@REP2 -repeat 2 -omit	(2-word repetition).TP.BI \-classes " file"Interpret the LM as an N-gram over word classes.The expansions of the classes are given in.IR file in .BR classes-format (5).Tokens in the LM that are not defined as classes in.I file are assumed to be plain words, so that the LM can contain mixed N-grams overboth words and word classes..brClass definitions may also follow the N-gram definitions in the LM file (the argument to .BR \-lm ).In that case .BR "\-classes /dev/null"should be specified to trigger interpretation of the LM as a class-based model.Otherwise, class definitions specified with this option override anydefinitions found in the LM file itself..TP.BR \-simple-classesAssume a "simple" class model: each word is member of at most one word class,and class expansions are exactly one word long..TP.BI \-expand-classes " k"Replace the read class-N-gram model with an (approximately) equivalentword-based N-gram.The argument.I klimits the length of the N-grams included in the new model(\c.IR k =0allows N-grams of arbitrary length)..TP.BI \-expand-exact " k"Use a more exact (but also more expensive) algorithm to compute the conditional probabilities of N-grams expanded from classes, forN-grams of length.I kor longer(\c.IR k =0is a special case and the default, it disables the exact algorithm for allN-grams).The exact algorithm is recommended for class-N-gram models that containmulti-word class expansions, for N-gram lengths exceeding the order of the underlying class N-grams..TP.B \-decipherUse the N-gram model exactly as the Decipher(TM) recognizer would,i.e., choosing the backoff path if it has a higher probability thanthe bigram transition, and rounding log probabilities to bytelogprecision..TP.B \-factoredUse a factored N-gram model, i.e., a model that represents words as vectors of feature-value pairs and models sequences of words by a set of conditional dependency relations between factors.Individual dependencies are modeled by standard N-gram LMs, allowinghowever for a generalized backoff mechanism to combine multiple backoffpaths (Bilmes and Kirchhoff 2003).The .BR \-lm ,.BR \-mix-lm ,etc. options name FLM specification files in the format described inKirchhoff et al. (2002)..TP.B \-hmmUse an HMM of N-grams language model.The .B \-lmoption specifies a file that describes a probabilistic graph, with eachline corresponding to a node or state.A line has the format:.br	\fIstatename\fP \fIngram-file\fP \fIs1\fP \fIp1\fP \fIs2\fP \fIp2\fP ....brwhere .I statename is a string identifying the state,.I ngram-filenames a file containing a backoff N-gram model,.IR s1 , s2 ,\&... are names of follow-states, and .IR p1 , p2 ,\&... are the associated transition probabilities.A filename of ``-'' can be used to indicate the N-gram model datais included in the HMM file, after the current line.(Further HMM states may be specified after the N-gram data.).brThe names.B INITIALand.B FINALdenote the start and end states, respectively, and have no associatedN-gram model (\c.I ngram-filemust be specified as ``.'' for these).The .B \-orderoption specifies the maximal N-gram length in the component models..brThe semantics of an HMM of N-grams is as follows: as each state is visited,words are emitted from the associated N-gram model.The first state (corresponding to the start-of-sentence) is.BR INITIAL .A state is left with the probability of the end-of-sentence tokenin the respective model, and the next state is chosen according tothe state transition probabilities.Each state has to emit at least one word.The actual end-of-sentence is emitted if and only if the.B FINALstate is reached.Each word probability is conditioned on all preceding words, regardless of whether they were emitted in the same or a previous state..TP.BI \-count-lm Use a count-based interpolated LM.The .B \-lmoption specifies a file that describes a set of N-gram counts along withinterpolation weights, based on which Jelinek-Mercer smoothing in theformulation of Chen and Goodman (1998) is performed.The file format is.br	\fBorder\fP \fIN\fP.br	\fBvocabsize\fP \fIV\fP.br	\fBtotalcount\fP \fIC\fP.br	\fBmixweights\fP \fIM\fP.br	 \fIw01\fP \fIw02\fP ... \fIw0N\fP.br	 \fIw11\fP \fIw12\fP ... \fIw1N\fP.br	 ....br	 \fIwM1\fP \fIwM2\fP ... \fIwMN\fP.br	\fBcountmodulus\fP \fIm\fP.br	\fBgoogle-counts\fP \fIdir\fP.br	\fBcounts\fP \fIfile\fP.brHere .I Nis the model order (maximal N-gram length), although as with backoff models,the actual value used is overridden by the.B \-ordercommand line when the model is read in..I Vgives the vocabulary size and.I C the sum of all unigram counts..I Mspecifies the number of mixture weight bins (minus 1)..I mis the width of a mixture weight bin.Thus, .I wijis the mixture weight used to interpolate an.IR j -thorder maximum-likelihood estimate with lower-order estimates given thatthe (\fIj\fP-1)-gram context has been seen with a frequencybetween.IR i * mand.RI ( i +1)* m -1times.(For contexts with frequency greater than .IR M * m ,the .IR i = Mweights are used.)The N-gram counts themselves are given in anindexed directory structure rooted at.IR dir ,in an external.IR file ,or, if .I fileis the string .BR -  ,starting on the line following the.B countskeyword..TP.BI \-vocab " file"Initialize the vocabulary for the LM from.IR file .This is especially useful if the LM itself does not specify a completevocabulary, e.g., as with.BR \-null ..TP.BI \-vocab-aliases " file"Reads vocabulary alias definitions from.IR file ,consisting of lines of the form.br	\fIalias\fP \fIword\fP.brThis causes all tokens.I aliasto be mapped to.IR word ..TP.BI \-nonevents " file"Read a list of words from.I filethat are to be considered non-events, i.e., thatshould only occur in LM contexts, but not as predictions.Such words are excluded from sentence generation.RB ( \-gen )andprobability summation.RB ( "\-ppl \-debug 3" )..TP.B \-limit-vocabDiscard LM parameters on reading that do not pertain to the words specified in the vocabulary.The default is that words used in the LM are automatically added to the vocabulary.This option can be used to reduce the memory requirements for large LMs that are going to be evaluated only on a small vocabulary subset..TP.B \-unkIndicates that the LM contains the unknown word, i.e., is an open-class LM..TP.BI \-map-unk " word"Map out-of-vocabulary words to .IR word ,rather than the default.B <unk>tag..TP.B \-tolowerMap all vocabulary to lowercase.Useful if case conventions for text/counts and language model differ..TP.B \-multiwordsSplit input words consisting of multiwords joined by underscoresinto their components, before evaluating LM probabilities..TP.BI \-mix-lm " file"Read a second N-gram model for interpolation purposes.The second and any additional interpolated models can also be class N-grams(using the same.B \-classes definitions), but are otherwise constrained to be standard N-grams, i.e.,the options.BR \-df ,.BR \-tagged ,.BR \-skip ,and.B \-hidden-vocab do not apply to them..br.B NOTE:Unless .B \-bayes(see below) is specified,.B \-mix-lmtriggers a static interpolation of the models in memory.In most cases a more efficient, dynamic interpolation is sufficient, requestedby .BR "\-bayes 0" .Also, mixing models of different type (e.g., word-based and class-based)will.I onlywork correctly with dynamic interpolation..TP.BI \-lambda " weight"Set the weight of the main model when interpolating with.BR \-mix-lm .Default value is 0.5..TP.BI \-mix-lm2 " file".TP.BI \-mix-lm3 " file".TP.BI \-mix-lm4 " file".TP.BI \-mix-lm5 " file".TP.BI \-mix-lm6 " file".TP.BI \-mix-lm7 " file".TP.BI \-mix-lm8 " file".TP.BI \-mix-lm9 " file"Up to 9 more N-gram models can be specified for interpolation..TP
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -