📄 ngram-count.1

📁 这是一款很好用的工具包
💻 1
字号:
.\" $Id: ngram-count.1,v 1.33 2006/09/04 09:13:10 stolcke Exp $.TH ngram-count 1 "$Date: 2006/09/04 09:13:10 $" "SRILM Tools".SH NAMEngram-count \- count N-grams and estimate language models.SH SYNOPSIS.B ngram-count[\c.BR \-help ].I option \&....SH DESCRIPTION.B ngram-countgenerates and manipulates N-gram counts, and estimates N-gram languagemodels from them.The program first builds an internal N-gram count set, eitherby reading counts from a file, or by scanning text input.Following that, the resulting counts can be output back to a fileor used for building an N-gram language model in ARPA.BR ngram-format (5).Each of these actions is triggered by corresponding options, asdescribed below..SH OPTIONS.PPEach filename argument can be an ASCII file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicatestdin/stdout..TP.B \-helpPrint option summary..TP.B \-versionPrint version information..TP.BI \-order " n"Set the maximal order (length) of N-grams to count.This also determines the order of the estimated LM, if any.The default order is 3..TP.BI \-vocab " file"Read a vocabulary from file.Subsequently, out-of-vocabulary words in both counts or text arereplaced with the unknown-word token.If this option is not specified all words found are implicitly addedto the vocabulary..TP.BI \-vocab-aliases " file"Reads vocabulary alias definitions from.IR file ,consisting of lines of the form.br	\fIalias\fP \fIword\fP.brThis causes all tokens.I aliasto be mapped to.IR word ..TP.BI \-write-vocab " file"Write the vocabulary built in the counting process to.IR file ..TP.B \-taggedInterpret text and N-grams as consisting of word/tag pairs..TP.B \-tolowerMap all vocabulary to lowercase..TP.B \-memusePrint memory usage statistics..SS Counting Options.TP.BI \-text " textfile"Generate N-gram counts from text file..I textfileshould contain one sentence unit per line.Begin/end sentence tokens are added if not already present.Empty lines are ignored..TP.BI \-read " countsfile"Read N-gram counts from a file.Ascii count files contain one N-gram of words per line, followed by an integer count, all separated by whitespace.Repeated counts for the same N-gram are added.Thus several count files can be merged by using .BR cat (1)and feeding the result to .BR "ngram-count \-read \-" (but see.BR ngram-merge (1)for merging counts that exceed available memory).Counts collected by .B \-textand .B \-readare additive as well.Binary count files (see below) are also recognized..TP.BI \-read-google " dir"Read N-grams counts from an indexed directory structure rooted in.BR dir ,in a format developed byGoogle to store very large N-gram collections.The corresponding directory structure can be created using the script.B make-google-ngramsdescribed in.BR training-scripts (1)..TP.BI \-write " file"Write total counts to.IR file ..TP.BI \-write-binary " file"Write total counts to .I file in binary format.Binary count files cannot be compressed and are typicallylarger than compressed ascii count files.However, they can be loaded faster, especially when the.B \-limit-vocab option is used..I.TP.BI \-write-order " n"Order of counts to write.The default is 0, which stands for N-grams of all lengths..TP.BI -write "n file"where.I nis 1, 2, 3, 4, 5, 6, 7, 8, or 9.Writes only counts of the indicated order to.IR file .This is convenient to generate counts of different orders separately in a single pass..TP.B \-sortOutput counts in lexicographic order, as required for.BR ngram-merge (1)..TP.B \-recomputeRegenerate lower-order counts by summing the highest-order counts for each N-gram prefix..TP.B \-limit-vocabDiscard N-gram counts on reading that do not pertain to the words specified in the vocabulary.The default is that words used in the count files are automatically added tothe vocabulary..SS LM Options.TP.BI \-lm " lmfile"Estimate a backoff N-gram model from the total counts, and write itto.I lmfile in .BR ngram-format (5)..TP.BI \-nonevents " file"Read a list of words from.I filethat are to be considered non-events, i.e., thatcan only occur in the context of an N-gram.Such words are given zero probability mass in model estimation..TP.B \-float-countsEnable manipulation of fractional counts.Only certain discounting methods support non-integer counts..TP.B \-skipEstimate a ``skip'' N-gram model, which predicts a word byan interpolation of the immediate context and the context one word prior.This also triggers N-gram counts to be generated that are one word longer than the indicated order.The following four options control the EM estimation algorithm used forskip-N-grams..TP.BI \-init-lm " lmfile"Load an LM to initialize the parameters of the skip-N-gram..TP.BI \-skip-init " value"The initial skip probability for all words..TP.BI \-em-iters " n"The maximum number of EM iterations..TP.BI \-em-delta " d"The convergence criterion for EM: if the relative change in log likelihoodfalls below the given value, iteration stops..TP.B \-count-lmEstimate a count-based interpolated LM using Jelinek-Mercer smoothing(Chen & Goodman, 1998).Several of the options for skip-N-gram LMs (above) apply.An initial count-LM in the format described in .BR ngram (1)needs to be specified using.BR \-init-lm .The options.B \-em-itersand.B \-em-deltacontrol termination of the EM algorithm.Note that the N-gram counts used to estimate the maximum-likelihoodestimates come from the .B \-init-lmmodel.The counts specified with.B \-reador.B \-textare used only to estimate the smoothing (interpolation weights)..TP.B \-unkBuild an ``open vocabulary'' LM, i.e., one that contains the unknown-wordtoken as a regular word.The default is to remove the unknown word..TP.BI \-map-unk " word"Map out-of-vocabulary words to .IR word ,rather than the default.B <unk>tag..TP.B \-trust-totalsForce the lower-order counts to be used as total counts in estimatingN-gram probabilities.Usually these totals are recomputed from the higher-order counts..TP.BI \-prune " threshold"Prune N-gram probabilities if their removal causes (training set)perplexity of the model to increase by less than.I thresholdrelative..TP.BI \-minprune " n"Only prune N-grams of length at least.IR n .The default (and minimum allowed value) is 2, i.e., only unigrams are excludedfrom pruning..TP.BI \-debug " level"Set debugging output from estimated LM at.IR level .Level 0 means no debugging.Debugging messages are written to stderr..TP.BI \-gt\fIn\fPmin " count"where.I nis 1, 2, 3, 4, 5, 6, 7, 8, or 9.Set the minimal count of N-grams of order.I nthat will be included in the LM.All N-grams with frequency lower than that will effectively be discounted to 0.If .I n is omitted the parameter for N-grams of order > 9 is set..brNOTE: This option affects not only the default Good-Turing discountingbut the alternative discounting methods described below as well..TP.BI \-gt\fIn\fPmax " count"where.I nis 1, 2, 3, 4, 5, 6, 7, 8, or 9.Set the maximal count of N-grams of order.I nthat are discounted under Good-Turing.All N-grams more frequent than that will receivemaximum likelihood estimates.Discounting can be effectively disabled by setting this to 0.If .I n is omitted the parameter for N-grams of order > 9 is set..PPIn the following discounting parameter options, the order.I nmay be omitted, in which case a default for all N-gram orders isset.The corresponding discounting method then becomes the default methodfor all orders, unless specifically overridden by an option with.IR n .If no discounting method is specified, Good-Turing is used..TP.BI \-gt\fIn\fP " gtfile"where.I nis 1, 2, 3, 4, 5, 6, 7, 8, or 9.Save or retrieve Good-Turing parameters(cutoffs and discounting factors) in/from.IR gtfile .This is useful as GT parameters should always be determined fromunlimited vocabulary counts, whereas the eventual LM may use alimited vocabulary.The parameter files may also be hand-edited.If an.B \-lmoption is specified the GT parameters are read from.IR gtfile ,otherwise they are computed from the current counts and saved in.IR gtfile ..TP.BI \-cdiscount\fIn\fP " discount"where.I nis 1, 2, 3, 4, 5, 6, 7, 8, or 9.Use Ney's absolute discounting for N-grams of order.IR n ,using.I discountas the constant to subtract..TP.B \-wbdiscount\fIn\fPwhere.I nis 1, 2, 3, 4, 5, 6, 7, 8, or 9.Use Witten-Bell discounting for N-grams of order.IR n .(This is the estimator where the first occurrence of each word istaken to be a sample for the ``unseen'' event.).TP.B \-ndiscount\fIn\fPwhere.I nis 1, 2, 3, 4, 5, 6, 7, 8, or 9.Use Ristad's natural discounting law for N-grams of order.IR n ..TP.B \-kndiscount\fIn\fPwhere.I nis 1, 2, 3, 4, 5, 6, 7, 8, or 9.Use Chen and Goodman's modified Kneser-Ney discounting for N-grams of order.IR n ..TP.B \-kn-counts-modifiedIndicates that input counts have already been modified for Kneser-Ney smoothing.If this option is not given, the KN discounting method modifies counts(except those of highest order) in order to estimate the backoff distributions.When using the .B \-writeand related options the output will reflect the modified counts..TP.B \-kn-modify-counts-at-endModify Kneser-Ney counts after estimating discounting constants, rather thanbefore as is the default..TP.BI \-kn\fIn\fP " knfile"where.I nis 1, 2, 3, 4, 5, 6, 7, 8, or 9.Save or retrieve Kneser-Ney parameters(cutoff and discounting constants) in/from.IR knfile .This is useful as smoothing parameters should always be determined fromunlimited vocabulary counts, whereas the eventual LM may use alimited vocabulary.The parameter files may also be hand-edited.If an.B \-lmoption is specified the KN parameters are read from.IR knfile ,otherwise they are computed from the current counts and saved in.IR knfile ..TP.B \-ukndiscount\fIn\fPwhere.I nis 1, 2, 3, 4, 5, 6, 7, 8, or 9.Use the original (unmodified) Kneser-Ney discounting method for N-grams oforder.IR n ..PPIn the above discounting options, if the parameter .I nis omitted the option sets the default discounting method for all N-grams of length greater than 9..TP.B \-interpolate\fIn\fPwhere.I nis 1, 2, 3, 4, 5, 6, 7, 8, or 9.Causes the discounted N-gram probability estimates at the specified order .I nto be interpolated with lower-order estimates.(The result of the interpolation is encoded as a standard backoffmodel and can be evaluated as such -- the interpolation happens atestimation time.)This sometimes yields better models with some smoothing methods(see Chen & Goodman, 1998).Only Witten-Bell, absolute discounting, and modified Kneser-Ney smoothingcurrently support interpolation..TP.BI \-meta-tag " string"Interpret words starting with .I stringas count-of-count (meta-count) tags.For example, an N-gram.br	a b \fIstring\fP3	4.brmeans that there were 4 trigrams starting with "a b"that occurred 3 times each.Meta-tags are only allowed in the last position of an N-gram..brNote: when using .B \-tolowerthe meta-tag.I string must not contain any uppercase characters..TP.B \-read-with-mincountsSave memory by eliminating N-grams with counts that fall below the thresholdsset by.BI \-gt N minoptions during .B \-read operation (this assumes the input counts contain no duplicate N-grams).Also, if.B \-meta-tag is defined,these low-count N-grams will be converted to count-of-count N-grams,so that smoothing methods that need this information still work correctly..SH "SEE ALSO"ngram-merge(1), ngram(1), ngram-class(1), training-scripts(1), lm-scripts(1),ngram-format(5)..brS. F. Chen and J. Goodman, ``An Empirical Study of Smoothing Techniques forLanguage Modeling,'' TR-10-98, Computer Science Group, Harvard Univ., 1998..brS. M. Katz, ``Estimation of Probabilities from Sparse Data for theLanguage Model Component of a Speech Recognizer,'' \fIIEEE Trans. ASSP\fP 35(3),400\-401, 1987..brR. Kneser and H. Ney, ``Improved backing-off for M-gram language modeling,''\fIProc. ICASSP\fP, 181-184, 1995..brH. Ney and U. Essen, ``On Smoothing Techniques for Bigram-based NaturalLanguage Modelling,'' \fIProc. ICASSP\fP, 825\-828, 1991..brE. S. Ristad, ``A Natural Law of Succession,'' CS-TR-495-95,Comp. Sci. Dept., Princeton Univ., 1995..brI. H. Witten and T. C. Bell, ``The Zero-Frequency Problem: Estimating theProbabilities of Novel Events in Adaptive Text Compression,''\fIIEEE Trans. Information Theory\fP 37(4), 1085\-1094, 1991..SH BUGSSeveral of the LM types supported by .BR ngram (1)don't have explicit support in.BR ngram-count .Instead, they are built by separately manipulating N-gram counts, followed by standard N-gram model estimation..brLM support for tagged words is incomplete..brOnly absolute and Witten-Bell discounting currently support fractional counts..brThe combination of .B \-read-with-mincountsand .B \-meta-tag preserves enough count-of-count information for.I applying discounting parameters to the input counts, but it does not necessarily allow the parameters to be correctly .IR estimated .Therefore, discounting parameters should always be estimated from full counts (e.g., using the helper .BR training-scripts (1)),and then read from files..SH AUTHORAndreas Stolcke <stolcke@speech.sri.com>..brCopyright 1995\-2006 SRI International
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -