📄 ngram-class.1
字号:
.\" $Id: ngram-class.1,v 1.5 2004/12/03 17:59:01 stolcke Exp $.TH ngram-class 1 "$Date: 2004/12/03 17:59:01 $" "SRILM Tools".SH NAMEngram-class \- induce word classes from N-gram statistics.SH SYNOPSIS.B ngram-class[\c.BR \-help ].I option \&....SH DESCRIPTION.B ngram-classinduces word classes from distributional statistics,so as to minimize perplexity of a class-based N-gram modelgiven the provided word N-gram counts.Presently, only bigram statistics are used, i.e., the induced classesare best suited for a class-bigram language model..PPThe program generates the class N-gram counts and class expansionsneeded by.BR ngram-count (1)and.BR ngram (1),respectively to train and to apply the class N-gram model..SH OPTIONS.PPEach filename argument can be an ASCII file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicatestdin/stdout..TP.B \-helpPrint option summary..TP.B \-versionPrint version information..TP.BI \-debug " level"Set debugging output at.IR level .Level 0 means no debugging.Debugging messages are written to stderr.A useful level to trace the formation of classes is 2..SS Input Options.TP.BI \-vocab " file"Read a vocabulary from file.Subsequently, out-of-vocabulary words in both counts or text arereplaced with the unknown-word token.If this option is not specified all words found are implicitly addedto the vocabulary..TP.B \-tolowerMap the vocabulary to lowercase..TP.BI \-counts " file"Read N-gram counts from a file.Each line contains an N-gram of words, followed by an integer count, all separated by whitespace.Repeated counts for the same N-gram are added.Counts collected by .B \-textand .B \-countsare additive as well..brNote that the input should contain consistent lower- and higher-ordercounts (i.e., unigrams and bigrams), as would be generated by.BR ngram-count (1)..TP.BI \-text " textfile"Generate N-gram counts from text file..I textfileshould contain one sentence unit per line.Begin/end sentence tokens are added if not already present.Empty lines are ignored..SS Class Merging.TP.BI \-numclasses " C"The target number of classes to induce.A zero argument suppresses automatic class merging altogether(e.g., for use with .B \-interact)..TP.B \-fullPerform full greedy merging over all classes starting with one class perword.This is the O(V^3) algorithm described in Brown et al. (1992)..TP.B \-incrementalPerform incremental greedy merging, starting with one class each for the .I Cmost frequent words, and then adding one word at a time.This is the O(V*C^2) algorithm described in Brown et al. (1992);it is the default..TP.B \-interactEnter a primitive interactive interface when done with automatic classinduction, allowing manual specification of additional merging steps..TP.BI \-noclass-vocab " file"Read a list of vocabulary items from.I filethat are to be excluded from classes.These words or tags do no undergo class merging, but their N-gram counts still affect the optimization of model perplexity..brThe default is to exclude the sentence begin/end tags (<s> and </s>)from class merging; this can be suppressed by specifying.BR "\-noclass-vocab /dev/null" ..SS Output Options.TP.BI \-class-counts " file"Write class N-gram counts to.I file when done.The format is the same as for word N-gram counts, and can beread by.BR ngram-count (1)to estimate a class-N-gram model..TP.BI \-classes " file"Write class definitions (member words and their probabilities) to.I filewhen done.The output format is the same as required by the.B \-classesoption of .BR ngram (1)..TP.BI \-save " S"Save the class counts and/or class definitions every.I Siterations during induction.The filenames are obtained from the.B \-class-countsand.B \-classesoptions, respectively, by appending the iteration number.This is convenient for producing sets of classes at different granularitiesduring the same run..IR S =0(the default) suppresses the saving actions..SH "SEE ALSO"ngram-count(1), ngram(1)..brP. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer,``Class-Based n-gram Models of Natural Language,''\fIComputational Linguistics\fP 18(4), 467\-479, 1992..SH BUGSClasses are optimized only for bigram models at present..SH AUTHORAndreas Stolcke <stolcke@speech.sri.com>..brCopyright 1999\-2004 SRI International
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -