📄 ngram-class.html

📁 这是一款很好用的工具包
💻 HTML
字号:
<! $Id: ngram-class.1,v 1.5 2004/12/03 17:59:01 stolcke Exp $><HTML><HEADER><TITLE>ngram-class</TITLE><BODY><H1>ngram-class</H1><H2> NAME </H2>ngram-class - induce word classes from N-gram statistics<H2> SYNOPSIS </H2><B> ngram-class </B>[<B>-help</B>]<B></B><I> option </I>...<H2> DESCRIPTION </H2><B> ngram-class </B>induces word classes from distributional statistics,so as to minimize perplexity of a class-based N-gram modelgiven the provided word N-gram counts.Presently, only bigram statistics are used, i.e., the induced classesare best suited for a class-bigram language model.<P>The program generates the class N-gram counts and class expansionsneeded by<A HREF="ngram-count.html">ngram-count(1)</A>and<A HREF="ngram.html">ngram(1)</A>,respectively to train and to apply the class N-gram model.<H2> OPTIONS </H2><P>Each filename argument can be an ASCII file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicatestdin/stdout.<DL><DT><B> -help </B><DD>Print option summary.<DT><B> -version </B><DD>Print version information.<DT><B>-debug</B><I> level</I><B></B><DD>Set debugging output at<I>level</I>.<I></I>Level 0 means no debugging.Debugging messages are written to stderr.A useful level to trace the formation of classes is 2.</DD></DL><H3> Input Options </H3><DL><DT><B>-vocab</B><I> file</I><B></B><DD>Read a vocabulary from file.Subsequently, out-of-vocabulary words in both counts or text arereplaced with the unknown-word token.If this option is not specified all words found are implicitly addedto the vocabulary.<DT><B> -tolower </B><DD>Map the vocabulary to lowercase.<DT><B>-counts</B><I> file</I><B></B><DD>Read N-gram counts from a file.Each line contains an N-gram of words, followed by an integer count, all separated by whitespace.Repeated counts for the same N-gram are added.Counts collected by <B> -text </B>and <B> -counts </B>are additive as well.<BR>Note that the input should contain consistent lower- and higher-ordercounts (i.e., unigrams and bigrams), as would be generated by<A HREF="ngram-count.html">ngram-count(1)</A>.<DT><B>-text</B><I> textfile</I><B></B><DD>Generate N-gram counts from text file.<I> textfile </I>should contain one sentence unit per line.Begin/end sentence tokens are added if not already present.Empty lines are ignored.</DD></DL><H3> Class Merging </H3><DL><DT><B>-numclasses</B><I> C</I><B></B><DD>The target number of classes to induce.A zero argument suppresses automatic class merging altogether(e.g., for use with <B> -interact). </B><DT><B> -full </B><DD>Perform full greedy merging over all classes starting with one class perword.This is the O(V^3) algorithm described in Brown et al. (1992).<DT><B> -incremental </B><DD>Perform incremental greedy merging, starting with one class each for the <I> C </I>most frequent words, and then adding one word at a time.This is the O(V*C^2) algorithm described in Brown et al. (1992);it is the default.<DT><B> -interact </B><DD>Enter a primitive interactive interface when done with automatic classinduction, allowing manual specification of additional merging steps.<DT><B>-noclass-vocab</B><I> file</I><B></B><DD>Read a list of vocabulary items from<I> file </I>that are to be excluded from classes.These words or tags do no undergo class merging, but their N-gram counts still affect the optimization of model perplexity.<BR>The default is to exclude the sentence begin/end tags (&lt;s&gt; and &lt;/s&gt;)from class merging; this can be suppressed by specifying<B>-noclass-vocab /dev/null</B>.<B></B></DD></DL><H3> Output Options </H3><DL><DT><B>-class-counts</B><I> file</I><B></B><DD>Write class N-gram counts to<I> file </I>when done.The format is the same as for word N-gram counts, and can beread by<A HREF="ngram-count.html">ngram-count(1)</A>to estimate a class-N-gram model.<DT><B>-classes</B><I> file</I><B></B><DD>Write class definitions (member words and their probabilities) to<I> file </I>when done.The output format is the same as required by the<B> -classes </B>option of <A HREF="ngram.html">ngram(1)</A>.<DT><B>-save</B><I> S</I><B></B><DD>Save the class counts and/or class definitions every<I> S </I>iterations during induction.The filenames are obtained from the<B> -class-counts </B>and<B> -classes </B>options, respectively, by appending the iteration number.This is convenient for producing sets of classes at different granularitiesduring the same run.<I>S</I>=0<I></I>(the default) suppresses the saving actions.</DD></DL><H2> SEE ALSO </H2><A HREF="ngram-count.html">ngram-count(1)</A>, <A HREF="ngram.html">ngram(1)</A>.<BR>P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer,``Class-Based n-gram Models of Natural Language,''<I>Computational Linguistics</I> 18(4), 467-479, 1992.<H2> BUGS </H2>Classes are optimized only for bigram models at present.<H2> AUTHOR </H2>Andreas Stolcke &lt;stolcke@speech.sri.com&gt;.<BR>Copyright 1999-2004 SRI International</BODY></HTML>
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -