📄 select-vocab.html

📁 这是一款很好用的工具包
💻 HTML
字号:
<! $Id: select-vocab.1,v 1.4 2003/12/14 02:43:14 stolcke Exp $><HTML><HEADER><TITLE>select-vocab</TITLE><BODY><H1>select-vocab</H1><H2> NAME </H2>select-vocab - Select a maximum-likelihood vocabulary from a mixture of corpora.<H2> SYNOPSIS </H2><B> select-vocab </B>[<I> -options </I>... ]<B> -heldout </B><I> file </I><I> f1 </I><I> f2 </I>...<I> fn </I><H2> DESCRIPTION </H2><B> select-vocab </B>picks a vocabulary from the union of the vocabularies of files<I> f1 </I>through<I> fn </I>in order to maximize the likelihood of the heldout file.  When invokedas above, the program will print out (unsorted) the list of words inall of the input corpora together with their weights.  This list maysubsequently be sorted to put the words in decreasing order of weightand a vocabulary may be chosen by picking a suitable threshold weightand ignoring words with weight less than this.A number of automatically detected formats are supported for the inputfiles<I> f1 </I>through<I> fn. </I>They can be count files, which are characterized by each line endingin a number, ARPA language models in<A HREF="ngram-format.html">ngram-format(5)</A>,or simply text files.  If they are text-files, further, andtheir names end in ".sentid", it is assumed that the first field ofeach line is a sentence identifier that is then discarded.Furthermore, all of the input files can also be compressed (if gzip isinstalled and available on the system).<H2> OPTIONS </H2><DL><DT><B> -help </B><DD>Prints a short help message.<DT><B>-heldout</B><I> file</I><B></B><DD>Likelihood maximization is performed on the contents of<I> file. </I>This file may also be in any of the formats supported for the inputcorpora, namely: text, counts, sentid, or ARPA-lm.<DT><B> -quiet </B><DD>Suppresses printing of progress and other informative messages duringexecution.  By default the script writes these out to the output errorstream.<DT><B>-scale</B><I> n</I><B></B><DD>The combined final counts are scaled by <I> n </I>before being written out. This makes it possible to sort the outputlist numerically with <A HREF="sort.html">sort(1)</A>.  The default scale is 1e6.</DD></DL><H2> NOTES </H2>This implementation corrects a minor error in the algorithmspecification in [1].  The paper describes corpus level interpolation,but the script actually does word-level interpolation.  The program is written in <A HREF="perl.html">perl(1)</A> and requires it to be installed inorder to run.<H2> SEE ALSO </H2><A HREF="ngram-count.html">ngram-count(1)</A>, <A HREF="ngram-format.html">ngram-format(5)</A>, <A HREF="training-scripts.html">training-scripts(1)</A>.<BR>[1] A. Venkataraman and W. Wang, "Techniques for effective vocabularyselection", in <I>Proceedings of Eurospeech</I>, Geneva, 2003.<H2> BUGS </H2>Probably.  Send bug-reports, fixes, modifications and enhancements toAnand Venkataraman (anand@speech.sri.com).<H2> SOURCE </H2>Download as part of the SRILM toolkit, or stand-alone fromhttp://www.speech.sri.com/people/anand/downloads/selvoc-v1.tar.gz<H2> AUTHORS </H2>Anand Venkataraman &lt;anand@speech.sri.com&gt;<BR>Wen Wang &lt;wwang@speech.dsri.com&gt;<BR>Copyright 2003 SRI International</BODY></HTML>
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -