lm-scripts.html

来自「这是一款很好用的工具包」· HTML 代码 · 共 195 行

HTML

195 行

<! $Id: lm-scripts.1,v 1.7 2006/11/18 22:32:45 stolcke Exp $><HTML><HEADER><TITLE>lm-scripts</TITLE><BODY><H1>lm-scripts</H1><H2> NAME </H2>lm-scripts, add-dummy-bows, change-lm-vocab, empty-sentence-lm, get-unigram-probs, make-hiddens-lm, make-lm-subset, make-sub-lm, remove-lowprob-ngrams, reverse-lm, sort-lm - manipulate N-gram language models<H2> SYNOPSIS </H2><B> add-dummy-bows </B>[<I>lm-file</I>]<B>&gt;</B><I>new-lm-file</I><B></B><BR><B> change-lm-vocab </B><B> -vocab </B><I> vocab </I><B> -lm </B><I> lm-file </I><B> -write-lm </B><I> new-lm-file </I>[<B>-tolower</B>][<B>-subset</B>][<I>ngram-options</I>...]<BR><B> empty-sentence-lm </B><B> -prob </B><I> p </I><B> -lm </B><I> lm-file </I><B> -write-lm </B><I> new-lm-file </I>[<I>ngram-options</I>...]<BR><B> get-unigram-probs </B>[<B>linear=1</B>]<BR><B> make-hiddens-lm </B>[<I>lm-file</I>]<B>&gt;</B><I>hiddens-lm-file</I><B></B><BR><B> make-lm-subset </B><I>count-file</I>|<I></I><B> - </B>[<I>lm-file</I>|<B>-</B>]<B></B><BR><B> make-sub-lm </B>[<B>maxorder=<I>N</I></B>][<I>lm-file</I>]<B>&gt;</B><I>new-lm-file</I><B></B><BR><B> remove-lowprob-ngrams </B>[<I>lm-file</I>]<B>&gt;</B><I>new-lm-file</I><B></B><BR><B> reverse-lm </B>[<I>lm-file</I>]<B>&gt;</B><I>new-lm-file</I><B></B><BR><B> sort-lm </B>[<I>lm-file</I>]<B>&gt;</B><I>sorted-lm-file</I><B></B><H2> DESCRIPTION </H2>These scripts perform various useful manipulations on N-gram modelsin their textual representation.Most operate on backoff N-grams in ARPA<A HREF="ngram-format.html">ngram-format(5)</A>.<P>Since these tools are implemented as scripts they don't automaticallyinput or output compressed model files correctly, unlike the mainSRILM tools.However, since most scripts work with data from standard input orto standard output (by leaving out the file argument, or specifying it as ``-'') it is easy to combine them with <A HREF="gunzip.html">gunzip(1)</A>or<A HREF="gzip.html">gzip(1)</A>on the command line.<P>Also note that many of the scripts take their options with the <A HREF="gawk.html">gawk(1)</A>syntax<I>option</I><B>=</B><I>value</I>instead of the more common<B>-</B><I>option</I><B></B><I>value</I>.<I></I><P><B> add-dummy-bows </B>adds dummy backoff weights to N-grams, even where they are not required, to satisfy some broken software that expectsbackoff weights on all N-grams (except those of highest order).<P><B> change-lm-vocab </B>modifies the vocabulary of an LM to be that in <I>vocab</I>.<I></I>Any N-grams containing out-of-vocabulary words are removed,new words receive a unigram probability, and the modelis renormalized.The <B> -tolower </B>option causes case distinctions to be ignored.<B> -subset </B>only removes words from the LM vocabulary, without adding any.Any remaining<I> ngram-options </I>are passes to<A HREF="ngram.html">ngram(1)</A>,and can be used to set debugging level, N-gram order, etc.<P><B> empty-sentence-lm </B>modifies an LM so that it allows the empty sentence with probability<I>p</I>.<I></I>This is useful to modify existing LMs that are trained on non-emptysentences only.<I> ngram-options </I>are passes to<A HREF="ngram.html">ngram(1)</A>,and can be used to set debugging level, N-gram order, etc.<P><B> make-hiddens-lm </B>constructs an N-gram model that can be used with the<B> ngram -hiddens </B>option.The new model contains intra-utterance sentence boundarytags ``&lt;#s&gt;'' with the same probability as the original modelhad final sentence tags &lt;/s&gt;.Also, utterance-initial words are not conditioned on &lt;s&gt; andthere is no penalty associated with utterance-final &lt;/s&gt;.Such as model might work better it the test corpus is segmented at places other than proper &lt;s&gt; boundaries.<P><B> make-lm-subset </B>forms a new LM containing only the N-grams found in the <I>count-file</I>,<I></I>in <A HREF="ngram-count.html">ngram-count(1)</A>format.The result still needs to be renormalized with<B> ngram -renorm </B>(which will also adjust the N-gram counts in the header).<P><B> make-sub-lm </B>removes N-grams of order exceeding<I>N</I>.<I></I>This function is now redundant, sinceall SRILM tools can do this implicitly (without using extra memory and very small time overhead) when reading N-gram modelswith the appropriate<B> -order </B>parameter.<P><B> remove-lowprob-ngrams </B>eliminates N-grams whose probability is lower than that which theywould receive through backoff.This is useful when building finite-state networks for N-grammodels.However, this function is now performed much faster by <A HREF="ngram.html">ngram(1)</A>with the<B> -prune-lowprobs </B>option.<P><B> reverse-lm </B>produces a new LM that generates sentences with probabilities equalto the reversed sentences in the input model.<P><B> sort-lm </B>sorts the n-grams in an LM in lexicographic order (left-most words beingthe most significant).This is not a requirement for SRILM, but might be necessary for some other LM software.(The LMs output by SRILM are sorted somewhat differently, reflecting the internal data structures used; that is also the order that should givebest cache utilization when using SRILM to read models.)<P><B> get-unigram-probs </B>extracts the unigram probabilities in a simple table formatfrom a backoff language model.The <B> linear=1 </B>option causes probabilities to be output on a linear (instead of log) scale.<H2> SEE ALSO </H2><A HREF="ngram-format.html">ngram-format(5)</A>, <A HREF="ngram.html">ngram(1)</A>.<H2> BUGS </H2>These are quick-and-dirty scripts, what do you expect?<BR><B> reverse-lm </B>supports only bigram LMs, and can produce improper probability estimates as a result of inconsistent marginals in the input model.<H2> AUTHOR </H2>Andreas Stolcke &lt;stolcke@speech.sri.com&gt;.<BR>Copyright 1995-2006 SRI International</BODY></HTML>

lm-scripts.html - 源码说明

本页面展示了「这是一款很好用的工具包」中的 lm-scripts.html 源码文件，采用 HTML 编程语言编写，共 195 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。

虫虫下载站收录了大量与工具包相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。

⌨️ 快捷键说明

复制代码Ctrl + C

搜索代码Ctrl + F

全屏模式F11

增大字号Ctrl + =

减小字号Ctrl + -

显示快捷键?