📄 hidden-ngram.html
字号:
<! $Id: hidden-ngram.1,v 1.28 2006/07/28 07:34:29 stolcke Exp $><HTML><HEADER><TITLE>hidden-ngram</TITLE><BODY><H1>hidden-ngram</H1><H2> NAME </H2>hidden-ngram - tag hidden events between words<H2> SYNOPSIS </H2><B> hidden-ngram </B>[<B>-help</B>]<B></B><I> option </I>...<H2> DESCRIPTION </H2><B> hidden-ngram </B>tags a stream of word tokens with hidden events occurring between words.For example, an unsegmented text could be tagged for sentence boundaries(the hidden events in this case being `boundary' and `no-boundary').The most likely hidden tag sequence consistent with the given wordsequence is found according to an N-gram language model over bothwords and hidden tags.<P><B> hidden-ngram </B>is a generalization of <A HREF="segment.html">segment(1)</A>.<H2> OPTIONS </H2><P>Each filename argument can be an ASCII file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicatestdin/stdout.<DL><DT><B> -help </B><DD>Print option summary.<DT><B> -version </B><DD>Print version information.<DT><B>-text</B><I> file</I><B></B><DD>Specifies the file containing the word sequences to be tagged(one sentence per line).Start- and end-of-sentence tags are <I> not </I>added by the program, but should be included in the input if the language model uses them.<DT><B>-escape</B><I> string</I><B></B><DD>Set an ``escape string.''Input lines starting with<I> string </I>are not processed and passed unchanged to stdout instead.This allows associated information to be passed to scoring scripts etc.<DT><B>-text-map</B><I> file</I><B></B><DD>Read the input words from a map file contain both the words andadditional likelihoods of events following each word.Each line contains one input word, plus optional hidden-event/likelihoodpairs in the format<BR> w e1 [p1] e2 [p2] ...<BR>If a p value is omitted a likelihood of 1 is assumed.All events not explicitly listed are given likelihood 0, and arehence excluded for that word.In particular, the label <B> *noevent* </B>must be listed to allow absence of a hidden event.Input word strings are assembled from multiple lines of<B> -text-map </B>input until either an end-of-sentence token </s> is found, or an escaped line (see <B>-escape</B>)<B></B>is encountered.<DT><B> -logmap </B><DD>Interpret numeric values in the<B> -text-map </B>file as log probabilities, ratherthan probabilities.<DT><B>-lm</B><I> file</I><B></B><DD>Specifies the word/tag language model as a standard ARPA N-gram backoff modelfile in<A HREF="ngram-format.html">ngram-format(5)</A>.<DT><B>-order</B><I> n</I><B></B><DD>Set the effective N-gram order used by the language model to<I>n</I>.<I></I>Default is 3 (use a trigram model).<DT><B>-classes</B><I> file</I><B></B><DD>Interpret the LM as an N-gram over word classes.The expansions of the classes are given in<I>file</I><I></I>in <A HREF="classes-format.html">classes-format(5)</A>.Tokens in the LM that are not defined as classes in<I> file </I>are assumed to be plain words, so that the LM can contain mixed N-grams overboth words and word classes.<DT><B>-simple-classes</B><B></B><DD>Assume a "simple" class model: each word is member of at most one word class,and class expansions are exactly one word long.<DT><B>-mix-lm</B><I> file</I><B></B><DD>Read a second N-gram model for interpolation purposes.The second and any additional interpolated models can also be class N-grams(using the same<B> -classes </B>definitions).<DT><B> -factored </B><DD>Interpret the files specified by <B>-lm</B>,<B></B><B>-mix-lm</B>,<B></B>etc. as factored N-gram model specifications.See <A HREF="ngram.html">ngram(1)</A>for more details.<DT><B>-lambda</B><I> weight</I><B></B><DD>Set the weight of the main model when interpolating with<B>-mix-lm</B>.<B></B>Default value is 0.5.<DT><B>-mix-lm2</B><I> file</I><B></B><DD><DT><B>-mix-lm3</B><I> file</I><B></B><DD><DT><B>-mix-lm4</B><I> file</I><B></B><DD><DT><B>-mix-lm5</B><I> file</I><B></B><DD><DT><B>-mix-lm6</B><I> file</I><B></B><DD><DT><B>-mix-lm7</B><I> file</I><B></B><DD><DT><B>-mix-lm8</B><I> file</I><B></B><DD><DT><B>-mix-lm9</B><I> file</I><B></B><DD>Up to 9 more N-gram models can be specified for interpolation.<DT><B>-mix-lambda2</B><I> weight</I><B></B><DD><DT><B>-mix-lambda3</B><I> weight</I><B></B><DD><DT><B>-mix-lambda4</B><I> weight</I><B></B><DD><DT><B>-mix-lambda5</B><I> weight</I><B></B><DD><DT><B>-mix-lambda6</B><I> weight</I><B></B><DD><DT><B>-mix-lambda7</B><I> weight</I><B></B><DD><DT><B>-mix-lambda8</B><I> weight</I><B></B><DD><DT><B>-mix-lambda9</B><I> weight</I><B></B><DD>These are the weights for the additional mixture components, correspondingto<B> -mix-lm2 </B>through<B>-mix-lm9</B>.<B></B>The weight for the<B> -mix-lm </B>model is 1 minus the sum of <B> -lambda </B>and <B> -mix-lambda2 </B>through<B>-mix-lambda9</B>.<B></B><DT><B>-bayes</B><I> length</I><B></B><DD>Set the context length used for Bayesian interpolation.The default value is 0, giving the standard fixed interpolation weightspecified by<B>-lambda</B>.<B></B><DT><B>-bayes-scale</B><I> scale</I><B></B><DD>Set the exponential scale factor on the context likelihood in conjunctionwith the<B> -bayes </B>function.Default value is 1.0.<DT><B>-lmw</B><I> W</I><B></B><DD>Scales the language model probabilities by a factor <I>W</I>.<I></I>Default language model weight is 1.<DT><B>-mapw</B><I> W</I><B></B><DD>Scales the likelihood map probability by a factor<I>W</I>.<I></I>Default map weight is 1.<DT><B> -tolower </B><DD>Map vocabulary to lowercase, removing case distinctions.<DT><B>-vocab</B><I> file</I><B></B><DD>Initialize the vocabulary for the LM from<I>file</I>.<I></I>This is useful in conjunction with<B>-limit-vocab</B>.<B></B><DT><B>-vocab-aliases</B><I> file</I><B></B><DD>Reads vocabulary alias definitions from<I>file</I>,<I></I>consisting of lines of the form<BR> <I>alias</I> <I>word</I><BR>This causes all tokens<I> alias </I>to be mapped to<I>word</I>.<I></I><DT><B>-hidden-vocab</B><I> file</I><B></B><DD>Read the list of hidden tags from<I>file</I>.<I></I>Note: This is a subset of the vocabulary contained in the language model.<DT><B> -limit-vocab </B><DD>Discard LM parameters on reading that do not pertain to the words specified in the vocabulary, either by <B> -vocab </B>or<B>-hidden-vocab</B>.<B></B>The default is that words used in the LM are automatically added to the vocabulary.This option can be used to reduce the memory requirements for large LMs that are going to be evaluated only on a small vocabulary subset.<DT><B> -force-event </B><DD>Forces a non-default event after every word.This is useful for language models that represent the default eventexplicitly with a tag, rather than implicitly by the absence of a tagbetween words (which is the default).<DT><B> -keep-unk </B><DD>Do not map unknown input words to the <unk> token.Instead, output the input word unchanged.Also, with this option the LM is assumed to be open-vocabulary(the default is close-vocabulary).<DT><B> -fb </B><DD>Perform forward-backward decoding of the input token sequence.Outputs the tags that have the highest posterior probability,for each position.The default is to use Viterbi decoding, i.e., the output is thetag sequence with the highest joint posterior probability.<DT><B> -fw-only </B><DD>Similar to <B>-fb</B>,<B></B>but uses only the forward probabilities for computing posteriors.This may be used to simulate on-line prediction of tags, without thebenefit of future context.<DT><B> -continuous </B><DD>Process all words in the input as one sequence of words, irrespective ofline breaks.Normally each line is processed separately as a sentence.Input tokens are output one-per-line, followed by event tokens.<DT><B> -posteriors </B><DD>Output the table of posterior probabilities for each tag position.If<B> -fb </B>is also specified the posterior probabilities will be computed usingforward-backward probabilities; otherwise an approximation will be usedthat is based on the probability of the most likely path containing a given tag at given position.<DT><B> -totals </B><DD>Output the total string probability for each input sentence.If<B> -fb </B>is also specified this probability is obtained by summing over allhidden event sequences; otherwise it is calculated (i.e., underestimated)using the most probably hidden event sequence.<DT><B>-nbest</B><I> N</I><B></B><DD>Output the<I> N </I>best hypotheses instead of just the first best whendoing Viterbi search.If<I>N</I>>1,<I></I>then each hypothesis is prefixed by the tag<B>NBEST_</B><I>n</I><B> </B>where<I> n </I>is the rank of the hypothesis in the N-best list and<I> x </I>its score, the negative log of the combined probability of transitionsand observations of the corresponding HMM path.<DT><B>-write-counts</B><I> file</I><B></B><DD>Write the posterior weighted counts of n-grams, including thosewith hidden tags, summed over the entire input data, to<I>file</I>.<I></I>The posterior probabilities should normally be computed with theforward-backward algorithm (instead of Viterbi), so the<B> -fb </B>option is usually also specified.Only n-grams whose contexts occur in the language model are output.<DT><B>-unk-prob</B><I> L</I><B></B><DD>Specifies that unknown words and other words having zero probability inthe language model be assigned a log probability of <I>L</I>.<I></I>This is -100 by default but might be set to 0, e.g., to compute perplexities excluding unknown words.<DT><B> -debug </B><DD>Sets debugging output level.</DD></DL><P>Each filename argument can be an ASCII file, or a compressedfile (name ending in .Z or .gz), or ``-'' to indicatestdin/stdout.<H2> BUGS </H2>The<B> -continuous </B>and<B> -text-map </B>options effectively disable<B>-keep-unk</B>,<B></B>i.e., unknown input words are always mapped to <unk>.Also, <B> -continuous </B>doesn't preserve the positions of escaped input lines relative tothe input.<BR>The dynamic programming for event decoding is not efficiently interleavedwith that required to evaluate class N-gram models;therefore, the state space generated in decoding with <B>-classes</B><B></B>quickly becomes infeasibly large unless <B>-simple-classes</B><B></B>is also specified.<P>The file given by <B> -classes </B>is read multiple times if<B> -limit-vocab </B>is in effect or if a mixture of LMs is specified.This will lead to incorrect behavior if the argument of<B> -classes </B>is stdin (``-'').<H2> SEE ALSO </H2><A HREF="ngram.html">ngram(1)</A>, <A HREF="ngram-count.html">ngram-count(1)</A>, <A HREF="disambig.html">disambig(1)</A>, <A HREF="segment.html">segment(1)</A>,<A HREF="ngram-format.html">ngram-format(5)</A>, <A HREF="classes-format.html">classes-format(5)</A>.<BR>A. Stolcke et al., ``Automatic Detection of Sentence Boundaries andDisfluencies based on Recognized Words,''<I>Proc. ICSLP</I>, 2247-2250, Sydney.<H2> AUTHORS </H2>Andreas Stolcke <stolcke@speech.sri.com>,<BR>Anand Venkataraman <anand@speech.sri.com>.<BR>Copyright 1998-2006 SRI International</BODY></HTML>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -