📄 disambig.html
字号:
<! $Id: disambig.1,v 1.28 2006/07/30 00:07:52 stolcke Exp $><HTML><HEADER><TITLE>disambig</TITLE><BODY><H1>disambig</H1><H2> NAME </H2>disambig - disambiguate text tokens using an N-gram model<H2> SYNOPSIS </H2><B> disambig </B>[<B>-help</B>]<B></B><I> option </I>...<H2> DESCRIPTION </H2><B> disambig </B>translates a stream of tokens from a vocabulary V1 to a corresponding streamof tokens from a vocabulary V2, according to a probabilistic, 1-to-many mapping.Ambiguities in the mapping are resolved by finding the V2 sequence withthe highest posterior probability given the V1 sequence.This probability is computed from pairwise conditional probabilities P(V1|V2),as well as a language model for sequences over V2.<H2> OPTIONS </H2><P>Each filename argument can be an ASCII file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicatestdin/stdout.<DL><DT><B> -help </B><DD>Print option summary.<DT><B> -version </B><DD>Print version information.<DT><B>-text</B><I> file</I><B></B><DD>Specifies the file containing the V1 sentences.<DT><B>-map</B><I> file</I><B></B><DD>Specifies the file containing the V1-to-V2 mapping information.Each line of<I> file </I>contains the mapping for a single word in V1:<BR> w1 w21 [p21] w22 [p22] ...<BR>where<I> w1 </I>is a word from V1, which has possible mappings<I>w21</I>,<I></I><I>w22</I>,<I></I>... from V2.Optionally, each of these can be followed by a numeric string for theprobability<I>p21</I>,<I></I>which defaults to 1.The number is used as the conditional probability P(<I>w1</I>|<I>w21</I>),but the program does not depend on these numbers being properly normalized.<DT><B>-escape</B><I> string</I><B></B><DD>Set an ``escape string.''Input lines starting with<I> string </I>are not processed and passed unchanged to stdout instead.This allows associated information to be passed to scoring scripts etc.<DT><B>-text-map</B><I> file</I><B></B><DD>Processes a combined text/map file.The format of<I> file </I>is the same as for<B>-map</B>,<B></B>except that the <I> w1 </I>field on each line is interpreted as a word <I> token </I>rather than a word <I>type</I>.<I></I>Hence, the V1 text input consists of all words in column 1 of<I> file </I>in order of appearance.This is convenient if different instances of a word have different mappings.There is no implicit insertion of begin/end sentence tokens in this mode. Sentence boundaries should be indicated explicitly by lines of the form<BR> </s> </s><BR> <s> <s><BR>An escaped line (see <B>-escape</B>)<B></B>also implicitly marks a sentence boundary.<DT><B>-classes</B> file<B></B><DD>Specifies the V1-to-V2 mapping information in <A HREF="classes-format.html">classes-format(5)</A>.Class labels are interpreted as V2 words, and expansions as V1 words.Multi-word expansions are not allowed.<DT><B> -scale </B><DD>Interpret the numbers in the mapping as P(<I>w21</I>|<I>w1</I>).This is done by dividing probabilities by the unigram probabilities of<I>w21</I>,<I></I>obtained from the V2 language model.<DT><B> -logmap </B><DD>Interpret numeric values in map file as log probabilities, not probabilities.<DT><B>-lm</B><I> file</I><B></B><DD>Specifies the V2 language model as a standard ARPA N-gram backoff model file<A HREF="ngram-format.html">ngram-format(5)</A>.The default is not to use a language model, i.e., choose V2 tokensbased only on the probabilities in the map file.<DT><B>-order</B><I> n</I><B></B><DD>Set the effective N-gram order used by the language model to<I>n</I>.<I></I>Default is 2 (use a bigram model).<DT><B>-mix-lm</B><I> file</I><B></B><DD>Read a second N-gram model for interpolation purposes.<DT><B> -factored </B><DD>Interpret the files specified by <B>-lm</B><B></B>and<B> -mix-lm </B>as a factored N-gram model specification.See <A HREF="ngram.html">ngram(1)</A>for details.<DT><B>-count-lm</B><B></B><DD>Interpret the model specified by<B>-lm</B><B></B>(but not<B>-mix-lm</B>)<B></B>as a count-based LM.See<A HREF="ngram.html">ngram(1)</A>for details.<DT><B>-lambda</B><I> weight</I><B></B><DD>Set the weight of the main model when interpolating with<B>-mix-lm</B>.<B></B>Default value is 0.5.<DT><B>-mix-lm2</B><I> file</I><B></B><DD><DT><B>-mix-lm3</B><I> file</I><B></B><DD><DT><B>-mix-lm4</B><I> file</I><B></B><DD><DT><B>-mix-lm5</B><I> file</I><B></B><DD><DT><B>-mix-lm6</B><I> file</I><B></B><DD><DT><B>-mix-lm7</B><I> file</I><B></B><DD><DT><B>-mix-lm8</B><I> file</I><B></B><DD><DT><B>-mix-lm9</B><I> file</I><B></B><DD>Up to 9 more N-gram models can be specified for interpolation.<DT><B>-mix-lambda2</B><I> weight</I><B></B><DD><DT><B>-mix-lambda3</B><I> weight</I><B></B><DD><DT><B>-mix-lambda4</B><I> weight</I><B></B><DD><DT><B>-mix-lambda5</B><I> weight</I><B></B><DD><DT><B>-mix-lambda6</B><I> weight</I><B></B><DD><DT><B>-mix-lambda7</B><I> weight</I><B></B><DD><DT><B>-mix-lambda8</B><I> weight</I><B></B><DD><DT><B>-mix-lambda9</B><I> weight</I><B></B><DD>These are the weights for the additional mixture components, correspondingto<B> -mix-lm2 </B>through<B>-mix-lm9</B>.<B></B>The weight for the<B> -mix-lm </B>model is 1 minus the sum of <B> -lambda </B>and <B> -mix-lambda2 </B>through<B>-mix-lambda9</B>.<B></B><DT><B>-bayes</B><I> length</I><B></B><DD>Set the context length used for Bayesian interpolation.The default value is 0, giving the standard fixed interpolation weightspecified by<B>-lambda</B>.<B></B><DT><B>-bayes-scale</B><I> scale</I><B></B><DD>Set the exponential scale factor on the context likelihood in conjunctionwith the<B> -bayes </B>function.Default value is 1.0.<DT><B>-lmw</B><I> W</I><B></B><DD>Scales the language model probabilities by a factor <I>W</I>.<I></I>Default language model weight is 1.<DT><B>-mapw</B><I> W</I><B></B><DD>Scales the likelihood map probability by a factor<I>W</I>.<I></I>Default map weight is 1.Note: For Viterbi decoding (the default) it is equivalent to use<B>-lmw</B><I> W</I><B></B>or <B>-mapw</B><I> 1/W",</I><B></B>but not for forward-backward computation.<DT><B> -tolower1 </B><DD>Map input vocabulary (V1) to lowercase, removing case distinctions.<DT><B> -tolower2 </B><DD>Map output vocabulary (V2) to lowercase, removing case distinctions.<DT><B> -keep-unk </B><DD>Do not map unknown input words to the <unk> token.Instead, output the input word unchanged.This is like having an implicit default mapping for unknown words tothemselves, except that the word will still be treated as <unk> by the languagemodel.Also, with this option the LM is assumed to be open-vocabulary(the default is close-vocabulary).<DT><B>-vocab-aliases</B><I> file</I><B></B><DD>Reads vocabulary alias definitions from<I>file</I>,<I></I>consisting of lines of the form<BR> <I>alias</I> <I>word</I><BR>This causes all V2 tokens<I> alias </I>to be mapped to<I>word</I>,<I></I>and is useful for adapting mismatched language models.<DT><B> -no-eos </B><DD>Do no assume that each input line contains a complete sentence.This prevents end-of-sentence tokens </s> from being appended automatically.<DT><B> -continuous </B><DD>Process all words in the input as one sequence of words, irrespective ofline breaks.Normally each line is processed separately as a sentence.V2 tokens are output one-per-line.This option also prevents sentence start/end tokens (<s> and </s>)from being added to the input.<DT><B> -fb </B><DD>Perform forward-backward decoding of the input (V1) token sequence.Outputs the V2 tokens that have the highest posterior probability,for each position.The default is to use Viterbi decoding, i.e., the output is theV2 sequence with the higher joint posterior probability.<DT><B> -fw-only </B><DD>Similar to <B>-fb</B>,<B></B>but uses only the forward probabilities for computing posteriors.This may be used to simulate on-line prediction of tags, without thebenefit of future context.<DT><B> -totals </B><DD>Output the total string probability for each input sentence.<DT><B> -posteriors </B><DD>Output the table of posterior probabilities for each input (V1) token and each V2 token, in the same format asrequired for the<B> -map </B>file.If<B> -fb </B>is also specified the posterior probabilities will be computed usingforward-backward probabilities; otherwise an approximation will be usedthat is based on the probability of the most likely path containing a given V2 token at given position.<DT><B>-nbest</B><I> N</I><B></B><DD>Output the<I> N </I>best hypotheses instead of just the first best whendoing Viterbi search.If<I>N</I>>1,<I></I>then each hypothesis is prefixed by the tag<B>NBEST_</B><I>n</I><B> </B>where<I> n </I>is the rank of the hypothesis in the N-best list and<I> x </I>its score, the negative log of the combined probability of transitionsand observations of the corresponding HMM path.<DT><B>-write-counts</B><I> file</I><B></B><DD>Outputs the V2-V1 bigram counts corresponding to the tagging performed onthe input data.If <B> -fb </B>was specified these are expected counts, and otherwise they reflect the 1-besttagging decisions.<DT><B>-write-vocab1</B><I> file</I><B></B><DD>Writes the input vocabulary from the map (V1) to<I>file</I>.<I></I><DT><B>-write-vocab2</B><I> file</I><B></B><DD>Writes the output vocabulary from the map (V2) to<I>file</I>.<I></I>The vocabulary will also include the words specified in the language model.<DT><B>-write-map</B><I> file</I><B></B><DD>Writes the map back to a file for validation purposes.<DT><B> -debug </B><DD>Sets debugging output level.</DD></DL><P>Each filename argument can be an ASCII file, or a compressedfile (name ending in .Z or .gz), or ``-'' to indicatestdin/stdout.<H2> BUGS </H2>The<B> -continuous </B>and <B> -text-map </B>options effectively disable<B>-keep-unk</B>,<B></B>i.e., unknown input words are always mapped to <unk>.Also, <B> -continuous </B>doesn't preserve the positions of escaped input lines relative tothe input.<H2> SEE ALSO </H2><A HREF="ngram-count.html">ngram-count(1)</A>, <A HREF="hidden-ngram.html">hidden-ngram(1)</A>, <A HREF="training-scripts.html">training-scripts(1)</A>,<A HREF="ngram-format.html">ngram-format(5)</A>, <A HREF="classes-format.html">classes-format(5)</A>.<H2> AUTHOR </H2>Andreas Stolcke <stolcke@speech.sri.com>,<BR>Anand Venkataraman <anand@speech.sri.com>.<BR>Copyright 1995-2006 SRI International</BODY></HTML>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -