📄 disambig.1
字号:
disambig(1) disambig(1)NNAAMMEE disambig - disambiguate text tokens using an N-gram modelSSYYNNOOPPSSIISS ddiissaammbbiigg [--hheellpp] _o_p_t_i_o_n ...DDEESSCCRRIIPPTTIIOONN ddiissaammbbiigg translates a stream of tokens from a vocabulary V1 to a corresponding stream of tokens from a vocabulary V2, according to a probabilistic, 1-to-many mapping. Ambiguities in the mapping are resolved by finding the V2 sequence with the highest posterior probability given the V1 sequence. This probability is computed from pairwise conditional probabilities P(V1|V2), as well as a language model for sequences over V2.OOPPTTIIOONNSS Each filename argument can be an ASCII file, or a com- pressed file (name ending in .Z or .gz), or ``-'' to indi- cate stdin/stdout. --hheellpp Print option summary. --vveerrssiioonn Print version information. --tteexxtt _f_i_l_e Specifies the file containing the V1 sentences. --mmaapp _f_i_l_e Specifies the file containing the V1-to-V2 mapping information. Each line of _f_i_l_e contains the map- ping for a single word in V1: w1 w21 [p21] w22 [p22] ... where _w_1 is a word from V1, which has possible map- pings _w_2_1, _w_2_2, ... from V2. Optionally, each of these can be followed by a numeric string for the probability _p_2_1, which defaults to 1. The number is used as the conditional probability P(_w_1|_w_2_1), but the program does not depend on these numbers being properly normalized. --eessccaappee _s_t_r_i_n_g Set an ``escape string.'' Input lines starting with _s_t_r_i_n_g are not processed and passed unchanged to stdout instead. This allows associated informa- tion to be passed to scoring scripts etc. --tteexxtt--mmaapp _f_i_l_e Processes a combined text/map file. The format of _f_i_l_e is the same as for --mmaapp, except that the _w_1 field on each line is interpreted as a word _t_o_k_e_n rather than a word _t_y_p_e. Hence, the V1 text input consists of all words in column 1 of _f_i_l_e in order of appearance. This is convenient if different instances of a word have different mappings. There is no implicit insertion of begin/end sentence tokens in this mode. Sentence boundaries should be indicated explicitly by lines of the form </s> </s> <s> <s> An escaped line (see --eessccaappee) also implicitly marks a sentence boundary. --ccllaasssseess file Specifies the V1-to-V2 mapping information in ccllaasssseess--ffoorrmmaatt(5). Class labels are interpreted as V2 words, and expansions as V1 words. Multi-word expansions are not allowed. --ssccaallee Interpret the numbers in the mapping as P(_w_2_1|_w_1). This is done by dividing probabilities by the uni- gram probabilities of _w_2_1, obtained from the V2 language model. --llooggmmaapp Interpret numeric values in map file as log proba- bilities, not probabilities. --llmm _f_i_l_e Specifies the V2 language model as a standard ARPA N-gram backoff model file nnggrraamm--ffoorrmmaatt(5). The default is not to use a language model, i.e., choose V2 tokens based only on the probabilities in the map file. --oorrddeerr _n Set the effective N-gram order used by the language model to _n. Default is 2 (use a bigram model). --mmiixx--llmm _f_i_l_e Read a second N-gram model for interpolation pur- poses. --ffaaccttoorreedd Interpret the files specified by --llmm and --mmiixx--llmm as a factored N-gram model specification. See nnggrraamm(1) for details. --ccoouunntt--llmm Interpret the model specified by --llmm (but not --mmiixx-- llmm) as a count-based LM. See nnggrraamm(1) for details. --llaammbbddaa _w_e_i_g_h_t Set the weight of the main model when interpolating with --mmiixx--llmm. Default value is 0.5. --mmiixx--llmm22 _f_i_l_e --mmiixx--llmm33 _f_i_l_e --mmiixx--llmm44 _f_i_l_e --mmiixx--llmm55 _f_i_l_e --mmiixx--llmm66 _f_i_l_e --mmiixx--llmm77 _f_i_l_e --mmiixx--llmm88 _f_i_l_e --mmiixx--llmm99 _f_i_l_e Up to 9 more N-gram models can be specified for interpolation. --mmiixx--llaammbbddaa22 _w_e_i_g_h_t --mmiixx--llaammbbddaa33 _w_e_i_g_h_t --mmiixx--llaammbbddaa44 _w_e_i_g_h_t --mmiixx--llaammbbddaa55 _w_e_i_g_h_t --mmiixx--llaammbbddaa66 _w_e_i_g_h_t --mmiixx--llaammbbddaa77 _w_e_i_g_h_t --mmiixx--llaammbbddaa88 _w_e_i_g_h_t --mmiixx--llaammbbddaa99 _w_e_i_g_h_t These are the weights for the additional mixture components, corresponding to --mmiixx--llmm22 through --mmiixx-- llmm99. The weight for the --mmiixx--llmm model is 1 minus the sum of --llaammbbddaa and --mmiixx--llaammbbddaa22 through --mmiixx-- llaammbbddaa99. --bbaayyeess _l_e_n_g_t_h Set the context length used for Bayesian interpola- tion. The default value is 0, giving the standard fixed interpolation weight specified by --llaammbbddaa. --bbaayyeess--ssccaallee _s_c_a_l_e Set the exponential scale factor on the context likelihood in conjunction with the --bbaayyeess function. Default value is 1.0. --llmmww _W Scales the language model probabilities by a factor _W. Default language model weight is 1. --mmaappww _W Scales the likelihood map probability by a factor _W. Default map weight is 1. Note: For Viterbi decoding (the default) it is equivalent to use --llmmww _W or --mmaappww _1_/_W,, but not for forward-backward compu- tation. --ttoolloowweerr11 Map input vocabulary (V1) to lowercase, removing case distinctions. --ttoolloowweerr22 Map output vocabulary (V2) to lowercase, removing case distinctions. --kkeeeepp--uunnkk Do not map unknown input words to the <unk> token. Instead, output the input word unchanged. This is like having an implicit default mapping for unknown words to themselves, except that the word will still be treated as <unk> by the language model. Also, with this option the LM is assumed to be open-vocabulary (the default is close-vocabulary). --vvooccaabb--aalliiaasseess _f_i_l_e Reads vocabulary alias definitions from _f_i_l_e, con- sisting of lines of the form _a_l_i_a_s _w_o_r_d This causes all V2 tokens _a_l_i_a_s to be mapped to _w_o_r_d, and is useful for adapting mismatched lan- guage models. --nnoo--eeooss Do no assume that each input line contains a com- plete sentence. This prevents end-of-sentence tokens </s> from being appended automatically. --ccoonnttiinnuuoouuss Process all words in the input as one sequence of words, irrespective of line breaks. Normally each line is processed separately as a sentence. V2 tokens are output one-per-line. This option also prevents sentence start/end tokens (<s> and </s>) from being added to the input. --ffbb Perform forward-backward decoding of the input (V1) token sequence. Outputs the V2 tokens that have the highest posterior probability, for each posi- tion. The default is to use Viterbi decoding, i.e., the output is the V2 sequence with the higher joint posterior probability. --ffww--oonnllyy Similar to --ffbb, but uses only the forward probabil- ities for computing posteriors. This may be used to simulate on-line prediction of tags, without the benefit of future context. --ttoottaallss Output the total string probability for each input sentence. --ppoosstteerriioorrss Output the table of posterior probabilities for each input (V1) token and each V2 token, in the same format as required for the --mmaapp file. If --ffbb is also specified the posterior probabilities will be computed using forward-backward probabilities; otherwise an approximation will be used that is based on the probability of the most likely path containing a given V2 token at given position. --nnbbeesstt _N Output the _N best hypotheses instead of just the first best when doing Viterbi search. If _N>1, then each hypothesis is prefixed by the tag NNBBEESSTT___n _x,, where _n is the rank of the hypothesis in the N-best list and _x its score, the negative log of the com- bined probability of transitions and observations of the corresponding HMM path. --wwrriittee--ccoouunnttss _f_i_l_e Outputs the V2-V1 bigram counts corresponding to the tagging performed on the input data. If --ffbb was specified these are expected counts, and other- wise they reflect the 1-best tagging decisions. --wwrriittee--vvooccaabb11 _f_i_l_e Writes the input vocabulary from the map (V1) to _f_i_l_e. --wwrriittee--vvooccaabb22 _f_i_l_e Writes the output vocabulary from the map (V2) to _f_i_l_e. The vocabulary will also include the words specified in the language model. --wwrriittee--mmaapp _f_i_l_e Writes the map back to a file for validation pur- poses. --ddeebbuugg Sets debugging output level. Each filename argument can be an ASCII file, or a com- pressed file (name ending in .Z or .gz), or ``-'' to indicate stdin/stdout.BBUUGGSS The --ccoonnttiinnuuoouuss and --tteexxtt--mmaapp options effectively disable --kkeeeepp--uunnkk, i.e., unknown input words are always mapped to <unk>. Also, --ccoonnttiinnuuoouuss doesn't preserve the positions of escaped input lines relative to the input.SSEEEE AALLSSOO ngram-count(1), hidden-ngram(1), training-scripts(1), ngram-format(5), classes-format(5).AAUUTTHHOORR Andreas Stolcke <stolcke@speech.sri.com>, Anand Venkataraman <anand@speech.sri.com>. Copyright 1995-2006 SRI InternationalSRILM Tools $Date: 2006/07/30 00:07:52 $ disambig(1)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -