📄 hidden-ngram.1
字号:
hidden-ngram(1) hidden-ngram(1)NNAAMMEE hidden-ngram - tag hidden events between wordsSSYYNNOOPPSSIISS hhiiddddeenn--nnggrraamm [--hheellpp] _o_p_t_i_o_n ...DDEESSCCRRIIPPTTIIOONN hhiiddddeenn--nnggrraamm tags a stream of word tokens with hidden events occurring between words. For example, an unseg- mented text could be tagged for sentence boundaries (the hidden events in this case being `boundary' and `no-bound- ary'). The most likely hidden tag sequence consistent with the given word sequence is found according to an N- gram language model over both words and hidden tags. hhiiddddeenn--nnggrraamm is a generalization of sseeggmmeenntt(1).OOPPTTIIOONNSS Each filename argument can be an ASCII file, or a com- pressed file (name ending in .Z or .gz), or ``-'' to indi- cate stdin/stdout. --hheellpp Print option summary. --vveerrssiioonn Print version information. --tteexxtt _f_i_l_e Specifies the file containing the word sequences to be tagged (one sentence per line). Start- and end- of-sentence tags are _n_o_t added by the program, but should be included in the input if the language model uses them. --eessccaappee _s_t_r_i_n_g Set an ``escape string.'' Input lines starting with _s_t_r_i_n_g are not processed and passed unchanged to stdout instead. This allows associated informa- tion to be passed to scoring scripts etc. --tteexxtt--mmaapp _f_i_l_e Read the input words from a map file contain both the words and additional likelihoods of events fol- lowing each word. Each line contains one input word, plus optional hidden-event/likelihood pairs in the format w e1 [p1] e2 [p2] ... If a p value is omitted a likelihood of 1 is assumed. All events not explicitly listed are given likelihood 0, and are hence excluded for that word. In particular, the label **nnooeevveenntt** must be listed to allow absence of a hidden event. Input word strings are assembled from multiple lines of --tteexxtt--mmaapp input until either an end-of-sentence token </s> is found, or an escaped line (see --eessccaappee) is encountered. --llooggmmaapp Interpret numeric values in the --tteexxtt--mmaapp file as log probabilities, rather than probabilities. --llmm _f_i_l_e Specifies the word/tag language model as a standard ARPA N-gram backoff model file in nnggrraamm--ffoorrmmaatt(5). --oorrddeerr _n Set the effective N-gram order used by the language model to _n. Default is 3 (use a trigram model). --ccllaasssseess _f_i_l_e Interpret the LM as an N-gram over word classes. The expansions of the classes are given in _f_i_l_e in ccllaasssseess--ffoorrmmaatt(5). Tokens in the LM that are not defined as classes in _f_i_l_e are assumed to be plain words, so that the LM can contain mixed N-grams over both words and word classes. --ssiimmppllee--ccllaasssseess Assume a "simple" class model: each word is member of at most one word class, and class expansions are exactly one word long. --mmiixx--llmm _f_i_l_e Read a second N-gram model for interpolation pur- poses. The second and any additional interpolated models can also be class N-grams (using the same --ccllaasssseess definitions). --ffaaccttoorreedd Interpret the files specified by --llmm, --mmiixx--llmm, etc. as factored N-gram model specifications. See nnggrraamm(1) for more details. --llaammbbddaa _w_e_i_g_h_t Set the weight of the main model when interpolating with --mmiixx--llmm. Default value is 0.5. --mmiixx--llmm22 _f_i_l_e --mmiixx--llmm33 _f_i_l_e --mmiixx--llmm44 _f_i_l_e --mmiixx--llmm55 _f_i_l_e --mmiixx--llmm66 _f_i_l_e --mmiixx--llmm77 _f_i_l_e --mmiixx--llmm88 _f_i_l_e --mmiixx--llmm99 _f_i_l_e Up to 9 more N-gram models can be specified for interpolation. --mmiixx--llaammbbddaa22 _w_e_i_g_h_t --mmiixx--llaammbbddaa33 _w_e_i_g_h_t --mmiixx--llaammbbddaa44 _w_e_i_g_h_t --mmiixx--llaammbbddaa55 _w_e_i_g_h_t --mmiixx--llaammbbddaa66 _w_e_i_g_h_t --mmiixx--llaammbbddaa77 _w_e_i_g_h_t --mmiixx--llaammbbddaa88 _w_e_i_g_h_t --mmiixx--llaammbbddaa99 _w_e_i_g_h_t These are the weights for the additional mixture components, corresponding to --mmiixx--llmm22 through --mmiixx-- llmm99. The weight for the --mmiixx--llmm model is 1 minus the sum of --llaammbbddaa and --mmiixx--llaammbbddaa22 through --mmiixx-- llaammbbddaa99. --bbaayyeess _l_e_n_g_t_h Set the context length used for Bayesian interpola- tion. The default value is 0, giving the standard fixed interpolation weight specified by --llaammbbddaa. --bbaayyeess--ssccaallee _s_c_a_l_e Set the exponential scale factor on the context likelihood in conjunction with the --bbaayyeess function. Default value is 1.0. --llmmww _W Scales the language model probabilities by a factor _W. Default language model weight is 1. --mmaappww _W Scales the likelihood map probability by a factor _W. Default map weight is 1. --ttoolloowweerr Map vocabulary to lowercase, removing case distinc- tions. --vvooccaabb _f_i_l_e Initialize the vocabulary for the LM from _f_i_l_e. This is useful in conjunction with --lliimmiitt--vvooccaabb. --vvooccaabb--aalliiaasseess _f_i_l_e Reads vocabulary alias definitions from _f_i_l_e, con- sisting of lines of the form _a_l_i_a_s _w_o_r_d This causes all tokens _a_l_i_a_s to be mapped to _w_o_r_d. --hhiiddddeenn--vvooccaabb _f_i_l_e Read the list of hidden tags from _f_i_l_e. Note: This is a subset of the vocabulary contained in the lan- guage model. --lliimmiitt--vvooccaabb Discard LM parameters on reading that do not per- tain to the words specified in the vocabulary, either by --vvooccaabb or --hhiiddddeenn--vvooccaabb. The default is that words used in the LM are automatically added to the vocabulary. This option can be used to reduce the memory requirements for large LMs that are going to be evaluated only on a small vocabu- lary subset. --ffoorrccee--eevveenntt Forces a non-default event after every word. This is useful for language models that represent the default event explicitly with a tag, rather than implicitly by the absence of a tag between words (which is the default). --kkeeeepp--uunnkk Do not map unknown input words to the <unk> token. Instead, output the input word unchanged. Also, with this option the LM is assumed to be open- vocabulary (the default is close-vocabulary). --ffbb Perform forward-backward decoding of the input token sequence. Outputs the tags that have the highest posterior probability, for each position. The default is to use Viterbi decoding, i.e., the output is the tag sequence with the highest joint posterior probability. --ffww--oonnllyy Similar to --ffbb, but uses only the forward probabil- ities for computing posteriors. This may be used to simulate on-line prediction of tags, without the benefit of future context. --ccoonnttiinnuuoouuss Process all words in the input as one sequence of words, irrespective of line breaks. Normally each line is processed separately as a sentence. Input tokens are output one-per-line, followed by event tokens. --ppoosstteerriioorrss Output the table of posterior probabilities for each tag position. If --ffbb is also specified the posterior probabilities will be computed using for- ward-backward probabilities; otherwise an approxi- mation will be used that is based on the probabil- ity of the most likely path containing a given tag at given position. --ttoottaallss Output the total string probability for each input sentence. If --ffbb is also specified this probabil- ity is obtained by summing over all hidden event sequences; otherwise it is calculated (i.e., under- estimated) using the most probably hidden event sequence. --nnbbeesstt _N Output the _N best hypotheses instead of just the first best when doing Viterbi search. If _N>1, then each hypothesis is prefixed by the tag NNBBEESSTT___n _x,, where _n is the rank of the hypothesis in the N-best list and _x its score, the negative log of the com- bined probability of transitions and observations of the corresponding HMM path. --wwrriittee--ccoouunnttss _f_i_l_e Write the posterior weighted counts of n-grams, including those with hidden tags, summed over the entire input data, to _f_i_l_e. The posterior proba- bilities should normally be computed with the for- ward-backward algorithm (instead of Viterbi), so the --ffbb option is usually also specified. Only n- grams whose contexts occur in the language model are output. --uunnkk--pprroobb _L Specifies that unknown words and other words having zero probability in the language model be assigned a log probability of _L. This is -100 by default but might be set to 0, e.g., to compute perplexi- ties excluding unknown words. --ddeebbuugg Sets debugging output level. Each filename argument can be an ASCII file, or a com- pressed file (name ending in .Z or .gz), or ``-'' to indicate stdin/stdout.BBUUGGSS The --ccoonnttiinnuuoouuss and --tteexxtt--mmaapp options effectively disable --kkeeeepp--uunnkk, i.e., unknown input words are always mapped to <unk>. Also, --ccoonnttiinnuuoouuss doesn't preserve the positions of escaped input lines relative to the input. The dynamic programming for event decoding is not effi- ciently interleaved with that required to evaluate class N-gram models; therefore, the state space generated in decoding with --ccllaasssseess quickly becomes infeasibly large unless --ssiimmppllee--ccllaasssseess is also specified. The file given by --ccllaasssseess is read multiple times if --lliimmiitt--vvooccaabb is in effect or if a mixture of LMs is speci- fied. This will lead to incorrect behavior if the argu- ment of --ccllaasssseess is stdin (``-'').SSEEEE AALLSSOO ngram(1), ngram-count(1), disambig(1), segment(1), ngram- format(5), classes-format(5). A. Stolcke et al., ``Automatic Detection of Sentence Boundaries and Disfluencies based on Recognized Words,'' _P_r_o_c_. _I_C_S_L_P, 2247-2250, Sydney.AAUUTTHHOORRSS Andreas Stolcke <stolcke@speech.sri.com>, Anand Venkataraman <anand@speech.sri.com>. Copyright 1998-2006 SRI InternationalSRILM Tools $Date: 2006/07/28 07:34:29 $ hidden-ngram(1)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -