📄 ngram-count.1
字号:
Prune N-gram probabilities if their removal causes (training set) perplexity of the model to increase by less than _t_h_r_e_s_h_o_l_d relative. --mmiinnpprruunnee _n Only prune N-grams of length at least _n. The default (and minimum allowed value) is 2, i.e., only unigrams are excluded from pruning. --ddeebbuugg _l_e_v_e_l Set debugging output from estimated LM at _l_e_v_e_l. Level 0 means no debugging. Debugging messages are written to stderr. --ggtt_nmmiinn _c_o_u_n_t where _n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Set the minimal count of N-grams of order _n that will be included in the LM. All N-grams with frequency lower than that will effectively be discounted to 0. If _n is omitted the parameter for N-grams of order > 9 is set. NOTE: This option affects not only the default Good-Turing discounting but the alternative dis- counting methods described below as well. --ggtt_nmmaaxx _c_o_u_n_t where _n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Set the maximal count of N-grams of order _n that are dis- counted under Good-Turing. All N-grams more fre- quent than that will receive maximum likelihood estimates. Discounting can be effectively disabled by setting this to 0. If _n is omitted the parame- ter for N-grams of order > 9 is set. In the following discounting parameter options, the order _n may be omitted, in which case a default for all N-gram orders is set. The corresponding discounting method then becomes the default method for all orders, unless specifi- cally overridden by an option with _n. If no discounting method is specified, Good-Turing is used. --ggtt_n _g_t_f_i_l_e where _n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Save or retrieve Good-Turing parameters (cutoffs and dis- counting factors) in/from _g_t_f_i_l_e. This is useful as GT parameters should always be determined from unlimited vocabulary counts, whereas the eventual LM may use a limited vocabulary. The parameter files may also be hand-edited. If an --llmm option is specified the GT parameters are read from _g_t_f_i_l_e, otherwise they are computed from the current counts and saved in _g_t_f_i_l_e. --ccddiissccoouunntt_n _d_i_s_c_o_u_n_t where _n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Use Ney's absolute discounting for N-grams of order _n, using _d_i_s_c_o_u_n_t as the constant to subtract. --wwbbddiissccoouunntt_n where _n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Use Wit- ten-Bell discounting for N-grams of order _n. (This is the estimator where the first occurrence of each word is taken to be a sample for the ``unseen'' event.) --nnddiissccoouunntt_n where _n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Use Ris- tad's natural discounting law for N-grams of order _n. --kknnddiissccoouunntt_n where _n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Use Chen and Goodman's modified Kneser-Ney discounting for N-grams of order _n. --kknn--ccoouunnttss--mmooddiiffiieedd Indicates that input counts have already been modi- fied for Kneser-Ney smoothing. If this option is not given, the KN discounting method modifies counts (except those of highest order) in order to estimate the backoff distributions. When using the --wwrriittee and related options the output will reflect the modified counts. --kknn--mmooddiiffyy--ccoouunnttss--aatt--eenndd Modify Kneser-Ney counts after estimating discount- ing constants, rather than before as is the default. --kknn_n _k_n_f_i_l_e where _n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Save or retrieve Kneser-Ney parameters (cutoff and dis- counting constants) in/from _k_n_f_i_l_e. This is useful as smoothing parameters should always be determined from unlimited vocabulary counts, whereas the even- tual LM may use a limited vocabulary. The parame- ter files may also be hand-edited. If an --llmm option is specified the KN parameters are read from _k_n_f_i_l_e, otherwise they are computed from the cur- rent counts and saved in _k_n_f_i_l_e. --uukknnddiissccoouunntt_n where _n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Use the original (unmodified) Kneser-Ney discounting method for N-grams of order _n. In the above discounting options, if the parameter _n is omitted the option sets the default discounting method for all N-grams of length greater than 9. --iinntteerrppoollaattee_n where _n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Causes the discounted N-gram probability estimates at the specified order _n to be interpolated with lower- order estimates. (The result of the interpolation is encoded as a standard backoff model and can be evaluated as such -- the interpolation happens at estimation time.) This sometimes yields better models with some smoothing methods (see Chen & Goodman, 1998). Only Witten-Bell, absolute dis- counting, and modified Kneser-Ney smoothing cur- rently support interpolation. --mmeettaa--ttaagg _s_t_r_i_n_g Interpret words starting with _s_t_r_i_n_g as count-of- count (meta-count) tags. For example, an N-gram a b _s_t_r_i_n_g3 4 means that there were 4 trigrams starting with "a b" that occurred 3 times each. Meta-tags are only allowed in the last position of an N-gram. Note: when using --ttoolloowweerr the meta-tag _s_t_r_i_n_g must not contain any uppercase characters. --rreeaadd--wwiitthh--mmiinnccoouunnttss Save memory by eliminating N-grams with counts that fall below the thresholds set by --ggtt_Nmmiinn options during --rreeaadd operation (this assumes the input counts contain no duplicate N-grams). Also, if --mmeettaa--ttaagg is defined, these low-count N-grams will be converted to count-of-count N-grams, so that smoothing methods that need this information still work correctly.SSEEEE AALLSSOO ngram-merge(1), ngram(1), ngram-class(1), training- scripts(1), lm-scripts(1), ngram-format(5). S. F. Chen and J. Goodman, ``An Empirical Study of Smooth- ing Techniques for Language Modeling,'' TR-10-98, Computer Science Group, Harvard Univ., 1998. S. M. Katz, ``Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer,'' _I_E_E_E _T_r_a_n_s_. _A_S_S_P 35(3), 400-401, 1987. R. Kneser and H. Ney, ``Improved backing-off for M-gram language modeling,'' _P_r_o_c_. _I_C_A_S_S_P, 181-184, 1995. H. Ney and U. Essen, ``On Smoothing Techniques for Bigram- based Natural Language Modelling,'' _P_r_o_c_. _I_C_A_S_S_P, 825-828, 1991. E. S. Ristad, ``A Natural Law of Succession,'' CS- TR-495-95, Comp. Sci. Dept., Princeton Univ., 1995. I. H. Witten and T. C. Bell, ``The Zero-Frequency Problem: Estimating the Probabilities of Novel Events in Adaptive Text Compression,'' _I_E_E_E _T_r_a_n_s_. _I_n_f_o_r_m_a_t_i_o_n _T_h_e_o_r_y 37(4), 1085-1094, 1991.BBUUGGSS Several of the LM types supported by nnggrraamm(1) don't have explicit support in nnggrraamm--ccoouunntt. Instead, they are built by separately manipulating N-gram counts, followed by standard N-gram model estimation. LM support for tagged words is incomplete. Only absolute and Witten-Bell discounting currently sup- port fractional counts. The combination of --rreeaadd--wwiitthh--mmiinnccoouunnttss and --mmeettaa--ttaagg pre- serves enough count-of-count information for _a_p_p_l_y_i_n_g dis- counting parameters to the input counts, but it does not necessarily allow the parameters to be correctly _e_s_t_i_- _m_a_t_e_d. Therefore, discounting parameters should always be estimated from full counts (e.g., using the helper ttrraaiinn-- iinngg--ssccrriippttss(1)), and then read from files.AAUUTTHHOORR Andreas Stolcke <stolcke@speech.sri.com>. Copyright 1995-2006 SRI InternationalSRILM Tools $Date: 2006/09/04 09:13:10 $ ngram-count(1)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -