📄 ngram-count.1
字号:
ngram-count(1) ngram-count(1)NNAAMMEE ngram-count - count N-grams and estimate language modelsSSYYNNOOPPSSIISS nnggrraamm--ccoouunntt [--hheellpp] _o_p_t_i_o_n ...DDEESSCCRRIIPPTTIIOONN nnggrraamm--ccoouunntt generates and manipulates N-gram counts, and estimates N-gram language models from them. The program first builds an internal N-gram count set, either by read- ing counts from a file, or by scanning text input. Fol- lowing that, the resulting counts can be output back to a file or used for building an N-gram language model in ARPA nnggrraamm--ffoorrmmaatt(5). Each of these actions is triggered by corresponding options, as described below.OOPPTTIIOONNSS Each filename argument can be an ASCII file, or a com- pressed file (name ending in .Z or .gz), or ``-'' to indi- cate stdin/stdout. --hheellpp Print option summary. --vveerrssiioonn Print version information. --oorrddeerr _n Set the maximal order (length) of N-grams to count. This also determines the order of the estimated LM, if any. The default order is 3. --vvooccaabb _f_i_l_e Read a vocabulary from file. Subsequently, out-of- vocabulary words in both counts or text are replaced with the unknown-word token. If this option is not specified all words found are implic- itly added to the vocabulary. --vvooccaabb--aalliiaasseess _f_i_l_e Reads vocabulary alias definitions from _f_i_l_e, con- sisting of lines of the form _a_l_i_a_s _w_o_r_d This causes all tokens _a_l_i_a_s to be mapped to _w_o_r_d. --wwrriittee--vvooccaabb _f_i_l_e Write the vocabulary built in the counting process to _f_i_l_e. --ttaaggggeedd Interpret text and N-grams as consisting of word/tag pairs. --ttoolloowweerr Map all vocabulary to lowercase. --mmeemmuussee Print memory usage statistics. CCoouunnttiinngg OOppttiioonnss --tteexxtt _t_e_x_t_f_i_l_e Generate N-gram counts from text file. _t_e_x_t_f_i_l_e should contain one sentence unit per line. Begin/end sentence tokens are added if not already present. Empty lines are ignored. --rreeaadd _c_o_u_n_t_s_f_i_l_e Read N-gram counts from a file. Ascii count files contain one N-gram of words per line, followed by an integer count, all separated by whitespace. Repeated counts for the same N-gram are added. Thus several count files can be merged by using ccaatt(1) and feeding the result to nnggrraamm--ccoouunntt --rreeaadd -- (but see nnggrraamm--mmeerrggee(1) for merging counts that exceed available memory). Counts collected by --tteexxtt and --rreeaadd are additive as well. Binary count files (see below) are also recognized. --rreeaadd--ggooooggllee _d_i_r Read N-grams counts from an indexed directory structure rooted in ddiirr, in a format developed by Google to store very large N-gram collections. The corresponding directory structure can be created using the script mmaakkee--ggooooggllee--nnggrraammss described in ttrraaiinniinngg--ssccrriippttss(1). --wwrriittee _f_i_l_e Write total counts to _f_i_l_e. --wwrriittee--bbiinnaarryy _f_i_l_e Write total counts to _f_i_l_e in binary format. Binary count files cannot be compressed and are typically larger than compressed ascii count files. However, they can be loaded faster, especially when the --lliimmiitt--vvooccaabb option is used. --wwrriittee--oorrddeerr _n Order of counts to write. The default is 0, which stands for N-grams of all lengths. --wwrriittee_n _f_i_l_e where _n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Writes only counts of the indicated order to _f_i_l_e. This is convenient to generate counts of different orders separately in a single pass. --ssoorrtt Output counts in lexicographic order, as required for nnggrraamm--mmeerrggee(1). --rreeccoommppuuttee Regenerate lower-order counts by summing the high- est-order counts for each N-gram prefix. --lliimmiitt--vvooccaabb Discard N-gram counts on reading that do not per- tain to the words specified in the vocabulary. The default is that words used in the count files are automatically added to the vocabulary. LLMM OOppttiioonnss --llmm _l_m_f_i_l_e Estimate a backoff N-gram model from the total counts, and write it to _l_m_f_i_l_e in nnggrraamm--ffoorrmmaatt(5). --nnoonneevveennttss _f_i_l_e Read a list of words from _f_i_l_e that are to be con- sidered non-events, i.e., that can only occur in the context of an N-gram. Such words are given zero probability mass in model estimation. --ffllooaatt--ccoouunnttss Enable manipulation of fractional counts. Only certain discounting methods support non-integer counts. --sskkiipp Estimate a ``skip'' N-gram model, which predicts a word by an interpolation of the immediate context and the context one word prior. This also triggers N-gram counts to be generated that are one word longer than the indicated order. The following four options control the EM estimation algorithm used for skip-N-grams. --iinniitt--llmm _l_m_f_i_l_e Load an LM to initialize the parameters of the skip-N-gram. --sskkiipp--iinniitt _v_a_l_u_e The initial skip probability for all words. --eemm--iitteerrss _n The maximum number of EM iterations. --eemm--ddeellttaa _d The convergence criterion for EM: if the relative change in log likelihood falls below the given value, iteration stops. --ccoouunntt--llmm Estimate a count-based interpolated LM using Jelinek-Mercer smoothing (Chen & Goodman, 1998). Several of the options for skip-N-gram LMs (above) apply. An initial count-LM in the format described in nnggrraamm(1) needs to be specified using --iinniitt--llmm. The options --eemm--iitteerrss and --eemm--ddeellttaa control termi- nation of the EM algorithm. Note that the N-gram counts used to estimate the maximum-likelihood estimates come from the --iinniitt--llmm model. The counts specified with --rreeaadd or --tteexxtt are used only to estimate the smoothing (interpolation weights). --uunnkk Build an ``open vocabulary'' LM, i.e., one that contains the unknown-word token as a regular word. The default is to remove the unknown word. --mmaapp--uunnkk _w_o_r_d Map out-of-vocabulary words to _w_o_r_d, rather than the default <<uunnkk>> tag. --ttrruusstt--ttoottaallss Force the lower-order counts to be used as total counts in estimating N-gram probabilities. Usually these totals are recomputed from the higher-order counts. --pprruunnee _t_h_r_e_s_h_o_l_d
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -