📄 ngram-class.1
字号:
ngram-class(1) ngram-class(1)NNAAMMEE ngram-class - induce word classes from N-gram statisticsSSYYNNOOPPSSIISS nnggrraamm--ccllaassss [--hheellpp] _o_p_t_i_o_n ...DDEESSCCRRIIPPTTIIOONN nnggrraamm--ccllaassss induces word classes from distributional statistics, so as to minimize perplexity of a class-based N-gram model given the provided word N-gram counts. Presently, only bigram statistics are used, i.e., the induced classes are best suited for a class-bigram lan- guage model. The program generates the class N-gram counts and class expansions needed by nnggrraamm--ccoouunntt(1) and nnggrraamm(1), respec- tively to train and to apply the class N-gram model.OOPPTTIIOONNSS Each filename argument can be an ASCII file, or a com- pressed file (name ending in .Z or .gz), or ``-'' to indi- cate stdin/stdout. --hheellpp Print option summary. --vveerrssiioonn Print version information. --ddeebbuugg _l_e_v_e_l Set debugging output at _l_e_v_e_l. Level 0 means no debugging. Debugging messages are written to stderr. A useful level to trace the formation of classes is 2. IInnppuutt OOppttiioonnss --vvooccaabb _f_i_l_e Read a vocabulary from file. Subsequently, out-of- vocabulary words in both counts or text are replaced with the unknown-word token. If this option is not specified all words found are implic- itly added to the vocabulary. --ttoolloowweerr Map the vocabulary to lowercase. --ccoouunnttss _f_i_l_e Read N-gram counts from a file. Each line contains an N-gram of words, followed by an integer count, all separated by whitespace. Repeated counts for the same N-gram are added. Counts collected by --tteexxtt and --ccoouunnttss are additive as well. Note that the input should contain consistent lower- and higher-order counts (i.e., unigrams and bigrams), as would be generated by nnggrraamm--ccoouunntt(1). --tteexxtt _t_e_x_t_f_i_l_e Generate N-gram counts from text file. _t_e_x_t_f_i_l_e should contain one sentence unit per line. Begin/end sentence tokens are added if not already present. Empty lines are ignored. CCllaassss MMeerrggiinngg --nnuummccllaasssseess _C The target number of classes to induce. A zero argument suppresses automatic class merging alto- gether (e.g., for use with --iinntteerraacctt)).. --ffuullll Perform full greedy merging over all classes start- ing with one class per word. This is the O(V^3) algorithm described in Brown et al. (1992). --iinnccrreemmeennttaall Perform incremental greedy merging, starting with one class each for the _C most frequent words, and then adding one word at a time. This is the O(V*C^2) algorithm described in Brown et al. (1992); it is the default. --iinntteerraacctt Enter a primitive interactive interface when done with automatic class induction, allowing manual specification of additional merging steps. --nnooccllaassss--vvooccaabb _f_i_l_e Read a list of vocabulary items from _f_i_l_e that are to be excluded from classes. These words or tags do no undergo class merging, but their N-gram counts still affect the optimization of model per- plexity. The default is to exclude the sentence begin/end tags (<s> and </s>) from class merging; this can be suppressed by specifying --nnooccllaassss--vvooccaabb //ddeevv//nnuullll. OOuuttppuutt OOppttiioonnss --ccllaassss--ccoouunnttss _f_i_l_e Write class N-gram counts to _f_i_l_e when done. The format is the same as for word N-gram counts, and can be read by nnggrraamm--ccoouunntt(1) to estimate a class- N-gram model. --ccllaasssseess _f_i_l_e Write class definitions (member words and their probabilities) to _f_i_l_e when done. The output for- mat is the same as required by the --ccllaasssseess option of nnggrraamm(1). --ssaavvee _S Save the class counts and/or class definitions every _S iterations during induction. The filenames are obtained from the --ccllaassss--ccoouunnttss and --ccllaasssseess options, respectively, by appending the iteration number. This is convenient for producing sets of classes at different granularities during the same run. _S=0 (the default) suppresses the saving actions.SSEEEE AALLSSOO ngram-count(1), ngram(1). P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer, ``Class-Based n-gram Models of Natural Language,'' _C_o_m_p_u_t_a_t_i_o_n_a_l _L_i_n_g_u_i_s_t_i_c_s 18(4), 467-479, 1992.BBUUGGSS Classes are optimized only for bigram models at present.AAUUTTHHOORR Andreas Stolcke <stolcke@speech.sri.com>. Copyright 1999-2004 SRI InternationalSRILM Tools $Date: 2004/12/03 17:59:01 $ ngram-class(1)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -