📄 ngram-merge.1
字号:
ngram-merge(1) ngram-merge(1)NNAAMMEE ngram-merge - merge N-gram countsSSYYNNOOPPSSIISS nnggrraamm--mmeerrggee [--hheellpp] [--wwrriittee _o_u_t_f_i_l_e] [--ffllooaatt--ccoouunnttss] [----] _i_n_f_i_l_e_1 _i_n_f_i_l_e_2 ...DDEESSCCRRIIPPTTIIOONN nnggrraamm--mmeerrggee reads two or more lexicographically sorted N- gram count files (as produced by nnggrraamm--ccoouunntt --ssoorrtt) and outputs the merged, sorted counts. The output is thus suitable for subsequent merging steps. The input format consists of one N-gram count per line, _w_o_r_d_1 _w_o_r_d_2 _._._. _w_o_r_d_n _c_o_u_n_t The lines must be sorted lexicographically on the words, leftmost first. The input may contain N-grams of differ- ent lengths. Each filename argument can be a plain ASCII count file, or a compressed file (name ending in .Z or .gz), or ``-'' to indicate stdin/stdout. nnggrraamm--mmeerrggee is recommended in cases where the full counts would far exceed available real memory. Although an arbi- trary number of input count files is accepted, it is best to use the program as follows. First, partition the input text into the largest chunks so that nnggrraamm--ccoouunntt can run in real memory. Then merge the resulting sorted counts using nnggrraamm--mmeerrggee pairwise, and continue doing so in a binary tree pattern until a single count file containing all N-grams remains. This procedure is automated by the mmaakkee--bbaattcchh--ccoouunnttss and mmeerrggee--bbaattcchh--ccoouunnttss scripts.OOPPTTIIOONNSS Each filename argument can be an ASCII file, or a com- pressed file (name ending in .Z or .gz), or ``-'' to indi- cate stdin/stdout. --hheellpp Print option and usage summary. --vveerrssiioonn Print version information. --wwrriittee _o_u_t_f_i_l_e Write merged counts to _o_u_t_f_i_l_e, instead of standard output. --ffllooaatt--ccoouunnttss Process counts as floating point numbers. By default counts are assumed to be unsigned integers. ---- Indicates the end of options, in case the first input filename begins with ``-''.SSEEEE AALLSSOO ngram-count(1), ngram(1), training-scripts(1).AAUUTTHHOORR Andreas Stolcke <stolcke@speech.sri.com> Copyright 1995-2004 SRI International
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -