📄 ngram-class.1

📁 这是一款很好用的工具包
💻 1
字号:
ngram-class(1)                                     ngram-class(1)NNAAMMEE       ngram-class - induce word classes from N-gram statisticsSSYYNNOOPPSSIISS       nnggrraamm--ccllaassss [--hheellpp] _o_p_t_i_o_n ...DDEESSCCRRIIPPTTIIOONN       nnggrraamm--ccllaassss   induces  word  classes  from  distributional       statistics, so as to minimize perplexity of a  class-based       N-gram  model  given  the  provided  word  N-gram  counts.       Presently, only bigram  statistics  are  used,  i.e.,  the       induced  classes  are  best suited for a class-bigram lan-       guage model.       The program generates the class N-gram  counts  and  class       expansions  needed by nnggrraamm--ccoouunntt(1) and nnggrraamm(1), respec-       tively to train and to apply the class N-gram model.OOPPTTIIOONNSS       Each filename argument can be an ASCII  file,  or  a  com-       pressed file (name ending in .Z or .gz), or ``-'' to indi-       cate stdin/stdout.       --hheellpp  Print option summary.       --vveerrssiioonn              Print version information.       --ddeebbuugg _l_e_v_e_l              Set debugging output at _l_e_v_e_l.  Level  0  means  no              debugging.    Debugging  messages  are  written  to              stderr.  A useful level to trace the  formation  of              classes is 2.   IInnppuutt OOppttiioonnss       --vvooccaabb _f_i_l_e              Read a vocabulary from file.  Subsequently, out-of-              vocabulary  words  in  both  counts  or  text   are              replaced  with  the  unknown-word  token.   If this              option is not specified all words found are implic-              itly added to the vocabulary.       --ttoolloowweerr              Map the vocabulary to lowercase.       --ccoouunnttss _f_i_l_e              Read N-gram counts from a file.  Each line contains              an N-gram of words, followed by an  integer  count,              all  separated  by whitespace.  Repeated counts for              the same N-gram are  added.   Counts  collected  by              --tteexxtt and --ccoouunnttss are additive as well.              Note  that  the  input  should  contain  consistent              lower- and higher-order counts (i.e., unigrams  and              bigrams),  as would be generated by nnggrraamm--ccoouunntt(1).       --tteexxtt _t_e_x_t_f_i_l_e              Generate N-gram counts from  text  file.   _t_e_x_t_f_i_l_e              should   contain   one   sentence  unit  per  line.              Begin/end sentence tokens are added if not  already              present.  Empty lines are ignored.   CCllaassss MMeerrggiinngg       --nnuummccllaasssseess _C              The  target  number  of  classes to induce.  A zero              argument suppresses automatic class  merging  alto-              gether (e.g., for use with --iinntteerraacctt))..       --ffuullll  Perform full greedy merging over all classes start-              ing with one class per word.  This  is  the  O(V^3)              algorithm described in Brown et al. (1992).       --iinnccrreemmeennttaall              Perform  incremental  greedy merging, starting with              one class each for the _C most frequent  words,  and              then  adding  one  word  at  a  time.   This is the              O(V*C^2)  algorithm  described  in  Brown  et   al.              (1992); it is the default.       --iinntteerraacctt              Enter  a  primitive interactive interface when done              with automatic  class  induction,  allowing  manual              specification of additional merging steps.       --nnooccllaassss--vvooccaabb _f_i_l_e              Read  a list of vocabulary items from _f_i_l_e that are              to be excluded from classes.  These words  or  tags              do  no  undergo  class  merging,  but  their N-gram              counts still affect the optimization of model  per-              plexity.              The  default  is  to exclude the sentence begin/end              tags (<s> and </s>) from class merging; this can be              suppressed  by specifying --nnooccllaassss--vvooccaabb //ddeevv//nnuullll.   OOuuttppuutt OOppttiioonnss       --ccllaassss--ccoouunnttss _f_i_l_e              Write class N-gram counts to _f_i_l_e when  done.   The              format  is  the same as for word N-gram counts, and              can be read by nnggrraamm--ccoouunntt(1) to estimate a  class-              N-gram model.       --ccllaasssseess _f_i_l_e              Write  class  definitions  (member  words and their              probabilities) to _f_i_l_e when done.  The output  for-              mat  is the same as required by the --ccllaasssseess option              of nnggrraamm(1).       --ssaavvee _S              Save the  class  counts  and/or  class  definitions              every _S iterations during induction.  The filenames              are obtained from the  --ccllaassss--ccoouunnttss  and  --ccllaasssseess              options,  respectively,  by appending the iteration              number.  This is convenient for producing  sets  of              classes  at different granularities during the same              run.   _S=0  (the  default)  suppresses  the  saving              actions.SSEEEE AALLSSOO       ngram-count(1), ngram(1).       P.  F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai       and R. L. Mercer, ``Class-Based n-gram Models  of  Natural       Language,''   _C_o_m_p_u_t_a_t_i_o_n_a_l  _L_i_n_g_u_i_s_t_i_c_s  18(4),  467-479,       1992.BBUUGGSS       Classes are optimized only for bigram models at present.AAUUTTHHOORR       Andreas Stolcke <stolcke@speech.sri.com>.       Copyright 1999-2004 SRI InternationalSRILM Tools        $Date: 2004/12/03 17:59:01 $    ngram-class(1)
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -