📄 ngram-count.1

📁 这是一款很好用的工具包
💻 1
📖 第 1 页 / 共 2 页
字号:
12 下一页
ngram-count(1)                                     ngram-count(1)NNAAMMEE       ngram-count - count N-grams and estimate language modelsSSYYNNOOPPSSIISS       nnggrraamm--ccoouunntt [--hheellpp] _o_p_t_i_o_n ...DDEESSCCRRIIPPTTIIOONN       nnggrraamm--ccoouunntt  generates  and manipulates N-gram counts, and       estimates N-gram language models from them.   The  program       first builds an internal N-gram count set, either by read-       ing counts from a file, or by scanning text  input.   Fol-       lowing  that, the resulting counts can be output back to a       file or used for building an N-gram language model in ARPA       nnggrraamm--ffoorrmmaatt(5).   Each  of  these actions is triggered by       corresponding options, as described below.OOPPTTIIOONNSS       Each filename argument can be an ASCII  file,  or  a  com-       pressed file (name ending in .Z or .gz), or ``-'' to indi-       cate stdin/stdout.       --hheellpp  Print option summary.       --vveerrssiioonn              Print version information.       --oorrddeerr _n              Set the maximal order (length) of N-grams to count.              This also determines the order of the estimated LM,              if any.  The default order is 3.       --vvooccaabb _f_i_l_e              Read a vocabulary from file.  Subsequently, out-of-              vocabulary   words  in  both  counts  or  text  are              replaced with  the  unknown-word  token.   If  this              option is not specified all words found are implic-              itly added to the vocabulary.       --vvooccaabb--aalliiaasseess _f_i_l_e              Reads vocabulary alias definitions from _f_i_l_e,  con-              sisting of lines of the form                   _a_l_i_a_s _w_o_r_d              This  causes all tokens _a_l_i_a_s to be mapped to _w_o_r_d.       --wwrriittee--vvooccaabb _f_i_l_e              Write the vocabulary built in the counting  process              to _f_i_l_e.       --ttaaggggeedd              Interpret   text   and  N-grams  as  consisting  of              word/tag pairs.       --ttoolloowweerr              Map all vocabulary to lowercase.       --mmeemmuussee              Print memory usage statistics.   CCoouunnttiinngg OOppttiioonnss       --tteexxtt _t_e_x_t_f_i_l_e              Generate N-gram counts from  text  file.   _t_e_x_t_f_i_l_e              should   contain   one   sentence  unit  per  line.              Begin/end sentence tokens are added if not  already              present.  Empty lines are ignored.       --rreeaadd _c_o_u_n_t_s_f_i_l_e              Read  N-gram counts from a file.  Ascii count files              contain one N-gram of words per line,  followed  by              an  integer  count,  all  separated  by whitespace.              Repeated counts for  the  same  N-gram  are  added.              Thus  several  count  files  can be merged by using              ccaatt(1) and feeding the result to nnggrraamm--ccoouunntt  --rreeaadd              --  (but  see nnggrraamm--mmeerrggee(1) for merging counts that              exceed  available  memory).   Counts  collected  by              --tteexxtt and --rreeaadd are additive as well.  Binary count              files (see below) are also recognized.       --rreeaadd--ggooooggllee _d_i_r              Read  N-grams  counts  from  an  indexed  directory              structure  rooted  in ddiirr, in a format developed by              Google to store very large N-gram collections.  The              corresponding  directory  structure  can be created              using the script  mmaakkee--ggooooggllee--nnggrraammss  described  in              ttrraaiinniinngg--ssccrriippttss(1).       --wwrriittee _f_i_l_e              Write total counts to _f_i_l_e.       --wwrriittee--bbiinnaarryy _f_i_l_e              Write  total  counts  to  _f_i_l_e  in  binary  format.              Binary count files cannot  be  compressed  and  are              typically larger than compressed ascii count files.              However, they can be loaded faster, especially when              the --lliimmiitt--vvooccaabb option is used.       --wwrriittee--oorrddeerr _n              Order  of counts to write.  The default is 0, which              stands for N-grams of all lengths.       --wwrriittee_n _f_i_l_e              where _n is 1, 2, 3, 4, 5, 6, 7, 8,  or  9.   Writes              only  counts  of the indicated order to _f_i_l_e.  This              is  convenient  to  generate  counts  of  different              orders separately in a single pass.       --ssoorrtt  Output  counts  in lexicographic order, as required              for nnggrraamm--mmeerrggee(1).       --rreeccoommppuuttee              Regenerate lower-order counts by summing the  high-              est-order counts for each N-gram prefix.       --lliimmiitt--vvooccaabb              Discard  N-gram  counts on reading that do not per-              tain to the words specified in the vocabulary.  The              default  is  that words used in the count files are              automatically added to the vocabulary.   LLMM OOppttiioonnss       --llmm _l_m_f_i_l_e              Estimate a backoff  N-gram  model  from  the  total              counts,  and write it to _l_m_f_i_l_e in nnggrraamm--ffoorrmmaatt(5).       --nnoonneevveennttss _f_i_l_e              Read a list of words from _f_i_l_e that are to be  con-              sidered  non-events,  i.e.,  that can only occur in              the context of an N-gram.   Such  words  are  given              zero probability mass in model estimation.       --ffllooaatt--ccoouunnttss              Enable  manipulation  of  fractional  counts.  Only              certain  discounting  methods  support  non-integer              counts.       --sskkiipp  Estimate  a ``skip'' N-gram model, which predicts a              word by an interpolation of the  immediate  context              and the context one word prior.  This also triggers              N-gram counts to be generated  that  are  one  word              longer  than  the  indicated  order.  The following              four options control the  EM  estimation  algorithm              used for skip-N-grams.       --iinniitt--llmm _l_m_f_i_l_e              Load  an  LM  to  initialize  the parameters of the              skip-N-gram.       --sskkiipp--iinniitt _v_a_l_u_e              The initial skip probability for all words.       --eemm--iitteerrss _n              The maximum number of EM iterations.       --eemm--ddeellttaa _d              The convergence criterion for EM: if  the  relative              change  in  log  likelihood  falls  below the given              value, iteration stops.       --ccoouunntt--llmm              Estimate  a  count-based  interpolated   LM   using              Jelinek-Mercer  smoothing  (Chen  & Goodman, 1998).              Several of the options for skip-N-gram LMs  (above)              apply.  An initial count-LM in the format described              in nnggrraamm(1) needs to be specified  using  --iinniitt--llmm.              The  options --eemm--iitteerrss and --eemm--ddeellttaa control termi-              nation of the EM algorithm.  Note that  the  N-gram              counts  used  to  estimate  the  maximum-likelihood              estimates come from the --iinniitt--llmm model.  The counts              specified  with  --rreeaadd  or  --tteexxtt  are used only to              estimate the smoothing (interpolation weights).       --uunnkk   Build an ``open vocabulary''  LM,  i.e.,  one  that              contains  the unknown-word token as a regular word.              The default is to remove the unknown word.       --mmaapp--uunnkk _w_o_r_d              Map out-of-vocabulary words to  _w_o_r_d,  rather  than              the default <<uunnkk>> tag.       --ttrruusstt--ttoottaallss              Force  the  lower-order  counts to be used as total              counts in estimating N-gram probabilities.  Usually              these  totals  are recomputed from the higher-order              counts.       --pprruunnee _t_h_r_e_s_h_o_l_d
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -