training-scripts.1

来自「这是一款很好用的工具包」· 1 代码 · 共 233 行
233 行
training-scripts(1)                           training-scripts(1)NNAAMMEE       training-scripts,    compute-oov-rate,   continuous-ngram-       count,   get-gt-counts,   make-abs-discount,   make-batch-       counts,   make-big-lm,  make-diacritic-map,   make-google-       ngrams,  make-gt-discounts,  make-kn-counts,  make-kn-dis-       counts,   merge-batch-counts,  replace-words-with-classes,       reverse-ngram-counts,  split-tagged-ngrams,  reverse-text,       uniform-classes,  vp2text - miscellaneous conveniences for       language model trainingSSYYNNOOPPSSIISS       ggeett--ggtt--ccoouunnttss mmaaxx==_K oouutt==_n_a_m_e [_c_o_u_n_t_s...]       mmaakkee--aabbss--ddiissccoouunntt _g_t_c_o_u_n_t_s       mmaakkee--ggtt--ddiissccoouunnttss mmiinn==_m_i_n mmaaxx==_m_a_x _g_t_c_o_u_n_t_s       mmaakkee--kknn--ccoouunnttss  oorrddeerr==_N   mmaaxx__ppeerr__ffiillee==_M   oouuttppuutt==_f_i_l_e   [       nnoo__mmaaxx__oorrddeerr==11 ]       mmaakkee--kknn--ddiissccoouunnttss mmiinn==_m_i_n _g_t_c_o_u_n_t_s       mmaakkee--bbaattcchh--ccoouunnttss _f_i_l_e_-_l_i_s_t [_b_a_t_c_h_-_s_i_z_e [_f_i_l_t_e_r [_c_o_u_n_t_-_d_i_r       [_o_p_t_i_o_n_s...]]]]       mmeerrggee--bbaattcchh--ccoouunnttss _c_o_u_n_t_-_d_i_r [_f_i_l_e_-_l_i_s_t|_s_t_a_r_t_-_i_t_e_r]       mmaakkee--ggooooggllee--nnggrraammss [ ddiirr==_D_I_R ] [ ppeerr__ffiillee==_N ] [  ggzziipp==00  ]       [_c_o_u_n_t_s_-_f_i_l_e...]       ccoonnttiinnuuoouuss--nnggrraamm--ccoouunntt [ oorrddeerr==_N ] [_t_e_x_t_f_i_l_e...]       rreevveerrssee--nnggrraamm--ccoouunnttss [_c_o_u_n_t_s_-_f_i_l_e...]       rreevveerrssee--tteexxtt [_t_e_x_t_f_i_l_e...]       sspplliitt--ttaaggggeedd--nnggrraammss [ sseeppaarraattoorr==_S ] [_c_o_u_n_t_s_-_f_i_l_e...]       mmaakkee--bbiigg--llmm  --nnaammee _n_a_m_e --rreeaadd _c_o_u_n_t_s [ --ttrruusstt--ttoottaallss --mmaaxx--       ppeerr--ffiillee M ] --llmm _n_e_w_-_m_o_d_e_l [_o_p_t_i_o_n_s...]       rreeppllaaccee--wwoorrddss--wwiitthh--ccllaasssseess ccllaasssseess==_c_l_a_s_s_e_s [oouuttffiillee==_c_o_u_n_t_s       nnoorrmmaalliizzee==0|1    aaddddoonnee==_K    hhaavvee__ccoouunnttss==11   ppaarrttiiaall==11   ]       [_t_e_x_t_f_i_l_e...]       uunniiffoorrmm--ccllaasssseess _c_l_a_s_s_e_s >>_n_e_w_-_c_l_a_s_s_e_s       mmaakkee--ddiiaaccrriittiicc--mmaapp _v_o_c_a_b       vvpp22tteexxtt [_t_e_x_t_f_i_l_e...]       ccoommppuuttee--oooovv--rraattee _v_o_c_a_b [_c_o_u_n_t_s...]DDEESSCCRRIIPPTTIIOONN       These scripts perform convenience  tasks  associated  with       the  training  of  language  models.   They complement and       extend the basic N-gram model estimator in nnggrraamm--ccoouunntt(1).       Since  these  tools  are implemented as scripts they don't       automatically input or output compressed data  files  cor-       rectly,  unlike the main SRILM tools.  However, since most       scripts work with data from standard input or to  standard       output (by leaving out the file argument, or specifying it       as ``-'') it is easy to combine  them  with  gguunnzziipp(1)  or       ggzziipp(1) on the command line.       Also note that many of the scripts take their options with       the ggaawwkk(1) syntax _o_p_t_i_o_n==_v_a_l_u_e instead of the more common       --_o_p_t_i_o_n _v_a_l_u_e.       ggeett--ggtt--ccoouunnttss  computes  the  counts-of-counts  statistics       needed  in  Good-Turing  smoothing.   The  frequencies  of       counts  up to _K are computed (default is 10).  The results       are  stored  in  a  series  of  files  with   root   _n_a_m_e,       _n_a_m_e..ggtt11ccoouunnttss,  _n_a_m_e..ggtt22ccoouunnttss,  ..., _n_a_m_e..ggtt_Nccoouunnttss.  It       is assumed  that  the  input  counts  have  been  properly       merged, i.e., that there are no duplicated N-grams.       mmaakkee--ggtt--ddiissccoouunnttss takes one of the output files of ggeett--ggtt--       ccoouunnttss and computes  the  corresponding  Good-Turing  dis-       counting factors.  The output can then be passed to nnggrraamm--       ccoouunntt(1) via the --ggtt_n options  to  control  the  smoothing       during  model estimation.  Precomputing the GT discounting       in this fashion has the advantage that the  GT  statistics       are  not  affected  by  restricting  N-grams  to a limited       vocabulary.   Also,  ggeett--ggtt--ccoouunnttss/mmaakkee--ggtt--ddiissccoouunnttss   can       process  arbitrarily  large count files, since they do not       need to read the counts into memory (unlike  nnggrraamm--ccoouunntt).       mmaakkee--aabbss--ddiissccoouunntt  computes  the absolute discounting con-       stant needed  for  the  nnggrraamm--ccoouunntt  --ccddiissccoouunntt_n  options.       Input is one of the files produced by ggeett--ggtt--ccoouunnttss.       mmaakkee--kknn--ddiissccoouunntt  computes  the discounting constants used       by the modified Kneser-Ney smoothing method.  Input is one       of the files produced by ggeett--ggtt--ccoouunnttss.       mmaakkee--bbaattcchh--ccoouunnttss  performs  the  first  stage in the con-       struction of very large N-gram count files.  _f_i_l_e_-_l_i_s_t  is       a  list  of  input  text files.  Lines starting with a `#'       character are ignored.  These files will be  grouped  into       batches of size _b_a_t_c_h_-_s_i_z_e (default 10) that are then pro-       cessed in one run of nnggrraamm--ccoouunntt each.  For  maximum  per-       formance,  _b_a_t_c_h_-_s_i_z_e should be as large as possible with-       out triggering paging.  Optionally,  a  _f_i_l_t_e_r  script  or       program can be given to condition the input texts.  The N-       gram  count  files  are  left   in   directory   _c_o_u_n_t_-_d_i_r       (``counts'' by default), where they can be found by a sub-       sequent run of mmeerrggee--bbaattcchh--ccoouunnttss.  All following  _o_p_t_i_o_n_s       are  passed to nnggrraamm--ccoouunntt, e.g., to control N-gram order,       vocabulary, etc.  (no options triggering model  estimation       should be included).       mmeerrggee--bbaattcchh--ccoouunnttss  completes  the  construction  of large       count files by merging the batched counts left  in  _c_o_u_n_t_-       _d_i_r  until a single count file is produced.  Optionally, a       _f_i_l_e_-_l_i_s_t of count files to combine can be specified; oth-       erwise  all  count  files in _c_o_u_n_t_-_d_i_r from a prior run of       mmaakkee--bbaattcchh--ccoouunnttss will be  merged.   A  number  as  second       argument  restarts the merging process at iteration _s_t_a_r_t_-       _i_t_e_r.  This is convenient if merging fails to complete for       some reason (e.g., for temporary lack of disk space).       mmaakkee--ggooooggllee--nnggrraammss  takes a sorted count file as input and       creates an indexed directory structure, in a format devel-       oped  by  Google  to  store very large N-gram collections.       The resulting directory can then be used with  the  nnggrraamm--       ccoouunntt(1)  --rreeaadd--ggooooggllee option.  Optional arguments specify       the output directory _d_i_r and the size _N of  individual  N-       gram  files (default is 10 million N-grams per file).  The       ggzziipp==00 option writes  plain,  as  opposed  to  compressed,       files.       ccoonnttiinnuuoouuss--nnggrraamm--ccoouunntt  generates  N-grams  that span line       breaks (which are usually  taken  to  be  sentence  bound-       aries).  To count N-grams across line breaks use            continuous-ngram-count _t_e_x_t_f_i_l_e | ngram-count -read -       The argument _N  controls  the  order  of  N-grams  counted       (default 3), and should match  the argument of nnggrraamm--ccoouunntt       --oorrddeerr.       rreevveerrssee--nnggrraamm--ccoouunnttss reverses the word order of N-grams in       a counts file or stream.  For example, to recompute lower-       order counts from higher-order ones, but do the  summation       over  preceding  words (rather than following words, as in       nnggrraamm--ccoouunntt(1)), use            reverse-ngram-counts _c_o_u_n_t_-_f_i_l_e | \            ngram-count -read - -recompute -write - | \            reverse-ngram-counts > _n_e_w_-_c_o_u_n_t_s       rreevveerrssee--tteexxtt reverses the word order in text files,  line-       by-line.   Start-  and end-sentence tags, if present, will       be preserved.  This reversal is  appropriate  for  prepro-       cessing  training  data  for LMs that are meant to be used       with the nnggrraamm --rreevveerrssee option.       sspplliitt--ttaaggggeedd--nnggrraammss expands N-gram count of word/tag pairs       into  mixed N-grams of words and tags.  The optional sseeppaa--       rraattoorr==_S argument allows the  delimiting  character,  which       defaults to "/", to be modified.       mmaakkee--bbiigg--llmm  constructs large N-gram models in a more mem-       ory-efficient way than nnggrraamm--ccoouunntt by itself.  It does  so       by  precomputing  the  Good-Turing or Kneser-Ney smoothing       parameters from the full set of counts, and then instruct-       ing  nnggrraamm--ccoouunntt  to  store only a subset of the counts in       memory, namely those of N-grams  to  be  retained  in  the       model.   The _n_a_m_e parameter is used to name various auxil-       iary files.  _c_o_u_n_t_s contains the raw N-gram counts; it may       be (and usually is) a compressed file.  Unlike with nnggrraamm--       ccoouunntt, the --rreeaadd option can  be  repeated  to  concatenate       multiple  count  files,  but the arguments must be regular       files; reading from stdin is not supported.  If  Good-Tur-       ing  smoothing  is  used  and  the  file contains complete       lower-order counts corresponding to the  sums  of  higher-       order  counts, then the --ttrruusstt--ttoottaallss options may be given       for efficiency.  All other _o_p_t_i_o_n_s are  passed  to  nnggrraamm--       ccoouunntt  (only  options affecting model estimation should be       given).  Smoothing methods other than Good-Turing and mod-       ified   Kneser-Ney   are  not  supported  by  mmaakkee--bbiigg--llmm.       Kneser-Ney smoothing also requires enough  disk  space  to       compute  and store the modified lower-order counts used by       the KN method.  This is done using the  mmeerrggee--bbaattcchh--ccoouunnttss       command,  and  the  --mmaaxx--ppeerr--ffiillee option controls how many       counts are to be stored per batch, and should be chosen so       that these batches fit in real memory.       mmaakkee--kknn--ccoouunnttss  computes  the  modified lower-order counts       used by the KN smoothing  method.   It  is  invoked  as  a       helper scripts by mmaakkee--bbiigg--llmm ..       rreeppllaaccee--wwoorrddss--wwiitthh--ccllaasssseess  replaces  expansions  of  word       classes with  the  corresponding  class  labels.   _c_l_a_s_s_e_s       specifies class expansions in ccllaasssseess--ffoorrmmaatt(5).  Ambigui-       ties are resolved in favor of the  longest  matching  word       strings.  Ties are broken in favor of the expansion listed       first  in  _c_l_a_s_s_e_s_.   Optionally,  the  file  _c_o_u_n_t_s  will       receive  the  expansion counts resulting from the replace-       ments.  nnoorrmmaalliizzee==00 or  11  indicates  whether  the  counts       should be normalized to probabilities (default is 1).  The       aaddddoonnee option may be used to smooth the  expansion  proba-       bilities  by  adding  _K  to  each  count (default 1).  The       option hhaavvee__ccoouunnttss==11 indicates that the input consists  of       N-gram  counts and that replacement should be performed on       them.  Note this will not  merge  counts  that  have  been       mapped  to identical N-grams, since this is done automati-       cally when nnggrraamm--ccoouunntt(1) reads count  data.   The  option       ppaarrttiiaall==11  prevents multi-word class expansions from being       replaced when more than one space character  occurs  inbe-       tween the words.       uunniiffoorrmm--ccllaasssseess takes a file in ccllaasssseess--ffoorrmmaatt(5) and adds       uniform probabilities to  expansions  that  don't  have  a       probability explicitly stated.       mmaakkee--ddiiaaccrriittiicc--mmaapp  constructs  a  map  file that pairs an       ASCII-fied version of the words  in  _v_o_c_a_b  with  all  the       occurring  non-ASCII word forms.  Such a map file can then       be used with ddiissaammbbiigg(1) and a language  model  to  recon-       struct  the  non-ASCII  word  form with diacritics from an       ASCII text.       vvpp22tteexxtt is a reimplementation of the filter  used  in  the       DARPA Hub-3 and Hub-4 CSR evaluations to convert ``verbal-       ized punctuation'' texts to language model training  data.       ccoommppuuttee--oooovv--rraattee  determines the out-of-vocabulary rate of       a corpus from its unigram _c_o_u_n_t_s and a  target  vocabulary       list in _v_o_c_a_b.SSEEEE AALLSSOO       ngram-count(1),  ngram(1), classes-format(5), disambig(1),       select-vocab(1).BBUUGGSS       Some of the tools could be generalized  and/or  made  more       robust to misuse.AAUUTTHHOORR       Andreas Stolcke <stolcke@speech.sri.com>.       Copyright 1995-2006 SRI InternationalSRILM Tools        $Date: 2006/08/11 22:35:11 $training-scripts(1)
training-scripts.1 - 源码说明

本页面展示了「这是一款很好用的工具包」中的 training-scripts.1 源码文件，采用 1 编程语言编写，共 233 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与工具包相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?