📄 ngram.1

📁 这是一款很好用的工具包
💻 1
📖 第 1 页 / 共 3 页
字号:
12 3 下一页
ngram(1)                                                 ngram(1)NNAAMMEE       ngram - apply N-gram language modelsSSYYNNOOPPSSIISS       nnggrraamm [--hheellpp] option ...DDEESSCCRRIIPPTTIIOONN       nnggrraamm  performs  various  operations with N-gram-based and       related language models, including sentence scoring,  per-       plexity  computation,  sentences  generation,  and various       types of model interpolation.  The N-gram language  models       are  read  from  files  in  ARPA  nnggrraamm--ffoorrmmaatt(5); various       extended language model formats  are  described  with  the       options below.OOPPTTIIOONNSS       Each  filename  argument  can  be an ASCII file, or a com-       pressed file (name ending in .Z or .gz), or ``-'' to indi-       cate stdin/stdout.       --hheellpp  Print option summary.       --vveerrssiioonn              Print version information.       --oorrddeerr _n              Set the maximal N-gram order to be used, by default              3.  NOTE: The order of the model is not  set  auto-              matically  when  a  model file is read, so the same              file can be used at various orders.  To use  models              of  order  higher  than 3 it is always necessary to              specify this option.       --ddeebbuugg _l_e_v_e_l              Set the debugging output level (0 means  no  debug-              ging  output).   Debugging  messages  are  sent  to              stderr,  with  the  exception  of  --ppppll  output  as              explained below.       --mmeemmuussee              Print memory usage statistics for the LM.       The following options determine the type of LM to be used.       --nnuullll  Use a `null' LM as the main model (one  that  gives              probability  1  to  all  words).  This is useful in              combination with mixture creation or for debugging.       --llmm _f_i_l_e              Read  the  (main)  N-gram  model  from  _f_i_l_e.  This              option is always required, unless --nnuullll was chosen.       --ttaaggggeedd              Interpret the LM as containing word/tag N-grams.       --sskkiipp  Interpret the LM as a ``skip'' N-gram model.       --hhiiddddeenn--vvooccaabb _f_i_l_e              Interpret the LM as an N-gram model containing hid-              den events between words.  The list of hidden event              tags is read from _f_i_l_e.              Hidden event definitions may also follow the N-gram              definitions in the LM file (the argument  to  --llmm).              The format for such definitions is                   _e_v_e_n_t  [--ddeelleettee  _D]  [--rreeppeeaatt  _R]  [--iinnsseerrtt _w]              [--oobbsseerrvveedd] [--oommiitt]              The optional flags after the event name modify  the              default behavior of hidden events in the model.  By              default events are unobserved pseudo-words of which              at  most  one  can occur between regular words, and              which are added to the context to predict following              words and events.  (A typical use would be to model              hidden  sentence  boundaries.)   --ddeelleettee  indicates              that  upon  encountering  the  event,  _D  words are              deleted from  the  next  word's  context.   --rreeppeeaatt              indicates  that  after  the  event the next _R words              from the context are to be repeated.  --iinnsseerrtt spec-              ifies that an (unobserved) word _w is to be inserted              into the history.  --oobbsseerrvveedd  specifies  the  event              tag is not hidden, but observed in the word stream.              --oommiitt indicates that the event tag itself is not to              be  added to the history for predicting the follow-              ing words.              The hidden event mechanism represents a generaliza-              tion of the disfluency LM enabled by --ddff.       --hhiiddddeenn--nnoott              Modifies processing of hidden event N-grams for the              case that the event tags are embedded in  the  word              stream, as opposed to inferred through dynamic pro-              gramming.       --ddff    Interpret the LM as containing  disfluency  events.              This  enables an older form of hidden-event LM used              in Stolcke & Shriberg (1996).  It is roughly equiv-              alent to a hidden-event LM with                   UH -observed -omit       (filled pause)                   UM -observed -omit       (filled pause)                   @SDEL -insert <s>        (sentence restart)                   @DEL1 -delete 1 -omit    (1-word deletion)                   @DEL2 -delete 2 -omit    (2-word deletion)                   @REP1 -repeat 1 -omit    (1-word repetition)                   @REP2 -repeat 2 -omit    (2-word repetition)       --ccllaasssseess _f_i_l_e              Interpret  the  LM  as an N-gram over word classes.              The expansions of the classes are given in _f_i_l_e  in              ccllaasssseess--ffoorrmmaatt(5).   Tokens  in the LM that are not              defined as classes in _f_i_l_e are assumed to be  plain              words,  so  that  the  LM can contain mixed N-grams              over both words and word classes.              Class definitions may also follow the N-gram  defi-              nitions  in  the LM file (the argument to --llmm).  In              that case --ccllaasssseess //ddeevv//nnuullll should be specified to              trigger  interpretation  of the LM as a class-based              model.  Otherwise, class definitions specified with              this  option  override any definitions found in the              LM file itself.       --ssiimmppllee--ccllaasssseess              Assume a "simple" class model: each word is  member              of at most one word class, and class expansions are              exactly one word long.       --eexxppaanndd--ccllaasssseess _k              Replace  the  read  class-N-gram  model   with   an              (approximately)  equivalent word-based N-gram.  The              argument  _k  limits  the  length  of  the   N-grams              included  in  the  new model (_k=0 allows N-grams of              arbitrary length).       --eexxppaanndd--eexxaacctt _k              Use a more exact (but also  more  expensive)  algo-              rithm  to  compute the conditional probabilities of              N-grams  expanded  from  classes,  for  N-grams  of              length  _k  or longer (_k=0 is a special case and the              default, it disables the exact algorithm for all N-              grams).   The  exact  algorithm  is recommended for              class-N-gram models that contain  multi-word  class              expansions,  for N-gram lengths exceeding the order              of the underlying class N-grams.       --ddeecciipphheerr              Use the N-gram model exactly  as  the  Decipher(TM)              recognizer  would,  i.e., choosing the backoff path              if it has a  higher  probability  than  the  bigram              transition,   and  rounding  log  probabilities  to              bytelog precision.       --ffaaccttoorreedd              Use a factored N-gram model,  i.e.,  a  model  that              represents  words as vectors of feature-value pairs              and models sequences of words by a  set  of  condi-              tional dependency relations between factors.  Indi-              vidual dependencies are modeled by standard  N-gram              LMs,  allowing  however  for  a generalized backoff              mechanism to combine multiple backoff paths (Bilmes              and   Kirchhoff  2003).   The  --llmm,  --mmiixx--llmm,  etc.              options name FLM specification files in the  format              described in Kirchhoff et al. (2002).       --hhmmmm   Use  an  HMM  of  N-grams  language model.  The --llmm              option specifies a file  that  describes  a  proba-              bilistic  graph,  with each line corresponding to a              node or state.  A line has the format:                   _s_t_a_t_e_n_a_m_e _n_g_r_a_m_-_f_i_l_e _s_1 _p_1 _s_2 _p_2 ...              where _s_t_a_t_e_n_a_m_e is a string identifying the  state,              _n_g_r_a_m_-_f_i_l_e names a file containing a backoff N-gram              model, _s_1,_s_2, ... are names of  follow-states,  and              _p_1,_p_2, ... are the associated transition probabili-              ties.  A filename of ``-'' can be used to  indicate              the  N-gram model data is included in the HMM file,              after the current line.  (Further HMM states may be              specified after the N-gram data.)              The  names  IINNIITTIIAALL  and FFIINNAALL denote the start and              end states, respectively, and have no associated N-              gram  model  (_n_g_r_a_m_-_f_i_l_e must be specified as ``.''              for these).  The --oorrddeerr option specifies the  maxi-              mal N-gram length in the component models.              The  semantics  of an HMM of N-grams is as follows:              as each state is visited, words  are  emitted  from              the associated N-gram model.  The first state (cor-              responding to the start-of-sentence) is IINNIITTIIAALL.  A              state  is  left with the probability of the end-of-              sentence token in the  respective  model,  and  the              next state is chosen according to the state transi-              tion probabilities.  Each  state  has  to  emit  at              least  one  word.   The  actual  end-of-sentence is              emitted if and only if the FFIINNAALL state is  reached.              Each word probability is conditioned on all preced-              ing words, regardless of whether they were  emitted              in the same or a previous state.       --ccoouunntt--llmm              Use  a count-based interpolated LM.  The --llmm option              specifies a file that describes  a  set  of  N-gram              counts  along  with interpolation weights, based on              which Jelinek-Mercer smoothing in  the  formulation              of  Chen and Goodman (1998) is performed.  The file              format is                   oorrddeerr _N                   vvooccaabbssiizzee _V                   ttoottaallccoouunntt _C                   mmiixxwweeiigghhttss _M                    _w_0_1 _w_0_2 ... _w_0_N                    _w_1_1 _w_1_2 ... _w_1_N                    ...                    _w_M_1 _w_M_2 ... _w_M_N                   ccoouunnttmmoodduulluuss _m                   ggooooggllee--ccoouunnttss _d_i_r                   ccoouunnttss _f_i_l_e              Here _N is the model order (maximal N-gram  length),              although  as  with backoff models, the actual value              used is overridden by the --oorrddeerr command line  when              the  model is read in.  _V gives the vocabulary size              and _C the sum of all unigram counts.   _M  specifies              the  number of mixture weight bins (minus 1).  _m is              the width of a mixture weight bin.   Thus,  _w_i_j  is              the  mixture  weight  used  to  interpolate an _j-th              order maximum-likelihood estimate with  lower-order              estimates  given  that  the  (_j-1)-gram context has              been  seen  with  a  frequency  between   _i*_m   and              (_i+1)*_m-1  times.   (For  contexts  with  frequency
12 3 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -