⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 hidden-ngram.1

📁 这是一款很好用的工具包
💻 1
字号:
hidden-ngram(1)                                   hidden-ngram(1)NNAAMMEE       hidden-ngram - tag hidden events between wordsSSYYNNOOPPSSIISS       hhiiddddeenn--nnggrraamm [--hheellpp] _o_p_t_i_o_n ...DDEESSCCRRIIPPTTIIOONN       hhiiddddeenn--nnggrraamm  tags  a  stream  of  word tokens with hidden       events occurring between words.  For  example,  an  unseg-       mented  text  could be tagged for sentence boundaries (the       hidden events in this case being `boundary' and `no-bound-       ary').   The  most  likely  hidden tag sequence consistent       with the given word sequence is found according to  an  N-       gram language model over both words and hidden tags.       hhiiddddeenn--nnggrraamm is a generalization of sseeggmmeenntt(1).OOPPTTIIOONNSS       Each  filename  argument  can  be an ASCII file, or a com-       pressed file (name ending in .Z or .gz), or ``-'' to indi-       cate stdin/stdout.       --hheellpp  Print option summary.       --vveerrssiioonn              Print version information.       --tteexxtt _f_i_l_e              Specifies the file containing the word sequences to              be tagged (one sentence per line).  Start- and end-              of-sentence  tags are _n_o_t added by the program, but              should be included in the  input  if  the  language              model uses them.       --eessccaappee _s_t_r_i_n_g              Set  an  ``escape  string.''   Input lines starting              with _s_t_r_i_n_g are not processed and passed  unchanged              to stdout instead.  This allows associated informa-              tion to be passed to scoring scripts etc.       --tteexxtt--mmaapp _f_i_l_e              Read the input words from a map file  contain  both              the words and additional likelihoods of events fol-              lowing each word.  Each  line  contains  one  input              word,  plus  optional hidden-event/likelihood pairs              in the format                   w    e1 [p1] e2 [p2] ...              If a p value  is  omitted  a  likelihood  of  1  is              assumed.   All  events  not  explicitly  listed are              given likelihood 0, and are hence excluded for that              word.   In  particular, the label **nnooeevveenntt** must be              listed to allow absence of a hidden  event.   Input              word  strings  are assembled from multiple lines of              --tteexxtt--mmaapp input  until  either  an  end-of-sentence              token  </s>  is  found,  or  an  escaped  line (see              --eessccaappee) is encountered.       --llooggmmaapp              Interpret numeric values in the --tteexxtt--mmaapp  file  as              log probabilities, rather than probabilities.       --llmm _f_i_l_e              Specifies the word/tag language model as a standard              ARPA N-gram backoff model file in  nnggrraamm--ffoorrmmaatt(5).       --oorrddeerr _n              Set the effective N-gram order used by the language              model to _n.  Default is 3 (use a trigram model).       --ccllaasssseess _f_i_l_e              Interpret the LM as an N-gram  over  word  classes.              The  expansions of the classes are given in _f_i_l_e in              ccllaasssseess--ffoorrmmaatt(5).  Tokens in the LM that  are  not              defined  as classes in _f_i_l_e are assumed to be plain              words, so that the LM  can  contain  mixed  N-grams              over both words and word classes.       --ssiimmppllee--ccllaasssseess              Assume  a "simple" class model: each word is member              of at most one word class, and class expansions are              exactly one word long.       --mmiixx--llmm _f_i_l_e              Read  a  second N-gram model for interpolation pur-              poses.  The second and any additional  interpolated              models  can  also  be class N-grams (using the same              --ccllaasssseess definitions).       --ffaaccttoorreedd              Interpret the files specified by --llmm, --mmiixx--llmm, etc.              as   factored  N-gram  model  specifications.   See              nnggrraamm(1) for more details.       --llaammbbddaa _w_e_i_g_h_t              Set the weight of the main model when interpolating              with --mmiixx--llmm.  Default value is 0.5.       --mmiixx--llmm22 _f_i_l_e       --mmiixx--llmm33 _f_i_l_e       --mmiixx--llmm44 _f_i_l_e       --mmiixx--llmm55 _f_i_l_e       --mmiixx--llmm66 _f_i_l_e       --mmiixx--llmm77 _f_i_l_e       --mmiixx--llmm88 _f_i_l_e       --mmiixx--llmm99 _f_i_l_e              Up  to  9  more  N-gram models can be specified for              interpolation.       --mmiixx--llaammbbddaa22 _w_e_i_g_h_t       --mmiixx--llaammbbddaa33 _w_e_i_g_h_t       --mmiixx--llaammbbddaa44 _w_e_i_g_h_t       --mmiixx--llaammbbddaa55 _w_e_i_g_h_t       --mmiixx--llaammbbddaa66 _w_e_i_g_h_t       --mmiixx--llaammbbddaa77 _w_e_i_g_h_t       --mmiixx--llaammbbddaa88 _w_e_i_g_h_t       --mmiixx--llaammbbddaa99 _w_e_i_g_h_t              These are the weights for  the  additional  mixture              components, corresponding to --mmiixx--llmm22 through --mmiixx--              llmm99.  The weight for the --mmiixx--llmm model is  1  minus              the  sum  of --llaammbbddaa and --mmiixx--llaammbbddaa22 through --mmiixx--              llaammbbddaa99.       --bbaayyeess _l_e_n_g_t_h              Set the context length used for Bayesian interpola-              tion.   The default value is 0, giving the standard              fixed interpolation weight specified by --llaammbbddaa.       --bbaayyeess--ssccaallee _s_c_a_l_e              Set the exponential scale  factor  on  the  context              likelihood in conjunction with the --bbaayyeess function.              Default value is 1.0.       --llmmww _W Scales the language model probabilities by a factor              _W.  Default language model weight is 1.       --mmaappww _W              Scales  the  likelihood map probability by a factor              _W.  Default map weight is 1.       --ttoolloowweerr              Map vocabulary to lowercase, removing case distinc-              tions.       --vvooccaabb _f_i_l_e              Initialize  the  vocabulary  for  the LM from _f_i_l_e.              This is useful in conjunction with --lliimmiitt--vvooccaabb.       --vvooccaabb--aalliiaasseess _f_i_l_e              Reads vocabulary alias definitions from _f_i_l_e,  con-              sisting of lines of the form                   _a_l_i_a_s _w_o_r_d              This  causes all tokens _a_l_i_a_s to be mapped to _w_o_r_d.       --hhiiddddeenn--vvooccaabb _f_i_l_e              Read the list of hidden tags from _f_i_l_e.  Note: This              is a subset of the vocabulary contained in the lan-              guage model.       --lliimmiitt--vvooccaabb              Discard LM parameters on reading that do  not  per-              tain  to  the  words  specified  in the vocabulary,              either by --vvooccaabb or --hhiiddddeenn--vvooccaabb.  The default  is              that  words  used in the LM are automatically added              to the vocabulary.  This  option  can  be  used  to              reduce  the  memory requirements for large LMs that              are going to be evaluated only on a  small  vocabu-              lary subset.       --ffoorrccee--eevveenntt              Forces  a non-default event after every word.  This              is useful for language models  that  represent  the              default  event  explicitly  with a tag, rather than              implicitly by the absence of a  tag  between  words              (which is the default).       --kkeeeepp--uunnkk              Do  not map unknown input words to the <unk> token.              Instead, output the input  word  unchanged.   Also,              with  this  option  the  LM  is assumed to be open-              vocabulary (the default is close-vocabulary).       --ffbb    Perform  forward-backward  decoding  of  the  input              token  sequence.   Outputs  the  tags that have the              highest posterior probability, for  each  position.              The  default  is to use Viterbi decoding, i.e., the              output is the tag sequence with the  highest  joint              posterior probability.       --ffww--oonnllyy              Similar to --ffbb, but uses only the forward probabil-              ities for computing posteriors.  This may  be  used              to simulate on-line prediction of tags, without the              benefit of future context.       --ccoonnttiinnuuoouuss              Process all words in the input as one  sequence  of              words,  irrespective of line breaks.  Normally each              line is processed separately as a sentence.   Input              tokens  are  output one-per-line, followed by event              tokens.       --ppoosstteerriioorrss              Output the table  of  posterior  probabilities  for              each  tag  position.   If --ffbb is also specified the              posterior probabilities will be computed using for-              ward-backward  probabilities; otherwise an approxi-              mation will be used that is based on the  probabil-              ity  of the most likely path containing a given tag              at given position.       --ttoottaallss              Output the total string probability for each  input              sentence.   If --ffbb is also specified this probabil-              ity is obtained by summing over  all  hidden  event              sequences; otherwise it is calculated (i.e., under-              estimated) using the  most  probably  hidden  event              sequence.       --nnbbeesstt _N              Output  the  _N  best hypotheses instead of just the              first best when doing Viterbi search.  If _N>1, then              each  hypothesis  is prefixed by the tag NNBBEESSTT___n _x,,              where _n is the rank of the hypothesis in the N-best              list  and _x its score, the negative log of the com-              bined probability of transitions  and  observations              of the corresponding HMM path.       --wwrriittee--ccoouunnttss _f_i_l_e              Write  the  posterior  weighted  counts of n-grams,              including those with hidden tags, summed  over  the              entire  input  data, to _f_i_l_e.  The posterior proba-              bilities should normally be computed with the  for-              ward-backward  algorithm  (instead  of Viterbi), so              the --ffbb option is usually also specified.  Only  n-              grams  whose  contexts  occur in the language model              are output.       --uunnkk--pprroobb _L              Specifies that unknown words and other words having              zero  probability in the language model be assigned              a log probability of _L.  This is  -100  by  default              but  might  be set to 0, e.g., to compute perplexi-              ties excluding unknown words.       --ddeebbuugg Sets debugging output level.       Each filename argument can be an ASCII  file,  or  a  com-       pressed  file   (name  ending  in  .Z  or  .gz),  or ``-''       to indicate stdin/stdout.BBUUGGSS       The --ccoonnttiinnuuoouuss and --tteexxtt--mmaapp options effectively  disable       --kkeeeepp--uunnkk,  i.e., unknown input words are always mapped to       <unk>.  Also, --ccoonnttiinnuuoouuss doesn't preserve  the  positions       of escaped input lines relative to the input.       The  dynamic  programming  for event decoding is not effi-       ciently interleaved with that required to  evaluate  class       N-gram  models;  therefore,  the  state space generated in       decoding with --ccllaasssseess quickly  becomes  infeasibly  large       unless --ssiimmppllee--ccllaasssseess is also specified.       The  file  given  by  --ccllaasssseess  is  read multiple times if       --lliimmiitt--vvooccaabb is in effect or if a mixture of LMs is speci-       fied.   This  will lead to incorrect behavior if the argu-       ment of --ccllaasssseess is stdin (``-'').SSEEEE AALLSSOO       ngram(1), ngram-count(1), disambig(1), segment(1),  ngram-       format(5), classes-format(5).       A.  Stolcke  et  al.,  ``Automatic  Detection  of Sentence       Boundaries and Disfluencies based on  Recognized  Words,''       _P_r_o_c_. _I_C_S_L_P, 2247-2250, Sydney.AAUUTTHHOORRSS       Andreas Stolcke <stolcke@speech.sri.com>,       Anand Venkataraman <anand@speech.sri.com>.       Copyright 1998-2006 SRI InternationalSRILM Tools        $Date: 2006/07/28 07:34:29 $   hidden-ngram(1)

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -