disambig.1

来自「这是一款很好用的工具包」· 1 代码 · 共 288 行
288 行
disambig(1)                                           disambig(1)NNAAMMEE       disambig - disambiguate text tokens using an N-gram modelSSYYNNOOPPSSIISS       ddiissaammbbiigg [--hheellpp] _o_p_t_i_o_n ...DDEESSCCRRIIPPTTIIOONN       ddiissaammbbiigg  translates  a stream of tokens from a vocabulary       V1 to a corresponding stream of tokens from  a  vocabulary       V2,  according  to  a  probabilistic,  1-to-many  mapping.       Ambiguities in the mapping are resolved by finding the  V2       sequence  with the highest posterior probability given the       V1 sequence.  This probability is computed  from  pairwise       conditional  probabilities P(V1|V2), as well as a language       model for sequences over V2.OOPPTTIIOONNSS       Each filename argument can be an ASCII  file,  or  a  com-       pressed file (name ending in .Z or .gz), or ``-'' to indi-       cate stdin/stdout.       --hheellpp  Print option summary.       --vveerrssiioonn              Print version information.       --tteexxtt _f_i_l_e              Specifies the file containing the V1 sentences.       --mmaapp _f_i_l_e              Specifies the file containing the V1-to-V2  mapping              information.   Each  line of _f_i_l_e contains the map-              ping for a single word in V1:                   w1   w21 [p21] w22 [p22] ...              where _w_1 is a word from V1, which has possible map-              pings  _w_2_1,  _w_2_2, ... from V2.  Optionally, each of              these can be followed by a numeric string  for  the              probability  _p_2_1,  which defaults to 1.  The number              is used as the conditional  probability  P(_w_1|_w_2_1),              but  the  program  does not depend on these numbers              being properly normalized.       --eessccaappee _s_t_r_i_n_g              Set an ``escape  string.''   Input  lines  starting              with  _s_t_r_i_n_g are not processed and passed unchanged              to stdout instead.  This allows associated informa-              tion to be passed to scoring scripts etc.       --tteexxtt--mmaapp _f_i_l_e              Processes  a combined text/map file.  The format of              _f_i_l_e is the same as for --mmaapp, except  that  the  _w_1              field  on  each line is interpreted as a word _t_o_k_e_n              rather than a word _t_y_p_e.  Hence, the V1 text  input              consists  of all words in column 1 of _f_i_l_e in order              of appearance.  This  is  convenient  if  different              instances of a word have different mappings.  There              is no  implicit  insertion  of  begin/end  sentence              tokens in this mode.  Sentence boundaries should be              indicated explicitly by lines of the form                   </s> </s>                   <s>  <s>              An escaped line (see --eessccaappee) also implicitly marks              a sentence boundary.       --ccllaasssseess file              Specifies   the  V1-to-V2  mapping  information  in              ccllaasssseess--ffoorrmmaatt(5).  Class labels are interpreted as              V2  words,  and expansions as V1 words.  Multi-word              expansions are not allowed.       --ssccaallee Interpret the numbers in the mapping as  P(_w_2_1|_w_1).              This  is done by dividing probabilities by the uni-              gram probabilities of _w_2_1,  obtained  from  the  V2              language model.       --llooggmmaapp              Interpret  numeric values in map file as log proba-              bilities, not probabilities.       --llmm _f_i_l_e              Specifies the V2 language model as a standard  ARPA              N-gram  backoff  model  file  nnggrraamm--ffoorrmmaatt(5).  The              default is not  to  use  a  language  model,  i.e.,              choose V2 tokens based only on the probabilities in              the map file.       --oorrddeerr _n              Set the effective N-gram order used by the language              model to _n.  Default is 2 (use a bigram model).       --mmiixx--llmm _f_i_l_e              Read  a  second N-gram model for interpolation pur-              poses.       --ffaaccttoorreedd              Interpret the files specified by --llmm and --mmiixx--llmm as              a   factored   N-gram   model  specification.   See              nnggrraamm(1) for details.       --ccoouunntt--llmm              Interpret the model specified by --llmm (but not --mmiixx--              llmm) as a count-based LM.  See nnggrraamm(1) for details.       --llaammbbddaa _w_e_i_g_h_t              Set the weight of the main model when interpolating              with --mmiixx--llmm.  Default value is 0.5.       --mmiixx--llmm22 _f_i_l_e       --mmiixx--llmm33 _f_i_l_e       --mmiixx--llmm44 _f_i_l_e       --mmiixx--llmm55 _f_i_l_e       --mmiixx--llmm66 _f_i_l_e       --mmiixx--llmm77 _f_i_l_e       --mmiixx--llmm88 _f_i_l_e       --mmiixx--llmm99 _f_i_l_e              Up  to  9  more  N-gram models can be specified for              interpolation.       --mmiixx--llaammbbddaa22 _w_e_i_g_h_t       --mmiixx--llaammbbddaa33 _w_e_i_g_h_t       --mmiixx--llaammbbddaa44 _w_e_i_g_h_t       --mmiixx--llaammbbddaa55 _w_e_i_g_h_t       --mmiixx--llaammbbddaa66 _w_e_i_g_h_t       --mmiixx--llaammbbddaa77 _w_e_i_g_h_t       --mmiixx--llaammbbddaa88 _w_e_i_g_h_t       --mmiixx--llaammbbddaa99 _w_e_i_g_h_t              These are the weights for  the  additional  mixture              components, corresponding to --mmiixx--llmm22 through --mmiixx--              llmm99.  The weight for the --mmiixx--llmm model is  1  minus              the  sum  of --llaammbbddaa and --mmiixx--llaammbbddaa22 through --mmiixx--              llaammbbddaa99.       --bbaayyeess _l_e_n_g_t_h              Set the context length used for Bayesian interpola-              tion.   The default value is 0, giving the standard              fixed interpolation weight specified by --llaammbbddaa.       --bbaayyeess--ssccaallee _s_c_a_l_e              Set the exponential scale  factor  on  the  context              likelihood in conjunction with the --bbaayyeess function.              Default value is 1.0.       --llmmww _W Scales the language model probabilities by a factor              _W.  Default language model weight is 1.       --mmaappww _W              Scales  the  likelihood map probability by a factor              _W.  Default map weight is  1.   Note:  For  Viterbi              decoding (the default) it is equivalent to use --llmmww              _W or --mmaappww _1_/_W,, but not for forward-backward compu-              tation.       --ttoolloowweerr11              Map  input  vocabulary  (V1) to lowercase, removing              case distinctions.       --ttoolloowweerr22              Map output vocabulary (V2) to  lowercase,  removing              case distinctions.       --kkeeeepp--uunnkk              Do  not map unknown input words to the <unk> token.              Instead, output the input word unchanged.  This  is              like having an implicit default mapping for unknown              words to themselves,  except  that  the  word  will              still  be  treated  as <unk> by the language model.              Also, with this option the  LM  is  assumed  to  be              open-vocabulary  (the default is close-vocabulary).       --vvooccaabb--aalliiaasseess _f_i_l_e              Reads vocabulary alias definitions from _f_i_l_e,  con-              sisting of lines of the form                   _a_l_i_a_s _w_o_r_d              This  causes  all  V2  tokens _a_l_i_a_s to be mapped to              _w_o_r_d, and is useful for  adapting  mismatched  lan-              guage models.       --nnoo--eeooss              Do  no  assume that each input line contains a com-              plete  sentence.   This  prevents   end-of-sentence              tokens </s> from being appended automatically.       --ccoonnttiinnuuoouuss              Process  all  words in the input as one sequence of              words, irrespective of line breaks.  Normally  each              line  is  processed  separately  as a sentence.  V2              tokens are output one-per-line.  This  option  also              prevents  sentence  start/end tokens (<s> and </s>)              from being added to the input.       --ffbb    Perform forward-backward decoding of the input (V1)              token  sequence.   Outputs  the V2 tokens that have              the highest posterior probability, for  each  posi-              tion.   The  default  is  to  use Viterbi decoding,              i.e., the output is the V2 sequence with the higher              joint posterior probability.       --ffww--oonnllyy              Similar to --ffbb, but uses only the forward probabil-              ities for computing posteriors.  This may  be  used              to simulate on-line prediction of tags, without the              benefit of future context.       --ttoottaallss              Output the total string probability for each  input              sentence.       --ppoosstteerriioorrss              Output  the  table  of  posterior probabilities for              each input (V1) token and each  V2  token,  in  the              same  format as required for the --mmaapp file.  If --ffbb              is also specified the posterior probabilities  will              be  computed  using forward-backward probabilities;              otherwise an approximation will  be  used  that  is              based  on  the  probability of the most likely path              containing a given V2 token at given position.       --nnbbeesstt _N              Output the _N best hypotheses instead  of  just  the              first best when doing Viterbi search.  If _N>1, then              each hypothesis is prefixed by the tag  NNBBEESSTT___n  _x,,              where _n is the rank of the hypothesis in the N-best              list and _x its score, the negative log of the  com-              bined  probability  of transitions and observations              of the corresponding HMM path.       --wwrriittee--ccoouunnttss _f_i_l_e              Outputs the V2-V1 bigram  counts  corresponding  to              the  tagging  performed  on the input data.  If --ffbb              was specified these are expected counts, and other-              wise they reflect the 1-best tagging decisions.       --wwrriittee--vvooccaabb11 _f_i_l_e              Writes  the  input  vocabulary from the map (V1) to              _f_i_l_e.       --wwrriittee--vvooccaabb22 _f_i_l_e              Writes the output vocabulary from the map  (V2)  to              _f_i_l_e.   The  vocabulary will also include the words              specified in the language model.       --wwrriittee--mmaapp _f_i_l_e              Writes the map back to a file for  validation  pur-              poses.       --ddeebbuugg Sets debugging output level.       Each  filename  argument  can  be an ASCII file, or a com-       pressed file  (name  ending  in  .Z  or  .gz),   or  ``-''       to indicate stdin/stdout.BBUUGGSS       The  --ccoonnttiinnuuoouuss and --tteexxtt--mmaapp options effectively disable       --kkeeeepp--uunnkk, i.e., unknown input words are always mapped  to       <unk>.   Also,  --ccoonnttiinnuuoouuss doesn't preserve the positions       of escaped input lines relative to the input.SSEEEE AALLSSOO       ngram-count(1),   hidden-ngram(1),    training-scripts(1),       ngram-format(5), classes-format(5).AAUUTTHHOORR       Andreas Stolcke <stolcke@speech.sri.com>,       Anand Venkataraman <anand@speech.sri.com>.       Copyright 1995-2006 SRI InternationalSRILM Tools        $Date: 2006/07/30 00:07:52 $       disambig(1)
disambig.1 - 源码说明

本页面展示了「这是一款很好用的工具包」中的 disambig.1 源码文件，采用 1 编程语言编写，共 288 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与工具包相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?