📄 segment.1
字号:
segment(1) segment(1)NNAAMMEE segment - segment text using N-gram language modelSSYYNNOOPPSSIISS sseeggmmeenntt [--hheellpp] option ...DDEESSCCRRIIPPTTIIOONN sseeggmmeenntt infers a most likely segmentation (location of segment boundaries) from a text, based on a segment lan- guage model. The language model is a standard backoff N- gram model in ARPA nnggrraamm--ffoorrmmaatt(5), modeling segmentation using the boundary tags <s> and </s>. The program reads in a word sequence, finds the most likely locations of segment boundaries according to the language model, and outputs the word sequence with segment boundaries marked by <s> tags.OOPPTTIIOONNSS Each filename argument can be an ASCII file, or a com- pressed file (name ending in .Z or .gz), or ``-'' to indi- cate stdin/stdout. --hheellpp Print option summary. --vveerrssiioonn Print version information. --oorrddeerr _n Set the maximal N-gram order to be used, by default 3. NOTE: The order of the model is not set auto- matically when a model file is read, so the same file can be used at various orders. --ddeebbuugg _l_e_v_e_l Set the debugging output level (0 means no debug- ging output). Debugging messages are sent to stderr. --llmm _f_i_l_e Read the N-gram model from _f_i_l_e. --tteexxtt _f_i_l_e Find the text to be segmented in _f_i_l_e. Default input is stdin. --ccoonnttiinnuuoouuss Process all words in the input as one sequence of words, irrespective of line breaks. Normally each line is processed separately as a word sequence. --ppoosstteerriioorrss Use a forward-backward algorithm to compute the posterior probabilities of a segment boundary at each word transition, and hypothesize a boundary whenever the probability exceeds 0.5. By default a Viterbi algorithm is used that computes the glob- ally most likely segmentation. If --ccoonnttiinnuuoouuss is specified as well, then this option will produce one line of output per word, containing, respectively, the <s> tag (if appropri- ate), the word itself, and the posterior probabil- ity for a boundary preceding the word. --uunnkk Output the unknown word token <unk> for each input word not in the language model vocabulary. The default is to output the input word unchanged. --ssttaagg _s_t_r_i_n_g Use _s_t_r_i_n_g to mark segment boundaries in the out- put. Default is the start-of-sentence symbol defined in the language model (<s>). --bbiiaass _b Make a segment boundary a priori more likely by a factor of _b. This allows balancing of false detec- tion/rejection errors. The default is 1.SSEEEE AALLSSOO ngram-count(1), ngram-format(5). A. Stolcke and E. Shriberg, ``Automatic Linguistic Segmen- tation of Spontaneous Speech,'' _P_r_o_c_. _I_C_S_L_P, 1005-1008, 1996.BBUUGGSS Only N-grams models up to trigram order are used accu- rately. For higher-order models use the more general hhiidd-- ddeenn--nnggrraamm(1).AAUUTTHHOORR Andreas Stolcke <stolcke@speech.sri.com>. Copyright 1997-2004 SRI InternationalSRILM Tools $Date: 2004/12/03 17:59:01 $ segment(1)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -