📄 select-vocab.1
字号:
select-vocab(1) select-vocab(1)NNAAMMEE select-vocab - Select a maximum-likelihood vocabulary from a mixture of corpora.SSYYNNOOPPSSIISS sseelleecctt--vvooccaabb [ _-_o_p_t_i_o_n_s ... ] --hheellddoouutt _f_i_l_e _f_1 _f_2 ... _f_nDDEESSCCRRIIPPTTIIOONN sseelleecctt--vvooccaabb picks a vocabulary from the union of the vocabularies of files _f_1 through _f_n in order to maximize the likelihood of the heldout file. When invoked as above, the program will print out (unsorted) the list of words in all of the input corpora together with their weights. This list may subsequently be sorted to put the words in decreasing order of weight and a vocabulary may be chosen by picking a suitable threshold weight and ignoring words with weight less than this. A number of automatically detected formats are supported for the input files _f_1 through _f_n_. They can be count files, which are characterized by each line ending in a number, ARPA language models in nnggrraamm--ffoorrmmaatt(5), or simply text files. If they are text-files, further, and their names end in ".sentid", it is assumed that the first field of each line is a sentence identifier that is then dis- carded. Furthermore, all of the input files can also be compressed (if gzip is installed and available on the sys- tem).OOPPTTIIOONNSS --hheellpp Prints a short help message. --hheellddoouutt _f_i_l_e Likelihood maximization is performed on the con- tents of _f_i_l_e_. This file may also be in any of the formats supported for the input corpora, namely: text, counts, sentid, or ARPA-lm. --qquuiieett Suppresses printing of progress and other informa- tive messages during execution. By default the script writes these out to the output error stream. --ssccaallee _n The combined final counts are scaled by _n before being written out. This makes it possible to sort the output list numerically with sort(1). The default scale is 1e6.NNOOTTEESS This implementation corrects a minor error in the algo- rithm specification in [1]. The paper describes corpus level interpolation, but the script actually does word- level interpolation. The program is written in perl(1) and requires it to be installed in order to run.SSEEEE AALLSSOO ngram-count(1), ngram-format(5), training-scripts(1). [1] A. Venkataraman and W. Wang, "Techniques for effective vocabulary selection", in _P_r_o_c_e_e_d_i_n_g_s _o_f _E_u_r_o_s_p_e_e_c_h, Geneva, 2003.BBUUGGSS Probably. Send bug-reports, fixes, modifications and enhancements to Anand Venkataraman (anand@speech.sri.com).SSOOUURRCCEE Download as part of the SRILM toolkit, or stand-alone from http://www.speech.sri.com/people/anand/downloads/selvoc- v1.tar.gzAAUUTTHHOORRSS Anand Venkataraman <anand@speech.sri.com> Wen Wang <wwang@speech.dsri.com> Copyright 2003 SRI InternationalSRILM Tools $Date: 2003/12/14 02:43:14 $ select-vocab(1)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -