📄 nbest-scripts.1
字号:
filename, followed by the words. rreessccoorree--mmiinniimmiizzee--wweerr is similar to rreessccoorree--rreewweeiigghhtt but picks hypotheses using the word error minimization algo- rithm of nnbbeesstt--llaattttiiccee(1). nnbbeesstt22--ttoo--nnbbeesstt11 converts an N-best list in ``NBestList2.0'' format to ``NBestlist1.0'', for the bene- fit of programs that have not yet been updated to deal with the new format. nnbbeesstt--rroovveerr combines hypotheses from multiple N-best lists at the word level, by performing the same kind of word error minimization as nnbbeesstt--llaattttiiccee(1), in a generaliza- tion of the ROVER algorithm. _s_e_n_t_i_d_-_l_i_s_t is a file list- ing sentence IDs. These must match the filenames in a set of N-best directories, which are specified in a _c_o_n_t_r_o_l_- _f_i_l_e. The format for the latter is _d_i_r_1 _l_m_w_1 _w_t_w_1 _w_1 [_n_1 [_s_1]] _d_i_r_2 _l_m_w_2 _w_t_w_2 _w_2 [_n_2 [_s_2]] ... Each line specifies an N-best directory, the language model and word transition weights to be used in score combination, and a weight to be applied to the posterior probabilities. An optional next-to-last parameter for each N-best list allows the lists to be truncated to the top _n_1, _n_2, etc., hypotheses. The final optional parame- ter sets the posterior distribution scaling factor, which defaults to the language model weight. Optionally, _c_o_n_- _t_r_o_l_-_f_i_l_e can also contain lines of the form _d_i_r _w ++ These indicate that additional score files can be found in directory _d_i_r and that the scores found therein should be added to the following N-best list set with weight _w. Several lines of this form may occur preceding a regular N-best directory specification; the corresponding additive combination of multiple scores is performed. If ``-'' is specified for _s_e_n_t_i_d_-_l_i_s_t, the sentence IDs are inferred from the contents of the first directory _d_i_r_1 specified in _c_o_n_t_r_o_l_-_f_i_l_e. If _p_o_s_t_e_r_i_o_r_-_f_i_l_e is specified on the command line, posterior word probability estimates are written to that file. Any additional arguments are passed as options to the underlying nnbbeesstt--llaattttiiccee(1) invo- cation. nnbbeesstt--rroovveerr can process N-best lists in any of the formats described in nnbbeesstt--ffoorrmmaatt(5), _a_s _l_o_n_g _a_s _a_l_l _N_-_b_e_s_t _l_i_s_t_s _f_o_r _a _g_i_v_e_n _u_t_t_e_r_a_n_c_e _a_r_e _i_n _t_h_e _s_a_m_e _f_o_r_m_a_t. When Deci- pher formats are used only their acoustic scores are used. ccoommbbiinnee--rroovveerr--ccoonnttrroollss takes one or more nnbbeesstt--rroovveerr con- trol files as arguments and outputs a new control file that specifies the combination of the input files. Each input system is given equal weight. Directory names in the input files are adjusted to reflect the relative loca- tion of the input files. The optional llaammbbddaa== argument may be used to specify a space-separated list of system weights; the default weights are uniform. nnbbeesstt--ppoosstteerriioorrss rescales the scores in an N-best list to reflect (weighted) posterior probabilities. The output is the same N-best list with acoustic scores set to the log (base 10) of the posterior hyp probabilities and LM scores set to zero. ppoossttssccaallee==_S attenuates the posterior distri- bution by dividing combined log scores by _S (the default is _S=_l_m_w). If wweeiigghhtt==_W is specified the posteriors are multiplied by _W. mmaaxx__nnbbeesstt==_M limits the number of hypotheses used to the top _M. This script is used mostly as a helper in nnbbeesstt--rroovveerr. mmeerrggee--nnbbeesstt merges hypotheses from one or more N-best lists into a single list, collapsing hypotheses that occur in more than one input list. If all input lists use the same nnbbeesstt--ffoorrmmaatt(5) then the output will also be in that format and contain the information from the first list in which a hypothesis was encountered. Otherwise, the output will be in SRI Decipher(TM) NBestList1.0 format and con- tain acoustic scores and word strings only. The mmaaxx__nnbbeesstt==_M option limits input to the first _M hypotheses from each input list. mmuullttiiwwoorrddss==11 merges hypotheses that are identical after resolving multiwords. nnooppaauusseess==11 merges hypotheses that are identical after removal of pause words. nnbbeesstt--vvooccaabb outputs the vocabulary used in a set of N-best lists. (The N-best files cannot be compressed, but may be concatenated and supplied via stdin.) nnbbeesstt--eerrrroorr computes the overall oracle word error rate of a set of N-best lists in directory _s_c_o_r_e_-_d_i_r or listed in _f_i_l_e_-_l_i_s_t. The reference answers are given in _r_e_f_s in the format output by rreessccoorree--rreewweeiigghhtt (see above). Additional arguments are passed to the underlying invocation of nnbbeesstt--llaattttiiccee(1), and can be used to limit the depth of the N-best list, compute lattice error rather than N-best error, etc. sseennttiidd--ttoo--sscclliittee converts 1-best hypotheses and references in the format used here to the ``trn'' format expected by the NIST sscclliittee(1) scoring software. sseennttiidd--ttoo--ccttmm converts 1-best hypotheses and references in the format used here to NIST ccttmm(5) format. The script relies on an encoding of conversation IDs, channel, and utterance time marks in the sentence IDs and may need adjustment to local conventions. ffiixx--ccttmm converts output produced by the --oouuttppuutt--ccttmm option of nnbbeesstt--llaattttiiccee(1) and llaattttiiccee--ttooooll(1) to a format suit- able for scoring with NIST sscclliittee(1). It, too, relies on information encoded in the sentids IDs and may need adjustments. ccoommppuuttee--sscclliittee is a wrapper around the NIST sscclliittee(1) scoring tool. _r_e_f_s and _h_y_p_s are the reference and hypoth- esized transcripts, respectively. The _r_e_f_s file can be either in "sentid" format or in ssttmm(5) format. In the latter case, _h_y_p_s will be converted to ccttmm(5) format using the sseennttiidd--ttoo--ccttmm helper script. The _h_y_p_s file can be either in "sentid" format or in ccttmm(5) format. More than one --hh option can be given to combine the contents of mul- tiple hypotheses files. Optionally, --SS specifies a sorted list of sentence IDs _s_u_b_s_e_t to score. Multiple --SS options may be given, to form the intersection of several subsets. --mmuullttiiwwoorrddss or --MM splits ``multiwords'' joined by under- scores into their component words prior to scoring. --nnooppeerriiooddss deletes periods from the hypotheses prior to scoring (typically used to bridge different conventions for spelled letters). --RR preserves reject words in the hypotheses for scoring (as appropriate if references also contain rejects). --gg _g_l_m_f_i_l_e enables filtering of refer- ences and hypotheses by the NIST ccssrrffiilltt..sshh script, con- trolled by the filter file _g_l_m_f_i_l_e (this is only possible with an stm reference file). In that case, the --HH option causes hesitations (as defined by the filter) to be deleted from the output for scoring purposes. --vv displays the complete command used to invoke sscclliittee. Any addi- tional options are passed to sscclliittee, e.g., to control its output actions or alignment mode. ccoommppaarree--sscclliittee scores two sets of hypotheses _h_y_p_s_1 and _h_y_p_s_2 for the same test set and computes in how many cases the first or second set had lower word error. The remain- ing options are as for ccoommppuuttee--sscclliittee. The script ignores hypotheses for sentence that do not appear in both hypoth- esis files, to ensure comparable scoring results.SSEEEE AALLSSOO nbest-format(5), ngram(1), nbest-lattice(1), nbest-opti- mize(1), sclite(1), stm(5), ctm(5). J.G. Fiscus, A Post-Processing System to Yield Reduced Word Error Rates: Recognizer Output Voting Error Reduction (ROVER), _P_r_o_c_. _I_E_E_E _A_u_t_o_m_a_t_i_c _S_p_e_e_c_h _R_e_c_o_g_n_i_t_i_o_n _a_n_d _U_n_d_e_r_s_t_a_n_d_i_n_g _W_o_r_k_s_h_o_p, Santa Barbara, CA, 347-352, 1997. A. Stolcke et al., "The SRI March 2000 Hub-5 Conversa- tional Speech Transcription System", _P_r_o_c_. _N_I_S_T _S_p_e_e_c_h _T_r_a_n_s_c_r_i_p_t_i_o_n _W_o_r_k_s_h_o_p, College Park, MD, 2000.BBUUGGSS sseennttiidd--ttoo--sscclliittee has some assumptions about the structure of sentence IDs built-in and may need to be modified for ccoommppuuttee--sscclliittee and ccoommppaarree--sscclliittee to work. rreessccoorree--ddeecciipphheerr --pprreettttyy may not work correctly with the --lliimmiitt--vvooccaabb option if the word mapping adds to the vocab- ulary subset used in the N-best lists.AAUUTTHHOORR Andreas Stolcke <stolcke@speech.sri.com>. Copyright 1995-2006 SRI InternationalSRILM Tools $Date: 2006/07/29 18:42:28 $ nbest-scripts(1)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -