📄 decode.tex

📁 隐马尔科夫模型工具箱
💻 TEX
📖 第 1 页 / 共 3 页
字号:
    .   \end{verbatim}This shows the start and end time of each word and the total log probability.The fields output by \htool{HVite} can be controlled using the \texttt{-o}.  For example, the option \texttt{-o ST} would suppressthe scores and the times to give\begin{verbatim}    "testf1.rec"    SIL     ONE    NINE    FOUR    SIL     .   \end{verbatim}In order to use \htool{HVite} effectively and efficiently, it is important to set appropriate values for its pruning\index{pruning} thresholds and the language modelscaling parameters.   The main pruning beam is set by the  \texttt{-t} option.Some experimentation will be necessary to determine appropriate levelsbut around 250.0 is usually a reasonable starting point.  Word-end pruning(\texttt{-v}) and the maximum model limit\index{maximum model limit} (\texttt{-u}) can also be setif required, but these are not mandatory and their effectiveness willdepend greatly on the task.The relative levels of insertion and deletion errors\index{deletion errors}\index{insertion errors} can be controlledby scaling the language model\index{language model scaling} likelihoods using the \texttt{-s} optionand adding a fixed \textit{penalty}   using the \texttt{-p} option.For example, setting \texttt{-s 10.0 -p -20.0} would mean that every languagemodel log probability $x$ would be converted to $10x - 20$ before beingadded to the tokens emitted from the corresponding word-end node. Asan extreme example, setting \texttt{-p 100.0}caused the digit recogniser above to output\begin{verbatim}   SIL OH OH ONE OH OH OH NINE FOUR OH OH OH OH SIL \end{verbatim}where adding 100 to each word-end transition has resulted in a large number ofinsertion errors.  The word inserted is ``oh'' primarily because it is theshortest in the vocabulary. Another problem which may occur during recognition is the inability to arriveat the final node in the recognition network after processing the wholeutterance. \index{forceout@\texttt{FORCEOUT}} The user is made aware of theproblem by the message ``No tokens survived to final node of network''. Theinability to match the data against the recognition network is usually causedby poorly trained acoustic models and/or very tight pruning beam-widths. Insuch cases, partial recognition results can still be obtained by setting the\htool{HRec} configuration variable \texttt{FORCEOUT} true. \index{partial results} The results will be based on the most likely partial hypothesis found in the network.\mysect{Evaluating Recognition Results}{receval}\index{decoder!results analysis}Once the test data has been processed by the recogniser, the next step is toanalyse the results. The tool \index{hresults@\htool{HResults}}\htool{HResults} is provided for this purpose. \htool{HResults} compares the transcriptions output by \htool{HVite} with the original referencetranscriptions and then outputs various statistics. \htool{HResults} matcheseach of the recognised and reference label sequences by performing an optimalstring match\index{string matching} using dynamic programming. Except whenscoring word-spotter output as described later, it does not take any notice ofany boundary timing information stored in the files being compared.  Theoptimal string match works by calculating a score for the match with respect tothe reference such that identical labels match with score 0, a label insertioncarries a score of 7, a deletion carries a score of 7 and a substitutioncarries a score of 10\footnote{The default behaviour of \htool{HResults} isslightly different to the widely used US NIST scoring software which usesweights of 3,3 and 4 and a slightly different alignment algorithm. Identicalbehaviour to NIST can be obtained by setting the -n option.}. The optimalstring match is the label alignment which has the lowest possible score.Once the optimal alignment has been found, the number of substitutionerrors ($S$), deletion errors ($D$) and insertion errors ($I$) can becalculated.  The percentage correct is then\begin{equation}    \mbox{Percent Correct} = \frac{N-D-S}{N} \times 100\%\end{equation}where $N$ is the total number of labels in the reference transcriptions.Notice that this measure ignores insertion errors.  For many purposes,the percentage accuracy defined as\begin{equation}    \mbox{Percent Accuracy} = \frac{N-D-S-I}{N} \times 100\%\end{equation}is a more representative figure of recogniser performance\index{recogniser performance}.\htool{HResults} outputs both of the above measures. As with all \HTK\ tools it can process individual label files and files stored in MLFs.Here the examples will assume that both reference and test transcriptionsare stored in MLFs.As an example of use, suppose that the MLF \texttt{results} containsrecogniser output transcriptions, \texttt{refs} containsthe corresponding reference transcriptions and \texttt{wlist}contains a list of all labels appearing in these files.  Then typing the command\begin{verbatim}    HResults -I refs wlist results\end{verbatim}would generate something like the following\begin{verbatim}  ====================== HTK Results Analysis =======================    Date: Sat Sep  2 14:14:22 1995    Ref : refs    Rec : results  ------------------------ Overall Results --------------------------  SENT: %Correct=98.50 [H=197, S=3, N=200]  WORD: %Corr=99.77, Acc=99.65 [H=853, D=1, S=1, I=1, N=855]  ===================================================================\end{verbatim}The first part shows the date and the names of the files being used.The line labelled \texttt{SENT} shows the total number of complete sentences which were recognised correctly.  The second line labelled \texttt{WORD} gives therecognition statistics\index{recognition!statistics} for the individual words\footnote{All the examples here will assume that each label corresponds to a wordbut in general the labels could stand for any recognition unit such asphones, syllables, etc.  \htool{HResults} does not care what the labelsmean but for human consumption, the labels  \texttt{SENT} and \texttt{WORD}  can be changed using the \texttt{-a} and  \texttt{-b} options.}.It is often useful to visually inspect the recognition errors\index{recognition!errors}.  Setting the\texttt{-t} option causes aligned test and reference transcriptions tobe output for all sentences containing errors.  For example, a typicaloutput might be\begin{verbatim}  Aligned transcription: testf9.lab vs testf9.rec   LAB: FOUR    SEVEN NINE THREE   REC: FOUR OH SEVEN FIVE THREE\end{verbatim}here an ``oh'' has been inserted by the recogniser and ``nine''has been recognised as ``five''If preferred, results output can be formatted in an identicalmanner to NIST scoring software\index{NIST scoring software} by setting the  {\tt -h} option.For example, the results given above would appear as follows inNIST format\index{NIST format}\begin{verbatim}  ,-------------------------------------------------------------.  | HTK Results Analysis at Sat Sep  2 14:42:06 1995            |  | Ref: refs                                                   |  | Rec: results                                                |  |=============================================================|  |           # Snt |  Corr    Sub    Del    Ins    Err  S. Err |  |-------------------------------------------------------------|  | Sum/Avg |  200  |  99.77   0.12   0.12   0.12   0.35   1.50 |  `-------------------------------------------------------------'\end{verbatim}When computing recognition results it is sometimesinappropriate to distinguish certain labels.  For example, to assessa digit recogniser used for voice dialing it might be required totreat the alternative vocabulary items ``oh'' and ``zero'' as beingequivalent.  This can be done by making them equivalent using the\texttt{-e} option, that is\begin{verbatim}    HResults -e ZERO OH  .....\end{verbatim}If a label is equated to the special label \verb+???+, then it is ignored.  Hence, for example, if the recognition output hadsilence marked by \texttt{SIL}, the setting the option\verb+-e ??? SIL+ would cause all the \texttt{SIL} labels to beignored.\index{word equivalence}\htool{HResults} contains a number of other options.Recognition statistics can be generated for each fileindividually by setting the {\tt -f} option and a confusion matrix\index{confusion matrix}can be generated by setting the  {\tt -p} option.When comparing phone recognition results, \htool{HResults} willstrip any triphone contexts by setting the  {\tt -s} option.\htool{HResults} can also process N-best recognition output.Setting the option \texttt{-d N} causes \htool{HResults} tosearch the first \texttt{N} alternatives of each test outputfile to find the most accurate match with the reference labels.When analysing the performance of a speaker independent recogniserit is often useful to obtain accuracy figures on a per speaker basis.This can be done using the option \texttt{-k mask} where \texttt{mask}is a pattern used to extract the speaker identifier\index{speaker identifier} from the test label file name.  The pattern consists of a string of characters which can includethe pattern matching metacharacters \texttt{*} and \texttt{?} to match zero or more characters and a single character,respectively.The patternshould also contain a string of one or more \texttt{\%} characters whichare used as a mask to identify the speaker identifier.  For example,suppose that the test filenames had the following structure\begin{verbatim}    DIGITS_spkr_nnnn.rec\end{verbatim}where \texttt{spkr} is a 4 character speaker id and \texttt{nnnn}is a 4 digit utterance id.  Then executing \htool{HResults} by\begin{verbatim}    HResults -h -k '*_%%%%_????.*' ....\end{verbatim}would give output of the form\begin{verbatim}    ,-------------------------------------------------------------.    | HTK Results Analysis at Sat Sep  2 15:05:37 1995            |    | Ref: refs                                                   |    | Rec: results                                                |    |-------------------------------------------------------------|    |    SPKR | # Snt |  Corr    Sub    Del    Ins    Err  S. Err |    |-------------------------------------------------------------|    |    dgo1 |   20  | 100.00   0.00   0.00   0.00   0.00   0.00 |    |-------------------------------------------------------------|    |    pcw1 |   20  |  97.22   1.39   1.39   0.00   2.78  10.00 |    |-------------------------------------------------------------|    ......    |=============================================================|    | Sum/Avg |  200  |  99.77   0.12   0.12   0.12   0.35   1.50 |    `-------------------------------------------------------------'\end{verbatim}In addition to string matching, \htool{HResults} can also analyse the results of a recogniser configured for word-spotting.In this case, there is no DP alignment.  Instead, each recogniserlabel $w$ is compared with the reference transcriptions.If the start and end times of $w$ lie either side of the mid-pointof an identical label in the reference, then that recogniser labelrepresents a \textit{hit}, otherwise it is a \textit{false-alarm} (FA).The recogniser output must include the log likelihood scores aswell as the word boundary information.  \index{Figure of Merit}These scores are used to compute the \textit{Figure of Merit} (FOM)defined by NIST which is an upper-bound estimate on word spottingaccuracy averaged over 1 to 10 false alarms per hour.The FOM\index{FOM} is calculated  as follows where it is assumed that thetotal duration of the test speech is $T$ hours.  For each word, all ofthe spots are ranked in score order.  The percentage of true hits$p_i$ found before the $i$'th false alarm is then calculated for $i = 1 \ldots N+1$ where $N$ is the first integer $\ge 10T - 0.5$.The figure of merit is then defined as\hequation{\mbox{FOM} = \frac{1}{10T}(p_1 + p_2 + \ldots + p_N + a p_{N+1})}{nistfom}where $a = 10T - N$ is a factor that interpolates to 10 falsealarms per hour.Word spotting analysis is enabled by setting the \texttt{-w} optionand the resulting output has the form\begin{verbatim}  ------------------- Figures of Merit --------------------      KeyWord:    #Hits     #FAs  #Actual      FOM        BADGE:       92       83      102    73.56       CAMERA:       20        2       22    89.86       WINDOW:       84        8       92    86.98        VIDEO:       72        6       72    99.81      Overall:      268       99      188    87.55  ---------------------------------------------------------\end{verbatim}If required the standard time unit of 1 hour as used in the abovedefinition of FOM can be changed using the \texttt{-u option}.\mysect{Generating Forced Alignments}{falign}\index{decoder!forced alignment}\sidefig{hvalign}{55}{Forced Alignment}{-4}{\htool{HVite} can be made to compute forced alignments by not specifying a network with the \texttt{-w} option but by specifyingthe \texttt{-a} option instead.  In this mode, \htool{HVite} computes a new network for each input utterance using the wordlevel transcriptions and a dictionary.  By default, the outputtranscription will just contain the words and their boundaries.One of the main uses of forced alignment\index{forced alignment}, however, is to determine the actual pronunciations used in the utterancesused to train the HMM system in this case, the \texttt{-m}option can be used to generate model level output transcriptions.}  This type of forced alignment is usually part of a \textit{bootstrap}process, initially models are trained on the basis of one fixedpronunciation per \index{hled@\htool{HLEd}}\index{ex@\texttt{EX} command}word\footnote{The \htool{HLEd} \texttt{EX} command can be used to compute phonelevel transcriptions when there is only one possible phone transcriptionper word}.  Then \htool{HVite} is used in forced alignment modeto select the best matching pronunciations.  The new phone leveltranscriptions can then be used to retrain the HMMs.  Since trainingdata may have leading and trailing silence, it is usuallynecessary to insert a silence model at the start and end of therecognition network.  The  \texttt{-b} option can be used to do this.As an illustration, executing\begin{verbatim}    HVite -a -b sil -m -o SWT -I words.mlf \       -H hmmset dict hmmlist file.mfc\end{verbatim}would result in the following sequence of events (see Fig.~\href{f:hvalign}).The input file name \texttt{file.mfc} would have its extension replaced by\texttt{lab} and then a label file of this name would be searched for.In this case, the MLF file \texttt{words.mlf} has been loaded. Assuming that this file contains a word level transcription called\texttt{file.lab}, this transcription along with the dictionary \texttt{dict}will be used to construct a network equivalent to \texttt{file.lab}but with alternative pronunciations included in parallel.  Since \texttt{-b}option has been set, the specified \texttt{sil} model will be inserted
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -