📄 decode.tex
字号:
%/* ----------------------------------------------------------- */%/* */%/* ___ */%/* |_| | |_/ SPEECH */%/* | | | | \ RECOGNITION */%/* ========= SOFTWARE */ %/* */%/* */%/* ----------------------------------------------------------- */%/* Copyright: Microsoft Corporation */%/* 1995-2000 Redmond, Washington USA */%/* http://www.microsoft.com */%/* */%/* Use of this software is governed by a License Agreement */%/* ** See the file License for the Conditions of Use ** */%/* ** This banner notice must not be removed ** */%/* */%/* ----------------------------------------------------------- */%% HTKBook - Steve Young 1/12/97%\mychap{Decoding}{decode}\sidepic{Tool.decode}{80}{ }The previous chapter has described how to construct a recognitionnetwork specifying what is allowed to be spoken and howeach word is pronounced. Given such a network, its associatedset of HMMs, and an unknown utterance, the probability ofany path through the network can be computed. The task of a decoder is to find those paths which are the most likely.\index{decoder}As mentioned previously, decoding in \HTK\ is performed by a librarymodule called \htool{HRec}. \htool{HRec} uses the token passingparadigm to find the best path and, optionally, multiple alternativepaths. In the latter case, it generates a lattice containing themultiple hypotheses which can if required be converted to an N-bestlist. To drive \htool{HRec} from the command line, \HTK\ provides atool called \htool{HVite}. As well as providing basic recognition,\htool{HVite} can perform forced alignments, lattice rescoring andrecognise direct audio input. To assist in evaluating the performanceof a recogniser using a test database and a set of reference transcriptions, \HTK\ also provides a tool called \htool{HResults} to compute word accuracy and various related statistics. The principles and use of these recognition facilities are describedin this chapter.\mysect{Decoder Operation}{decop}\index{decoder!operation}As described in Chapter~\ref{c:netdict} and illustrated byFig.~\href{f:recsys}, decoding in \HTK\ is controlled by a recognitionnetwork compiled from a word-level network, a dictionary and a set ofHMMs. The recognition network consists of a set of nodes connectedby arcs. Each node is either a HMM model instance or a word-end.Each model node is itself a network consisting of states connected byarcs. Thus, once fully compiled, a recognition network\index{recognition!network} ultimatelyconsists of HMM states connected by transitions. However, it can beviewed at three different levels: word, model and state.Fig.~\href{f:recnetlev} illustrates this hierarchy.\sidefig{recnetlev}{62}{Recognition Network Levels}{2}{For an unknowninput utterance with $T$ frames, every path from the start node to theexit node of the network which passes through exactly $T$ emitting HMMstates is a potential recognition hypothesis\index{recognition!hypothesis}.Each of these paths has a log probability which is computed by summingthe log probability of each individual transition in the path and thelog probability of each emitting state generating the correspondingobservation. Within-HMM transitions are determined from the HMMparameters, between-model transitions are constant and word-endtransitions are determined by the language model likelihoods attachedto the word level networks.The job of the decoder is to find those paths through the networkwhich have the highest log probability. These pathsare found using a \textit{Token Passing} algorithm. A token representsa partial path through the network extending from time 0 through to time $t$.At time 0, a token is placed in every possible start node. \index{token passing}}Each time step,tokens are propagated along connecting transitions stopping whenever theyreach an emitting HMM state. When there are multiple exits from a node,the token is copied so that all possible paths are explored in parallel.As the token passes across transitions and through nodes, its log probabilityis incremented by the corresponding transition and emission probabilities.A network node can hold at most $N$ tokens. Hence, at the end of each time step,all but the $N$ best tokens in any node are discarded.As each token passes through the network it must maintain a historyrecording its route. The amount of detail in this history\index{token history} dependson the required recognition output. Normally, only word sequencesare wanted and hence, only transitions out of word-end nodes\index{word-end nodes} needbe recorded. However, for some purposes, it is useful to know theactual model sequence and the time of each model to model transition.Sometimes a description of each path down to the state levelis required. All of this information, whatever level of detail isrequired, can conveniently be represented using a lattice structure.Of course, the number of tokens allowed per node and the amount ofhistory information requested will have a significant impact onthe time and memory needed to compute the lattices. The mostefficient configuration is $N=1$ combined with just word level history information and this is sufficientfor most purposes.A large network will have many nodes and one way to make a significantreduction in the computation needed is to only propagate tokens whichhave some chance of being amongst the eventual winners. This processis called \textit{pruning}. It is implemented at each time step bykeeping a record of the best token overall and de-activating alltokens whose log probabilities fall more than a \textit{beam-width}below the best. For efficiency reasons, it is best to implement primarypruning\index{pruning} at the model rather than the state level. Thus, modelsare deactivated when they have no tokens in any state within the beam andthey are reactivated whenever active tokens are propagated into them.State-level pruning is also implemented by replacing any token by a null (zero probability) token if it falls outside of the beam.If the pruning beam-width\index{beam width} is set too small then the most likelypath might be pruned before its token reaches the end of the utterance.This results in a \textit{search error}. Setting the beam-width isthus a compromise between speed and avoiding search errors.When using word loops with bigram probabilities, tokens emitted fromword-end nodes will have a language model probability added to thembefore entering the following word. Since the range of languagemodel probabilities is relatively small, a narrower beam can beapplied to word-end nodes without incurring additional search errors\index{search errors}.This beam is calculated relative to the best word-end token andit is called a \textit{word-end beam}. In the case, of a recognitionnetwork with an arbitrary topology, word-end pruning may still bebeneficial but this can only be justified empirically.Finally, a third type of pruning control is provided. An upper-boundon the allowed use of compute resource can be applied by settingan upper-limit on the number of models in the network which canbe active simultaneously. When this limit is reached, the pruningbeam-width is reduced in order to prevent it being exceeded.\mysect{Decoder Organisation}{decorg}The decoding process itself is performed by a set of core functions provided within the library module \htool{HRec}\index{hrec@\htool{HRec}}. The processof recognising a sequence of utterances is illustrated in Fig.~\href{f:decflow}.\index{decoder!organisation}The first stage is to create a \textit{recogniser-instance}. This isa data structure containing the compiled recognition network andstorage for storing tokens. The point of encapsulating all of theinformation and storage needed for recognition into a single object isthat \htool{HRec}\index{hrec@\htool{HRec}} is re-entrant and can support multiple recognisers\index{multiple recognisers}simultaneously. Thus, although this facility is not utilised in thesupplied recogniser \htool{HVite}\index{hvite@\htool{HVite}}, it does provide applicationsdevelopers with the capability to have multiple recognisers runningwith different networks.Once a recogniser has been created, each unknown input is processed by first executing a \textit{start recogniser} call, and thenprocessing each observation one-by-one. When all input observationshave been processed, recognition is completed by generating a lattice.This can be saved to disk as a standard lattice format (SLF) file orconverted to a transcription.The above decoder organisation is extremely flexible and this isdemonstrated by the \HTK\ tool \htool{HVite} which is a simple shell program designed to allow \htool{HRec} to be driven fromthe command line. Firstly, inputcontrol in the form of a recognition network allows three distinct modesof operation\sidefig{decflow}{62}{Recognition Processing}{2}{\begin{enumerate}\item \textit{Recognition} \\This is the conventional case in which the recognition networkis compiled from a task level word network.\index{decoder!recognition mode}\item \textit{Forced Alignment} \\In this case, the recognition networkis constructed from a word level transcription (i.e.\ orthography)and a dictionary. The compiled network may include optional silencesbetween words and pronunciation variants. Forced alignment is often usefulduring training to automatically derive phone level transcriptions.It can also be used in automatic annotation systems.\index{decoder!alignment mode}\item \textit{Lattice-based Rescoring} \\In this case, the input network is compiled from a lattice generatedduring an earlier recognition run. This mode of operation can beextremely useful for recogniser development since rescoring can bean order of magnitude faster than normal recognition. The requiredlattices are usually generated by a basic recogniser running withmultiple tokens, the idea being to generate a lattice containing boththe correct transcription plus a representative number of confusions.Rescoring can then be used to quickly evaluate the performance of more advanced recognisers and the effectiveness of new recognition techniques.\index{decoder!rescoring mode}\end{enumerate}The second source of flexibility lies in the provision of multipletokens and recognition outputin the form of a lattice. In addition to providing a mechanismfor rescoring, lattice output can be used as a source of multiplehypotheses either for further recognition processing or inputto a natural language processor. Where convenient, lattice outputcan easily be converted into N-best lists.}Finally, since \htool{HRec} is explicitly driven step-by-step at theobservation level, it allows fine control over the recognition process and avariety of traceback and on-the-fly output possibilities.For application developers, \htool{HRec} and the \HTK\ library moduleson which it depends can be linked directly into applications. It will also be available in the form of an industry standard API. However, as mentioned earlier the \HTK\ toolkit also supplies a tool called \htool{HVite} which is a shell programdesigned to allow \htool{HRec} to be driven from the command line.The remainder of this chapter will therefore explain the various facilitiesprovided for recognition from the perspective of \htool{HVite}.\mysect{Recognition using Test Databases}{hvrec}When building a speech recognition system or investigating speechrecognition algorithms, performance must be monitored by testingon databases of test utterances for which reference transcriptionsare available. To use \htool{HVite} for this purpose it isinvoked with a command line of the form\begin{verbatim} HVite -w wdnet dict hmmlist testf1 testf2 ....\end{verbatim}where \texttt{wdnet} is an SLF file containing the word level network, \texttt{dict} is the pronouncing dictionary and hmmlist containsa list of the HMMs to use. The effect of this command is that\htool{HVite} will use \htool{HNet} to compile the word level networkand then use \htool{HRec} to recognise each test file. The parameter kindof these test files must match exactly with that used to train the HMMs.For evaluation purposes, test files are normally stored in parameterisedform but only the basic static coefficients are saved on disk. For example,delta parameters are normally computed during loading. As explained inChapter~\ref{c:speechio}, \HTK\ can perform a range of parameter conversionson loading and these are controlled by configuration variables. Thus,when using \htool{HVite}, it is normal to include a configuration filevia the \texttt{-C} option in which the required target parameter kind is specified. Section~\ref{s:recaudio} below on processing directaudio input explains the use of configuration files in more detail.\index{decoder!evaluation}In the simpledefault form of invocation given above, \htool{HVite} wouldexpect to find each HMM definition in a separate file in the currentdirectory and eachoutput transcription would be written to a separate file in the current directory.Also, of course, there will typically be a large number of test files.In practice, it is much more convenient to store HMMs in master macro files (MMFs),store transcriptions in master label files (MLFs) and list data filesin a script file. Thus, a more common form of the above invocation wouldbe \begin{verbatim} HVite -T 1 -S test.scp -H hmmset -i results -w wdnet dict hmmlist \end{verbatim}where the file \texttt{test.scp} contains the list of test file names,\texttt{hmmset} is an MMF containing the HMM definitions\footnote{Large HMM sets will often be distributed across a number of MMF files,in this case, the \texttt{-H} option will be repeated for each file.},and \texttt{results} is the MLF for storing the recognition output.\index{decoder!progress reporting}As shown, it is usually a good idea to enable basic progress reportingby setting the trace option as shown. This will cause the recognisedword string to be printed after processing each file. For example,in a digit recognition task the trace output might look like\begin{verbatim} File: testf1.mfc SIL ONE NINE FOUR SIL [178 frames] -96.1404 [Ac=-16931.8 LM=-181.2] (Act=75.0)\end{verbatim}where the information listed after the recognised string is the totalnumber of frames in the utterance, the average log probability\index{average log probability} per frame,the total acoustic likelihood, the total language model likelihood andthe average number of active models.\index{decoder!trace output}The corresponding transcriptionwritten to the output MLF form will contain an entry of the form\index{decoder!output MLF}\begin{verbatim} "testf1.rec" 0 6200000 SIL -6067.333008 6200000 9200000 ONE -3032.359131 9200000 12300000 NINE -3020.820312 12300000 17600000 FOUR -4690.033203 17600000 17800000 SIL -302.439148
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -