📄 fundaments.tex

📁 Hidden Markov Toolkit (HTK) 3.2.1 HTK is a toolkit for use in research into automatic speech recogn
💻 TEX
📖 第 1 页 / 共 3 页
字号:
上一页 1 23
let $\phi_j(t)$ represent the maximum likelihood of observing speech vectors $\bm{o}_1$ to$\bm{o}_t$ and being in state $j$ at time $t$.This partial likelihood can be computed efficiently usingthe following recursion (cf. equation~\ref{e:16})\begin{equation} \label{e:27}    \phi_j(t) = \max_i \left\{ \phi_i(t-1) a_{ij} \right\}                     b_j(\bm{o}_t).\end{equation}where\begin{equation}    \phi_1(1) = 1\end{equation}\begin{equation}    \phi_j(1) = a_{1j} b_j(\bm{o}_1)\end{equation}for $1<j<N$.  The maximum likelihood $\hat{P}(O|M)$ is then given by\begin{equation}    \phi_N(T) = \max_i \left\{ \phi_i(T) a_{iN} \right\}\end{equation}As for the re-estimation case, the direct computation of likelihoodsleads to underflow, hence, log likelihoods are used instead.  Therecursion of equation~\ref{e:27} then becomes\begin{equation} \label{e:30}   \psi_j(t) = \max_i \left\{ \psi_i(t-1) + log(a_{ij}) \right\}                       + log(b_j(\bm{o}_t)).\end{equation}This recursion forms the basis of the so-called Viterbi algorithm.As shown in Fig.~\href{f:vtrellis}, this algorithm can be visualisedas finding the best path through a matrix where the verticaldimension represents the states of the HMM andthe horizontal dimension represents the frames of speech (i.e. time).Each large dot in the picture represents the log probabilityof observing that frame at that time and each arc between dotscorresponds to a log transition probability.  The log probabilityof any path is computed simply by summing the log transition probabilitiesand the log output probabilities along that path.  The paths aregrown from left-to-right column-by-column.  At time $t$, eachpartial path\index{path@partial} $\psi_i(t-1)$ is known for all states $i$, henceequation~\ref{e:30} can be used to compute $\psi_j(t)$thereby extending the partial paths by one time frame.\centrefig{vtrellis}{100}{The Viterbi Algorithm for IsolatedWord Recognition}This concept ofa path\index{path} is extremely important and it is generalised below to dealwith the continuous speech case.This completes the discussion of isolated word recognition usingHMMs.  There is no \HTK\ tool which implements the above Viterbialgorithm directly.  Instead, a tool called \htool{HVite}\index{hvite@\htool{HVite}} is provided whichalong with its supporting libraries, \htool{HNet} and \htool{HRec}, is designed to handle continuous speech.  Since this recogniser is syntaxdirected, it can also perform isolated word recognition as a special case.This is discussed in more detail below.\mysect{Continuous Speech Recognition}{consprec}Returning now to the conceptual model of speech production andrecognition exemplified by Fig.~\href{f:messencode}, it should beclear that the extension to continuous speech simply involvesconnecting HMMs together in sequence.  Each model in the sequencecorresponds directly to the assumed underlying symbol.  Thesecould be either whole words for so-called {\it connected speech recognition} or sub-words such as phonemesfor {\it continuous speech recognition}.  The reason for includingthe non-emitting entry and exit states\index{non-emitting states} should now be evident, thesestates provide the {\it glue} needed to join models together.There are, however, somepractical difficulties to overcome.  The trainingdata for continuous speech must consist of continuous utterances and,in general, the boundaries dividing the segments of speechcorresponding to each underlying sub-word model in the sequence will notbe known.  In practice, it is usually feasible to mark the boundariesof a small amount of data by hand.  All of the segments correspondingto a given model can then be extracted and the {\it isolated word}style of training described above can be used.  However, the amount of data obtainable in this way is usually very limited andthe resultant models will be poor estimates.  Furthermore, evenif there was a large amount of data, the boundaries imposed byhand-marking may not be optimal as far as the HMMs are concerned.Hence, in \HTK\ the use of \htool{HInit} and \htool{HRest}for initialising sub-wordmodels is regarded as a {\it bootstrap}\index{bootstrapping} operation\footnote{They can even be avoided altogether by using a \textit{flat start}as described in section~\ref{s:flatstart}.}.The maintraining phase involves the use of a tool called \htool{HERest}\index{herest@\htool{HERest}} which does{\it embedded training}.Embedded training\index{embedded training} uses the same Baum-Welch procedure as for the isolated case but rather than training each model individuallyall models are trained in parallel.  It works in the following steps:\begin{enumerate}\item Allocate and zero accumulators for all parameters of all HMMs.\item Get the next training utterance.\item Construct a composite HMM by joining in sequence  the      HMMs corresponding to the symbol transcription of the      training utterance. \item Calculate the forward and backward probabilities for the      composite HMM.  The inclusion of      intermediate non-emitting states in the composite model       requires some changes to the computation of the forward      and backward probabilities but these are only minor.  The      details are given in chapter~\ref{c:Training}.\item Use the forward and backward probabilities to compute       the probabilities of state occupation at each time frame      and update the accumulators in the usual way.\item Repeat from 2 until all training utterances have been      processed.\item Use the accumulators to calculate new parameter estimates      for all of the HMMs.\end{enumerate}These steps can then all be repeated as many times as is necessaryto achieve the required convergence.  Notice that although thelocation of symbol boundaries in the training data is not required (or wanted) for this procedure, the symbolic transcriptionof each training utterance is needed.Whereas the extensions needed to the Baum-Welch procedure fortraining sub-word models are relatively minor\footnote{In practice, a good deal of extra work is needed to achieveefficient operation on large training databases.  For example,the \htool{HERest} tool includes facilities forpruning on both the forward and backward passes and parallel operation on a network of machines.}, the correspondingextensions to the Viterbi algorithm are more substantial.In \HTK, an alternative formulation of the Viterbi algorithm isused\index{Token Passing Model} called the {\it Token Passing Model} \footnote{See ``Token Passing: a Conceptual Model for Connected SpeechRecognition Systems'', SJ Young, NH Russell and JHS Thornton,CUED Technical Report F\_INFENG/TR38, Cambridge University, 1989.Available by anonymous ftp from \texttt{svr-ftp.eng.cam.ac.uk}.}.  In brief,the token\index{path!as a token} passing model makes the concept of a state alignmentpath explicit.  Imagine each state $j$ of a HMM at time $t$ holds a singlemoveable token which contains, amongst other information,the partial log probability $\psi_j(t)$.  This token then representsa partial match between the observation sequence $\bm{o}_1$ to$\bm{o}_t$ and the model subject to the constraint that the modelis in state $j$ at time $t$.  The path extension algorithm representedby the recursion of equation~\ref{e:30} is then replaced by theequivalent {\it token passing algorithm} which isexecuted at each time frame $t$.  The key steps in this algorithmare as follows\begin{enumerate}\item Pass a copy of every token in state $i$       to all connecting states $j$, incrementing the log probability      of the copy by $log[a_{ij}]+log[b_j(\bm{o}(t)]$.\item Examine the tokens in every state and discard all but      the token with the highest probability.\end{enumerate}In practice, some modifications are needed to deal with the non-emittingstates but these are straightforward  if the tokens in entrystates are assumed to represent paths extended to time $t-\delta t$and tokens in exitstates are assumed to represent paths extended to time $t+\delta t$.The point of using the Token Passing Model is that it extendsvery simply to the continuous speech case.  Suppose that theallowed sequence of HMMs is defined by a finite state network.For example, Fig.~\href{f:netforcsr} shows a simple network inwhich each word is defined as a sequence of phoneme-based HMMsand all of the words are placed in a loop. In this network, the oval boxes denote HMM instances\index{HMM!instance of} and the squareboxes denote \textit{word-end} nodes\index{word-end nodes}. This compositenetwork is essentially just a single large HMM and the aboveToken Passing algorithm applies.  The only difference now is thatmore information is needed beyond the log probability of the besttoken.  When the best token reaches the end of the speech,the route it took through the network must be known in order to recover the recognised sequence of models.\centrefig{netforcsr}{65}{Recognition Network for ContinuouslySpoken Word Recognition}The history of a token's route through the network may be recorded efficiently as follows.  Every token carries a pointercalled a {\it word end link}.  When a token is propagated from the exit state of a word (indicated by passing through a word-end node\index{word-end nodes})to the entry state of another, that transitionrepresents a potential word boundary.  Hence a record called a {\it Word Link Record} is generated\index{WLR}\index{Word Link Record}in which is stored the identity of the word from which the tokenhas just emerged and the current value of the token's link.  Thetoken's actual link is then replaced by a pointer to the newlycreated WLR.  Fig.~\href{f:wlroper} illustrates this process.Once all of the unknown speech has been processed, the WLRsattached to the link of the best matching token(i.e. the token with the highest log probability)can be traced back to give the best matching sequence of words.At the same time the positions of the word boundaries can alsobe extracted if required.  \centrefig{wlroper}{100}{Recording Word Boundary Decisions}The token passing algorithm for continuous speech has been describedin terms of recording the word sequence only.  If required, the sameprinciple can be used to record decisions at the model and state level.Also, more than just the best token at each word boundary can be saved.This gives the potential for generating a lattice of hypotheses ratherthan just the single best hypothesis.  Algorithms based on this ideaare called \textit{lattice N-best}\index{lattice!N-best}.   They are suboptimal because theuse of a single token per state limits the number of different tokenhistories that can be maintained.  This limitation can be avoided byallowing each model state to hold multiple-tokens and regardingtokens as distinct if they come from different preceding words.  Thisgives a class of algorithm called  \textit{word N-best} which has beenshown empirically to be comparable in performance to an optimal N-bestalgorithm. \index{N-best}\index{word N-best}The above outlines the main idea of Token Passing as it isimplemented within \HTK. The algorithms are embedded in thelibrary modules \htool{HNet}\index{hnet@\htool{HNet}} and \htool{HRec}\index{hrec@\htool{HRec}} and they may beinvoked using the recogniser tool called \htool{HVite}\index{hvite@\htool{HVite}}.They provide single and multiple-token\index{multiple-tokens} passing recognition, single-bestoutput, lattice output, N-best lists, support for cross-word context-dependency,lattice rescoring\index{lattice!rescoring} and forced alignment\index{forced alignment}.\mysect{Speaker Adaptation}{spk_adapt}\index{adaptation}Although the training and recognition techniques described previously canproduce high performance recognition systems, these systems can be improved upon by customising the HMMs to the characteristics of a particular speaker. \HTK\ provides the tools \htool{HERest}\index{headapt@\htool{HEAdapt}} and \htool{HVite}\index{hvite@\htool{HVite}} to perform adaptation using a small amount of enrollment or adaptation data. The two tools differ in that \htool{HERest} performsoffline supervised adaptation while \htool{HVite} recognises theadaptation data and uses the generated transcriptions to perform the adaptation. Generally, more robust adaptation is performed in a supervised mode,as provided by \htool{HERest}, but given an initial well trained model set, \htool{HVite} can still achieve noticeable improvements in performance. Full details of adaptation and how it is used in \HTK\ can be found in Chapter~\ref{c:Adapt}.%%% Local Variables: %%% mode: latex%%% TeX-master: "htkbook"%%% End:
上一页 1 23
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -