📄 fundaments.tex
字号:
can estimate values for it. Multiple data streams are used toenable separate modelling of multiple information sources. In\HTK, the processing of streams is completely general. However,the speech input modules assume that the source data is split into at most 4 streams. Chapter~\ref{c:speechio}discusses this in more detail but for now it is sufficient toremark that the default streams are thebasic parameter vector, first (delta) and second (acceleration)difference coefficients and log energy.\mysect{Baum-Welch Re-Estimation}{bwrest}To determine the parameters of a HMM it is first necessary to makea rough guess at what they might be. Once this is done, moreaccurate (in the maximum likelihood sense) parameterscan be found by applying the so-called Baum-Welch re-estimation\index{Baum-Welch re-estimation}formulae. \sidefig{subsmixrep}{60}{Representing a Mixture}{-4}{Chapter~\ref{c:Training} gives the formulae used in \HTK\ in full detail. Here the basisof the formulae will be presented in a very informal way.Firstly, it should be noted that the inclusion of multiple datastreams does not alter matters significantly since each streamis considered to be statistically independent. Furthermore,mixture components can be considered to be a special form of sub-state in which the transition probabilities are the mixtureweights (see Fig.~\href{f:subsmixrep}).}Thus, the essential problem is to estimate the means and variances of a HMM in which each state output distribution is a singlecomponent Gaussian, that is\begin{equation} \label{e:10}b_j(\bm{o}_t) = \frac{1}{\sqrt{(2 \pi)^n | \bm{\Sigma_j} |}} e^{- \frac{1}{2}(\bm{o}_t - \bm{\mu}_j)'\bm{\Sigma}_j^{-1}(\bm{o}_t - \bm{\mu}_j)}\end{equation}If there was just one state $j$ in the HMM, this parameterestimation would be easy. The maximum likelihood estimates of $\bm{\mu}_j$ and $\bm{\Sigma}_j$ would be just the simple averages, that is\begin{equation} \label{e:11} \hat{\bm{\mu}}_j = \frac{1}{T} \sum_{t=1}^{T} \bm{o}_t\end{equation}and\begin{equation} \label{e:12} \hat{\bm{\Sigma}}_j = \frac{1}{T} \sum_{t=1}^{T} (\bm{o}_t - \bm{\mu}_j) (\bm{o}_t - \bm{\mu}_j)'\end{equation}In practice, of course, there are multiple states and there is nodirect assignment of observation vectors to individual states because the underlying state sequence is unknown. Note, however,that if some approximate assignment of vectors to states could be made thenequations \ref{e:11} and \ref{e:12} could be used to give therequired initial values for the parameters. Indeed, this is exactlywhat is done in the \HTK\ tool called \htool{HInit}\index{hinit@\htool{HInit}}. \htool{HInit} first divides thetraining observation vectors equally amongst the model states and thenuses equations \ref{e:11} and \ref{e:12} to give initial values forthe mean and variance of each state. It then finds the maximumlikelihood state sequence using the Viterbi\index{Viterbi training} algorithm described below,reassigns the observation vectors to states and then usesequations \ref{e:11} and \ref{e:12} again to get better initial values. This process is repeated until the estimatesdo not change.Since the full likelihood of each observation sequenceis based on the summation of all possible state sequences,each observation vector $\bm{o}_t$ contributes to the computationof the maximum likelihood parameter values for each state $j$.In other words, instead of assigning each observation vectorto a specific state as in the above approximation, eachobservation is assigned to every state in proportion tothe probability of the model being in that state when thevector was observed. Thus, if $L_j(t)$ denotes the probabilityof being in state $j$ at time $t$ then the equations \ref{e:11} and \ref{e:12} given above become thefollowing weighted averages\begin{equation} \label{e:13} \hat{\bm{\mu}}_j = \frac{ \sum_{t=1}^{T} L_j(t) \bm{o}_t} {\sum_{t=1}^{T} L_j(t)}\end{equation}and\begin{equation} \label{e:14} \hat{\bm{\Sigma}}_j = \frac{ \sum_{t=1}^{T} L_j(t) (\bm{o}_t - \bm{\mu}_j) (\bm{o}_t - \bm{\mu}_j)' } {\sum_{t=1}^{T} L_j(t)}\end{equation}where the summations in the denominators are included to givethe required normalisation.Equations \ref{e:13} and \ref{e:14} are the Baum-Welch re-estimation\index{Baum-Welch re-estimation}formulae for the means and covariances of a HMM. A similar butslightly more complex formula can be derived for the transitionprobabilities (see chapter~\ref{c:Training}).Of course, to apply equations \ref{e:13} and \ref{e:14}, theprobability of state occupation $L_j(t)$ must be calculated.This is done efficiently using the so-called {\it Forward-Backward}\index{Forward-Backward algorithm}algorithm. Let the forward probability\footnote{Since the output distributions are densities, these are notreally probabilities but it is a convenient fiction.}\index{forward probability} $\alpha_j(t)$ for some model$M$ with $N$ states be defined as \begin{equation} \label{e:15} \alpha_j(t) = P(\bm{o}_1,\ldots,\bm{o}_t, x(t)=j | M).\end{equation}That is, $\alpha_j(t)$ is the joint probability of observing thefirst $t$ speech vectors and being in state $j$ at time $t$. Thisforward probability can be efficiently calculated by the followingrecursion\begin{equation} \label{e:16} \alpha_j(t) = \left[ \sum_{i=2}^{N-1} \alpha_i(t-1) a_{ij} \right] b_j(\bm{o}_t).\end{equation}This recursion depends on the fact that the probabilityof being in state $j$ at time $t$ and seeing observation $\bm{o}_t$can be deduced by summing the forward probabilities for allpossible predecessor states $i$ weighted by the transitionprobability $a_{ij}$. The slightly odd limits are caused bythe fact that states $1$ and $N$ are non-emitting\footnote{To understand equations involving a non-emitting state at time $t$, the timeshould be thought of as being $t-\delta t$ if it is an entry state, and $t+\delta t$if it is an exit state. This becomes important when HMMs are connected togetherin sequence so that transitions across non-emitting states take place\textit{between frames}.}. Theinitial conditions for the above recursion are\begin{equation} \alpha_1(1) = 1\end{equation}\begin{equation} \alpha_j(1) = a_{1j} b_j(\bm{o}_1)\end{equation}for $1<j<N$ and the final condition is given by\begin{equation} \alpha_N(T) = \sum_{i=2}^{N-1} \alpha_i(T) a_{iN}.\end{equation}Notice here that from the definition of $\alpha_j(t)$, \begin{equation} P(\bm{O}|M) = \alpha_N(T).\end{equation}Hence, the calculation of the forward probability alsoyields the total likelihood\index{total likelihood} $P(\bm{O}|M)$.The backward probability\index{backward probability} $\beta_j(t)$ is defined as \begin{equation} \label{e:21} \beta_j(t) = P(\bm{o}_{t+1},\ldots,\bm{o}_T | x(t)=j , M).\end{equation}As in the forward case, this backward probability can be computed efficiently using the following recursion\begin{equation} \beta_i(t) = \sum_{j=2}^{N-1} a_{ij} b_j(\bm{o}_{t+1}) \beta_j(t+1)\end{equation}with initial condition given by\begin{equation} \beta_i(T) = a_{iN}\end{equation}for $1<i<N$ and final condition given by\begin{equation} \beta_1(1) = \sum_{j=2}^{N-1} a_{1j} b_j(\bm{o}_1) \beta_j(1).\end{equation}Notice that in the definitions above, the forwardprobability is a joint probability whereas the backwardprobability is a conditional probability. This somewhatasymmetric definition is deliberate since it allows theprobability of state occupation to be determined by taking theproduct of the two probabilities. From the definitions,\begin{equation} \label{e:25} \alpha_j(t) \beta_j(t) = P(\bm{O},x(t)=j | M).\end{equation}Hence, \begin{eqnarray} L_j(t) & = & P(x(t)=j|\bm{O},M) \\ \nonumber & = & \frac{P(\bm{O},x(t)=j | M)}{P(\bm{O}|M)} \\ \nonumber & = & \frac{1}{P} \alpha_j(t) \beta_j(t)\end{eqnarray}where $P=P(\bm{O}|M)$.All of the information needed to perform HMMparameter re-estimation using the Baum-Welch algorithm\index{Baum-Welch algorithm}is now in place.The steps in this algorithm may be summarised as follows\begin{enumerate}\item For every parameter vector/matrix requiring re-estimation, allocate storage for the numerator and denominator summations of the form illustrated by equations~\ref{e:13} and \ref{e:14}. These storage locations are referred to as\index{accumulators} {\it accumulators}\footnote{ Note that normally the summations in the denominators of the re-estimation formulae are identical across the parameter sets of a given state and therefore only a single common storage location for the denominators is required and it need only be calculated once. However, \HTK\ supports a generalised parameter tying mechanism which can result in the denominator summations being different. Hence, in \HTK\, the denominator summations are always stored and calculated individually for each distinct parameter vector or matrix.}.\item Calculate the forward and backward probabilities for all states $j$ and times $t$.\item For each state $j$ and time $t$, use the probability $L_j(t)$ and the current observation vector $\bm{o}_t$ to update the accumulators for that state.\item Use the final accumulator values to calculate new parameter values.\item If the value of $P=P(\bm{O}|M)$ for this iteration is not higher than the value at the previous iteration then stop, otherwise repeat the above steps using the new re-estimated parameter values.\end{enumerate}All of the above assumes that the parameters for a HMM arere-estimated from a single observation sequence, that is asingle example of the spoken word. In practice, manyexamples are needed to get good parameter estimates. However,the use of multiple observation sequences adds no additional complexityto the algorithm. Steps 2 and 3 above are simply repeated foreach distinct training sequence.One final point that should be mentioned is that the computationof the forward and backward probabilities involves taking theproduct of a large number of probabilities. In practice, thismeans that the actual numbers involved become very small. Hence,to avoid numerical problems, the forward-backward computationis computed in \HTK\ using log arithmetic\index{log arithmetic}.The \HTK\ program which implements the above algorithm is called \htool{HRest}\index{hrest@\htool{HRest}}. In combination with the tool \htool{HInit}\index{hinit@\htool{HInit}} forestimating initial values mentioned earlier, \htool{HRest} allows isolatedword HMMs to be constructed from a set of training examplesusing Baum-Welch re-estimation.\mysect{Recognition and Viterbi Decoding}{recandvit}The previous section has described the basic ideas underlyingHMM parameter re-estima\-tion using the Baum-Welch algorithm.In passing, it was noted that the efficient recursivealgorithm for computing the forward probability also yieldedas a by-product the total likelihood\index{total likelihood} $P(\bm{O}|M)$. Thus, thisalgorithm could also be used to find the model which yieldsthe maximum value of $P(\bm{O}|M_i)$, and hence, it could be usedfor recognition.In practice, however, it is preferable to base recognitionon the maximum likelihood state sequence since this generaliseseasily to the continuous speech case whereas the useof the total probability does not. This likelihood is computed using essentially the same algorithm as the forwardprobability calculation except that the summation is replacedby a maximum operation. For a given model $M$,
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -