📄 labman2.tex
字号:
likelihood of an observation sequence $X=\{x_1,x_2,\cdots,x_T\}$ withrespect to a Hidden Markov Model with parameters $\Theta$ expands asfollows\,:\[ p(X|\Theta) = \sum_{\mbox{\scriptsize every possible }Q} p(X,Q|\Theta)\]i.e. it is the sum of the joint likelihoods of the sequence over allpossible state sequence allowed by the model.%\item[-] the {\em forward recursion}\,: in practice, the enumeration ofevery possible state sequence is infeasible. Nevertheless, $p(X|\Theta)$can be computed in a recursive way by the {\em forward recursion}. Thisalgorithm defines a forward variable $\alpha_t(i)$ corresponding to\,:\[ \alpha_t(i) = p(x_1,x_2,\cdots x_t,q^t=q_i|\Theta)\]i.e. $\alpha_t(i)$ is the probability of having observed the partialsequence $\{x_1,x_2,\cdots,x_t\}$ {\em and} being in the state $i$ at time$t$ (event denoted $q_i^t$ in the course), given the parameters$\Theta$. For a HMM with 5 states (where states 1 and N are thenon-emitting initial and final states, and states $2 \cdots N-1$ areemitting), $\alpha_t(i)$ can be computed recursively as follows\,:\pagebreak\noindent \fbox{\bf The Forward Recursion}\begin{enumerate}\item {\bf Initialization}% \[ \alpha_1(i) = a_{1i} \cdot b_i(x_1), \;\;\;\; 2 \leq i \leq N-1 \]%where $a_{1i}$ are the transitions from the initial non-emitting state tothe emitting states with pdfs $b_{i,\,i = 2 \cdots N-1}(x)$. Note that$b_1(x)$ and $b_N{x}$ do not exist since they correspond to thenon-emitting initial and final states.\item {\bf Recursion} \[ \alpha_{t+1}(j) = \left[ \sum_{i=2}^{N-1} \alpha_{t}(i) \cdot a_{ij} \right] b_j(x_{t+1}), \;\;\;\; \begin{array}{l} 1 \leq t \leq T \\ 2 \leq j \leq N-1 \end{array} \]\item {\bf Termination} \[ p(X|\Theta) = \left[ \sum_{i=2}^{N-1} \alpha_{T}(i) \cdot a_{iN} \right] \]%i.e. at the end of the observation sequence, sum the probabilities of thepaths converging to the final state (state number $N$).\end{enumerate}(For more detail about the forward procedure, refer to~\cite{RAB93},chap.6.4.1).This procedure raises a very important implementation issue. As a matter offact, the computation of the $\alpha_t$ vector consists in products of alarge number of values that are less than 1 (in general, {\emsignificantly} less than 1). Hence, after a few observations ($t \approx$10), the values of $\alpha_t$ head exponentially to 0, and the floatingpoint arithmetic precision is exceeded (even in the case of doubleprecision arithmetics). Two solutions exist for that problem. One consistsin scaling the values and undo the scaling at the end of the procedure\,:see \cite{RAB93} for more explanations. The other solution consists inusing log-likelihoods and log-probabilities, and to compute $\logp(X|\Theta)$ instead of $p(X|\Theta)$.\end{itemize}\subsubsection*{Questions\,:}\begin{enumerate}\item The following formula can be used to compute the log of a sum giventhe logs of the sum's arguments\,:\[ \log(a+b) = f(\log a,\log b) = \log a + \log \left( 1 + e^{(\log b - \log a)} \right)\]%Demonstrate its validity.Naturally, one has the choice between using $\log(a+b) = \log a + \log\left( 1 + e^{(\log b - \log a)} \right)$ or $\log(a+b) = \log b + \log\left( 1 + e^{(\log a - \log b)} \right)$, which are equivalent in theory.If $\log a > \log b$, which version leads to the most preciseimplementation ?\item Express the log version of the forward recursion. (Don't fullydevelop the log of the sum in the recursion step, just call it``logsum''\,: $\sum_{i=1}^{N} x_i \stackrel{\log}{\longmapsto} \mbox{logsum}_{i=1}^{N} ( \logx_i )$.) In addition to the arithmetic precision issues, what are the othercomputational advantages of the log version ?\end{enumerate}\subsubsection*{Answers\,:}\expl{\begin{enumerate}\item Demonstration\,:\vspace{-1em}\[ a = e^{\log a} \;\;\;\;\;;\;\;\;\;\; b = e^{\log b} \]\vspace{-2em}\begin{eqnarray}a+b & = & e^{\log a} + e^{\log b} \nonumber \\ & = & e^{\log a} \left( 1 + e^{(\log b - \log a)} \right) \nonumber\end{eqnarray}\vspace{-1.5em}\[ \log(a+b) = \log a + \log \left( 1 + e^{(\log b - \log a)} \right) \;\;\; \mbox{QED.}\]The computation of the exponential overflows the double precisionarithmetics for big values ($\approx700$) earlier than for smallvalues. Similarly, the implementations of the exponential operation aregenerally more precise for small values than for big values (since an erroron the input term is exponentially amplified). Hence, if $\log a > \log b$,the first version ($\log(a+b) = \log a + \log \left( 1 + e^{(\log b - \loga)} \right)$) is more precise since in this case $(\log b - \log a)$ issmall. If $\log a < \log b$, it is better to swap the terms (i.e. to usethe second version).%\item\begin{enumerate}\item {\bf Initialization}% \[ \alpha_1^{(log)}(i) = \log a_{1i} + \log b_i(x_1), \;\;\;\; 2 \leq i \leq N-1 \]%\item {\bf Recursion} \[ \alpha_{t+1}^{(log)}(j) = \left[ \mbox{logsum}_{i=2}^{N-1} \left( \alpha_{t}^{(log)}(i) + \log a_{ij} \right) \right] + \log b_j(x_{t+1}), \;\;\;\; \begin{array}{l} 1 \leq t \leq T \\ 2 \leq j \leq N-1 \end{array} \]\item {\bf Termination} \[ \log p(X|\Theta) = \left[ \mbox{logsum}_{i=2}^{N-1} \left( \alpha_{T}^{(log)}(i) + \log a_{iN} \right) \right] \]%\end{enumerate}In addition to the precision issues, this version transforms the productsinto sums, which is more computationally efficient. Furthermore, if theemission probabilities are Gaussians, the computation of thelog-likelihoods $\log(b_j(x_t))$ eliminates the computation of theGaussians' exponential (see lab session 4).\end{enumerate}These two points just show you that once the theoretic barrier is crossedin the study of a particular statistical model, the importance of theimplementation issues must not be neglected. }\vspace{-1em}%%%%%%%%%\subsection{Bayesian classification}\label{sub:bayesian}%%%%%%%%%%\vspace{-0.5em}\subsubsection*{Question\,:}\vspace{-0.5em}The forward recursion allows us to compute the likelihood of an observationsequence with respect to a HMM. Hence, given a sequence of features, we areable to find the most likely generative model in a Maximum Likelihoodsense. What additional quantities and assumptions do we need to perform atrue Bayesian classification rather than a Maximum Likelihoodclassification of the sequences ?Which additional condition makes the result of Bayesian classificationequivalent to the result of ML classification ?\vspace{-0.5em}\subsubsection*{Answer\,:}\vspace{-0.5em}\expl{%To perform a Bayesian classification, we need the prior probabilities$P(\Theta_i|\Theta)$ of each model. In addition, we can assume that all theobservation sequences are equi-probable\,:\begin{eqnarray}P(\Theta_i|X,\Theta) & = & \frac{p(X|\Theta_i,\Theta) P(\Theta_i|\Theta)}{P(X|\Theta)} \nonumber \\ & \propto & p(X|\Theta_i) P(\Theta_i) \nonumber\end{eqnarray}$P(\Theta_i)$ can be determined by counting the probability of occurrenceof each model (word or phoneme) in a database covering the vocabulary torecognize (see lab session 4).\tab If every model has the same prior probability, then Bayesianclassification becomes equivalent to ML classification. }\pagebreak%%%%%%%%%\subsection{Maximum Likelihood classification}\label{sub:class}%%%%%%%%%In practice, for speech recognition, it is very often assumed that all themodel priors are equal (i.e. that the words or phonemes to recognize haveequal probabilities of occurring in the observed speech). Hence, the speechrecognition task consists mostly in performing the Maximum Likelihoodclassification of acoustic feature sequences. For that purpose, we musthave of a set of HMMs that model the acoustic sequences corresponding to aset of phonemes or a set of words. These models can be considered as``stochastic templates''. Then, we associate a new sequence to the mostlikely generative model. This part is called the {\em decoding} of theacoustic feature sequences.\subsubsection*{Experiment\,:}Classify the sequences $X_1, X2, \cdots X_6$, given in the file\com{data.mat}, in a maximum likelihood sense with respect to the sixMarkov models given in table~\ref{tab:models}. Use the function\com{logfwd} to compute the log-forward recursion expressed in the previoussection. Store the results in a matrix (they will be used in the nextsection) and note them in the table below.\subsubsection*{Example\,:}\mat{plot(X1(:,1),X1(:,2));}\mat{logProb(1,1) = logfwd(X1,hmm1)}\mat{logProb(1,2) = logfwd(X1,hmm2)}etc. \\\mat{logProb(3,2) = logfwd(X3,hmm2)}etc.\medskip\noindent Filling the \com{logProb} matrix can be done automatically withthe help of loops\,:\begin{verbatim}>> for i=1:6, for j=1:6, stri = num2str(i); strj = num2str(j); eval([ 'logProb(' , stri , ',' , strj , ')=logfwd(X' , stri , ',hmm' , strj , ');' ]); end;end;>> logProb\end{verbatim}\noindent\begin{center}\renewcommand{\arraystretch}{1.5}\begin{tabular}{|c|c|c|c|c|c|c|c|} \hline Sequence &\small $\log p(X|\Theta_1)$ & \small $\log p(X|\Theta_2)$ &\small $\log p(X|\Theta_3)$ & \small $\log p(X|\Theta_4)$ &\small $\log p(X|\Theta_5)$ & \small $\log p(X|\Theta_6)$ &\parbox[c][3em][c]{11ex}{Most likely\\ model} \\ \hlineX1 & & & & & & & \\ \hlineX2 & & & & & & & \\ \hlineX3 & & & & & & & \\ \hlineX4 & & & & & & & \\ \hlineX5 & & & & & & & \\ \hlineX6 & & & & & & & \\ \hline\end{tabular}\end{center}\subsubsection*{Answer\,:}\expl{$X_1 \rightarrow HMM1$, $X_2 \rightarrow HMM3$, $X_3 \rightarrowHMM5$, $X_4 \rightarrow HMM4$, $X_5 \rightarrow HMM6$, $X_6 \rightarrowHMM2$.}\pagebreak%%%%%%%%%%%%%%%%%%\section{Optimal state sequence}\label{sec:optimal}%%%%%%%%%%%%%%%%%%\subsubsection*{Useful formulas and definitions\,:}In speech recognition and several other pattern recognition applications,it is useful to associate an ``optimal'' sequence of states to a sequenceof observations, given the parameters of a model. For instance, in the caseof speech recognition, knowing which frames of features ``belong'' to whichstate allows to locate the word boundaries across time. This is called the{\em alignment} of acoustic feature sequences.A ``reasonable'' optimality criterion consists in choosing the statesequence (or {\em path}) that brings a maximum likelihood with respect to agiven model. This sequence can be determined recursively via the {\emViterbi algorithm}. This algorithm makes use of two variables\,:\begin{itemize}\item the {\em highest} likelihood $\delta_t(i)$ along a {\em single} pathamong all the paths ending in state $i$ at time $t$\,:\[\delta_t(i) = \max_{q_1,q_2,\cdots,q_{t-1}}p(q_1,q_2,\cdots,q_{t-1},q^t=q_i,x_1,x_2,\cdots x_t|\Theta)\]\item a variable $\psi_t(i)$ which allows to keep track of the ``bestpath'' ending in state $i$ at time $t$\,:\[\psi_t(i) = \mbox{\hspace{2ex}arg}\hspace{-1ex}\max_{\hspace{-4.5ex}q_1,q_2,\cdots,q_{t-1}}p(q_1,q_2,\cdots,q_{t-1},q^t=q_i,x_1,x_2,\cdots x_t|\Theta)\]\end{itemize}Note that these variables are vectors of $(N-2)$ elements, $(N-2)$ beingthe number of emitting states. With the help of these variables, thealgorithm takes the following steps\,: \\[1em]\noindent \fbox{\bf The Viterbi Algorithm}\begin{enumerate}\item {\bf Initialization}% \begin{eqnarray} \delta_1(i) & = & a_{1i} \cdot b_i(x_1), \;\;\;\; 2 \leq i \leq N-1 \nonumber \\ \psi_1(i) & = & 0 \nonumber \end{eqnarray}%where, again, $a_{1i}$ are the transitions from the initial non-emittingstate to the emitting states with pdfs $b_{i,\,i = 2 \cdots N-1}(x)$, andwhere $b_1(x)$ and $b_N{x}$ do not exist since they correspond to thenon-emitting initial and final states.%\item {\bf Recursion}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -