📄 labman1.tex

📁 一本关于dsp的原著教材
💻 TEX
📖 第 1 页 / 共 4 页
字号:
\prod_{i=1}^{N} p(x_i|\mu,\Sigma) =\prod_{i=1}^{N} g_{(\mu,\Sigma)}(x_i)\]\end{itemize}\subsubsection*{Experiment\,:}Given the 4 Gaussian models\,:\begin{center}\begin{tabular}{ccc}${\cal N}_1: \; \Theta_1 = \left(\left[\begin{array}{c}730 \\ 1090\end{array}\right],\left[\begin{array}{cc}8000 & 0 \\ 0 & 8000\end{array}\right]\right)$ & \hspace{2cm} &${\cal N}_2: \; \Theta_2 = \left(\left[\begin{array}{c}730 \\ 1090\end{array}\right],\left[\begin{array}{cc}8000 & 0 \\ 0 & 18500\end{array}\right]\right)$ \\[2em]${\cal N}_3: \; \Theta_3 = \left(\left[\begin{array}{c}730 \\ 1090\end{array}\right],\left[\begin{array}{cc}8000 & 8400 \\ 8400 & 18500\end{array}\right]\right)$ & \hspace{2cm} &${\cal N}_4: \; \Theta_4 = \left(\left[\begin{array}{c}270 \\ 1690\end{array}\right],\left[\begin{array}{cc}8000 & 8400 \\ 8400 & 18500\end{array}\right]\right)$\end{tabular}\end{center}\vspace{0.5em} compute the following {log-likelihoods} for the whole sample$X_3$ (10000 points)\,:\medskip\centerline{$\log p(X_3|\Theta_1)$, $\log p(X_3|\Theta_2)$, $\logp(X_3|\Theta_3)$ and $\log p(X_3|\Theta_4)$.}\medskip\noindent (First answer the following question and then look at the exemplegiven on the next page.)\vspace{-1ex}\subsubsection*{Question\,:}Why do we want to compute the {\em log-likelihood} rather than the simple{\em likelihood} ?\subsubsection*{Answer\,:}\expl{Computing the log-likelihood turns the product into a sum\,:\[p(X|\Theta) = \prod_{i=1}^{N} p(x_i|\Theta)\;\;\; \Leftrightarrow \;\;\;\log p(X|\Theta) = \sum_{i=1}^{N} \log p(x_i|\Theta)\]In the Gaussian case, it also avoids the computation of the exponential\,:\begin{eqnarray}p(x|\Theta) & = & \frac{1}{\sqrt{2\pi}^d \sqrt{\det\left(\Sigma\right)}}\, e^{-\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu)} \nonumber \\\log p(x|\Theta) & = &- \frac{d}{2} \log \left( 2\pi \right)- \frac{1}{2} \log \left( \det\left(\Sigma\right) \right)- \frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu) \nonumber\end{eqnarray}Furthermore, since $\log(x)$ is a monotonically growing function, thelog-likelihoods have the same relations of order as the likelihoods\,:\[p(x|\Theta_1) > p(x|\Theta_2) \;\; \Leftrightarrow \;\;\log p(x|\Theta_1) > \log p(x|\Theta_2)\]so they can be used directly to classify samples.\tab In a classification framework, the computation can be even moresimplified\,: the relations of order will remain valid if we drop thedivision by 2 and the $d \log (2\pi)$ term (we have a right to drop theseterms because they are {\em independent of the classes}). If ${\calN}_1(\Theta_1)$ and ${\cal N}_2(\Theta_2)$ have the same variance, we canalso drop the $\log (\det(\Sigma))$ term (since in this case the varianceitself becomes independent of the classes).\tab As a summary, log-likelihoods use simpler computation but are readilyusable for classification tasks.}\pagebreak\subsubsection*{Example\,:}\mat{N = size(X3,1)}\mat{mu\_1 = [730 1090]; sigma\_1 = [8000 0; 0 8000];}\mat{logLike1 = 0;}\mat{for i = 1:N;}\com{logLike1 = logLike1 + (X3(i,:) - mu\_1)*inv(sigma\_1)*(X3(i,:) - mu\_1)';} \\\com{end;} \\\mat{logLike1 =  - 0.5 * ( logLike1 + N*log(det(sigma\_1)) + 2*N*log(2*pi) )}\noindent Note\,: if you don't understand why the different models generatea different log-likelihood value for the data, use the function\com{gausview} to compare the relative positions of the models ${\calN}_1$, ${\cal N}_2$, ${\cal N}_3$ and ${\cal N}_4$ with respect to the dataset $X_3$, e.g.\,: \\%\mat{mu\_1 = [730 1090]; sigma\_1 = [8000 0; 0 8000];}\mat{gausview(X3,mu\_1,sigma\_1,'Comparison of X3 and N1');}\vspace{-\baselineskip}\subsubsection*{Question\,:}Of ${\cal N}_1$, ${\cal N}_2$, ${\cal N}_3$ and ${\cal N}_4$, which model``explains'' best the data $X_3$ ?  Which model has the highest number ofparameters ?  Which model would you choose for a good compromise betweenthe number of parameters and the capacity to represent accurately thedata~?\subsubsection*{Answer\,:}\expl{The model ${\cal N}_3$ produces the highest likelihood for the dataset $X_3$. So we can say that data $X_3$ is more likely to have beengenerated by model ${\cal N}_3$ than by the other models, or that ${\calN}_3$ explains best the data.\tab On the other hand, model ${\cal N}_3$ has the highest number ofparameters (2 terms for the mean, 4 non null terms for the variance). Thismay seem low in two dimensions, but the number of parameters growsexponentially with the dimension of the data (this phenomenon is called the{\em curse of dimensionality}). Also, the more parameters you have, themore data you need to estimate (or {\em train}) them.\tab In ``real world'' speech recognition applications, the dimensionalityof the speech features is typically of the order of 40 (1 energycoefficient + 12 cepstrum coefficients + their first and second orderderivatives = a vector of 39 coefficients). Further processing is appliedto {\em orthogonalize} the data (e.g., cepstral coefficients can beinterpreted as orthogonalized spectra, and hence admit quasi-diagonalcovariance matrices).  Therefore, the compromise usually considered is touse models with diagonal covariance matrices, such as the model ${\calN}_2$ in our example.}%%%%%%%%%%%%%%%%%%\section{Statistical pattern recognition}%%%%%%%%%%%%%%%%%%%%%%%%%%%\subsection{A-priori class probabilities}\label{sub:apriori}%%%%%%%%%\subsubsection*{Experiment\,:}Load data from file ``vowels.mat''. This file contains a database ofsimulated 2-dimensional speech features in the form of {\em artificial}pairs of formant values (the first and the second spectral formants,$[F_1,F_2]$). These artificial values represent the features that would beextracted from several occurrences of vowels /a/, /e/, /i/, /o/ and/y/\footnote{/y/ is the phonetic symbol for ``u'' like in the French word``{\it tutu}''.}.  They are grouped in matrices of size $N\times2$, whereeach of the $N$ lines is a training example and $2$ is the dimension of thefeatures (in our case, formant frequency pairs).Supposing that the whole database covers adequately an imaginary languagemade only of /a/'s, /e/'s, /i/'s, /o/'s and /y/'s, compute the probability$P(q_k)$ of each class $q_k$, $k \in \{/a/,/e/,/i/,/o/,/y/\}$. What are themost common and the least common phoneme in the language~?\subsubsection*{Example\,:}\mat{clear all; load vowels.mat; whos}\mat{Na = size(a,1); Ne = size(e,1); Ni = size(i,1); No = size(o,1); Ny = size(y,1);}\mat{N = Na + Ne + Ni + No + Ny;}\mat{Pa = Na/N}\mat{Pi = Ni/N}etc.\subsubsection*{Answer\,:}\expl{The probability of using /a/ in this imaginary speech is 0.25.  It is0.3 for /e/, 0.25 for /i/, 0.15 for /o/ and 0.05 for /y/. The most commonphoneme is therefore /e/, while the least common is /y/.}%%%%%%%%%\subsection{Gaussian modeling of classes}\label{gaussmod}%%%%%%%%%\subsubsection*{Experiment\,:}Plot each vowel's data as clouds of points in the 2D plane. Train theGaussian models corresponding to each class (use directly the \com{mean}and \com{cov} commands). Plot their contours (use directly the function\com{plotgaus({\em mu},{\em sigma},{\em color})} where \com{{\em color} =[R,G,B]}).\subsubsection*{Example\,:}\mat{plotvow; \% Plot the clouds of simulated vowel features}(Do not close the obtained figure, it will be used later on.) \\Then compute and plot the Gaussian models\,: \\\mat{mu\_a = mean(a);}\mat{sigma\_a = cov(a);}\mat{plotgaus(mu\_a,sigma\_a,[0 1 1]);}\mat{mu\_e = mean(e);}\mat{sigma\_e = cov(e);}\mat{plotgaus(mu\_e,sigma\_e,[0 1 1]);}etc.\subsubsection*{Note your results below\,:}\bigskip\centerline{$\mu_{/a/} = $ \hfill $\Sigma_{/a/} =$ \hfill}\vspace{1cm}\centerline{$\mu_{/e/} = $ \hfill $\Sigma_{/e/} =$ \hfill}\vspace{1cm}\centerline{$\mu_{/i/} = $ \hfill $\Sigma_{/i/} =$ \hfill}\vspace{1cm}\centerline{$\mu_{/o/} = $ \hfill $\Sigma_{/o/} =$ \hfill}\vspace{1cm}\centerline{$\mu_{/y/} = $ \hfill $\Sigma_{/y/} =$ \hfill}\vspace{1.5cm}%%%%%%%%%\subsection{Bayesian classification}%%%%%%%%%\subsubsection*{Useful formulas and definitions\,:}\begin{itemize}\item[-] {\em Bayes' decision rule}\,:\[	X \in q_k \;\; \mbox{if} \;\; P(q_k|X,\Theta) \geq P(q_j|X,\Theta), \; \forall j \neq k\]This formula means\,: given a set of classes $q_k$, characterized by a setof known parameters $\Theta$, a set of one or more speech feature vectors$X$ (also called {\em observations}) belongs to the class which has thehighest probability once we actually know (or ``see'', or ``measure'') thesample $X$. $P(q_k|X,\Theta)$ is therefore called the {\em a posterioriprobability}, because it depends on having seen the observations, asopposed to the {\em a priori} probability $P(q_k|\Theta)$ which does notdepend on any observation (but depends of course on knowing how tocharacterize all the classes $q_k$, which means knowing the parameter set$\Theta$).\item[-] For some classification tasks (e.g. speech recognition), it ismore practical to resort to {\em Bayes' law}, which makes use of {\emlikelihoods}, rather than trying to estimate directly the posteriorprobability. Bayes' law says\,:\[P(q_k|X,\Theta) = \frac{p(X|q_k,\Theta) P(q_k|\Theta)}{p(X|\Theta)}\] where $q_k$ is a class, $X$ is a sample containing one or more featurevectors and $\Theta$ is the parameter set of all the class models.%\item[-] The speech features are usually considered equi-probable. Hence,it is considered that $P(q_k|X,\Theta)$ is proportional to $p(X|q_k,\Theta)P(q_k|\Theta)$ for all the classes\,:\[\forall k, \;\; P(q_k|X,\Theta) \propto p(X|q_k,\Theta) P(q_k|\Theta)\]%\item[-] Once again, it is more convenient to do the computation in the$\log$ domain\,:\[\log P(q_k|X,\Theta) \simeq \log p(X|q_k,\Theta) + \log P(q_k|\Theta)\]\end{itemize}\subsubsection*{Question\,:}\begin{enumerate}\item In our case (Gaussian models for phoneme classes), what is themeaning of the $\Theta$ given in the above formulas ?\item What is the expression of $p(X|q_k,\Theta)$, and of $\logp(X|q_k,\Theta)$ ?\item What is the definition of the probability $P(q_k|\Theta)$ ?\end{enumerate}\subsubsection*{Answer\,:}\expl{\begin{enumerate}\item In our case, $\Theta$ represents the set of all the means $\mu_k$ andvariances $\Sigma_k$, $k \in \{/a/,/e/,/i/,/o/,/u/\}$.\item The expression of $p(X|q_k,\Theta)$ and of $\log p(X|q_k,\Theta)$correspond to the computation of the Gaussian pdf and its logarithm,already expressed in section~\ref{like}.\item The probability $P(q_k|\Theta)$ is the a-priori class probability forthe class $q_k$ (corresponding to the parameters $\Theta_k \in \Theta$). Itdefines an absolute probability of occurrence for the class $q_k$. Thea-priori class probabilities for our artificial phoneme classes have beencomputed in the section~\ref{sub:apriori}.\end{enumerate}}\subsubsection*{Question\,:}Now, we have modeled each vowel class with a Gaussian pdf (by computingmeans and variances), we know the probability $P(q_k)$ of each class in theimaginary language, and we assume that the speech {\em features} (asopposed to speech {\em classes}) are equi-probable. What is the mostprobable class $q_k$ for the speech feature points $x=(F_1,F_2)^T$ given inthe following table ? (Compute the posterior probabilities according to theexample given on the next page.)\medskip\noindent\begin{center}\renewcommand{\arraystretch}{1.5}\begin{tabular}{|c|c|c|c|c|c|c|c|c|} \hline x & $F_1$ & $F_2$ &\small $\log P(q_{/a/}|x)$ & \small $\log P(q_{/e/}|x)$ &\small $\log P(q_{/i/}|x)$ & \small $\log P(q_{/o/}|x)$ &\small $\log P(q_{/y/}|x)$ & \parbox[c][3em][c]{11ex}{Most prob. \\ class} \\ \hline1. & 400 & 1800 & & & & & & \\ \hline2. & 400 & 1000 & & & & & & \\ \hline3. & 530 & 1000 & & & & & & \\ \hline4. & 600 & 1300 & & & & & & \\ \hline5. & 670 & 1300 & & & & & & \\ \hline6. & 420 & 2500 & & & & & & \\ \hline\end{tabular}\end{center}\subsubsection*{Example\,:}Use function \com{gloglike({\em point},{\em mu},{\em sigma})} to computethe likelihoods. Don't forget to add the log of the prior probability !E.g., for point 1. and class /a/\,: \\\com{>> gloglike([400,1800],mu\_a,sigma\_a) + log(Pa)}\subsubsection*{Answer\,:}\expl{1. /e/ \,\, 2. /y/ \,\, 3. /o/ \,\, 4. /y/ \,\, 5. /a/ \,\, 6. /i/}\bigskip%%%%%%%%%\subsection{Discriminant surfaces}\label{discr}%%%%%%%%%\subsubsection*{Useful formulas and definitions\,:}\begin{itemize}\item[-] {\em Discriminant function}\,: a set of functions $f_k(x)$ allowsto classify a sample $x$ into $k$ classes $q_k$ if\,:\[x \in q_k \;\; \Leftrightarrow \;\; f_k(x,\Theta_k) \geq f_l(x,\Theta_l),\; \forall l \neq k\]In this case, the $k$ functions $f_k(x)$ are called discriminant functions.\end{itemize}\subsubsection*{Question\,:}What is the link between discriminant functions and Bayesian classifiers~?\subsubsection*{Answer\,:}\expl{The a-posteriori probability $P(q_k|x)$ that a sample $x$ belongs toclass $q_k$ is itself a discriminant function\,:\begin{eqnarray}x \in q_k & \Leftrightarrow & P(q_k|x) \geq P(q_l|x) \; \forall l \neq k \nonumber \\ & \Leftrightarrow & p(x|q_k) P(q_k) \geq p(x|q_l) P(q_l) \nonumber \\ & \Leftrightarrow & \log p(x|q_k) + \log P(q_k) \geq \log p(x|q_l) + \log P(q_l) \nonumber\end{eqnarray}}\subsubsection*{Experiment\,:}%%%%%%%%%%%%\begin{figure}\centerline{\includegraphics[height=0.98\textheight]{iso.eps.gz}}\caption{\label{iso}Iso-likelihood lines for the Gaussian pdfs ${\cal
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -