📄 labman1.tex
字号:
to see the resulting means.\subsubsection*{Question\,:}\begin{enumerate}\item Does the final solution depend on the initialization of thealgorithm~?\item Describe the evolution of the total squared Euclidean distance.\item What is the nature of the discriminant surfaces corresponding to aminimum Euclidean distance classification scheme~?\item Is the algorithm suitable for fitting Gaussian clusters~?\end{enumerate}\subsubsection*{Answer\,:}\expl{\begin{enumerate}\item The final solution depends upon the initialization of thealgorithm.\item The total squared Euclidean distance decreases in a monotonic way.\item The minimum-distance point attribution scheme is equivalent to lineardiscriminant surfaces.\item The final solution of the K-means algorithm is unable to lockonto the original Gaussian phoneme classes.\end{enumerate}}%%%%%%%%%\subsection{Viterbi-EM algorithm for Gaussian clustering}%%%%%%%%%\subsubsection*{Synopsis of the algorithm\,:}\begin{itemize}\item Start from $K$ initial Gaussian models ${\cal N}(\mu_{k},\Sigma_{k}),\; k=1\cdots K$, characterized by the set of parameters $\Theta$ (i.e. theset of all means and variances $\mu_k$ and $\Sigma_k$, $k=1\cdots K$). Setthe initial prior probabilities $P(q_k)$ to $1/K$.\item {\bf Do}\,:\begin{enumerate}\item Classify each data-point using Bayes' rule.This step is equivalent to having a set $Q$ of boolean hidden variablesthat give a labeling of the data by taking the value 1 (belongs) or 0 (doesnot belong) for each class $q_k$ and each point $x_n$. The value of $Q$that maximizes $p(X,Q|\Theta)$ precisely tells which is the most probablemodel for each point of the whole set $X$ of training data.Hence, each data point is assigned to its most probable cluster $q_k^{(old)}$.\item Update the parameters\,:\begin{itemize}\item update the means\,: \[ \mu_{k}^{(new)} = \mbox{mean of the points belonging to } q_k^{(old)} \]\item update the variances\,: \[ \Sigma_{k}^{(new)} = \mbox{variance of the points belonging to } q_k^{(old)} \]\item update the priors\,: \[ P(q_k^{(new)}|\Theta^{(new)}) = \frac{\mbox{number of training points belonging to } q_k^{(old)} }{\mbox{total number of training points}}\]\end{itemize}\item Go to 1.\end{enumerate}\item {\bf Until}\,: no further change occurs.\end{itemize}The global criterion defined in the present case is\,:\begin{eqnarray}{\cal L}(\Theta) & = & \sum_{X} P(X|\Theta) \; = \; \sum_{Q} \sum_{X} p(X,Q|\Theta) \nonumber \\ & = & \sum_{k=1}^{K} \sum_{x_n \in q_k} \log p(x_n|\Theta_k) \nonumber\end{eqnarray}and represents the joint likelihood of the data with respect to the modelsthey belong to. This criterion is locally maximized by the algorithm.\subsubsection*{Experiment\,:}Use the Viterbi-EM explorer utility\,:\begin{verbatim} VITERB Viterbi version of the EM algorithm Launch it with VITERB(DATA,NCLUST) where DATA is the matrix of observations (one observation per row) and NCLUST is the desired number of clusters. The clusters are initialized with a heuristic that spreads them randomly around mean(DATA) with standard deviation sqrtm(cov(DATA)). Their initial covariance is set to cov(DATA). If you want to set your own initial clusters, use VITERB(DATA,MEANS,VARS) where MEANS and VARS are cell arrays containing respectively NCLUST initial mean vectors and NCLUST initial covariance matrices. In this case, the initial a-priori probabilities are set equal to 1/NCLUST. To set your own initial priors, use VITERB(DATA,MEANS,VARS,PRIORS) where PRIORS is a vector containing NCLUST a priori probabilities. Example: for two clusters means{1} = [1 2]; means{2} = [3 4]; vars{1} = [2 0;0 2]; vars{2} = [1 0;0 1]; viterb(data,means,vars);\end{verbatim}Launch it with the dataset \com{allvow}. Do several runs with differentcases of initialization of the algorithm\,:\begin{enumerate}\item 5 initial clusters determined according to the default heuristic;\item some initial \com{MEANS} values equal to some data points, and somerandom \com{VARS} values (try for instance \com{cov(allvow)} for all theclasses);\item the initial \com{MEANS}, \com{VARS} and \com{PRIORS} values found bythe K-means algorithm.\item some initial \com{MEANS} values equal to $\{\mu_{/a/}, \mu_{/e/},\mu_{/i/}, \mu_{/o/}, \mu_{/y/}\}$, \com{VARS} values equal to \linebreak$\{\Sigma_{/a/}, \Sigma_{/e/}, \Sigma_{/i/}, \Sigma_{/o/}, \Sigma_{/y/}\}$,and \com{PRIORS} values equal to$[P_{/a/},P_{/e/},P_{/i/},P_{/o/},P_{/y/}]$;\item some initial \com{MEANS} and \com{VARS} values chosen by yourself.\end{enumerate}Iterate the algorithm until it converges. Observe the evolution of theclusters, of the data points attribution chart and of the total likelihoodcurve. Observe the mean, variance and priors values found after theconvergence of the algorithm. Compare them with the values computed insection~\ref{gaussmod} (with supervised training).\subsubsection*{Example\,:}\mat{viterb(allvow,5);}- or - \\\mat{means = \{ mu\_a, mu\_e, mu\_i, mu\_o, mu\_y \};}\mat{vars = \{ sigma\_a, sigma\_e, sigma\_i, sigma\_o, sigma\_y \};}\mat{viterb(allvow,means,vars);}Enlarge the window, then push the buttons, zoom etc.After convergence, use\,:\\\mat{for k=1:5, disp(viterb\_result\_means\{k\}); end}\mat{for k=1:5, disp(viterb\_result\_vars\{k\}); end}\mat{for k=1:5, disp(viterb\_result\_priors(k)); end}to see the resulting means, variances and priors.\subsubsection*{Question\,:}\begin{enumerate}\item Does the final solution depend on the initialization of thealgorithm~?\item Describe the evolution of the total likelihood. Is it monotonic~?\item In terms of optimization of the likelihood, what does the finalsolution correspond to~?\item What is the nature of the discriminant surfaces corresponding to theGaussian classification~?\item Is the algorithm suitable for fitting Gaussian clusters~?\end{enumerate}\subsubsection*{Answer\,:}\expl{\begin{enumerate}\item The final solution strongly depends upon the initialization ofthe algorithm.\item The total likelihood increases monotonically, which means thatthe models ``explain'' the training dataset better and better.\item The final solution corresponds approximately to a local maximumof likelihood in the sense of Bayesian classification with Gaussianmodels.\item As seen in section~\ref{discr}, the discriminant surfacescorresponding to Gaussian clusters with unequal variances have the form of(hyper-)parabolas.\item The final solution of the Viterbi-EM algorithm is able to lockapproximately onto the simulated Gaussian phoneme classes if the number andthe placement of the initial clusters are correctly guessed (which is {\emnot an easy task}).\end{enumerate}}\pagebreak%%%%%%%%%\subsection{EM algorithm for Gaussian clustering}%%%%%%%%%\subsubsection*{Synopsis of the algorithm\,:}\begin{itemize}\item Start from K initial Gaussian models ${\cal N}(\mu_{k},\Sigma_{k}),\; k=1\cdots K$, with equal priors set to $P(q_k) = 1/K$.\item {\bf Do}\,:\begin{enumerate}\item {\bf Estimation step}\,: compute the probability $P(q_k^{(old)}|x_n,\Theta^{(old)})$ foreach data point $x_n$ to belong to the class $q_k^{(old)}$\,:%\begin{eqnarray}P(q_k^{(old)}|x_n,\Theta^{(old)}) & = & \frac{P(q_k^{(old)}|\Theta^{(old)}) \cdot p(x_n|q_k^{(old)},\Theta^{(old)})} {p(x_n|\Theta^{(old)})} \nonumber \\ & = & \frac{P(q_k^{(old)}|\Theta^{(old)}) \cdot p(x_n|\mu_k^{(old)},\Sigma_k^{(old)}) } {\sum_j P(q_j^{(old)}|\Theta^{(old)}) \cdot p(x_n|\mu_j^{(old)},\Sigma_j^{(old)}) } \nonumber \end{eqnarray}This step is equivalent to having a set $Q$ of continuous hidden variables,taking values in the interval $[0,1]$, that give a labeling of the data bytelling to which extent a point $x_n$ belongs to the class $q_k$. Thisrepresents a soft classification, since a point can belong, e.g., by 60\%to class 1 and by 40\% to class 2 (think of Schr\"odinger's cat which is60\% alive and 40\% dead as long as nobody opens the box or performsBayesian classification).%\item {\bf Maximization step}\,: \begin{itemize} \item update the means\,: \[\mu_{k}^{(new)} = \frac{\sum_{n=1}^{N} x_n P(q_k^{(old)}|x_n,\Theta^{(old)})} {\sum_{n=1}^{N} P(q_k^{(old)}|x_n,\Theta^{(old)})} \] \item update the variances\,: \[\Sigma_{k}^{(new)} = \frac{\sum_{n=1}^{N} P(q_k^{(old)}|x_n,\Theta^{(old)}) (x_n - \mu_k^{(new)})(x_n - \mu_k^{(new)})^T } {\sum_{n=1}^{N} P(q_k^{(old)}|x_n,\Theta^{(old)})} \] \item update the priors\,: \[ P(q_k^{(new)}|\Theta^{(new)}) = \frac{1}{N} \sum_{n=1}^{N} P(q_k^{(old)}|x_n,\Theta^{(old)}) \] \end{itemize}%In the present case, all the data points participate to the update of allthe models, but their participation is weighted by the value of$P(q_k^{(old)}|x_n,\Theta^{(old)})$.\item Go to 1.\end{enumerate}\item {\bf Until}\,: the total likelihood increase for the training datafalls under some desired threshold.\end{itemize}The global criterion defined in the present case is\,:\begin{eqnarray}{\cal L}(\Theta) \; = \; \log p(X|\Theta) & = & \log \sum_Q p(X,Q|\Theta) \nonumber \\ & = & \log \sum_Q P(Q|X,\Theta) p(X|\Theta) \;\;\;\; \mbox{(Bayes)} \nonumber \\ & = & \log \sum_{k=1}^{K} P(q_k|X,\Theta) p(X|\Theta) \nonumber\end{eqnarray}Applying Jensen's inequality $\left( \log \sum_j \lambda_j y_j \geq \sum_j\lambda_j \log y_j \mbox{ if } \sum_j \lambda_j = 1 \right)$, weobtain\,:\begin{eqnarray}{\cal L}(\Theta) & \approx & \sum_{k=1}^{K} P(q_k|X,\Theta) \log p(X|\Theta) \nonumber \\ & = & \sum_{k=1}^{K} \sum_{n=1}^{N} P(q_k|x_n,\Theta) \log p(x_n|\Theta) \nonumber\end{eqnarray}Hence, the final $J$ represents a lower boundary for the joint likelihood ofall the data with respect to all the models. This criterion is locallymaximized by the algorithm.\subsubsection*{Experiment\,:}Use the EM explorer utility\,:\begin{verbatim} EMALGO EM algorithm explorer Launch it with EMALGO(DATA,NCLUST) where DATA is the matrix of observations (one observation per row) and NCLUST is the desired number of clusters. The clusters are initialized with a heuristic that spreads them randomly around mean(DATA) with standard deviation sqrtm(cov(DATA)*10). Their initial covariance is set to cov(DATA). If you want to set your own initial clusters, use EMALGO(DATA,MEANS,VARS) where MEANS and VARS are cell arrays containing respectively NCLUST initial mean vectors and NCLUST initial covariance matrices. In this case, the initial a-priori probabilities are set equal to 1/NCLUST. To set your own initial priors, use VITERB(DATA,MEANS,VARS,PRIORS) where PRIORS is a vector containing NCLUST a priori probabilities. Example: for two clusters means{1} = [1 2]; means{2} = [3 4]; vars{1} = [2 0;0 2]; vars{2} = [1 0;0 1]; emalgo(data,means,vars);\end{verbatim}Launch it with again the same dataset \com{allvow}. Do several runs withdifferent cases of initialization of the algorithm\,:\begin{enumerate}\item 5 clusters determined according to the default heuristic;\item some initial \com{MEANS} values equal to some data points, and somerandom \com{VARS} values (e.g. \com{cov(allvow)} for all the classes);\item the initial \com{MEANS} and \com{VARS} values found by the K-means algorithm.\item some initial \com{MEANS} values equal to $\{\mu_{/a/}, \mu_{/e/},\mu_{/i/}, \mu_{/o/}, \mu_{/y/}\}$, \com{VARS} values equal to \linebreak$\{\Sigma_{/a/}, \Sigma_{/e/}, \Sigma_{/i/}, \Sigma_{/o/}, \Sigma_{/y/}\}$,and \com{PRIORS} values equal to$[P_{/a/},P_{/e/},P_{/i/},P_{/o/},P_{/y/}]$;\item some initial \com{MEANS} and \com{VARS} values chosen by yourself.\end{enumerate}(If you have time, also increase the number of clusters and play again withthe algorithm.)Iterate the algorithm until the total likelihood reaches an asymptoticconvergence. Observe the evolution of the clusters and of the totallikelihood curve. (In the EM case, the data points attribution chart is notgiven because each data point participates to the update of each cluster.)Observe the mean, variance and prior values found after the convergence ofthe algorithm. Compare them with the values found insection~\ref{gaussmod}.\subsubsection*{Example\,:}\mat{emalgo(allvow,5);}- or - \\\mat{means = \{ mu\_a, mu\_e, mu\_i, mu\_o, mu\_y \};}\mat{vars = \{ sigma\_a, sigma\_e, sigma\_i, sigma\_o, sigma\_y \};}\mat{emalgo(allvow,means,vars);}Enlarge the window, then push the buttons, zoom etc.After convergence, use\,:\\\mat{for k=1:5, disp(emalgo\_result\_means\{k\}); end}\mat{for k=1:5, disp(emalgo\_result\_vars\{k\}); end}\mat{for k=1:5, disp(emalgo\_result\_priors(k)); end}to see the resulting means, variances and priors.\subsubsection*{Question\,:}\begin{enumerate}\item Does the final solution depend on the initialization of thealgorithm~?\item Describe the evolution of the total likelihood. Is it monotonic~?\item In terms of optimization of the likelihood, what does the finalsolution correspond to~?\item Is the algorithm suitable for fitting Gaussian clusters~?\end{enumerate}\subsubsection*{Answer\,:}\expl{\begin{enumerate}\item Here again, the final solution depends upon the initializationof the algorithm.\item Again, the total likelihood always increases monotonically.\item The final solution corresponds to a local maximum of likelihoodin the sense of Bayesian classification with Gaussian models.\item The final solution of the EM algorithm is able to lockperfectly onto the Gaussian phoneme clusters, but only if the number andthe placement of the initial clusters are correctly guessed (which, onceagain, is {\em not an easy task}).\tab In practice, the initial clusters are usually set to the K-meanssolution (which, as you have seen, may not be an optimalguess). Alternately, they can be set to some ``well chosen'' data points.\end{enumerate}}\vfill%%%%%%%%%%%%%%%%%%\section*{After the lab...}%%%%%%%%%%%%%%%%%%This lab manual can be kept as additional course material. If you want tobrowse the experiments again, you can use the script\,:\noindent \com{>> lab1demo}\noindentwhich will automatically redo all the computation and plots for you (exceptthose of section~\ref{unsup}).\end{document}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -