📄 gaussian.tex
字号:
- or - \\
\mat{means = \{ mu\_a, mu\_e, mu\_i, mu\_o, mu\_y \};}
\mat{kmeans(allvow,means);}
Enlarge the window, then push the buttons, zoom etc.
After the convergence, use:\\
\mat{for k=1:5, disp(kmeans\_result\_means\{k\}); end}
to see the resulting means.
\subsubsection*{Think about the following question:}
\begin{enumerate}
\item Does the final solution depend on the initialization of the
algorithm?
\item Describe the evolution of the total squared Euclidean distance.
\item What is the nature of the discriminant surfaces corresponding to
a minimum Euclidean distance classification scheme?
\item Is the algorithm suitable for fitting Gaussian clusters?
\end{enumerate}
%%%%%%%%%
\subsection{Viterbi-EM algorithm for Gaussian clustering}
%%%%%%%%%
\subsubsection*{Synopsis of the algorithm:}
\begin{itemize}
\item Start from $K$ initial Gaussian models ${\cal N}(\muv_{k},\Sm_{k}),
\; k=1\ldots K$, characterized by the set of parameters $\Tm$ (i.e., the
set of all means and variances $\muv_k$ and $\Sm_k$, $k=1\ldots K$). Set
the initial prior probabilities $P(q_k)$ to $1/K$.
\item {\bf Do}:
\begin{enumerate}
\item Classify each data-point using Bayes' rule.
This step is equivalent to having a set $Q$ of boolean hidden variables
that give a labeling of the data by taking the value 1 (belongs) or 0 (does
not belong) for each class $q_k$ and each point $\xv_n$. The value of $Q$
that maximizes $p(X,Q|\Tm)$ precisely tells which is the most probable
model for each point of the whole set $X$ of training data.
Hence, each data point $\xv_n$ is assigned to its most probable
cluster $q_k$.
\item Update the parameters ($i$ is the iteration index):
\begin{itemize}
\item update the means:
\[
\muv_{k}^{(i+1)} = \mbox{mean of the points belonging to } q_k^{(i)}
\]
\item update the variances:
\[
\Sm_{k}^{(i+1)} = \mbox{variance of the points belonging to } q_k^{(i)}
\]
\item update the priors:
\[
P(q_k^{(i+1)}|\Tm^{(i+1)}) = \frac{\mbox{number of training points
belonging to } q_k^{(i)} }{\mbox{total number of training points}}
\]
\end{itemize}
\item Go to 1.
\end{enumerate}
\item {\bf Until}: no further change occurs.
\end{itemize}
The global criterion in the present case is
\begin{eqnarray*}
{\cal L}(\Tm) &=& \sum_{X} P(X|\Tm) \;=\; \sum_{Q} \sum_{X} p(X,Q|\Tm) \\
&=& \sum_{k=1}^{K} \sum_{\xv_n \in q_k} \log p(\xv_n|\Tm_k),
\end{eqnarray*}
and represents the joint likelihood of the data with respect to the models
they belong to. This criterion is locally maximized by the algorithm.
\subsubsection*{Experiment:}
Use the Viterbi-EM explorer utility:
\begin{verbatim}
VITERB Viterbi version of the EM algorithm
Launch it with VITERB(DATA,NCLUST) where DATA is the matrix
of observations (one observation per row) and NCLUST is the
desired number of clusters.
The clusters are initialized with a heuristic that spreads
them randomly around mean(DATA) with standard deviation
sqrtm(cov(DATA)). Their initial covariance is set to cov(DATA).
If you want to set your own initial clusters, use
VITERB(DATA,MEANS,VARS) where MEANS and VARS are cell arrays
containing respectively NCLUST initial mean vectors and NCLUST
initial covariance matrices. In this case, the initial a-priori
probabilities are set equal to 1/NCLUST.
To set your own initial priors, use VITERB(DATA,MEANS,VARS,PRIORS)
where PRIORS is a vector containing NCLUST a priori probabilities.
Example: for two clusters
means{1} = [1 2]; means{2} = [3 4];
vars{1} = [2 0;0 2]; vars{2} = [1 0;0 1];
viterb(data,means,vars);
\end{verbatim}
Launch \com{viterb} with the dataset \com{allvow}. Do several runs
with different initializations of the algorithm:
\begin{enumerate}
\item 5 initial clusters determined according to the default heuristic;
\item initial \com{MEANS} values equal to some data points, and some
random \com{VARS} values (try for instance \com{cov(allvow)} for all
the classes);
\item the initial \com{MEANS}, \com{VARS} and \com{PRIORS} values
found by the $K$-means algorithm.
\item initial \com{MEANS} values equal to $\{\muv_{\text{/a/}},
\muv_{\text{/e/}}, \muv_{\text{/i/}}, \muv_{\text{/o/}},
\muv_{\text{/y/}}\}$, \com{VARS} values equal to \linebreak
$\{\Sm_{\text{/a/}}, \Sm_{\text{/e/}}, \Sm_{\text{/i/}},
\Sm_{\text{/o/}}, \Sm_{\text{/y/}}\}$, and \com{PRIORS} values
equal to
$[P_{\text{/a/}},P_{\text{/e/}},P_{\text{/i/}},P_{\text{/o/}},P_{\text{/y/}}]$;
\item initial \com{MEANS} and \com{VARS} values chosen by yourself.
\end{enumerate}
Iterate the algorithm until it converges. Observe the evolution of the
clusters, of the data points attribution chart and of the total
likelihood curve. Observe the mean, variance and priors values found
after the convergence of the algorithm. Compare them with the values
computed in section~\ref{gaussmod} (using supervised training).
\subsubsection*{Example:}
\mat{viterb(allvow,5);}
- or - \\
\mat{means = \{ mu\_a, mu\_e, mu\_i, mu\_o, mu\_y \};}
\mat{vars = \{ sigma\_a, sigma\_e, sigma\_i, sigma\_o, sigma\_y \};}
\mat{viterb(allvow,means,vars);}
Enlarge the window, then push the buttons, zoom, etc.
After convergence, use:\\
\mat{for k=1:5, disp(viterb\_result\_means\{k\}); end}
\mat{for k=1:5, disp(viterb\_result\_vars\{k\}); end}
\mat{for k=1:5, disp(viterb\_result\_priors(k)); end}
to see the resulting means, variances and priors.
\subsubsection*{Question:}
\begin{enumerate}
\item Does the final solution depend on the initialization of the
algorithm?
\item Describe the evolution of the total likelihood. Is it monotonic?
\item In terms of optimization of the likelihood, what does the final
solution correspond to?
\item What is the nature of the discriminant surfaces corresponding to the
Gaussian classification?
\item Is the algorithm suitable for fitting Gaussian clusters?
\end{enumerate}
%%%%%%%%%
\subsection{EM algorithm for Gaussian clustering}
%%%%%%%%%
\subsubsection*{Synopsis of the algorithm:}
\begin{itemize}
\item Start from K initial Gaussian models ${\cal N}(\muv_{k},\Sm_{k}),
\; k=1\ldots K$, with equal priors set to $P(q_k) = 1/K$.
\item {\bf Do}:
\begin{enumerate}
\item {\bf Estimation step}: compute the probability
$P(q_k^{(i)}|\xv_n,\Tm^{(i)})$ for each data point $\xv_n$ to belong
to the class $q_k^{(i)}$:
%
\begin{eqnarray*}
P(q_k^{(i)}|\xv_n,\Tm^{(i)}) & = & \frac{P(q_k^{(i)}|\Tm^{(i)})
\cdot p(\xv_n|q_k^{(i)},\Tm^{(i)})}
{p(\xv_n|\Tm^{(i)})} \\
& = & \frac{P(q_k^{(i)}|\Tm^{(i)}) \cdot p(\xv_n|\muv_k^{(i)},\Sm_k^{(i)}) }
{\sum_j P(q_j^{(i)}|\Tm^{(i)}) \cdot p(\xv_n|\muv_j^{(i)},\Sm_j^{(i)}) }
\end{eqnarray*}
This step is equivalent to having a set $Q$ of continuous hidden
variables, taking values in the interval $[0,1]$, that give a labeling
of the data by telling to which extent a point $\xv_n$ belongs to the
class $q_k$. This represents a soft classification, since a point can
belong, e.g., by 60\,\% to class 1 and by 40\,\% to class 2 (think of
Schr\"odinger's cat which is 60\,\% alive and 40\,\% dead as long as
nobody opens the box or performs Bayesian classification).
%
\item {\bf Maximization step}:
\begin{itemize}
\item update the means:
\[\muv_{k}^{(i+1)} = \frac{\sum_{n=1}^{N} \xv_n
P(q_k^{(i)}|\xv_n,\Tm^{(i)})}
{\sum_{n=1}^{N} P(q_k^{(i)}|\xv_n,\Tm^{(i)})} \]
\item update the variances:
\[\Sm_{k}^{(i+1)} = \frac{\sum_{n=1}^{N} P(q_k^{(i)}|\xv_n,\Tm^{(i)})\;
(\xv_n - \muv_k^{(i+1)})(\xv_n - \muv_k^{(i+1)})\trn }
{\sum_{n=1}^{N} P(q_k^{(i)}|\xv_n,\Tm^{(i)})} \]
\item update the priors:
\[ P(q_k^{(i+1)}|\Tm^{(i+1)}) = \frac{1}{N} \sum_{n=1}^{N}
P(q_k^{(i)}|\xv_n,\Tm^{(i)}) \]
\end{itemize}
%
In the present case, all the data points participate to the update of all
the models, but their participation is weighted by the value of
$P(q_k^{(i)}|\xv_n,\Tm^{(i)})$.
\item Go to 1.
\end{enumerate}
\item {\bf Until}: the total likelihood increase for the training data
falls under some desired threshold.
\end{itemize}
The global criterion in the present case is the joint likelihood of
all data with respect to all the models:
\begin{align*}
{\cal L}(\Tm) = \log p(X|\Tm)
& = \log \sum_Q p(X,Q|\Tm)\\
& = \log \sum_Q P(Q|X,\Tm)p(X|\Tm) \quad\text{(Bayes)} \\
& = \log \sum_{k=1}^{K} P(q_k|X,\Tm) p(X|\Tm)
\end{align*}
Applying Jensen's inequality $\left( \log \sum_j \lambda_j y_j \geq
\sum_j \lambda_j \log y_j \mbox{ if } \sum_j \lambda_j = 1 \right)$,
we obtain:
\begin{align*}
{\cal L}(\Tm) \ge J(\Tm)
&= \sum_{k=1}^{K} P(q_k|X,\Tm) \log p(X|\Tm)\\
&= \sum_{k=1}^{K} \sum_{n=1}^{N} P(q_k|\xv_n,\Tm) \log p(\xv_n|\Tm)
\end{align*}
Hence, the criterion $J(\Tm)$ represents a lower boundary for ${\cal
L}(\Tm)$. This criterion is locally maximized by the algorithm.
\subsubsection*{Experiment:}
Use the EM explorer utility:
\begin{verbatim}
EMALGO EM algorithm explorer
Launch it with EMALGO(DATA,NCLUST) where DATA is the matrix
of observations (one observation per row) and NCLUST is the
desired number of clusters.
The clusters are initialized with a heuristic that spreads
them randomly around mean(DATA) with standard deviation
sqrtm(cov(DATA)*10). Their initial covariance is set to cov(DATA).
If you want to set your own initial clusters, use
EMALGO(DATA,MEANS,VARS) where MEANS and VARS are cell arrays
containing respectively NCLUST initial mean vectors and NCLUST
initial covariance matrices. In this case, the initial a-priori
probabilities are set equal to 1/NCLUST.
To set your own initial priors, use VITERB(DATA,MEANS,VARS,PRIORS)
where PRIORS is a vector containing NCLUST a priori probabilities.
Example: for two clusters
means{1} = [1 2]; means{2} = [3 4];
vars{1} = [2 0;0 2]; vars{2} = [1 0;0 1];
emalgo(data,means,vars);
\end{verbatim}
Launch \com{emalgo} with again the same dataset \com{allvow}. Do
several runs with different cases of initialization of the algorithm:
\begin{enumerate}
\item 5 clusters determined according to the default heuristic;
\item initial \com{MEANS} values equal to some data points, and some
random \com{VARS} values (e.g., \com{cov(allvow)} for all the
classes);
\item the initial \com{MEANS} and \com{VARS} values found by the
$K$-means algorithm.
\item initial \com{MEANS} values equal to $\{\muv_{\text{/a/}},
\muv_{\text{/e/}}, \muv_{\text{/i/}}, \muv_{\text{/o/}},
\muv_{\text{/y/}}\}$, \com{VARS} values equal to \linebreak
$\{\Sm_{\text{/a/}}, \Sm_{\text{/e/}}, \Sm_{\text{/i/}},
\Sm_{\text{/o/}}, \Sm_{\text{/y/}}\}$, and \com{PRIORS} values
equal to
$[P_{\text{/a/}},P_{\text{/e/}},P_{\text{/i/}},P_{\text{/o/}},P_{\text{/y/}}]$;
\item initial \com{MEANS} and \com{VARS} values chosen by yourself.
\end{enumerate}
(If you have time, also increase the number of clusters and play again with
the algorithm.)
Iterate the algorithm until the total likelihood reaches an asymptotic
convergence. Observe the evolution of the clusters and of the total
likelihood curve. (In the EM case, the data points attribution chart is not
given because each data point participates to the update of each cluster.)
Observe the mean, variance and prior values found after the convergence of
the algorithm. Compare them with the values found in
section~\ref{gaussmod}.
\subsubsection*{Example:}
\mat{emalgo(allvow,5);}
- or - \\
\mat{means = \{ mu\_a, mu\_e, mu\_i, mu\_o, mu\_y \};}
\mat{vars = \{ sigma\_a, sigma\_e, sigma\_i, sigma\_o, sigma\_y \};}
\mat{emalgo(allvow,means,vars);}
Enlarge the window, then push the buttons, zoom etc.
After convergence, use:\\
\mat{for k=1:5; disp(emalgo\_result\_means\{k\}); end}
\mat{for k=1:5; disp(emalgo\_result\_vars\{k\}); end}
\mat{for k=1:5; disp(emalgo\_result\_priors(k)); end}
to see the resulting means, variances and priors.
\subsubsection*{Question:}
\begin{enumerate}
\item Does the final solution depend on the initialization of the
algorithm?
\item Describe the evolution of the total likelihood. Is it monotonic?
\item In terms of optimization of the likelihood, what does the final
solution correspond to?
\item Is the algorithm suitable for fitting Gaussian clusters?
\end{enumerate}
\subsection*{Questions to $K$-means, Viterbi-EM and EM algorithm:}
\subsubsection*{Find the right statements:}
\begin{itemize}
\item[$\Box$] With the $K$-means algorithm the solution is independent
upon the initialization.
\item[$\Box$] With the $K$-means algorithm the discriminant surfaces are
linear.
\item[$\Box$] The $K$-means algorithm is well suited for fitting
Gaussian clusters
\item[$\Box$] In the $K$-means algorithm the global criterion used to
minimize is the maximum likelihood.
\item[$\Box$] In all 3 algorithms the measure used as global criterion
decreases in a monotonic way.
\item[$\Box$] In the Viterbi-EM and EM algorithm the solution is
highly dependent upon initialization.
\item[$\Box$] The EM algorithm is best suited for fitting Gaussian
clusters.
\item[$\Box$] It is an easy task to guess the parameters for
initialization.
\item[$\Box$] With the Viterbi-EM algorithm the discriminant surfaces
are linear.
\item[$\Box$] With the EM algorithm the discriminant surfaces have the
form of (hyper)parabola.
\item[$\Box$] The EM algorithm needs less computational effort then
the Viterbi-EM algorithm.
\item[$\Box$] In the EM algorithm and the Viterbi-EM algorithm, the
same global criterion is used.
\item[$\Box$] The EM algorithm finds a global optimum.
\end{itemize}
\end{document}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -