📄 gaussian.tex
字号:
\xv \in q_k \quad \Leftrightarrow \quad f_k(\xv,\Tm_k) \geq f_l(\xv,\Tm_l),
\quad \forall l \neq k
\]
In this case, the $k$ functions $f_k(\xv)$ are called discriminant
functions.
\end{itemize}
The a-posteriori probability $P(q_k|\xv_i)$ that a sample $\xv_i$
belongs to class $q_k$ is itself a discriminant function:
\begin{eqnarray*}
\xv \in q_k
& \Leftrightarrow & P(q_k|\xv_i) \geq P(q_l|\xv_i),\quad \forall l \neq k \\
& \Leftrightarrow & p(\xv_i|q_k)\; P(q_k) \geq p(\xv_i|q_l)\; P(q_l),\quad
\forall l \neq k \\
& \Leftrightarrow & \log p(\xv_i|q_k)+\log P(q_k) \geq \log
p(\xv_i|q_l)+\log P(q_l),\quad \forall l \neq k
\end{eqnarray*}
As in our case the samples $\xv$ are two-dimensional vectors, the
discriminant surfaces are one-dimensional, i.e., lines at equal values
of the discriminant functions for two distinct classes.
\subsubsection{Experiment:}
%%%%%%%%%%%%
\begin{figure}
\centerline{\includegraphics[height=0.95\textheight]{iso}}
\caption{\label{iso}Iso-likelihood lines for the Gaussian pdfs
${\cal N}(\muv_{\text{/i/}},\Sm_{\text{/i/}})$ and ${\cal
N}(\muv_{\text{/e/}},\Sm_{\text{/e/}})$ (top), and ${\cal
N}(\muv_{\text{/i/}},\Sm_{\text{/e/}})$ and ${\cal
N}(\muv_{\text{/e/}},\Sm_{\text{/e/}})$ (bottom).}
\end{figure}
%%%%%%%%%%%%
The iso-likelihood lines for the Gaussian pdfs ${\cal
N}(\muv_{\text{/i/}},\Sm_{\text{/i/}})$ and ${\cal
N}(\muv_{\text{/e/}},\Sm_{\text{/e/}})$, which we used before to
model the class /i/ and the class /e/, are plotted in
figure~\ref{iso}, first graph. On the second graph in
figure~\ref{iso}, the iso-likelihood lines for ${\cal
N}(\muv_{\text{/i/}},\Sm_{\text{/e/}})$ and ${\cal
N}(\muv_{\text{/e/}},\Sm_{\text{/e/}})$ (two pdfs with the
\emph{same} covariance matrix $\Sm_{\text{/e/}}$) are represented.
On these figures, use a colored pen to join the intersections of the
level lines that correspond to equal likelihoods. Assume that the
highest iso-likelihood lines (smallest ellipses) are of the same
height. (You can also use \com{isosurf} in {\sc Matlab} to create a
color plot.)
\subsubsection{Question:}
What is the nature of the surface that separates class /i/ from class
/e/ when the two models have {\em different} variances? Can you
explain the origin of this form?
What is the nature of the surface that separates class /i/ from class
/e/ when the two models have the {\em same} variances? Why is it
different from the previous discriminant surface?
\\
Show that in the case of two Gaussian pdfs with \emph{equal covariance
matrices}, the separation between class 1 and class 2 does not
depend upon the covariance $\Sm$ any more.
\\
\tab As a summary, we have seen that Bayesian classifiers with
Gaussian data models separate the classes with combinations of
parabolic surfaces. If the covariance matrices of the models are
equal, the parabolic separation surfaces become simple hyper-planes.
\bigskip
%%%%%%%%%
%%%%%%%%%
\section{Unsupervised training}
\label{unsup}
%%%%%%%%%
%%%%%%%%%
In the previous section, we have computed the models for classes /a/,
/e/, /i/, /o/, and /y/ by knowing a-priori which training samples
belongs to which class (we were disposing of a {\em labeling} of the
training data). Hence, we have performed a {\em supervised training}
of the Gaussian models. Now, suppose that we only have unlabeled
training data that we want to separate in several classes (e.g., 5
classes) without knowing a-priori which point belongs to which class.
This is called {\em unsupervised training}. Several algorithms are
available for that purpose, among them: the $K$-means, the EM
(Expectation-Maximization), and the Viterbi-EM algorithm.
%
All these algorithms are characterized by the following components:
\begin{itemize}
\item a set of models for the classes $q_k$ (not necessarily
Gaussian), defined by a parameter set $\Tm$ (means, variances,
priors,...);
\item a measure of membership, telling to which extent a data point
``belongs'' to a model;
\item a ``recipe'' to update the model parameters as a function of the
membership information.
\end{itemize}
The measure of membership usually takes the form of a distance measure
or the form of a measure of probability. It replaces the missing
labeling information to permit the application of standard parameter
estimation techniques. It also defines implicitly a global criterion
of ``goodness of fit'' of the models to the data, e.g.:
\begin{itemize}
\item in the case of a distance measure, the models that are closer to
the data characterize it better;
\item in the case of a probability measure, the models with a larger
likelihood for the data explain it better.
\end{itemize}
\begin{latexonly}
Table~\ref{algos} summarizes the components of the algorithms that
will be studied in the following experiments. More detail will be
given in the corresponding subsections.
%\pagebreak
%\newpage
%\pagestyle{empty}
\newcommand{\PBS}[1]{\let\temp=\\#1\let\\=\temp}
\newcommand{\RR}{\PBS\raggedright\hspace{0pt}}
\newcommand{\RL}{\PBS\raggedleft\hspace{0pt}}
\newcommand{\CC}{\PBS\centering\hspace{0pt}}
\begin{sidewaystable}[p]
%\setlength{\arrayrulewidth}{1.2pt}
\setlength{\tabcolsep}{0.05in}
\renewcommand{\arraystretch}{2.2}
%\hspace{-1.5em}
\begin{tabular}{|>{\RR}m{5.2em}|>{\RR}m{7.3em}|>{\CC}m{21em}|>{\RR}m{22.5em}|>{\CC}m{8em}|}
\hline
\centering\bf Algorithm &
\centering\bf Parameters &
\bf Membership measure &
\centering\bf Update method &
\bf Global criterion \\ \hline
%
$K$-means &
$\bullet$ mean $\muv_k$ &
Euclidean distance \linebreak
\[d_k(\xv_n)= \sqrt{(\xv_n-\muv_k)\trn(\xv_n-\muv_k)}\] \linebreak
(or the square of it)
&
Find the points closest to $q_k^{(i)}$, then:
\medskip
$\bullet$ $\muv_{k}^{(i+1)}$ = mean of the points closest to $q_k^{(i)}$ &
Least squares \\
\hline
%
% Kmeans w/ \linebreak Mahalanobis distance &
% $\bullet$ mean $\muv_k$ \linebreak
% $\bullet$ variance $\Sm_k$ &
% Mahalanobis distance, \linebreak (Weighted Euclidean distance)
% \[d_k(x_n)= \sqrt{(x_n-\muv_k)\trn\Sm_k^{-1}(x_n-\muv_k)}\]
% &
% Find the points closest to $q_k^{(i)}$, then:
% \medskip
% $\bullet$ $\muv_{k}^{(i+1)}$ = mean of the points closest to $q_k^{(i)}$
% \medskip
% $\bullet$ $\Sm_{k}^{(i+1)}$ = variance of the points closest to $q_k^{(i)}$ &
% Minimal \linebreak spread of the data \linebreak (in a Mahalanobis distance sense) \\ \hline
%
Viterbi-EM &
$\bullet$ mean $\muv_k$ \linebreak
$\bullet$ variance $\Sm_k$ \linebreak
$\bullet$ priors $P(q_k|\Tm)$
&
Posterior probability
\[
d_k(\xv_n) = P(q_k|\xv_n,\Tm)
\]
\footnotesize
\[
\propto \frac{1}{\sqrt{2\pi}^d \sqrt{\det\left(\Sm_k\right)}}
\, e^{-\frac{1}{2} (\xv_n-\muv_k)\trn \Sm_k^{-1} (\xv_n-\muv_k)}
\cdot P(q_k|\Tm)
\]
\normalsize
&
Do Bayesian classification of each data point, then:
\medskip
$\bullet$ $\muv_{k}^{(i+1)}$ = mean of the points belonging to $q_k^{(i)}$
\medskip
$\bullet$ $\Sm_{k}^{(i+1)}$ = variance of the points belonging to $q_k^{(i)}$
\medskip
$\bullet$ $P(q_k^{(i+1)}|\Tm^{(i+1)})$ = number of training points
belonging to $q_k^{(i)}$ / total number of training points
&
Maximum likelihood \\ \hline
%
EM &
$\bullet$ mean $\muv_k$ \linebreak
$\bullet$ variance $\Sm_k$ \linebreak
$\bullet$ priors $P(q_k|\Tm)$ &
Posterior probability
\[
d_k(\xv_n) = P(q_k|\xv_n,\Tm)
\]
\footnotesize
\[
\propto \frac{1}{\sqrt{2\pi}^d \sqrt{\det\left(\Sm_k\right)}}
\, e^{-\frac{1}{2} (\xv_n-\muv_k)\trn \Sm_k^{-1} (\xv_n-\muv_k)}
\cdot P(q_k|\Tm)
\]
\normalsize
&
Compute $P(q_k^{(i)}|\xv_n,\Tm^{(i)})$ (soft classification), then:
\medskip
$\bullet$ $\muv_{k}^{(i+1)} = \frac{\sum_{n=1}^{N} x_n P(q_k^{(i)}|\xv_n,\Tm^{(i)})}
{\sum_{n=1}^{N} P(q_k^{(i)}|\xv_n,\Tm^{(i)})} $
\medskip
$\bullet$ $\Sm_{k}^{(i+1)} = \frac{\sum_{n=1}^{N} P(q_k^{(i)}|\xv_n,\Tm^{(i)})
(\xv_n - \muv_k^{(i+1)})(\xv_n - \muv_k^{(i+1)})\trn }
{\sum_{n=1}^{N} P(q_k^{(i)}|\xv_n,\Tm^{(i)})} $
\medskip
$\bullet$ $P(q_k^{(i+1)}|\Tm^{(i+1)}) = \frac{1}{N} \sum_{n=1}^{N} P(q_k^{(i)}|\xv_n,\Tm^{(i)}) $ &
Maximum likelihood \\ \hline
\end{tabular}
\caption{\label{algos}Characteristics of some usual unsupervised
clustering algorithms.}
\end{sidewaystable}
%
\end{latexonly}
%%%%%%%%%
%\newpage
%\pagestyle{fancyplain}
\subsection{$K$-means algorithm}
%%%%%%%%%
\subsubsection*{Synopsis of the algorithm:}
\begin{itemize}
\item Start with $K$ initial prototypes $\muv_k$, $k=1,\ldots,K$.
\item {\bf Do}:
\begin{enumerate}
\item For each data-point $\xv_n$, $n=1,\ldots,N$, compute the squared
Euclidean distance from the $k^{\text{th}}$ prototype:
\begin{eqnarray}
\label{eq:dist}
d_k(\xv_n) & = & \, \left\| \xv_n - \muv_k \right\|^2 \nonumber \\
& = & (\xv_n-\muv_k)\trn(\xv_n-\muv_k) \nonumber
\end{eqnarray}
\item Assign each data-point $\xv_n$ to its \emph{closest prototype}
$\muv_k$, i.e., assign $\xv_n$ to the class $q_k$ if
\[
d_k(\xv_n) \, < \, d_l(\xv_n), \qquad \forall l \neq k
\]
{\em Note}: using the squared Euclidean distance for the
classification gives the same result as using the true Euclidean
distance, since the square root is a monotonically growing
function. But the computational load is obviously smaller when the
square root is dropped.
\item Replace each prototype with the mean of the data-points assigned to
the corresponding class;
\item Go to 1.
\end{enumerate}
\item {\bf Until}: no further change occurs.
\end{itemize}
The global criterion for the present case is
\[
J = \sum_{k=1}^{K} \sum_{\xv_n \in q_k} d_k(\xv_n)
\]
and represents the total squared distance between the data and the models
they belong to. This criterion is locally minimized by the algorithm.
% Alternately, the Euclidean distance used in step 1. can be replaced by the
% Mahalanobis distance, belonging to the class of weighted Euclidean
% distances:
% \[
% d_k(x_n)= \sqrt{(x_n-\muv_k)\trn\Sm_k^{-1}(x_n-\muv_k)}
% \]
% where $\Sm_k$ is the covariance of the points associated with the class
% $q_k$.
% After the convergence of the algorithm, the final clusters of points may be
% used to make Gaussian models since we dispose of means and variances.
\subsubsection*{Experiment:}
Use the $K$-means explorer utility:
\begin{verbatim}
KMEANS K-means algorithm exploration tool
Launch it with KMEANS(DATA,NCLUST) where DATA is the matrix
of observations (one observation per row) and NCLUST is the
desired number of clusters.
The clusters are initialized with a heuristic that spreads
them randomly around mean(DATA) with standard deviation
sqrtm(cov(DATA)).
If you want to set your own initial clusters, use
KMEANS(DATA,MEANS) where MEANS is a cell array containing
NCLUST initial mean vectors.
Example: for two clusters
means{1} = [1 2]; means{2} = [3 4];
kmeans(data,means);
\end{verbatim}
%\pagebreak
Launch \com{kmeans} with the data sample \com{allvow}, which was part
of file \com{vowels.mat} and gathers all the simulated vowels data. Do
several runs with different cases of initialization of the algorithm:
\begin{enumerate}
\item 5 initial clusters determined according to the default heuristic;
\item initial \com{MEANS} values equal to some data points;
\item initial \com{MEANS} values equal to $\{\muv_{\text{/a/}},
\muv_{\text{/e/}}, \muv_{\text{/i/}}, \muv_{\text{/o/}},
\muv_{\text{/y/}}\}$.
\end{enumerate}
Iterate the algorithm until its convergence. Observe the evolution of the
cluster centers, of the data-points attribution chart and of the total
squared Euclidean distance. (It is possible to zoom these plots: left
click inside the axes to zoom $2\times$ centered on the point under the
mouse; right click to zoom out; click and drag to zoom into an area; double
click to reset the figure to the original). %Observe the mean and variance
% values found after the convergence of the algorithm.
Observe the mean values found after the convergence of the algorithm.
\subsubsection*{Example:}
\mat{kmeans(allvow,5);}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -