📄 gaussian.tex
字号:
and the joint likelihood of a series of samples $X$ for a given model
$\Tm$ with one Gaussian. The likelihood will be used in the formula
for classification later on (sec.~\ref{sec:classification}).
\subsubsection*{Useful formulas and definitions:}
\begin{itemize}
\item {\em Likelihood}: the likelihood of a sample point $\xv_i$ given
a data generation model (i.e., given a set of parameters $\Tm$ for
the model pdf) is the value of the pdf $p(\xv_i|\Tm)$ for that
point. In the case of Gaussian models $\Tm = (\muv,\Sm)$, this
amounts to the evaluation of equation~\ref{eq:gauss}.
\item {\em Joint likelihood}: for a set of independent identically
distributed (i.i.d.) samples, say $X = \{\xv_1, \xv_2, \ldots, \xv_N
\}$, the joint (or total) likelihood is the product of the
likelihoods for each point. For instance, in the Gaussian case:
\begin{equation}
\label{eq:joint-likelihood}
p(X|\Tm) =
\prod_{i=1}^{N} p(\xv_i|\Tm) =
\prod_{i=1}^{N} p(\xv_i|\muv,\Sm) =
\prod_{i=1}^{N} g_{(\muv,\Sm)}(\xv_i)
\end{equation}
\end{itemize}
\subsubsection{Question:}
Why do we might want to compute the {\em log-likelihood} rather than
the simple {\em likelihood}?
\\
Computing the log-likelihood turns the product into a sum:
\[
p(X|\Tm) = \prod_{i=1}^{N} p(\xv_i|\Tm) \quad \Leftrightarrow \quad
\log p(X|\Tm) = \log \prod_{i=1}^{N} p(\xv_i|\Tm) = \sum_{i=1}^{N}
\log p(\xv_i|\Tm)
\]
In the Gaussian case, it also avoids the computation of the exponential:
\begin{eqnarray}
\label{eq:loglikely}
p(\xv|\Tm) & = & \frac{1}{\sqrt{2\pi}^d \sqrt{\det\left(\Sm\right)}}
\, e^{-\frac{1}{2} (\xv-\muv)\trn \Sm^{-1} (\xv-\muv)} \nonumber \\
\log p(\xv|\Tm) & = &
\frac{1}{2} \left[-d \log \left( 2\pi \right)
- \log \left( \det\left(\Sm\right) \right)
- (\xv-\muv)\trn \Sm^{-1} (\xv-\muv)\right]
\end{eqnarray}
Furthermore, since $\log(x)$ is a monotonically growing function, the
log-likelihoods have the same relations of order as the likelihoods
\[
p(x|\Tm_1) > p(x|\Tm_2) \quad \Leftrightarrow \quad
\log p(x|\Tm_1) > \log p(x|\Tm_2),
\]
so they can be used directly for classification.
\subsubsection*{Find the right statements:}
We can further simplify the computation of the log-likelihood in
eq.~\ref{eq:loglikely} for classification by
\begin{itemize}
\item[$\Box$] dropping the division by two: $\frac{1}{2}
\left[\ldots\right]$,
\item[$\Box$] dropping term $d\log \left( 2\pi \right)$,
\item[$\Box$] dropping term $\log \left( \det\left(\Sm\right)
\right)$,
\item[$\Box$] dropping term $(\xv-\muv)\trn \Sm^{-1} (\xv-\muv)$,
\item[$\Box$] calculating the term $\log \left( \det\left(\Sm\right)
\right)$ in advance.
\end{itemize}
\medskip
\noindent
We can drop term(s) because:
\begin{itemize}
\item[$\Box$] The term(s) are independent of $\muv$.
\item[$\Box$] The terms are negligible small.
\item[$\Box$] The term(s) are independent of the classes.
\end{itemize}
\noindent As a summary, log-likelihoods use simpler computation and are readily
usable for classification tasks.
\subsubsection{Experiment:}
Given the following 4 Gaussian models $\Tm_i = (\muv_i,\Sm_i)$
\begin{center}
\begin{tabular}{ccc}
${\cal N}_1: \; \Tm_1 = \left(
\left[\begin{array}{c}730 \\ 1090\end{array}\right],
\left[\begin{array}{cc}8000 & 0 \\ 0 & 8000\end{array}\right]
\right)$ & \hspace{2cm} &
${\cal N}_2: \; \Tm_2 = \left(
\left[\begin{array}{c}730 \\ 1090\end{array}\right],
\left[\begin{array}{cc}8000 & 0 \\ 0 & 18500\end{array}\right]
\right)$ \\[2em]
${\cal N}_3: \; \Tm_3 = \left(
\left[\begin{array}{c}730 \\ 1090\end{array}\right],
\left[\begin{array}{cc}8000 & 8400 \\ 8400 & 18500\end{array}\right]
\right)$ & \hspace{2cm} &
${\cal N}_4: \; \Tm_4 = \left(
\left[\begin{array}{c}270 \\ 1690\end{array}\right],
\left[\begin{array}{cc}8000 & 8400 \\ 8400 & 18500\end{array}\right]
\right)$
\end{tabular}
\end{center}
\vspace{0.5em} compute the following {log-likelihoods} for the whole sample
$X_3$ (10000 points):
\[
\log p(X_3|\Tm_1),\; \log p(X_3|\Tm_2),\; \log p(X_3|\Tm_3),\;
\text{and}\; \log p(X_3|\Tm_4).
\]
\subsubsection{Example:}
\mat{N = size(X3,1)}
\mat{mu\_1 = [730 1090]; sigma\_1 = [8000 0; 0 8000];}
\mat{logLike1 = 0;}
\mat{for i = 1:N;}
\com{logLike1 = logLike1 + (X3(i,:) - mu\_1) * inv(sigma\_1) * (X3(i,:) - mu\_1)';} \\
\com{end;} \\
\mat{logLike1 = - 0.5 * (logLike1 + N*log(det(sigma\_1)) + 2*N*log(2*pi))}
\noindent Note: Use the function \com{gausview} to compare the
relative positions of the models ${\cal N}_1$, ${\cal N}_2$, ${\cal
N}_3$ and ${\cal N}_4$ with respect to the data set $X_3$, e.g.:
\\
%
\mat{mu\_1 = [730 1090]; sigma\_1 = [8000 0; 0 8000];}
\mat{gausview(X3,mu\_1,sigma\_1,'Comparison of X3 and N1');}
%
\vspace{-\baselineskip}
\subsubsection*{Question:}
Of ${\cal N}_1$, ${\cal N}_2$, ${\cal N}_3$ and ${\cal N}_4$, which
model ``explains'' best the data $X_3$? Which model has the highest
number of parameters (with non-zero values)? Which model would you
choose for a good compromise between the number of parameters and the
capacity to accurately represent the data?
%%%%%%%%%
%%%%%%%%%
\section{Statistical pattern recognition}
%%%%%%%%%
%%%%%%%%%
%%%%%%%%%
\subsection{A-priori class probabilities}
\label{sec:apriori}
%%%%%%%%%
\subsubsection{Experiment:}
Load data from file ``vowels.mat''. This file contains a database of
2-dimensional samples of speech features in the form of formant
frequencies (the first and the second spectral formants, $[F_1,F_2]$).
The formant frequency samples represent features that would be
extracted from the speech signal for several occurrences of the vowels
/a/, /e/, /i/, /o/, and /y/\footnote{/y/ is the phonetic symbol for
``\"u''}. They are grouped in matrices of size $N\times2$, where
each of the $N$ lines contains the two formant frequencies for one
occurrence of a vowel.
Supposing that the whole database covers adequately an imaginary
language made only of /a/'s, /e/'s, /i/'s, /o/'s, and /y/'s, compute
the probability $P(q_k)$ of each class $q_k$, $k \in
\{\text{/a/},\text{/e/},\text{/i/},\text{/o/},\text{/y/}\}$. Which is
the most common and which the least common phoneme in our imaginary
language?
\subsubsection{Example:}
\mat{clear all; load vowels.mat; whos}
\mat{Na = size(a,1); Ne = size(e,1); Ni = size(i,1); No = size(o,1); Ny = size(y,1);}
\mat{N = Na + Ne + Ni + No + Ny;}
\mat{Pa = Na/N}
\mat{Pi = Ni/N}
etc.
%%%%%%%%
\subsection{Gaussian modeling of classes}
\label{gaussmod}
%%%%%%%%%
\subsubsection*{Experiment:}
Plot each vowel's data as clouds of points in the 2D plane. Train the
Gaussian models corresponding to each class (use directly the
\com{mean} and \com{cov} commands). Plot their contours (use directly
the function \com{plotgaus(mu,sigma,color)} where \com{color =
[R,G,B]}).
\subsubsection{Example:}
\mat{plotvow; \% Plot the clouds of simulated vowel features}
(Do not close the figure obtained, it will be used later on.) \\
Then compute and plot the Gaussian models: \\
\mat{mu\_a = mean(a);}
\mat{sigma\_a = cov(a);}
\mat{plotgaus(mu\_a,sigma\_a,[0 1 1]);}
\mat{mu\_e = mean(e);}
\mat{sigma\_e = cov(e);}
\mat{plotgaus(mu\_e,sigma\_e,[0 1 1]);}
etc.
%%%%%%%%%%%%%%%%%%
%%%%%%%%%
\subsection{Bayesian classification}
\label{sec:classification}
%%%%%%%%%
We will now find how to classify a feature vector $\xv_i$ from a data
sample (or several feature vectors $X$) as belonging to a certain
class $q_k$.
\subsubsection*{Useful formulas and definitions:}
\begin{itemize}
\item {\em Bayes' decision rule}:
\[
X \in q_k \quad \mbox{if} \quad P(q_k|X,\Tm) \geq P(q_j|X,\Tm),
\quad\forall j \neq k
\]
This formula means: given a set of classes $q_k$, characterized by a
set of known parameters in model $\Tm$, a set of one or more speech
feature vectors $X$ (also called {\em observations}) belongs to the
class which has the highest probability once we actually know (or
``see'', or ``measure'') the sample $X$. $P(q_k|X,\Tm)$ is therefore
called the {\em a posteriori probability}, because it depends on
having seen the observations, as opposed to the {\em a priori}
probability $P(q_k|\Tm)$ which does not depend on any observation
(but depends of course on knowing how to characterize all the
classes $q_k$, which means knowing the parameter set $\Tm$).
\item For some classification tasks (e.g. speech recognition), it is
practical to resort to {\em Bayes' law}, which makes use of {\em
likelihoods} (see sec.~\ref{sec:likelihood}), rather than trying
to directly estimate the posterior probability $P(q_k|X,\Tm)$.
Bayes' law says:
\begin{equation}
\label{eq:decision-rule}
P(q_k|X,\Tm) = \frac{p(X|q_k,\Tm)\; P(q_k|\Tm)}{p(X|\Tm)}
\end{equation}
where $q_k$ is a class, $X$ is a sample containing one or more
feature vectors and $\Tm$ is the parameter set of all the class
models.
\item The speech features are usually considered equi-probable:
$p(X|\Tm)=\text{const.}$ (uniform prior distribution for $X$).
Hence, $P(q_k|X,\Tm)$ is proportional to $p(X|q_k,\Tm) P(q_k|\Tm)$
for all classes:
\[
P(q_k|X,\Tm) \propto p(X|q_k,\Tm)\; P(q_k|\Tm), \quad \forall k
\]
\item Once again, it is more convenient to do the computation in the
$\log$ domain:
\begin{equation}
\label{eq:log-decision-rule}
\log P(q_k|X,\Tm) \propto \log p(X|q_k,\Tm) + \log P(q_k|\Tm)
\end{equation}
\end{itemize}
In our case, $\Tm$ represents the set of \emph{all} the means $\muv_k$
and variances $\Sm_k$, $k \in
\{\text{/a/},\text{/e/},\text{/i/},\text{/o/},/u/\}$ of our data
generation model. $p(X|q_k,\Tm)$ and $\log p(X|q_k,\Tm)$ are the
joint likelihood and joint log-likelihood
(eq.~\ref{eq:joint-likelihood} in section~\ref{sec:likelihood}) of the
sample $X$ with respect to the model $\Tm$ for class $q_k$ (i.e., the
model with parameter set $(\muv_k,\Sm_k)$).
The probability $P(q_k|\Tm)$ is the a-priori class probability for the
class $q_k$. It defines an absolute probability of occurrence for the
class $q_k$. The a-priori class probabilities for our phoneme classes
have been computed in section~\ref{sec:apriori}.
\subsubsection{Experiment:}
Now, we have modeled each vowel class with a Gaussian pdf (by
computing means and variances), we know the probability $P(q_k)$ of
each class in the imaginary language (sec.~\ref{sec:apriori}), which
we assume to be the correct a priori probabilities $P(q_k|\Tm)$ for
each class given our model $\Tm$. Further we assume that the speech
\emph{features} $\xv_i$ (as opposed to speech {\em classes} $q_k$) are
equi-probable\footnote{Note, that -- also for not equi-probable
features -- finding the most probable class $q_k$ according to
eq.~\ref{eq:decision-rule} does not depend on the denominator
$p(X|\Tm)$, since $p(X|\Tm)$ is independent of $q_k$!}.
What is the most probable class $q_k$ for each of the formant pairs
(features) $\xv_i=[F_1,F_2]\trn$ given in the table below? Compute
the values of the functions $f_k(\xv_i)$ for our model $\Tm$ as the
right-hand side of eq.~\ref{eq:log-decision-rule}: $f_k(\xv_i) = \log
p(\xv_i|q_k,\Tm) + \log P(q_k|\Tm)$, proportional to the log of the
posterior probability of $\xv_i$ belonging to class $q_k$.
\medskip
\noindent
\begin{center}
\renewcommand{\arraystretch}{1.5}
\setlength{\tabcolsep}{0.12in}
\begin{tabular}{|c|c|c|c|c|c|c|c|} \hline
i & \small$\xv_i=[F_1,F_2]\trn$ &
\small $f_{\text{/a/}}(\xv_i)$ &
\small $f_{\text{/e/}}(\xv_i)$ &
\small $f_{\text{/i/}}(\xv_i)$ &
\small $f_{\text{/o/}}(\xv_i)$ &
\small $f_{\text{/y/}}(\xv_i)$ &
Most prob.\ class $q_k$ \\
\hline
1 & $[400,1800]\trn$ & & & & & & \\ \hline
2 & $[400,1000]\trn$ & & & & & & \\ \hline
3 & $[530,1000]\trn$ & & & & & & \\ \hline
4 & $[600,1300]\trn$ & & & & & & \\ \hline
5 & $[670,1300]\trn$ & & & & & & \\ \hline
6 & $[420,2500]\trn$ & & & & & & \\ \hline
\end{tabular}
\end{center}
\subsubsection{Example:}
Use function \com{gloglike(point,mu,sigma)} to compute the
log-likelihoods $\log p(\xv_i|q_k,\Tm)$. Don't forget to add the log
of the prior probability $P(q_k|\Tm)$!
E.g., for the feature set $x_1$ and class /a/ use\\
\com{>> gloglike([400,1800],mu\_a,sigma\_a) + log(Pa)}
\bigskip
%%%%%%%%%
\subsection{Discriminant surfaces}
\label{sec:discr}
%%%%%%%%%
For the Bayesian classification in the last section we made use of the
\emph{discriminant functions} $f_k(\xv_i) = \log p(\xv_i|q_k,\Tm) +
\log P(q_k|\Tm)$ to classify data points $\xv_i$. This corresponds to
establishing \emph{discriminant surfaces} of dimension $d-1$ in the
vector space for $\xv$ (dimension $d$) to separate regions for the
different classes.
\subsubsection*{Useful formulas and definitions:}
\begin{itemize}
\item {\em Discriminant function}: a set of functions $f_k(\xv)$ allows
to classify a sample $\xv$ into $k$ classes $q_k$ if:
\[
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -