📄 expectation-maximization algorithm.htm
字号:
likelihood given the unobservable data <IMG class=tex
alt="p(\mathbf y|\mathbf x, \theta)"
src="Expectation-maximization algorithm.files/950284f5644685dd65b2f878efe4d47c.png">,
as well as the probability of the unobservable data <IMG class=tex
alt="p(\mathbf x |\theta)"
src="Expectation-maximization algorithm.files/76103fd6f4ab39a77fd2c7c01c9d7e01.png">.</P>
<DIV class=editsection style="FLOAT: right; MARGIN-LEFT: 5px">[<A
title="Edit section: Maximize expected log-likelihood for the complete dataset"
href="http://en.wikipedia.org/w/index.php?title=Expectation-maximization_algorithm&action=edit&section=3">edit</A>]</DIV>
<P><A id=Maximize_expected_log-likelihood_for_the_complete_dataset
name=Maximize_expected_log-likelihood_for_the_complete_dataset></A></P>
<H3>Maximize expected log-likelihood for the complete dataset</H3>
<P>An EM algorithm will then iteratively improve an initial estimate <SPAN
class=texhtml>θ<SUB>0</SUB></SPAN> and construct new estimates <IMG class=tex
alt="\theta_1, \dots,\theta_n, \dots"
src="Expectation-maximization algorithm.files/7cc3eb4bff04c447f4298f90756ec1f0.png">.
An individual re-estimation step that derives <IMG class=tex alt=\theta_{n+1}\,
src="Expectation-maximization algorithm.files/6c6314555e52d30597a0333553f133d4.png">
from <IMG class=tex alt=\theta_n\,
src="Expectation-maximization algorithm.files/edd8f154244c8d342895c3a6da2d4a74.png">
takes the following form:</P>
<DL>
<DD><IMG class=tex
alt="\theta_{n+1} = \arg\max_{\theta} E_{\mathbf x} \! \! \left[ \log p \left(\mathbf y, \mathbf x \,|\, \theta \right) \Big| \mathbf y \right],"
src="Expectation-maximization algorithm.files/b45b58d4ff5b798908e1f4e25e5946a0.png">
</DD></DL>
<P>where <IMG class=tex alt="E_{\mathbf x} \! \! \left[ \cdot \right]"
src="Expectation-maximization algorithm.files/670834265b79bee2153d2cf870f8e4e4.png">
denotes the conditional expectation of <IMG class=tex
alt="\log p \left( \mathbf y, \mathbf x \,|\, \theta \right)"
src="Expectation-maximization algorithm.files/881efb92880c9bb21573bab40e7c8318.png">
being taken with <SPAN class=texhtml>θ</SPAN> in the conditional distribution of
<B>x</B> fixed at <SPAN class=texhtml>θ<SUB><I>n</I></SUB></SPAN>.
Log-likelihood <IMG class=tex
alt="\log p \left( \mathbf y, \mathbf x \,|\, \theta \right)"
src="Expectation-maximization algorithm.files/881efb92880c9bb21573bab40e7c8318.png">
is often used instead of true likelihood <IMG class=tex
alt="p \left( \mathbf y, \mathbf x \,|\, \theta \right)"
src="Expectation-maximization algorithm.files/22846f74665fa6358dcab3a8093ab183.png">
because it leads to easier formulas, but still attains its maximum at the same
point as the likelihood.</P>
<P>In other words, <SPAN class=texhtml>θ<SUB><I>n</I> + 1</SUB></SPAN> is the
value that maximizes (M) the <A title="Conditional expectation"
href="http://en.wikipedia.org/wiki/Conditional_expectation">conditional
expectation</A> (E) of the complete data log-likelihood given the observed
variables under the previous parameter value. This expectation is usually
denoted as <SPAN class=texhtml><I>Q</I>(θ)</SPAN>. In the continuous case, it
would be given by</P>
<DL>
<DD><IMG class=tex
alt="Q(\theta) = E_{\mathbf x} \! \! \left[ \log p \left(\mathbf y, \mathbf x \,|\, \theta \right) \Big| \mathbf y \right] = \int^\infty _{- \infty} p \left(\mathbf x \,|\, \mathbf y, \theta_n \right) \log p \left(\mathbf y, \mathbf x \,|\, \theta \right) d\mathbf x"
src="Expectation-maximization algorithm.files/530962eb2126e9a6cb4874e6534d5005.png">
</DD></DL>
<P>Speaking of an expectation (E) step is a bit of a misnomer. What is
calculated in the first step are the fixed, data-dependent parameters of the
function <I>Q</I>. Once the parameters of <I>Q</I> are known, it is fully
determined and is maximized in the second (M) step of an EM algorithm.</P>
<DIV class=editsection style="FLOAT: right; MARGIN-LEFT: 5px">[<A
title="Edit section: Properties"
href="http://en.wikipedia.org/w/index.php?title=Expectation-maximization_algorithm&action=edit&section=4">edit</A>]</DIV>
<P><A id=Properties name=Properties></A></P>
<H3>Properties</H3>
<P>It can be shown that an EM iteration does not decrease the observed data
likelihood function. However, there is no guarantee that the sequence converges
to a <A title="Maximum likelihood estimator"
href="http://en.wikipedia.org/wiki/Maximum_likelihood_estimator">maximum
likelihood estimator</A>. For multimodal distributions, this means that an EM
algorithm will converge to a <A title="Local maximum"
href="http://en.wikipedia.org/wiki/Local_maximum">local maximum</A> (or <A
title="Saddle point" href="http://en.wikipedia.org/wiki/Saddle_point">saddle
point</A>) of the observed data likelihood function, depending on starting
values. There are a variety of heuristic approaches for escaping a local maximum
such as using several different random initial estimates, <SPAN
class=texhtml>θ<SUB>0</SUB></SPAN>, or applying <A title="Simulated annealing"
href="http://en.wikipedia.org/wiki/Simulated_annealing">simulated
annealing</A>.</P>
<P>EM is particularly useful when <A title="Maximum likelihood estimation"
href="http://en.wikipedia.org/wiki/Maximum_likelihood_estimation">maximum
likelihood estimation</A> of a complete data model is easy. If <A
title="Closed form"
href="http://en.wikipedia.org/wiki/Closed_form">closed-form</A> estimators
exist, the M step is often trivial. A classic example is maximum likelihood
estimation of a finite mixture of <A title="Normal distribution"
href="http://en.wikipedia.org/wiki/Normal_distribution">Gaussians</A>, where
each component of the mixture can be estimated trivially if the mixing
distribution is known.</P>
<P>"Expectation-maximization" is a description of a class of related algorithms,
not a specific algorithm; EM is a recipe or meta-algorithm which is used to
devise particular algorithms. The <A title="Baum-Welch algorithm"
href="http://en.wikipedia.org/wiki/Baum-Welch_algorithm">Baum-Welch
algorithm</A> is an example of an EM algorithm applied to <A
title="Hidden Markov model"
href="http://en.wikipedia.org/wiki/Hidden_Markov_model">hidden Markov
models</A>. Another example is the EM algorithm for fitting a <A
title="Mixture density"
href="http://en.wikipedia.org/wiki/Mixture_density">mixture density</A>
model.</P>
<P>An EM algorithm can also find <A title="Maximum a posteriori"
href="http://en.wikipedia.org/wiki/Maximum_a_posteriori">maximum a
posteriori</A> (MAP) estimates, by performing MAP estimation in the M step,
rather than maximum likelihood.</P>
<P>There are other methods for finding maximum likelihood estimates, such as <A
title="Gradient descent"
href="http://en.wikipedia.org/wiki/Gradient_descent">gradient descent</A>, <A
title="Conjugate gradient"
href="http://en.wikipedia.org/wiki/Conjugate_gradient">conjugate gradient</A> or
variations of the <A title="Gauss-Newton method"
href="http://en.wikipedia.org/wiki/Gauss-Newton_method">Gauss-Newton method</A>.
Unlike EM, such methods typically require the evaluation of first and/or second
derivatives of the likelihood function.</P>
<DIV class=editsection style="FLOAT: right; MARGIN-LEFT: 5px">[<A
title="Edit section: Incremental versions"
href="http://en.wikipedia.org/w/index.php?title=Expectation-maximization_algorithm&action=edit&section=5">edit</A>]</DIV>
<P><A id=Incremental_versions name=Incremental_versions></A></P>
<H2>Incremental versions</H2>
<P>The classic EM procedure is to replace both <I>Q</I> and <I>θ</I> with their
optimal possible (argmax) values at each iteration. However it can be shown (see
Neal & Hinton, 1999) that simply finding <I>Q</I> and <I>θ</I> to give
<I>some</I> improvement over their current value will also ensure successful
convergence.</P>
<P>For example, to improve <I>Q</I>, we could restrict the space of possible
functions to a computationally simple distribution such as a factorial
distribution,</P>
<DL>
<DD><IMG class=tex alt="Q=\prod_i Q_i. \!"
src="Expectation-maximization algorithm.files/a1c675d4d94cda72dcef28cbce0fb211.png">
</DD></DL>
<P>Thus at each E step we compute the variational approximation of <I>Q</I>.</P>
<P>To improve <I>θ</I>, we could use any <A title="Hill climbing"
href="http://en.wikipedia.org/wiki/Hill_climbing">hill-climbing</A> method, and
not worry about finding the optimal <I>θ</I>, just some improvement. This method
is also known as <I>Generalized EM</I> (GEM).</P>
<DIV class=editsection style="FLOAT: right; MARGIN-LEFT: 5px">[<A
title="Edit section: Relation to variational Bayes methods"
href="http://en.wikipedia.org/w/index.php?title=Expectation-maximization_algorithm&action=edit&section=6">edit</A>]</DIV>
<P><A id=Relation_to_variational_Bayes_methods
name=Relation_to_variational_Bayes_methods></A></P>
<H2>Relation to variational Bayes methods</H2>
<P>EM is a partially non-Bayesian, maximum likelihood method. Its final result
gives a <A title="Probability distribution"
href="http://en.wikipedia.org/wiki/Probability_distribution">probability
distribution</A> over the latent variables (in the Bayesian style) together with
a point estimate for <I>θ</I> (either a <A title="Maximum likelihood estimation"
href="http://en.wikipedia.org/wiki/Maximum_likelihood_estimation">maximum
likelihood estimate</A> or a posterior mode). We may want a fully Bayesian
version of this, giving a probability distribution over <I>θ</I> as well as the
latent variables. In fact the Bayesian approach to inference is simply to treat
<I>θ</I> as another latent variable. In this paradigm, the distinction between
the E and M steps disappears. If we use the factorized Q approximation as
described above (<A title="Variational Bayes"
href="http://en.wikipedia.org/wiki/Variational_Bayes">variational Bayes</A>), we
may iterate over each latent variable (now including <I>θ</I>) and optimize them
one at a time. There are now <I>k</I> steps per iteration, where <I>k</I> is the
number of latent variables. For <A title="Graphical models"
href="http://en.wikipedia.org/wiki/Graphical_models">graphical models</A> this
is easy to do as each variable's new <I>Q</I> depends only on its <A
title="Markov blanket" href="http://en.wikipedia.org/wiki/Markov_blanket">Markov
blanket</A>, so local <A title="Message passing"
href="http://en.wikipedia.org/wiki/Message_passing">message passing</A> can be
used for efficient inference.</P>
<DIV class=editsection style="FLOAT: right; MARGIN-LEFT: 5px">[<A
title="Edit section: Example: Mixture Gaussian"
href="http://en.wikipedia.org/w/index.php?title=Expectation-maximization_algorithm&action=edit&section=7">edit</A>]</DIV>
<P><A id=Example:_Mixture_Gaussian name=Example:_Mixture_Gaussian></A></P>
<H2>Example: Mixture Gaussian</H2>
<P>Assume that the samples <IMG class=tex alt="\mathbf y_1, \dots, \textbf{y}_m"
src="Expectation-maximization algorithm.files/28ba0e9e07634fc32d928d81cee07067.png">,
where <IMG class=tex alt="\mathbf y_j \in \mathbb{R}^l"
src="Expectation-maximization algorithm.files/1888f78eb094700a628d4dc6ca080581.png">,
are drawn from the <A title="Normal distribution"
href="http://en.wikipedia.org/wiki/Normal_distribution">gaussians</A> <IMG
class=tex alt="x_1, \dots, x_n"
src="Expectation-maximization algorithm.files/89b5279f29fc0071b7a5b002c785178f.png">,
such that</P>
<P><IMG class=tex
alt="P(\mathbf y | x_i,\theta) = \mathcal{N}(\mu_i,\Sigma_i) = (2\pi)^{-l/2} {\left| \Sigma_i \right|}^{-1/2} \exp\left(-\frac{1}{2}(\mathbf y - \mathbf \mu_i)^T \Sigma_i^{-1} (\mathbf y - \mathbf \mu_i)\right)"
src="Expectation-maximization algorithm.files/040a0c85965f8f94baf5469221e31642.png"></P>
<P>The model you are trying to estimate is <IMG class=tex
alt="\theta = \left\{ \mu_1, \dots, \mu_n, \Sigma_1, \dots, \Sigma_n, P(x_1), \dots, P(x_n) \right\}"
src="Expectation-maximization algorithm.files/2ea0caa9575c9b880319d4d07ecc3e27.png"></P>
<DIV class=editsection style="FLOAT: right; MARGIN-LEFT: 5px">[<A
title="Edit section: E-step:"
href="http://en.wikipedia.org/w/index.php?title=Expectation-maximization_algorithm&action=edit&section=8">edit</A>]</DIV>
<P><A id=E-step: name=E-step:></A></P>
<H3>E-step:</H3>
<P>Estimation for unobserved event (which Gaussian is used), conditioned on the
observation, using the values from the last maximisation step:</P>
<DL>
<DD><IMG class=tex
alt="P(x_i|\mathbf y_j,\theta_t) = \frac{p(\mathbf y_j|x_i,\theta_t) P(x_i|\theta_t)}{p(\mathbf y_j|\theta_t)} = \frac{p(\mathbf y_j|x_i,\theta_t) P(x_i|\theta_t)}{\sum_{k=1}^n p(x_k,\mathbf y_j | \theta_t)}"
src="Expectation-maximization algorithm.files/e0404238dd7bc017c9c4ebac5511692b.png">
</DD></DL>
<DIV class=editsection style="FLOAT: right; MARGIN-LEFT: 5px">[<A
title="Edit section: M-step"
href="http://en.wikipedia.org/w/index.php?title=Expectation-maximization_algorithm&action=edit&section=9">edit</A>]</DIV>
<P><A id=M-step name=M-step></A></P>
<H3>M-step</H3>
<P>You want to maximise the expected log-likelihood of the joint event:</P>
<DL>
<DD><IMG class=tex
alt="\begin{matrix} Q(\theta) &=& E_{x} \left[ \ln \prod_{j=1}^m p \left(\mathbf y_j, \mathbf x | \theta \right) \Big| \mathbf y_j \right] \\ &=& E_{x} \left[ \sum_{j=1}^m \ln p \left(\mathbf y_j, \mathbf x | \theta \right) \Big| \mathbf y_j \right] \\ &=& \sum_{j=1}^m E_{x} \left[ \ln p \left(\mathbf y_j, \mathbf x | \theta \right) \Big| \mathbf y_j \right] \\ &=& \sum_{j=1}^m \sum_{i=1}^n P \left(x_i | \mathbf y_j, \theta_t \right) \ln p\left(x_i, \mathbf y_j | \theta \right) \\ \end{matrix}"
src="Expectation-maximization algorithm.files/c2f1a19de7dbb35aa01a9d9ea435bdc0.png">
</DD></DL>
<P>If we expand the probability of the joint event, we get</P>
<DL>
<DD><IMG class=tex
alt="Q(\theta) = \sum_{j=1}^m \sum_{i=1}^n P(x_i | \mathbf y_j, \theta_t) \ln \left( p(\mathbf y_j | x_i, \theta) P(x_i | \theta) \right)"
src="Expectation-maximization algorithm.files/f127a3b101d3387346e29964fda2c76b.png">
</DD></DL>
<P>You have a constraint</P>
<DL>
<DD><IMG class=tex alt="\sum_{i=1}^{n} P(x_i|\theta) = 1"
src="Expectation-maximization algorithm.files/997ae8471b2e4b972fe3e52dda137f7a.png">
</DD></DL>
<P>If we add a <A title=Lagrangian
href="http://en.wikipedia.org/wiki/Lagrangian">lagrangian</A>, and expand the
pdf, we get</P>
<DL>
<DD><IMG class=tex
alt="\begin{matrix} \mathcal{L}(\theta) &=& \left( \sum_{j=1}^m \sum_{i=1}^n P(x_i | \mathbf y_j, \theta_t) \left( - \frac{l}{2} \ln (2\pi) - \frac{1}{2} \ln \left| \Sigma_i \right| - \frac{1}{2}(\mathbf y_j - \mathbf \mu_i)^T \Sigma_i^{-1} (\mathbf y_j - \mathbf \mu_i) + \ln P(x_i | \theta) \right) \right) \\ & & - \lambda \left( \sum_{i=1}^{n} P(x_i | \theta) - 1 \right) \end{matrix}"
src="Expectation-maximization algorithm.files/86df080f55d3fb5339a3628aceb730ae.png">
</DD></DL>
<P>To find the new estimate <SPAN class=texhtml>θ<SUB><I>t</I> + 1</SUB></SPAN>,
you find a maxima where <IMG class=tex
alt="\frac{\partial \mathcal{L}(\theta)}{\partial \theta} = 0"
src="Expectation-maximization algorithm.files/605a86528c76325cc7b2ee9a0f458fa8.png">.</P>
<P>New estimate for mean (using some differentiation rules from <A
title="Matrix calculus"
href="http://en.wikipedia.org/wiki/Matrix_calculus">matrix calculus</A>):</P>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -