📄 expectation-maximization algorithm.htm

📁 这个书很重要可以好好帮助不信你可以下来看看真的很好看看哦
💻 HTM
📖 第 1 页 / 共 3 页
字号:
likelihood given the unobservable data <IMG class=tex 
alt="p(\mathbf y|\mathbf x, \theta)" 
src="Expectation-maximization algorithm.files/950284f5644685dd65b2f878efe4d47c.png">, 
as well as the probability of the unobservable data <IMG class=tex 
alt="p(\mathbf x |\theta)" 
src="Expectation-maximization algorithm.files/76103fd6f4ab39a77fd2c7c01c9d7e01.png">.</P>
<DIV class=editsection style="FLOAT: right; MARGIN-LEFT: 5px">[<A 
title="Edit section: Maximize expected log-likelihood for the complete dataset" 
href="http://en.wikipedia.org/w/index.php?title=Expectation-maximization_algorithm&amp;action=edit&amp;section=3">edit</A>]</DIV>
<P><A id=Maximize_expected_log-likelihood_for_the_complete_dataset 
name=Maximize_expected_log-likelihood_for_the_complete_dataset></A></P>
<H3>Maximize expected log-likelihood for the complete dataset</H3>
<P>An EM algorithm will then iteratively improve an initial estimate <SPAN 
class=texhtml>θ<SUB>0</SUB></SPAN> and construct new estimates <IMG class=tex 
alt="\theta_1, \dots,\theta_n, \dots" 
src="Expectation-maximization algorithm.files/7cc3eb4bff04c447f4298f90756ec1f0.png">. 
An individual re-estimation step that derives <IMG class=tex alt=\theta_{n+1}\, 
src="Expectation-maximization algorithm.files/6c6314555e52d30597a0333553f133d4.png"> 
from <IMG class=tex alt=\theta_n\, 
src="Expectation-maximization algorithm.files/edd8f154244c8d342895c3a6da2d4a74.png"> 
takes the following form:</P>
<DL>
  <DD><IMG class=tex 
  alt="\theta_{n+1} = \arg\max_{\theta}  E_{\mathbf x} \! \! \left[ \log p \left(\mathbf y, \mathbf x \,|\, \theta \right) \Big| \mathbf y \right]," 
  src="Expectation-maximization algorithm.files/b45b58d4ff5b798908e1f4e25e5946a0.png"> 
  </DD></DL>
<P>where <IMG class=tex alt="E_{\mathbf x} \! \! \left[ \cdot \right]" 
src="Expectation-maximization algorithm.files/670834265b79bee2153d2cf870f8e4e4.png"> 
denotes the conditional expectation of <IMG class=tex 
alt="\log p \left( \mathbf y, \mathbf x \,|\, \theta \right)" 
src="Expectation-maximization algorithm.files/881efb92880c9bb21573bab40e7c8318.png"> 
being taken with <SPAN class=texhtml>θ</SPAN> in the conditional distribution of 
<B>x</B> fixed at <SPAN class=texhtml>θ<SUB><I>n</I></SUB></SPAN>. 
Log-likelihood <IMG class=tex 
alt="\log p \left( \mathbf y, \mathbf x \,|\, \theta \right)" 
src="Expectation-maximization algorithm.files/881efb92880c9bb21573bab40e7c8318.png"> 
is often used instead of true likelihood <IMG class=tex 
alt="p \left( \mathbf y, \mathbf x \,|\, \theta \right)" 
src="Expectation-maximization algorithm.files/22846f74665fa6358dcab3a8093ab183.png"> 
because it leads to easier formulas, but still attains its maximum at the same 
point as the likelihood.</P>
<P>In other words, <SPAN class=texhtml>θ<SUB><I>n</I> + 1</SUB></SPAN> is the 
value that maximizes (M) the <A title="Conditional expectation" 
href="http://en.wikipedia.org/wiki/Conditional_expectation">conditional 
expectation</A> (E) of the complete data log-likelihood given the observed 
variables under the previous parameter value. This expectation is usually 
denoted as <SPAN class=texhtml><I>Q</I>(θ)</SPAN>. In the continuous case, it 
would be given by</P>
<DL>
  <DD><IMG class=tex 
  alt="Q(\theta) = E_{\mathbf x} \! \! \left[ \log p \left(\mathbf y, \mathbf x \,|\, \theta \right) \Big| \mathbf y \right] = \int^\infty _{- \infty}  p \left(\mathbf x \,|\, \mathbf y, \theta_n \right)  \log p \left(\mathbf y, \mathbf x \,|\, \theta \right) d\mathbf x" 
  src="Expectation-maximization algorithm.files/530962eb2126e9a6cb4874e6534d5005.png"> 
  </DD></DL>
<P>Speaking of an expectation (E) step is a bit of a misnomer. What is 
calculated in the first step are the fixed, data-dependent parameters of the 
function <I>Q</I>. Once the parameters of <I>Q</I> are known, it is fully 
determined and is maximized in the second (M) step of an EM algorithm.</P>
<DIV class=editsection style="FLOAT: right; MARGIN-LEFT: 5px">[<A 
title="Edit section: Properties" 
href="http://en.wikipedia.org/w/index.php?title=Expectation-maximization_algorithm&amp;action=edit&amp;section=4">edit</A>]</DIV>
<P><A id=Properties name=Properties></A></P>
<H3>Properties</H3>
<P>It can be shown that an EM iteration does not decrease the observed data 
likelihood function. However, there is no guarantee that the sequence converges 
to a <A title="Maximum likelihood estimator" 
href="http://en.wikipedia.org/wiki/Maximum_likelihood_estimator">maximum 
likelihood estimator</A>. For multimodal distributions, this means that an EM 
algorithm will converge to a <A title="Local maximum" 
href="http://en.wikipedia.org/wiki/Local_maximum">local maximum</A> (or <A 
title="Saddle point" href="http://en.wikipedia.org/wiki/Saddle_point">saddle 
point</A>) of the observed data likelihood function, depending on starting 
values. There are a variety of heuristic approaches for escaping a local maximum 
such as using several different random initial estimates, <SPAN 
class=texhtml>θ<SUB>0</SUB></SPAN>, or applying <A title="Simulated annealing" 
href="http://en.wikipedia.org/wiki/Simulated_annealing">simulated 
annealing</A>.</P>
<P>EM is particularly useful when <A title="Maximum likelihood estimation" 
href="http://en.wikipedia.org/wiki/Maximum_likelihood_estimation">maximum 
likelihood estimation</A> of a complete data model is easy. If <A 
title="Closed form" 
href="http://en.wikipedia.org/wiki/Closed_form">closed-form</A> estimators 
exist, the M step is often trivial. A classic example is maximum likelihood 
estimation of a finite mixture of <A title="Normal distribution" 
href="http://en.wikipedia.org/wiki/Normal_distribution">Gaussians</A>, where 
each component of the mixture can be estimated trivially if the mixing 
distribution is known.</P>
<P>"Expectation-maximization" is a description of a class of related algorithms, 
not a specific algorithm; EM is a recipe or meta-algorithm which is used to 
devise particular algorithms. The <A title="Baum-Welch algorithm" 
href="http://en.wikipedia.org/wiki/Baum-Welch_algorithm">Baum-Welch 
algorithm</A> is an example of an EM algorithm applied to <A 
title="Hidden Markov model" 
href="http://en.wikipedia.org/wiki/Hidden_Markov_model">hidden Markov 
models</A>. Another example is the EM algorithm for fitting a <A 
title="Mixture density" 
href="http://en.wikipedia.org/wiki/Mixture_density">mixture density</A> 
model.</P>
<P>An EM algorithm can also find <A title="Maximum a posteriori" 
href="http://en.wikipedia.org/wiki/Maximum_a_posteriori">maximum a 
posteriori</A> (MAP) estimates, by performing MAP estimation in the M step, 
rather than maximum likelihood.</P>
<P>There are other methods for finding maximum likelihood estimates, such as <A 
title="Gradient descent" 
href="http://en.wikipedia.org/wiki/Gradient_descent">gradient descent</A>, <A 
title="Conjugate gradient" 
href="http://en.wikipedia.org/wiki/Conjugate_gradient">conjugate gradient</A> or 
variations of the <A title="Gauss-Newton method" 
href="http://en.wikipedia.org/wiki/Gauss-Newton_method">Gauss-Newton method</A>. 
Unlike EM, such methods typically require the evaluation of first and/or second 
derivatives of the likelihood function.</P>
<DIV class=editsection style="FLOAT: right; MARGIN-LEFT: 5px">[<A 
title="Edit section: Incremental versions" 
href="http://en.wikipedia.org/w/index.php?title=Expectation-maximization_algorithm&amp;action=edit&amp;section=5">edit</A>]</DIV>
<P><A id=Incremental_versions name=Incremental_versions></A></P>
<H2>Incremental versions</H2>
<P>The classic EM procedure is to replace both <I>Q</I> and <I>θ</I> with their 
optimal possible (argmax) values at each iteration. However it can be shown (see 
Neal &amp; Hinton, 1999) that simply finding <I>Q</I> and <I>θ</I> to give 
<I>some</I> improvement over their current value will also ensure successful 
convergence.</P>
<P>For example, to improve <I>Q</I>, we could restrict the space of possible 
functions to a computationally simple distribution such as a factorial 
distribution,</P>
<DL>
  <DD><IMG class=tex alt="Q=\prod_i Q_i. \!" 
  src="Expectation-maximization algorithm.files/a1c675d4d94cda72dcef28cbce0fb211.png"> 
  </DD></DL>
<P>Thus at each E step we compute the variational approximation of <I>Q</I>.</P>
<P>To improve <I>θ</I>, we could use any <A title="Hill climbing" 
href="http://en.wikipedia.org/wiki/Hill_climbing">hill-climbing</A> method, and 
not worry about finding the optimal <I>θ</I>, just some improvement. This method 
is also known as <I>Generalized EM</I> (GEM).</P>
<DIV class=editsection style="FLOAT: right; MARGIN-LEFT: 5px">[<A 
title="Edit section: Relation to variational Bayes methods" 
href="http://en.wikipedia.org/w/index.php?title=Expectation-maximization_algorithm&amp;action=edit&amp;section=6">edit</A>]</DIV>
<P><A id=Relation_to_variational_Bayes_methods 
name=Relation_to_variational_Bayes_methods></A></P>
<H2>Relation to variational Bayes methods</H2>
<P>EM is a partially non-Bayesian, maximum likelihood method. Its final result 
gives a <A title="Probability distribution" 
href="http://en.wikipedia.org/wiki/Probability_distribution">probability 
distribution</A> over the latent variables (in the Bayesian style) together with 
a point estimate for <I>θ</I> (either a <A title="Maximum likelihood estimation" 
href="http://en.wikipedia.org/wiki/Maximum_likelihood_estimation">maximum 
likelihood estimate</A> or a posterior mode). We may want a fully Bayesian 
version of this, giving a probability distribution over <I>θ</I> as well as the 
latent variables. In fact the Bayesian approach to inference is simply to treat 
<I>θ</I> as another latent variable. In this paradigm, the distinction between 
the E and M steps disappears. If we use the factorized Q approximation as 
described above (<A title="Variational Bayes" 
href="http://en.wikipedia.org/wiki/Variational_Bayes">variational Bayes</A>), we 
may iterate over each latent variable (now including <I>θ</I>) and optimize them 
one at a time. There are now <I>k</I> steps per iteration, where <I>k</I> is the 
number of latent variables. For <A title="Graphical models" 
href="http://en.wikipedia.org/wiki/Graphical_models">graphical models</A> this 
is easy to do as each variable's new <I>Q</I> depends only on its <A 
title="Markov blanket" href="http://en.wikipedia.org/wiki/Markov_blanket">Markov 
blanket</A>, so local <A title="Message passing" 
href="http://en.wikipedia.org/wiki/Message_passing">message passing</A> can be 
used for efficient inference.</P>
<DIV class=editsection style="FLOAT: right; MARGIN-LEFT: 5px">[<A 
title="Edit section: Example: Mixture Gaussian" 
href="http://en.wikipedia.org/w/index.php?title=Expectation-maximization_algorithm&amp;action=edit&amp;section=7">edit</A>]</DIV>
<P><A id=Example:_Mixture_Gaussian name=Example:_Mixture_Gaussian></A></P>
<H2>Example: Mixture Gaussian</H2>
<P>Assume that the samples <IMG class=tex alt="\mathbf y_1, \dots, \textbf{y}_m" 
src="Expectation-maximization algorithm.files/28ba0e9e07634fc32d928d81cee07067.png">, 
where <IMG class=tex alt="\mathbf y_j \in \mathbb{R}^l" 
src="Expectation-maximization algorithm.files/1888f78eb094700a628d4dc6ca080581.png">, 
are drawn from the <A title="Normal distribution" 
href="http://en.wikipedia.org/wiki/Normal_distribution">gaussians</A> <IMG 
class=tex alt="x_1, \dots, x_n" 
src="Expectation-maximization algorithm.files/89b5279f29fc0071b7a5b002c785178f.png">, 
such that</P>
<P><IMG class=tex 
alt="P(\mathbf y | x_i,\theta) = \mathcal{N}(\mu_i,\Sigma_i) = (2\pi)^{-l/2} {\left| \Sigma_i \right|}^{-1/2} \exp\left(-\frac{1}{2}(\mathbf y - \mathbf \mu_i)^T \Sigma_i^{-1} (\mathbf y - \mathbf \mu_i)\right)" 
src="Expectation-maximization algorithm.files/040a0c85965f8f94baf5469221e31642.png"></P>
<P>The model you are trying to estimate is <IMG class=tex 
alt="\theta = \left\{ \mu_1, \dots, \mu_n, \Sigma_1, \dots, \Sigma_n, P(x_1), \dots, P(x_n) \right\}" 
src="Expectation-maximization algorithm.files/2ea0caa9575c9b880319d4d07ecc3e27.png"></P>
<DIV class=editsection style="FLOAT: right; MARGIN-LEFT: 5px">[<A 
title="Edit section: E-step:" 
href="http://en.wikipedia.org/w/index.php?title=Expectation-maximization_algorithm&amp;action=edit&amp;section=8">edit</A>]</DIV>
<P><A id=E-step: name=E-step:></A></P>
<H3>E-step:</H3>
<P>Estimation for unobserved event (which Gaussian is used), conditioned on the 
observation, using the values from the last maximisation step:</P>
<DL>
  <DD><IMG class=tex 
  alt="P(x_i|\mathbf y_j,\theta_t) = \frac{p(\mathbf y_j|x_i,\theta_t) P(x_i|\theta_t)}{p(\mathbf y_j|\theta_t)} = \frac{p(\mathbf y_j|x_i,\theta_t) P(x_i|\theta_t)}{\sum_{k=1}^n p(x_k,\mathbf y_j | \theta_t)}" 
  src="Expectation-maximization algorithm.files/e0404238dd7bc017c9c4ebac5511692b.png"> 
  </DD></DL>
<DIV class=editsection style="FLOAT: right; MARGIN-LEFT: 5px">[<A 
title="Edit section: M-step" 
href="http://en.wikipedia.org/w/index.php?title=Expectation-maximization_algorithm&amp;action=edit&amp;section=9">edit</A>]</DIV>
<P><A id=M-step name=M-step></A></P>
<H3>M-step</H3>
<P>You want to maximise the expected log-likelihood of the joint event:</P>
<DL>
  <DD><IMG class=tex 
  alt="\begin{matrix} Q(\theta)   &amp;=&amp; E_{x} \left[ \ln \prod_{j=1}^m p \left(\mathbf y_j, \mathbf x | \theta \right) \Big| \mathbf y_j \right] \\  &amp;=&amp; E_{x} \left[ \sum_{j=1}^m \ln p \left(\mathbf y_j, \mathbf x | \theta \right) \Big| \mathbf y_j \right] \\  &amp;=&amp; \sum_{j=1}^m E_{x} \left[ \ln p \left(\mathbf y_j, \mathbf x | \theta \right) \Big| \mathbf y_j \right] \\  &amp;=&amp; \sum_{j=1}^m \sum_{i=1}^n  P \left(x_i | \mathbf y_j, \theta_t \right) \ln p\left(x_i, \mathbf y_j | \theta \right) \\ \end{matrix}" 
  src="Expectation-maximization algorithm.files/c2f1a19de7dbb35aa01a9d9ea435bdc0.png"> 
  </DD></DL>
<P>If we expand the probability of the joint event, we get</P>
<DL>
  <DD><IMG class=tex 
  alt="Q(\theta)  = \sum_{j=1}^m \sum_{i=1}^n  P(x_i | \mathbf y_j, \theta_t) \ln \left( p(\mathbf y_j | x_i, \theta) P(x_i | \theta) \right)" 
  src="Expectation-maximization algorithm.files/f127a3b101d3387346e29964fda2c76b.png"> 
  </DD></DL>
<P>You have a constraint</P>
<DL>
  <DD><IMG class=tex alt="\sum_{i=1}^{n} P(x_i|\theta) = 1" 
  src="Expectation-maximization algorithm.files/997ae8471b2e4b972fe3e52dda137f7a.png"> 
  </DD></DL>
<P>If we add a <A title=Lagrangian 
href="http://en.wikipedia.org/wiki/Lagrangian">lagrangian</A>, and expand the 
pdf, we get</P>
<DL>
  <DD><IMG class=tex 
  alt="\begin{matrix} \mathcal{L}(\theta)   &amp;=&amp; \left( \sum_{j=1}^m \sum_{i=1}^n  P(x_i | \mathbf y_j, \theta_t) \left( - \frac{l}{2} \ln (2\pi) - \frac{1}{2} \ln \left| \Sigma_i \right| - \frac{1}{2}(\mathbf y_j - \mathbf \mu_i)^T \Sigma_i^{-1} (\mathbf y_j - \mathbf \mu_i) + \ln P(x_i | \theta) \right) \right) \\  &amp; &amp; - \lambda \left( \sum_{i=1}^{n} P(x_i | \theta) - 1 \right) \end{matrix}" 
  src="Expectation-maximization algorithm.files/86df080f55d3fb5339a3628aceb730ae.png"> 
  </DD></DL>
<P>To find the new estimate <SPAN class=texhtml>θ<SUB><I>t</I> + 1</SUB></SPAN>, 
you find a maxima where <IMG class=tex 
alt="\frac{\partial \mathcal{L}(\theta)}{\partial \theta} = 0" 
src="Expectation-maximization algorithm.files/605a86528c76325cc7b2ee9a0f458fa8.png">.</P>
<P>New estimate for mean (using some differentiation rules from <A 
title="Matrix calculus" 
href="http://en.wikipedia.org/wiki/Matrix_calculus">matrix calculus</A>):</P>
💿 文件大小 1810 K
👤 上传用户 wp1111111
📂 所属分类电子书籍
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -