⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 manual.tex

📁 贝叶斯算法:盲分离技术
💻 TEX
📖 第 1 页 / 共 5 页
字号:
\noindent{2.4. PRIOR PROBABILITIES}
\smallskip

Before we even start to use the data, we need to assign a prior probability distribution $\pi(\theta)=\Pr(\theta\mid I)$ to everything relevant that we don't know.
I sometimes remark that I don't know what I am talking about --- until I have defined a prior.  
We have complete freedom to set $\pi$, subject only to it being non-negative and summing to unity (so that it can actually {\bf be} a probability distribution).  
In practice, we will assign $\pi$ by using such desiderata as symmetry, simplicity, reasonable behaviour, perhaps maximum entropy, according to our experience and judgment. 
We can also use the posterior distribution from previous observations as our current prior.
(That's what ``doing inference consistently'' means.  
It doesn't matter whether we use dataset $D_1$ first then $D_2$, or the other way round, or use them both together.  
Each procedure yields the same posterior.)
Yes, the assignment of prior is subjective.  That's the way it is!
But, after acquiring data, we can compare evidence values, so the subjective assumptions can be objectively compared.

In my illustrative population example, we need to assign a prior over the four non-negative populations.  
In fact, we may as well split the British Isles into arbitrary fractions $f_i$ instead of restricting to four quarters.  
One choice (Ferguson 1973), which can be used at any resolution and is quite often suggested (though seldom by me nowadays), is a gamma distribution
$$
  \pi(\theta) \propto \prod \theta_i^{-1 + c f_i} e^{-\alpha \theta_i}
$$
where $\alpha$ and $c$ are constants, or a Dirichlet distribution if $\sum\theta$ is fixed.  
Priors nearly always involve constants like these, grandiosely called ``hyper-parameters'' to distinguish them from ordinary parameters $\theta$. 
Interpreting them and setting them appropriately is part of the art.

Given the success of maximum entropy in regularisation, it is tempting to try
$$
  \pi(\theta) \propto
         \exp( - \theta_1 \log\theta_1 - \theta_2 \log\theta_2
               - \theta_3 \log\theta_3 - \theta_4 \log\theta_4 )
$$
or some close variant.  However, integration shows that the implied prior on the total population $\Theta = \theta_1 + \theta_2 + \theta_3 + \theta_4$ 
is nothing like $\exp( - \Theta \log\Theta)$, whereas the gamma distribution would have remained intact, albeit with quadrupled $c$. 
Worse, the entropy prior cannot be subdivided.
There is no distribution $p(\cdot)$ which could be applied independently to northern and southern England and then integrated to give a prior 
like $\exp( - \theta \log\theta )$ for the population of England as a whole.
Entropy is a good regulariser but, like most functions, it does not translate into a good prior.
(Assigning $\pi$ by maximising entropy $-\int\pi\log\pi\,d\theta$ under linear constraints is different, and legitimate.)

\bigskip
\noindent{2.4.1. \it Atomic priors}
\smallskip

One useful way of breaking down the complexity of a problem is to construct the object as some number $n$ of {\it a-priori-}equivalent ``atoms'', 
which are scattered randomly with positions $x$ over the domain of interest.  
These positions determine whatever extra attributes are needed to model the object. 
Statisticians call the atoms in an atomic prior the ``components of a mixture distribution'' (Titterington, Smith and Makov 1985), but I prefer my own terminology. 
In the population example, an atom might represent a person, in which case those particular data would require 10000 atoms. 
Or an atom might represent a census unit of 500 people, in which case only 20 would be needed.  
Or an atom might represent a tribe whose size was drawn from some distribution such as an exponential: 
this would often be better because of the flexibility involved, and might require even fewer atoms.  
Whatever an atom represents, and however many there are, letting them fall randomly over the domain ensures that we can compute at arbitrary resolution.

Atomic priors give a structure-based description.  
The number of atoms is usually allowed to vary, and the computer only needs to deal with the amount of structure that is actually required.
Spatial resolution, in the form of accuracy of location of the atoms, is ``free'', 
because each atom is automatically held to the arithmetical precision of the hardware.  
Algebraically-continuous priors like the gamma distribution, by contrast, give a cell-based description.
To use them, we have to divide the spatial domain into as many cells as we are ever likely to need, and be prepared to compute them all.
Resolution is directly limited by the computer memory and processor time.
This comparison between atomic and continuous priors is somewhat analogous to the comparison between a Monte Carlo representation by samples 
and a full covering of the entire space. 
In each case, the former is practical, while the latter is likely not.  
There is no loss of generality.
If required, we could let the whole object be a single atom having attributes for every cell, which would reproduce a cell-based prior.

To define an atomic prior, we assign distributions for the number $n$ of atoms, and for their attributes.  
Typical priors for $n$ are uniform
$$
  \pi(n) = \hbox{constant}
$$
between a minimum and maximum number (whose equality would fix $n$ at that value), or Poisson
$$
  \pi(n) = e^{-\alpha} \alpha^n / n!
$$
(binomial if a maximum is imposed), or geometric
$$
  \pi(n) = (1-c)c^n\quad \hbox{with $c < 1$}
$$
(which is wider than the Poisson).
As a technical note, only the Poisson assignment is ``infinitely divisible'', meaning that it can be accumulated from arbitrarily small but 
fully independent subdivisions of the domain --- remember that Poisson distributions combine into another Poisson distribution with summed mean.  
If computed at small scale, the other forms need small correlations to make the total number correctly distributed.  
I don't think that that matters at all.  
Indeed, I usually prefer the less-committal geometric assignment.

\bigskip
\noindent{2.4.2. \it Coordinates}
\smallskip

As for the attribute coordinates, their priors will depend on what the attributes are.
Importantly, there are many applications where an atom has very few attributes, 
such as image reconstruction where an atom has only position $(x,y)$, brightness, and possibly shape.
It is then much easier to find acceptable and useful new attributes for an atom than it would be for the object as a whole, 
simply because of the huge reduction in dimensionality.
({\it Divide and conquer}, as the slogan has it.)

A common case is where an attribute measures an additive quantity $z$ such as population number, power of signal, brightness of radiation, or flux of material.
There seems to be no science-wide term for additive quantities: I use the crisp term ``flux'', reflecting my training in astronomy.
For such fluxes, an exponential distribution is often convenient;
$$
  \pi(z) = c\,e^{-cz}, \quad \hbox{$c$ = constant.}
$$
\bigskip
We can now give Bayesian solutions to the population example.
Whatever prior we take, the likelihood function (as expressed in terms of the Irish population $\theta_4 \equiv x$) is
$$
 \Pr(D \mid \theta) = \delta(5500 + x - \theta_1) \,\delta(2500 - x - \theta_2) \,\delta(2000 - x - \theta_3) \quad [{\rm people}]^{-3}\,.
$$
Note the dimensions: the three measurements are each in units of people.
The simplest Bayesian prior is just constant, over whatever range might be deemed adequate, for example
$$
 \Pr(\theta) = 10^{-16}\ [{\rm people}]^{-4} \quad  \hbox{over $0 < \theta_i < 10000$ for each $i=1,2,3,4$.}
$$
The usual Bayesian machinery
$$
 {\rm Prior} \times {\rm Likelihood}\ =\ {\rm Joint}\ \Rightarrow\ {\rm Evidence}\ \Rightarrow\ {\rm Posterior}
$$
yields an evidence $\Pr(D) = 2\times 10^{-13}\ [{\rm people}]^{-3}$,
with flat posterior $Pr(x \mid D) = 1/2000$ in $0 < x < 2000$ for the Irish population.

A more sophisticated prior supposes that people are distributed in tribes with mean $\mu$ (say 2) per country,
each tribe having an exponential distribution of people with mean $q$ (say 1000).
Anticipating the algebraic result of section 6.1, this translates to
$$
 \Pr(\theta) = \prod_{i=1}^4 e^{-\mu} \big( \delta(\theta_i) + e^{-\theta_i/q} \sqrt{\mu / q \theta_i} \,I_1(2\sqrt{\mu \theta_i / q})\big)
$$
in terms of the four populations, $I_1$ being the modified Bessel function.
The evidence is now
$$
 \Pr(D) = 6.9\times 10^{-13}\ [{\rm people}]^{-3}, 
$$
which is over three times larger than before, whilst the posterior is composite with
\halign
{\quad  #                                                                                                    \hfill \cr
\hbox{(a) a 19\% chance that the Irish population is $x=0$ (because no tribes happen to inhabit Ireland);}   \cr
\hbox{(b) a 11\% chance that the Welsh population is  0 instead (so that $x = 2000$, the maximum allowed);}  \cr
\hbox{(c) a 70\% chance that $0 < x < 2000$, with a flattish distribution having most probable value near 400.}\cr
}
\noindent Whether or not you favour the more sophisticated prior because of its greater evidence 
depends also on the relative plausibilities you presumably had in mind in the first place for the two models.

\hfill\eject
The population example was solvable algebraically, but usually numerical calculations are needed.
For numerical work, I suggest forcing all coordinates to lie in [0,1], with uniform prior.
In one dimension, this is easy --- just replace the original coordinate $z$ by the cumulant $x = \int_0^z \pi(z) dz$.
Whatever the number ($d$) of attributes of an atom, though,
it is always possible to squash the original prior into uniformity $\pi(x) = 1$ over the unit hypercube $[0,1]^d$.
I recommend this discipline: there is no loss of generality and numerical exploration is likely to be easier.
Also, there is no possibility of using an ``improper'' (non-normalised) prior, through assigning an infinite range or other pathology.
Uniformity over the unit hypercube enforces proper behaviour, and is required by BayeSys.

\bigskip
\noindent{2.5. SAMPLING}
\smallskip

Only when an object $\theta$ has very few degrees of freedom can we hope to explore ``all'' $\theta$ to find the posterior ``everywhere''. 
Instead, the modern approach is to characterise the posterior in a Monte Carlo sense, by taking a dozen or more random-sample objects $\tilde\theta$ from it.

It is fortunate but true that these dozen samples will very likely answer any particular question about $\theta$ with adequate precision.
The point is that each sample object yields an independent value $\tilde Q = Q(\tilde\theta)$ for a scalar quantity $Q$ being sought.
Only occasionally will these be badly biassed overall.  
For example, there is only about a 1 in 2000 chance that their average $\langle \tilde Q \rangle$ will be more than one standard deviation from the true mean 
of $Q$ (for Gaussian statistics), and this chance drops sharply as more samples are taken.

Note that other ways of selecting a single $\theta$ to represent the posterior have serious defects.  
Suppose for instance that the posterior requires $\theta$ to lie on a circle. 
Then the mean (a popular candidate for presentation) will lie inside the circle, which is supposed to be prohibited!
It also moves if $\theta$ is re-parameterised.
The median (another plausible choice)is predicated upon being able to order the values of $\theta$, which really only makes sense if $\theta$ is restricted to one dimension.
The mode, obtained by maximising the posterior (in a method sometimes glorified with the acronym MAPP for maximum {\it a-posteriori} probability), 
also moves if $\theta$ is re-parameterised, because squeezing $\theta$ somewhere increases the posterior there to compensate.  
Hence the mode lacks an invariance we may want.  
Moreover, with a flat prior, MAPP reduces to maximum likelihood (ML), which is non-unique whenever a problem is under-constrained.
Another dastardly counter-example to MAPP estimation has already been found in the population example.
Under the ``tribal'' prior, the MAPP estimate of the Irish population is exactly 0 because of the extreme height of the delta function there, 
even though $0 < x < 2000$ is more than 3 times more likely.
Indeed, it's only because the mythical population data were assigned perfect reliability that the MAPP estimate did not collapse to zero in all four countries.

In fact, the dozen or more sample objects $\tilde\theta$ seem to be the only faithful representation that is generally accessible.
As a practical matter, representing the posterior by these samples almost forces one to use all one's data in the computation.
Using part of the dataset first, and then using the rest later, 
will be next to impossible if the posterior is compressed to a limited list of samples at the intermediate stage.
\vfill\eject

\centerline{\bigger PART 2. THEORY}
\bigskip
\noindent{$\underline{\hbox{\bf{Section 3. Markov chain Monte Carlo (MCMC)}}}$}

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -