📄 manual.tex
字号:
\bigskip
Today, Markov chain Monte Carlo algorithms are the preferred method for practical inference.
The only generally faithful way of representing the posterior distribution is through sampling, which suggests a Monte Carlo method.
And the posterior can only be reached in practice through a Markov chain of intermediate states, each learning from the previous one(s).
Hence MCMC.
The method is attributed to Metropolis {\it et.al.} (1953), as re-worked with more generality by Hastings (1970).
Let our object $\theta$ have available states $i = 1,2,\cdots\,$, our prior knowledge being probabilistic: $\Pr(i) = \pi_i$.
A MCMC algorithm for exploring the states is identified by the transitions {\bf T} that it can make, with their probabilities
$$
\hbox{$i \rightarrow j$ with probability $T_{ji} = \Pr( j \mid i )$}.
$$
Technically, a Markov transition could also involve a memory of earlier states.
However, transitions may (and sometimes do) already involve a sophisticated history of intermediate trial or actual states,
so the restriction is more apparent than real.
Repeatedly applying the algorithm to an arbitrary initial probability assignment {\bf p}
will yield ${\bf p} \rightarrow {\bf T}{\bf p} \rightarrow {\bf T}^2{\bf p} \rightarrow {\bf T}^3{\bf p} \rightarrow \cdots\,$.
Eventually, {\bf p} will converge to the principal eigenvector of {\bf T} with greatest eigenvalue
(which, because the components of {\bf p} always sum to the same unit total, must be 1).
Provided the algorithm is aperiodic (so that it doesn't just bounce) and ``irreducible'' (so that every state is eventually accessible from any other),
this principal eigenvector is unique (Smith \& Roberts 1993, and Roberts \& Smith 1994).
Elements of randomness soon ensure that an algorithm is aperiodic,
and if it happens that chains of successive {\bf p} reduce into disjoint un-mixed domains,
we will include extra transition routes to join all such domains.
Hence it is usually straightforward to satisfy these conditions.
We will start by designing algorithms which target the prior $\pi_i$, for which we need a transition matrix whose principal eigenvector is \bfpi.
Eigenvectors are somewhat remote from the components of a matrix, being expensive to compute.
However, any matrix with the property of ``{\bf detailed balance}''
$$
T_{ji} / T_{ij} = \pi_j / \pi_i \quad \hbox{for all $i$, $j$}
$$
will suffice.
Thinking physically, we see that if we start correctly with $\pi_i$ objects in state $i$ and $\pi_j$ in state~$j$,
then on applying {\bf T} the same number ($T_{ji}\pi_i$) will pass from $i$ to $j$ as pass from $j$ to $i$ ($T_{ij}\pi_j$), leaving the correct assignment intact.
$$
\hbox{
\vbox{\offinterlineskip
\hrule
\halign{&\vrule#&\strut\ \ \hfil#\hfil\ &\vrule#\cr
height4pt & \omit && \omit \cr
& $i$ & \cr
height4pt & \omit && \omit \cr
}
\hrule
}
\vbox{ \hbox{${\buildrel T_{ji}\pi_i \over {\hbox to 50pt{\rightarrowfill}}} $}
\hbox{${\buildrel {\displaystyle{\hbox to 50pt{\leftarrowfill}}} \over {\scriptstyle{T_{ij}\pi_j}}}$} }
\vbox{\offinterlineskip
\hrule
\halign{&\vrule#&\strut\ \ \hfil#\hfil\ &\vrule#\cr
height4pt & \omit && \omit \cr
& $j$ & \cr
height4pt & \omit && \omit \cr
}
\hrule
}
}
$$
The corresponding algebraic proof is $ ({\bf T}\bfpi)_j = \sum_i T_{ji} \pi_i = \sum_i T_{ij} \pi_j = (\sum_i T_{ij}) \pi_j = \pi_j$.
Although not strictly necessary, detailed balance is the key to constructing useful transition matrices.
We start with algorithms that explore the prior faithfully, leaving the extra complication of likelihood factors until later.
\bigskip
\noindent{3.1. THE NUMBER OF ATOMS}
\smallskip
Typical priors $\Pr(n)$ for the number of atoms are
$$
\pi(n) = \left\{\ \matrix{
1/N\,, \hfill & \hbox{for $0 \le n < N$} \hfill & \hbox{(uniform);} \hfill \cr
e^{-\alpha} \alpha^n / n!\,, \hfill & \hbox{for $n \ge 0$} \hfill & \hbox{(Poisson);} \hfill \cr
(1-c)c^n\,, \hfill & \hbox{for $n \ge 0$} \hfill & \hbox{(geometric).} \hfill \cr
}\right.
$$
and we seek an algorithm that faithfully targets any such prior.
Actually, the BayeSys program imposes a minimum value for $n$, which we ignore here for clarity.
The natural unit of change is just one atom at a time, so the only non-zero transitions $T_{ji}$ will be $j=i+1$ (``birth'' of an atom),
$j=i-1$ (``death'' of an atom), and $j=i$ (no change).
Multiple births or deaths, as implied by the general treatment of jump-diffusion by Grenander and Miller (1994) or the reversible-jump dynamics of Green (1994),
are likely to be less often acceptable, so I restrict attention to single birth or death.
Detailed balance fixes the ratios between birth and death rates but leaves their overall magnitudes free.
I choose to let each atom decay with unit mean lifetime, so that the death rate is set as
$$
T_{n-1,n} = n\,dt
$$
in infinitesimal interval $dt$ of artificial time.
The rationale for this is that most of the atoms will have been changed after unit time.
Regular ${\cal O}(1)$ timesteps thus give a natural sampling period $\tau$ for our multi-atom object.
Sampling more frequently would make successive objects tediously similar, whereas sampling less often might waste useful intermediate samples.
Detailed balance implies a corresponding birth rate
$$
T_{n+1,n} = \beta_n\,dt\,, \qquad \beta_n = (n+1)\, \pi_{n+1} / \pi_n.
$$
For the three typical priors, these birth rates are
$$
\beta_n = \Biggl\{\matrix{
n+1, \qquad \hbox{for $n < N-1$, else 0} & \hbox{(uniform);} \hfill \cr
\alpha, \hfill & \hbox{(Poisson);} \hfill \cr
(n+1)\,c, \hfill & \hbox{(geometric).} \hfill \cr
}
$$
Starting with (say) $n$ atoms, the time to the next event is exponentially distributed
$$
\Pr(\Delta t) = (\beta_n + n)\,e^{-(\beta_n + n)\,\Delta t}
$$
and when that event occurs, it is either birth or death in ratio $\beta_n\!:\!n$.
Thus, in the following example, samples are taken at uniform times $\ldots,5\tau, 6\tau, 7\tau, 8\tau, \ldots$ when $n$ happens to be $\ldots,3,3,2,2,\ldots\,$.
\centerline{
\vbox{
\vskip 6pt
\hbox{ \ \ \vbox{ \hbox{ } \hbox{ $n=3$ } }\ \ \ \ \
\vbox{ \hbox{Birth} \hbox{\ \ \ $\big\uparrow$} }
\vbox{ \hbox{ } \hbox{ $n=4$ } }
\vbox{ \hbox{Death} \hbox{\ \ \ $\big\downarrow$} }
\ \ \ \ \vbox{ \hbox{ } \hbox{ $n=3$ } }\ \ \ \
\vbox{ \hbox{Death} \hbox{\ \ \ $\big\downarrow$} }
\ \ \ \ \ \ \ \ \ \vbox{ \hbox{ } \hbox{ $n=2$ } }\ \ \ \ \ \ \ \ \
\vbox{ \hbox{Birth} \hbox{\ \ \ $\big\uparrow$} }
\ \ \vbox{ \hbox{ } \hbox{ $n=3$ } }
}
\hrule
\hbox{ \hskip 50pt
\vbox{ \hbox{$\,|$}\hbox{$5\tau$} }
\hskip 80pt
\vbox{ \hbox{$\,|$}\hbox{$6\tau$} }
\hskip 80pt
\vbox{ \hbox{$\,|$}\hbox{$7\tau$} }
\hskip 80pt
\vbox{ \hbox{$\,|$}\hbox{$8\tau\qquad\rightarrow$\ time} }
}
\vskip 6pt
}
}
\noindent Phillips \& Smith (1996) review dimension-changing methods like this at a more abstract level.
\bigskip
\noindent{3.2. COORDINATES}
\smallskip
As recommended above, we require the prior to be uniform $\pi(x)=1$ over the unit hypercube $[0,1]^d$ for coordinates $x$.
Within the computer, of course, coordinates will come from a finite set determined by the finite precision of the hardware.
This observation suggests a ``modern digital'' style of treatment.
My convention, taking the computer word length to be $W$ bits (usually 32) is to let the available values be odd multiples of $2^{-(W+1)}$,
labelled by {\tt unsigned} ({\it i.e.} non-negative) integers $k$ from 0 to $2^W - 1$.
$$
x = 2^{-W}(k + \hbox{$1\over2$})
$$
The rules of integer arithmetic (modulo $2^W$) ensure that $x$ is wraparound continuous, which is never harmful and sometimes appropriate.
This makes all states {\it a-priori-}equivalent geometrically as well as by prior measure, independently of any concern over the vagaries of floating-point representations.
Helpfully, the states avoid the boundaries $x=0$ and $x=1$ and centre $x={1\over2}$, which might be special in an application.
When $x$ has $d$ dimensions (more than one), we can fill the unit hypercube with a space-filling Hilbert curve (Hilbert 1891, Sagan 1994),
thus reducing the topology to a one-dimensional coordinate along the curve.
The hypercube contains $(2^W)^d$ digital points, so each point along the curve is labelled by an integer with $Wd$ bits, or $d$ words.
The diagrams below show Hilbert curves in two dimensions.
A Hilbert curve can be constructed recursively from its generator (Butz 1969, 1971).
At the top level the generator is a path around the $2^d$ corners of a hypercube, using segments directed parallel to the axes.
In computer parlance, this path is a ``Gray code'' for $d$-bit integers.
At the next (second) level, smaller copies of each generator are placed at each first-level point, oriented to keep the ends of successive generators adjacent.
There are now $2^{2d}$ points along the path.
At the third level, yet smaller copies of each generator are placed at each second-level point, again oriented to keep the ends of successive generators adjacent.
By now, the path has $2^{3d}$ points.
$$
\vbox{\lineskip = -0.4pt \baselineskip = 0pt
\hbox{\Za\Oa}
\hbox{\Ia\Ia}
}
\hskip -56pt
\vbox{\lineskip = -0.4pt \baselineskip = 0pt
\hbox{\Zb\Ob\Zb\Ob}
\hbox{\Ib\Lb\Ib\Ib}
\hbox{\Lb\Ob\Zb\Ib}
\hbox{\Zb\Ib\Lb\Ob}
}
\vbox{\lineskip = -0.4pt \baselineskip = 0pt
\hbox{\Zc\Oc\Zc\Oc\Zc\Oc\Zc\Oc}
\hbox{\Ic\Lc\Ic\Ic\Ic\Lc\Ic\Ic}
\hbox{\Lc\Oc\Zc\Ic\Lc\Oc\Zc\Ic}
\hbox{\Zc\Ic\Lc\Zc\Zc\Ic\Lc\Oc}
\hbox{\Ic\Z
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -