⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 manual.tex

📁 贝叶斯算法:盲分离技术
💻 TEX
📖 第 1 页 / 共 5 页
字号:
I offer it as an alternative to currently popular strategies, 
hoping that it will have at least as much power in a wide variety of probabilistic and optimisation applications.
Much else in BayeSys is new too.
There is ``selective annealing'', 
which is a form of simulated annealing that is designed to cope properly with difficult problems which may have multiple maxima.
There is a binary variant of slice sampling used for exploration along the Hilbert curve.
There are linked lists for locating neighbouring objects, which can catalyse each other's evolution.
The ``Massive Inference'' extension (``MassInf'' for short) copes semi-analytically with coordinates representing additive quantities such as intensity or flux, 
which eases the exploration task.
These and other coding tricks are hidden within the program.
You, the user, see only a simple interface.

Section 2 of this manual gives an overview of inference as appropriate to BayeSys and MassInf, and Sections 3 to 7 give the theory underlying the program.
I have tried to present this material straightforwardly.
After all, it involves nothing more than some easy algebra with occasional use of basic calculus.
I have not aimed to present maximal generality, in notation and terminology that are then necessarily opaque: 
that just makes the subject look difficult.
Moreover, generality is often the enemy of efficiency as well as of clarity.
Instead, I present the ideas in simple terms and conversational style, referring to the literature where connections exist.
Generalisation is usually obvious anyway, once an idea is understood.
All this theoretical material is for the reader who wants to know how it all works.  Otherwise, it's optional.
The later sections 8 to 15 describe the requirements for writing an application, and Section 16 lists the program files that should be provided with BayeSys.
These include ``toy'' example programs, which can serve as templates for your own application's code.
The program language is ANSI C.

\vfill\eject
\noindent{$\underline{\hbox{\bf{Section 2. Overview of Inference}}}$}
\bigskip

For us, inference is the art of recovering an object $\theta$ from data $D$.
Usually, the object will be something quite complicated, perhaps a spectrum or an image, 
having at least several and maybe thousands or millions of degrees of freedom.  The data, acquired through some experiment $R$ so that
$$
D \approx R(\theta)
$$
will nearly always be incomplete (fewer components than $\theta$ has) or noisy (inexact), or somehow inadequate to fix $\theta$ unambiguously.
Methods of inference fall into three increasingly sophisticated classes, ``inversion'', ``regularisation'' and ``probabilistic''.

I can illustrate these with the population of the four countries (England, Scotland, Wales and Ireland) that comprise the British Isles.
I have received data (mythical, of course) that of the entire population of 10000, 
8000 live in England and Scotland, and 7500 live in England and Wales.  
From these three numbers, I have to estimate the populations $\theta = (\theta_1, \theta_2, \theta_3, \theta_4)$ of the four countries.

\centerline{\vbox{\vskip 4pt
\offinterlineskip
      \halign{  & \vrule# &
                   \strut\quad\hfil#\quad &
                                             \vrule# &
                                               \strut\quad\hfil#\quad &
                                                                         \vrule# &
                                                                       \strut\quad\hfil# \cr
                        \multispan5\hrulefill                                            \cr
                height2pt &    \omit           &     &   \omit             &     & \omit \cr
                          &Ireland = $\theta_4$&     &Scotland = $\theta_2$&     &  2500 \cr
                height2pt &    \omit           &     &   \omit             &     & \omit \cr
                        \multispan5\hrulefill                                            \cr
                height2pt &    \omit           &     &   \omit             &     & \omit \cr
                          &Wales   = $\theta_3$&     &England  = $\theta_1$&     &  7500 \cr
                height2pt &    \omit           &     &   \omit             &     & \omit \cr
                        \multispan5\hrulefill                                            \cr
                   \omit  &       2000         &\omit&     8000            &\omit& 10000 \cr
          }
       } }
In algebraic terms,
$$
  D = \left[ \matrix{ 2500 \cr
                      7500 \cr
                      2000 \cr
                      8000 \cr } \right]
    = \left[ \matrix{ 0 & 1 & 0 & 1 \cr
                      1 & 0 & 1 & 0 \cr
                      0 & 0 & 1 & 1 \cr
                      1 & 1 & 0 & 0 \cr } \right] 
                                   \left[ \matrix{ \theta_1 \cr
                                                   \theta_2 \cr
                                                   \theta_3 \cr
                                                   \theta_4 \cr } \right]
    = R \theta.
$$

\bigskip
\noindent{2.1. INVERSION}
\smallskip

Inversion in its pure form inverts $R$ to obtain the estimate $ \hat\theta = R^{-1} D $, 
which might work except that $R$ is here (and in general) singular so that there is no inverse.  
A common fixup is to use a pseudo-inverse instead, so that $ \hat\theta = X D $ where $X$ approximates the inverse of $R$.  
But the data which we would have obtained if this were true are $R X D$ which, insofar as $X$ is {\bf not} the inverse of $R$, differ from the true data, 
showing $\hat\theta$ cannot be correct.  
So much for inversion.  
This method should only be used when the results are essentially unambiguous, or don't matter much.

\bigskip
\noindent{2.2. REGULARISATION}
\smallskip

In my illustrative problem, simple logic shows that the data leave only one degree of freedom. 
Let this be the Irish population $x$.  
Then
$$
  \theta_4 = x,\quad \theta_3 = 2000-x,
  \quad \theta_2 = 2500-x,\quad \theta_1 = 5500+x,
$$
and this satisfies the data for any $x$.  
Regularisation consists of finding the ``best'' result from among all those that agree with the data, 
as defined by maximising some regularisation function $\Phi(\theta)$ representing quality. 
The general method is usually attributed to Tikhonov \& Arsenin (1977). 
The commonest such method is least squares, which here fixes $x$ by minimising 
$$
  \theta_1^2 + \theta_2^2 + \theta_3^2 + \theta_4^2 = -\Phi(\theta)
$$
subject to agreement with the data.  
Unfortunately the selected value of $x$ is $-250$, indicating that the population of Ireland was negative.
Oops!  Least squares is simple, but not necessarily best.

For the sort of positive distribution being considered, maximum entropy (maximise $ \Phi = -\sum \theta_i \log\theta_i $) is useful.  
Negative populations are prohibited by the logarithm, and the result

\centerline{\vbox{\vskip 4pt
\offinterlineskip
      \halign{ & \vrule#  &
             \strut\quad\hfil#\hfil\quad &
                                       \vrule# &
                                   \strut\quad\hfil#\hfil\quad &
                                                             \vrule#  \cr
                        \noalign{\hrule}                              \cr
                height2pt &    \omit     &     &   \omit       &      \cr
                          &Ireland =  500&     &Scotland = 2000&      \cr
                height2pt &    \omit     &     &   \omit       &      \cr
                        \noalign{\hrule}                              \cr
                height2pt &    \omit     &     &   \omit       &      \cr
                          &Wales   = 1500&     &England  = 6000&      \cr
                height2pt &    \omit     &     &   \omit       &      \cr
                        \noalign{\hrule}                              \cr
          }
       } }
\noindent
has the 1:3 north:south ratio nicely independent of longitude, and the 1:4 east:west ratio nicely independent of latitude.  
In fact, entropy is the only function yielding this general symmetry, as was proved by Shore \& Johnson (1980,1983), also with less formality by Gull \& Skilling (1984).
This may well account for the high quality of many maximum-entropy reconstructions.
If there is a general regularisation method at all, it has to apply in special cases, and maximum entropy is the only sensible method in this particular special case.
But questions remain.  How should we account for noise in the data?
What about nuisance parameters?  How uncertain is the ``best'' result?

\bigskip
\noindent{2.3. PROBABILITIES}
\smallskip

In 1946, R.T. Cox proved that the only way of doing inference consistently is through probability calculus (Cox, 1946) and seeking the posterior probability
distribution $\Pr(\theta\mid D)$ of result $\theta$, given data $D$.
We start by assigning a prior distribution representing what we think $\theta$ might be, and modulate this with how likely the data measurements would then be.
This gives the joint distribution of $\theta$ and $D$, which can be alternatively expanded as the posterior we want, 
scaled by a coefficient called the ``prior predictive'' --- or ``evidence'' for short.
$$
\matrix {
      \Pr(\theta,D) & = &\Pr(\theta) \times \Pr(D\mid\theta)   &    =      &\Pr(\theta\mid D) \times \Pr(D)         &\qquad\qquad\parallel I \cr
       \hbox{Joint} &   &\hbox{Prior} \times \hbox{Likelihood} &           &\hbox{Posterior} \times \hbox{Evidence} &                        \cr
                    &   &\hbox{assumptions \& measurements}    &\Rightarrow&                 \hbox{inference}       &                        \cr
        }
$$
That's the famous, and stunningly elementary, {\bf Bayes' Theorem}.
In writing it, I have explicitly included the dependence on whatever contextual information $I$ is to hand, through the ``$\parallel I$'' qualification.
All probabilities in the equation are to be understood as conditioned on $I$, though in practice one often omits the symbol when the context is unambiguous.

Be careful, though.  
Wonderfully confusing paradoxes can follow if you inadvertently alter the context in the middle of a probability derivation, and don't notice.
It's just the same with other general terms such as ``average''.
In the ordinary arithmetical case, the average of 10 and 40 is $(10+40)/2 = 25$,
but if the situation is logarithmic, the average becomes $\sqrt{10 \times 40} = 20$ instead.
Change the context in the middle of your calculation and you may ``prove'' \hbox{$25=20$}.
However, we would not consider that such a discrepancy implied a contradiction within the theory of averages: it's merely a mistake.
Yet exactly such claim of contradiction was actually made about probability theory (Dawid, Stone and Zidek 1973).
Jaynes (2003) details the shameful saga.
The danger with a probabilistic calculation is that the level of abstraction is likely to be greater,
so that the mistake can be less obvious and the conclusion may be seductively misleading.
To avoid such error, just take care to track the context.

Observe that the evidence amounts to the likelihood $\Pr(D\mid I)$ for the contextual information.
We would use exactly the same Bayes' Theorem to compare different contexts $I$ (under more general background information~$J$) 
as we do here to compare different parameter values $\theta$: it is only the identifications of the symbols that would change.
If we don't have any alternatives for $I$, the evidence is useless to us, 
though in principle we should still calculate it and publish it in case somebody else can think of an alternative that might explain the data better
(or worse, if their alternative's evidence turns out to be smaller).
Always calculate the evidence (its units are inverse data) as an integral part of Bayesian inference.  
It's a basic part of honest presentation, but few people do it, which is a pity.

\bigskip

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -