📄 classification.html
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html> <head> <title>Gaussian Process Classification</title> <link type="text/css" rel="stylesheet" href="style.css"> </head> <body><h2>Gaussian Process Classification</h2>Exact inference in Gaussian process models with likelihood functions tailoredto classification isn't tractable. Here we consider two procedures forapproximate inference for binary classification:<ol><li> Laplace's approximation is based on an expansion around the mode of theposterior:<ol><li type="a"> a <a href="#laplace">description</a> of the implementation of thealgorithm</li><li type="a"> a <a href="#laplace-ex-toy">simple example</a> applying thealgorithm to a 2-dimensional classification problem</li><li type="a"> a somewhat <a href="#laplace-ex-usps">more involved</a> example,classifying images of hand-written digits.</li></ol></li><li> The Expectation Propagation (EP) algorithm is based on matching momentsapproximations to the marginals of the posterior:<ol><li type="a"> a <a href="#ep">description</a> of the implementation of the algorithm</li><li type="a"> a <a href="#ep-ex-toy">simple example</a> applying the algorithm to a 2-dimensional classification problem</li><li type="a"> a somewhat <a href="#ep-ex-usps">more involved<a> example,classifying images of hand-written digits.</li></ol></li></ol>The code, demonstration scripts and documentation are all contained in the <ahref="http://www.gaussianprocess.org/gpml/code/gpml-matlab.tar.gz">tar</a> or<a href="http://www.gaussianprocess.org/gpml/code/gpml-matlab.zip">zip</a>archive file.<h3 id="laplace">Laplace's Approximation</h3><p>It is straight forward to implement Laplace's method for binary Gaussianprocess classification using matlab. Here we discuss an implementation whichrelies closely on Algorithm 3.1 (p. 46) for computing Laplace's approximationto the posterior, Algorithm 3.2 (p. 47) for making probabilistic predictionsfor test cases and Algorithm 5.1 (p. 126) for computing partial derivatives ofthe log marginal likelihood w.r.t. the hyperparameters (the parameters of thecovariance function). Be aware that the <em>negative</em> of the log marginallikelihood is used.</p><p>The implementation given in <ahref="../gpml/binaryLaplaceGP.m">binaryLaplaceGP.m</a> can convenientlybe used together with <a href="../gpml/minimize.m">minimize.m</a>. Theprogram can do one of two things:<ol><li> compute the negative log marginal likelihood and its partial derivativeswrt. the hyperparameters, usage</p><pre>[nlml dnlml] = binaryLaplaceGP(logtheta, covfunc, lik, x, y)</pre>which is used when "training" the hyperparameters, or</li><li> compute the (marginal) predictive distribution of test inputs, usage</p><pre>[p mu s2 nlml] = binaryLaplaceGP(logtheta, covfunc, lik, x, y, xstar)</pre></li></ol>Selection between the two modes is indicated by the presence (or absence) oftest cases, <tt>xstar</tt>. The arguments to the <ahref="../gpml/binaryLaplaceGP.m">binaryLaplaceGP.m</a> function aregiven in the table below where <tt>n</tt> is the number of training cases,<tt>D</tt> is the dimension of the input space and <tt>nn</tt> is the number oftest cases:<br><br><table border=0 cols=2 width="100%"><tr><td width="15%"><b>inputs</b></td><td></td></tr><tr><td><tt>logtheta</tt></td><td>a (column) vector containing the logarithm of the hyperparameters</td></tr><tr><td><tt>covfunc</tt></td><td>the covariance function, see <a href="../gpml/covFunctions.m">covFunctions.m</a><tr><td><tt>lik</tt></td><td>the likelihood function, built-in functions are: <tt> logistic</tt> and <tt>cumGauss</tt>.</td></tr><tr><td><tt>x</tt></td><td>a <tt>n</tt> by <tt>D</tt> matrix of traininginputs</td></tr><tr><td><tt>y</tt></td><td>a (column) vector (of length <tt>n</tt>) of training set <tt>+1/-1</tt> binary targets</td></tr><tr><td><tt>xstar</tt></td><td>(optional) a <tt>nn</tt> by <tt>D</tt> matrix of testinputs</td></tr><tr><td> </td><td></td></tr><tr><td><b>outputs</b></td><td></td></tr><tr><td><tt>nlml</tt></td><td>the negative log marginal likelihood</td></tr><tr><td><tt>dnlml</tt></td><td>(column) vector withthe partial derivatives of the negative log marginal likelihood wrt. thelogarithm of the hyperparameters.</td></tr><tr><td><tt>mu</tt></td><td>(column) vector (of length <tt>nn</tt>) ofpredictive latent means</td></tr><tr><td><tt>s2</tt></td><td>(column) vector (of length <tt>nn</tt>) ofpredictive latent variances</td></tr><tr><td><tt>p</tt></td><td>(column) vector (of length <tt>nn</tt>) ofpredictive probabilities</td></tr></table><br>The number of hyperparameters (and thus the length of the <tt>logtheta</tt>vector) depends on the choice of covariance function. Below we will used thesquared exponential covariance function with isotropic distance measure, implemented in <a href="../gpml/covSEiso.m">covSEiso.m</a>; this covariancefunction has <tt>2</tt> parameters.</p>Properties of various covariance functions are discussed in section 4.2. Forthe details of the implementation of the above covariance functions, see <ahref="../gpml/covFunctions.m">covFunctions.m</a> and the two likelihoodfunctions <tt>logistic</tt> and <tt>cumGauss</tt> are placed at the endof the <a href="../gpml/binaryLaplaceGP.m">binaryLaplaceGP.m</a> file.In either mode (training or testing) the program first uses Newton's algorithmto find the maximum of the posterior over latent variables. In the first evercall of the function, the initial guess for the latent variables is the zerovector. If the function is called multiple times, it stores the optimum fromthe previous call in a persistent variable and attempts to use this value as astarting guess for Newton's algorithm. This is useful when training, e.g. using<a href="../gpml/minimize.m">minimize.m</a>, since the hyperparametersfor consecutive calls will generally be similar, and one would expect themaximum over latent variables based on the previous setting of thehyperparameters as a reasonable starting guess. (If it turns out that thisstrategy leads to a very bad marginal likelihood value, then the functionreverts to starting at zero.)</p>The Newton algorithm follows Algorithm 3.1, section 3.4, page 46<br></p><center><img src="alg31.gif"></center><br><p>The iterations are terminated when the improvement in log marginallikelihood drops below a small tolerance. During the iterations, it is checkedthat the log marginal likelihood never decreases; if this happens bisection isrepeatedly applied (up to a maximum of 10 times) until the likelihoodincreases.</p>What happens next depends on whether we are in the training or prediction mode,as indicated by the absence or presence of test inputs <tt>xstar</tt>. If testcases are present, then predictions are computed following Algorithm 3.2,section 3.4, page 47<br></p><center><img src="alg32.gif"></center><br>Alternatively, if we are in the training mode, we proceed to compute thepartial derivatives of the log marginal likelihood wrt the hyperparameters,using Algorithm 5.1, section 5.5.1, page 126<br></p> <center><imgsrc="alg51.gif"></center><br><h3 id="laplace-ex-toy">Example of Laplace's Approximation applied to a2-dimensional classification problem</h3>You can either follow the example below or run the short <ahref="../gpml-demo/demo_laplace_2d.m">demo_laplace_2d.m</a> script.First we generate a simple artificial classification dataset, by sampling datapoints from each of two classes from separate Gaussian distributions, asfollows:<pre> n1=80; n2=40; % number of data points from each class S1 = eye(2); S2 = [1 0.95; 0.95 1]; % the two covariance matrices m1 = [0.75; 0]; m2 = [-0.75; 0]; % the two means randn('seed',17) x1 = chol(S1)'*randn(2,n1)+repmat(m1,1,n1); % samples from one class x2 = chol(S2)'*randn(2,n2)+repmat(m2,1,n2); % and from the other x = [x1 x2]'; % these are the inputs and y = [repmat(-1,1,n1) repmat(1,1,n2)]'; % outputs used a training data</pre>Below the samples are show together with the "Bayes Decision Probabilities",obtained from complete knowledge of the data generating process:</p><center><img src="fig2d.gif"></center><br>Note, that the ideal predictive probabilities depend only on the relativedensity of the two classes, and not on the absolute density. We would, forexample, expect that the structure in the upper right hand corner of the plotmay be very difficult to obtain based on the samples, because the data densityis very low. The contour plot is obtained by:<pre> [t1 t2] = meshgrid(-4:0.1:4,-4:0.1:4); t = [t1(:) t2(:)]; % these are the test inputs tt = sum((t-repmat(m1',length(t),1))*inv(S1).*(t-repmat(m1',length(t),1)),2); z1 = n1*exp(-tt/2)/sqrt(det(S1)); tt = sum((t-repmat(m2',length(t),1))*inv(S2).*(t-repmat(m2',length(t),1)),2); z2 = n2*exp(-tt/2)/sqrt(det(S2)); contour(t1,t2,reshape(z2./(z1+z2),size(t1)),[0.1:0.1:0.9]); hold on plot(x1(1,:),x1(2,:),'b+') plot(x2(1,:),x2(2,:),'r+')</pre> Now, we will fit a probabilistic Gaussian process classifier to this data,using an implementation of Laplace's method. We must specify a covariancefunction and a likelihood function. First, we will try the squared exponentialcovariance function <tt>covSEiso</tt>. We must specify the parameters of thecovariance function (hyperparameters). For the isotropic squared exponentialcovariance function there are two hyperparameters, the lengthscale (kernelwidth) and the magnitude. We need to specify values for these hyperparameters(see below for how to learn them). Initially, we will simply set the log ofthese hyperparameters to zero, and see what happens. For the likelihoodfunction, we use the cumulative Gaussian:<pre> loghyper = [0; 0]; p2 = binaryLaplaceGP(loghyper, 'covSEiso', 'cumGauss', x, y, t); clf contour(t1,t2,reshape(p2,size(t1)),[0.1:0.1:0.9]); hold on plot(x1(1,:),x1(2,:),'b+') plot(x2(1,:),x2(2,:),'r+')</pre>to produce predictive probabilities on the grid of test points:</p> <center><img src="fig2dl1.gif"></center><br>Although the predictive contours in this plot look quite different from the"Bayes Decision Probabilities" plotted above, note that the predictiveprobabilities in regions of high data density are not terribly different fromthose of the generating process. Recall, that this plot was made usinghyperparameter which we essentially pulled out of thin air. Now, we find thevalues of the hyperparameters which maximize the marginal likelihood (orstrictly, the Laplace approximation of the marginal likelihood):<pre> newloghyper = minimize(loghyper, 'binaryLaplaceGP', -20, 'covSEiso', 'cumGauss', x, y) p3 = binaryLaplaceGP(newloghyper, 'covSEiso', 'cumGauss', x, y, t);</pre>where the argument <tt>-20</tt> tells minimize to evaluate the function at most<tt>20</tt> times. The new hyperparameters have a fairly similar length scale,but a much larger magnitude for the latent function. This leads to more extremepredictive probabilities:<pre> clf contour(t1,t2,reshape(p3,size(t1)),[0.1:0.1:0.9]); hold on plot(x1(1,:),x1(2,:),'b+') plot(x2(1,:),x2(2,:),'r+')</pre>produces:</p><center><img src="fig2dl2.gif"></center><br>Note, that this plot still shows that the predictive probabilities revert toone half, when we move away from the data (in stark contrast to the "BayesDecision Probabilities" in this example). This may or may not be seen as anappropriate behaviour, depending on our prior expectations about the data. Itis a direct consequence of the behaviour of the squared exponential covariance
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -