📄 classification.html
字号:
function. We can instead try a neural network covariance function <ahref="../gpml/covNNiso.m">covNNiso.m</a>, which has the ability to saturate atspecific latent values, as we move away from zero:<pre> newloghyper = minimize(loghyper, 'binaryLaplaceGP',-20,'covNNiso','cumGauss',x,y); p4 = binaryLaplaceGP(newloghyper, 'covNNiso','cumGauss', x, y, t); clf contour(t1,t2,reshape(p4,size(t1)),[0.1:0.1:0.9]); hold on plot(x1(1,:),x1(2,:),'b+') plot(x2(1,:),x2(2,:),'r+')</pre>to produce:</p><center><img src="fig2dl3.gif"></center><br>which shows a somewhat less pronounced tendency for the predictiveprobabilities to tend to one half as we move towards the boundaries ofthe plot.<h3 id="laplace-ex-usps">Example of Laplace's Approximation applied tohand-written digit classification</h3>You can either follow the example below or run the short <ahref="../gpml/gpml-demo/demo_laplace_usps.m">demo_laplace_usps.m</a>script. You will need the code form the <ahref="http://www.gaussianprocess.org/gpml/code/gpml-matlab.tar.gz">tar</a> or<a href="http://www.gaussianprocess.org/gpml/code/gpml-matlab.zip">zip</a>archive and additionally the data, see below. Consider compiling the mex files(see <a href="../gpml/README">README</a> file), although the programs shouldwork even without.</p>As an example application we will consider the USPS resampled data which can befound <ahref="http://www.gaussianprocess.org/gpml/data">here</a>. Specifically, we willneed to download <tt>usps_resampled.mat</tt> and unpack the data from <ahref="http://www.gaussianprocess.org/gpml/data/usps_resampled.mat.bz2">bz2</a>(7.0 Mb) or <ahref="http://www.gaussianprocess.org/gpml/data/usps_resampled.mat.zip">zip</a>(8.3 Mb) files. As an example, we will consider the binary classification taskof separating images of the digit "3" from images of the digit "5".<pre> [x y xx yy] = loadBinaryUSPS(3,5);</pre>which will create the training inputs <tt>x</tt> (of size 767 by 256), binary(+1/-1) training targets <tt>y</tt> (of length 767), and <tt>xx</tt> and<tt>yy</tt> containing the test set (of size 773). Usee.g. <tt>imagesc(reshape(x(3,:),16,16)');</tt> to view an image (here the 3rdimage of the training set).</p>Now, let us make predictions using the squared exponential covariance function<tt>covSEiso</tt> and the cumulative Gaussian likelihood function<tt>cumGauss</tt>. We must specify the parameters of the covariance function(hyperparameters). For the isotropic squared exponential covariance functionthere are two hyperparameters, the characteristic length-scale (kernel width)and the signal magnitude. We need to specify values for these hyperparameters(see below for how to learn them). One strategy, which is perhaps not toounreasonable would be to set the characteristic length-scale to be equal to theaverage pair-wise distance between points, which for this data is around 22;the signal variance can be set to unity, since this is the range at which thesigmoid non-linearity comes into play. Other strategies for setting initialguesses for the hyperparameters may be reasonable as well. We need to specifythe logarithm of the hypers, roughly log(22)=3 and log(1)=0:<pre> loghyper = [3.0; 0.0]; % set the log hyperparameters p = binaryLaplaceGP(loghyper, 'covSEiso', 'cumGauss', x, y, xx);</pre>which may take a few seconds to compute. We can visualize the predictivetest set probabilities using:<pre> plot(p,'.') hold on plot([1 length(p)],[0.5 0.5],'r')</pre>to get:</p><br><center><img src="figlapp.gif"></center><br>where we should keep in mind that the test cases are ordered according to theirtarget class. Notice that there are misclassifications, but there are no veryconfident misclassifications. The number of test set errors (out of 773 testcases) when thresholding the predictive probability at 0.5 and the averageamount of information about the test set labels in excess of a 50/50 model inbits are given by:<pre> sum((p>0.5)~=(yy>0)) mean((yy==1).*log2(p)+(yy==-1).*log2(1-p))+1</pre>which shows test set 35 prediction errors (out of 773 test cases) and anaverage amount of information of about 0.71 bits, compare to Figure 3.7(b) and3.7(d) on page 64.</p>More interestingly, we can also use the program to find the values for thehyperparameters which optimize the marginal likelihood, as follows:<pre> [loghyper' binaryLaplaceGP(loghyper, 'covSEiso', 'cumGauss', x, y)] [newloghyper logmarglik] = minimize(loghyper, 'binaryLaplaceGP', -20, 'covSEiso', 'cumGauss', x, y); [newloghyper' logmarglik(end)]</pre>where the third argument <tt>-20</tt> to the minimize function, tells minimizeto evaluate the function at most 20 times (this is adequate when the parameterspace is only 2-dimensional. Above, we first evaluate the negative log marginallikelihood at our initial guess for <tt>loghyper</tt> (which is about<tt>222</tt>), then minimize the negative log marginal likelihood w.r.t. thehyperparameters, to obtain new values of <tt>2.75</tt> for the log lengthscaleand <tt>2.27</tt> for the log magnitude. The negative log marginal likelihooddecreases to about <tt>99</tt> at the minimum, see also Figure 3.7(a) page64. The marginal likelihood has thus increased by a factor of exp(222-99) orroughly 3e+53.</p>Finally, we can make test set predictions with the new hyperparameters:<pre> pp = binaryLaplaceGP(newloghyper, 'covSEiso', 'cumGauss', x, y, xx); plot(pp,'g.')</pre>to obtain:</p><br><center><img src="figlapp2.gif"></center><br>We note that the new predictions (in green) generally take more extreme valuesthan the old ones (in blue). Evaluating the new test predictions:<pre> sum((pp>0.5)~=(yy>0)) mean((yy==1).*log2(pp)+(yy==-1).*log2(1-pp))+1</pre>reveals that there are now 22 test set misclassifications (as opposed to 35)and the information about the test set labels has increased from 0.71 bits to0.83 bits.<h3 id="ep">2. The Expectation Propagation Algorithm</h3><p>It is straight forward to implement the Expectation Propagation (EP)algorithm for binary Gaussian process classification using matlab. Here wediscuss an implementation which relies closely on Algorithm 3.5 (p. 58) forcomputing the EP approximation to the posterior, Algorithm 3.6 (p. 59) formaking probabilistic predictions for test cases and Algorithm 5.2 (p. 127) forcomputing partial derivatives of the log marginal likelihood w.r.t. thehyperparameters (the parameters of the covariance function). Be aware that the<em>negative</em> of the log marginal likelihood is used.</p><p>The implementation given in <ahref="../gpml/binaryEPGP.m">binaryEPGP.m</a> can convenientlybe used together with <a href="../gpml/minimize.m">minimize.m</a>. Theprogram can do one of two things:<ol><li> compute the negative log marginal likelihood and its partial derivativeswrt. the hyperparameters, usage</p><pre>[nlml dnlml] = binaryEPGP(logtheta, covfunc, x, y)</pre>which is used when "training" the hyperparameters, or</li><li> compute the (marginal) predictive distribution of test inputs, usage</p><pre>[p mu s2 nlml] = binaryEPGP(logtheta, cov, x, y, xstar)</pre></li></ol>Selection between the two modes is indicated by the presence (or absence) oftest cases, <tt>xstar</tt>. The arguments to the <ahref="../gpml/binaryEPGP.m">binaryEPGP.m</a> function aregiven in the table below where <tt>n</tt> is the number of training cases,<tt>D</tt> is the dimension of the input space and <tt>nn</tt> is the number oftest cases:<br><br><table border=0 cols=2 width="100%"><tr><td width="15%"><b>inputs</b></td><td></td></tr><tr><td><tt>logtheta</tt></td><td>a (column) vector containing the logarithm of the hyperparameters</td></tr><tr><td><tt>covfunc</tt></td><td>the covariance function, see <a href="../gpml/covFunctions.m">covFunctions.m</a><tr><td><tt>x</tt></td><td>a <tt>n</tt> by <tt>D</tt> matrix of traininginputs</td></tr><tr><td><tt>y</tt></td><td>a (column) vector (of length <tt>n</tt>) of training set <tt>+1/-1</tt> binary targets</td></tr><tr><td><tt>xstar</tt></td><td>(optional) a <tt>nn</tt> by <tt>D</tt> matrix of testinputs</td></tr><tr><td> </td><td></td></tr><tr><td><b>outputs</b></td><td></td></tr><tr><td><tt>nlml</tt></td><td>the negative log marginal likelihood</td></tr><tr><td><tt>dnlml</tt></td><td>(column) vector withthe partial derivatives of the negative log marginal likelihood wrt. thelogarithm of the hyperparameters.</td></tr><tr><td><tt>mu</tt></td><td>(column) vector (of length <tt>nn</tt>) ofpredictive latent means</td></tr><tr><td><tt>s2</tt></td><td>(column) vector (of length <tt>nn</tt>) ofpredictive latent variances</td></tr><tr><td><tt>p</tt></td><td>(column) vector (of length <tt>nn</tt>) ofpredictive probabilities</td></tr></table><br>The number of hyperparameters (and thus the length of the <tt>logtheta</tt>vector) depends on the choice of covariance function. Below we will use thesquared exponential covariance functions with isotropic distance measure,implemented in <a href="../gpml/covSEiso.m">covSEiso.m</a>: this covariance function has <tt>2</tt> parameters.</p>Properties of various covariance functions are discussed in section 4.2. Forthe details of the implementation of the above covariance functions, see <ahref="../gpml/covFunctions.m">covFunctions.m</a>.In either mode (training or testing) the program first uses sweeps of EPupdates to each latent variable in turn. Every update recomputes new values forthe tilde parameters <tt>ttau</tt> and <tt>tnu</tt> (and updates the parametersof the posterior <tt>mu</tt> and <tt>Sigma</tt>. In the first ever call of thefunction, the initial guess for the tilde parameters <tt>ttau</tt> and<tt>tnu</tt> are both the zero vector. If the function is called multipletimes, it stores the optimum from the previous call in a persistent variableand attempts to use these values as a starting guess for subsequent EPupdates. This is useful when training, e.g. using <ahref="../gpml/minimize.m">minimize.m</a>, since the hyperparameters forconsecutive calls will generally be similar, and one would expect the bestvalue of the tilde parameters from the previous setting of the hyperparametersas a reasonable starting guess. (If it turns out that this strategy leads to avery bad marginal likelihood value, then the function reverts to starting atzero.)</p>The Expectation Propagation implementation follows Algorithm 3.5, section 3.6,page 58<br></p> <center><img src="alg35.gif"></center><br>The iterations are terminated when the improvement in log marginal likelihooddrops below a small tolerance, or a prespecified maximum (10) number of sweepsis reached (causing a warning message to be issued). About five sweeps is oftensufficient when starting from zero, less if a reasonable starting point isavailable.</p>What happens next depends on whether we are in the training or prediction mode,as indicated by the absence or presence of test inputs <tt>xstar</tt>. If testcases are present, then predictions are computed following Algorithm 3.6,section 3.6, page 59<br></p> <center><img src="alg36.gif"></center><br>Alternatively, if we are in the training mode, we proceed to compute thepartial derivatives of the log marginal likelihood wrt the hyperparameters,using Algorithm 5.2, section 5.5, page 127<br></p> <center><imgsrc="alg52.gif"></center><br><h3 id="ep-ex-toy">Example of Laplace's Approximation applied to a2-dimensional classification problem</h3>You can either follow the example below or run the short <ahref="../gpml-demo/demo_ep_2d.m">demo_ep_2d.m</a> script.First we generate a simple artificial classification dataset, by sampling datapoints from each of two classes from separate Gaussian distributions, asfollows:<pre>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -