bootstrapping a principal component analysis.htm

来自「matlab bootstrap程序设计方法」· HTM 代码 · 共 509 行 · 第 1/2 页

HTM
509
字号
 \begin{displaymath}x^{(k)}_{ij} = \sum^k_{t=1}u_{it}s_t v_{jt}\end{displaymath} --><IMG 
height=58 
alt="\begin{displaymath}&#10;x^{(k)}_{ij} = \sum^k_{t=1}u_{it}s_t v_{jt}&#10;\end{displaymath}" 
src="Bootstrapping a Principal Component Analysis.files/img250.png" width=135 
border=0> </DIV><BR clear=all>
<P></P>is the best (optimal) approximation of rank <IMG height=17 alt=$k$ 
src="Bootstrapping a Principal Component Analysis.files/img251.png" width=15 
align=bottom border=0> for <IMG height=16 alt=$X$ 
src="Bootstrapping a Principal Component Analysis.files/img162.png" width=22 
align=bottom border=0>. 
<P>The columns of <IMG height=16 alt=$V$ 
src="Bootstrapping a Principal Component Analysis.files/img252.png" width=20 
align=bottom border=0> are the directions along which the variances are maximal. 
Definition: <BR>Principal components are the coordinates of the observations on 
the basis of the new variables (namely the columns of <IMG height=16 alt=$V$ 
src="Bootstrapping a Principal Component Analysis.files/img252.png" width=20 
align=bottom border=0>) and they are the rows of <IMG height=16 alt=$G=XV=US$ 
src="Bootstrapping a Principal Component Analysis.files/img253.png" width=129 
align=bottom border=0>. The components are orthogonal and their lengths are the 
singular values <!-- MATH $C^\prime C=SU^\prime US=S^2$ --><IMG height=19 
alt="$C^\prime C=SU^\prime US=S^2$" 
src="Bootstrapping a Principal Component Analysis.files/img254.png" width=168 
align=bottom border=0>. 
<P></P>In the same way the principal axes are defined as the rows of the matrix <!-- MATH $Z=U^\prime X=SV^\prime$ --><IMG height=17 
alt="$Z=U^\prime X=SV^\prime$" 
src="Bootstrapping a Principal Component Analysis.files/img243.png" width=137 
align=bottom border=0>. Coordinates of the variables on the basis of columns of 
<IMG height=16 alt=$U$ 
src="Bootstrapping a Principal Component Analysis.files/img255.png" width=19 
align=bottom border=0>. <BR>
<P></P>
<DIV align=center><!-- MATH \begin{displaymath}Z=S^{-1}C^\prime X \quad\mbox{and}\quad G=XZ^\prime S^{-1}.\end{displaymath} --><IMG 
height=28 
alt="\begin{displaymath}&#10;Z=S^{-1}C^\prime X \quad\mbox{and}\quad G=XZ^\prime S^{-1}.&#10;\end{displaymath}" 
src="Bootstrapping a Principal Component Analysis.files/img256.png" width=283 
border=0> </DIV><BR clear=all>
<P></P>However, this decomposition will be highly dependent upon the unity of 
measurement (scale) on which the variables are measured. It is only used, in 
fact, when the <IMG height=35 alt="$X_{\cdot k}$" 
src="Bootstrapping a Principal Component Analysis.files/img257.png" width=32 
align=middle border=0> are all of the same ``order''. Usually what is done is 
that a weight is assigned to each variable that takes into account its overall 
scale. This weight is very often inversely proportional to the variable's 
variance. 
<P>So we define a different metric between observations instead of 
<P></P>
<DIV align=center><!-- MATH \begin{eqnarray*}d\,\left( x_{i\cdot}, x_{j\cdot}\right) &=& \left( x_{i\cdot}- x_{j\cdot}\right)^\prime, \left( x_{i\cdot}- x_{j\cdot}\right)\\d\,\left( x_{i\cdot}, x_{j\cdot}\right) &=& \left( x_{i\cdot}- x_{j\cdot}\right), Q\, \left( x_{i\cdot}- x_{j\cdot}\right)\\Q &=& \left(\begin{array}{ccc}\frac{1}{\sigma^2_1}\\& \ddots\\& & \frac{1}{\sigma^2_p}\end{array}\right).\end{eqnarray*} --><IMG 
height=154 
alt="\begin{eqnarray*}&#10;d \left( x_{i\cdot}, x_{j\cdot}\right) &amp;=&amp; \left( x_{i\cdot}-...&#10;...2_1}\\&#10;&amp; \ddots\\&#10;&amp; &amp; \frac{1}{\sigma^2_p}&#10;\end{array}\right).&#10;\end{eqnarray*}" 
src="Bootstrapping a Principal Component Analysis.files/img258.png" width=314 
border=0></DIV><BR clear=all>
<P></P>The same can be said of the observations; some may be more ``important'' 
than others (resulting from a mean of a group which is larger). 
<H3><A name=SECTION002121100000000000000>Matlab for the Scores Example -in 
handout 4/27/99</A> </H3><B>Score data</B> <PRE>&gt;&gt; round(cov(scor88))
ans =
   306   127   102   106   117
   127   173    85    95    99
   102    85   113   112   122
   106    95   112   220   156
   117    99   122   156   298
&gt;&gt; corrcoef(scor88) 
ans =
    1.0000    0.5534    0.5468    0.4094    0.3891
    0.5534    1.0000    0.6096    0.4851    0.4364
    0.5468    0.6096    1.0000    0.7108    0.6647
    0.4094    0.4851    0.7108    1.0000    0.6072
    0.3891    0.4364    0.6647    0.6072    1.0000
&gt;&gt; mean(scor88)
ans = 38.95   50.59   50.60   46.68   42.3068
&gt;&gt; scorc88=scor88-ones(88,1)*mean(scor88);
&gt;&gt; mean(scorc88)
ans =1.0e-13 *
    0.1421   -0.1970   -0.0565   -0.0436   -0.0363
&gt;&gt; [U,S,V] = svd(scorc88,0);
&gt;&gt; S 
  244.4752         0         0         0         0
         0  132.6034         0         0         0
         0         0   95.0053         0         0
         0         0         0   85.8070         0
         0         0         0         0   52.8898
&gt;&gt; V
V = 0.5054   -0.7487   -0.2998    0.2962   -0.0794
    0.3683   -0.2074    0.4156   -0.7829   -0.1889
    0.3457    0.0759    0.1453   -0.0032    0.9239
    0.4511    0.3009    0.5966    0.5181   -0.2855
    0.5347    0.5478   -0.6003   -0.1757   -0.1512
&gt;&gt; [VE, LE]=eig(cov(scorc88)) 
VE =
    0.2962   -0.2998    0.0794   -0.7487    0.5054
   -0.7829    0.4156    0.1889   -0.2074    0.3683
   -0.0032    0.1453   -0.9239    0.0759    0.3457
    0.5181    0.5966    0.2855    0.3009    0.4511
   -0.1757   -0.6003    0.1512    0.5478    0.5347
LE =
   84.6304         0         0         0         0
         0  103.7473         0         0         0
         0         0   32.1533         0         0
         0         0         0  202.1111         0
         0         0         0         0  686.9898
</PRE>Nonparametric Bootstrap of the ratio of the variance of the first few(k) 
axes. <PRE>function out=bpca(B,orig,k)
%Bootstrap of the PCA of a dataset
%Output the proprtion
%of variance for the first k axes
[n,p]=size(orig);
for (b =(1:B))
    newsample=bsample(orig);
    out(b)=pca(newsample,n,p,k);
end %---------------------------


function out=bsample(orig)
%Function to create one resample from
%the original sample orig, where
%orig is the original data, and is a matrix
      [n,p]=size(orig);
      indices=randint(1,n,n)+1;
      out=orig(indices,:);

%-------------------------------

function out=pca(data,n,p,k)
%Do a principal component analysis
%Return the ratio of the variance
%explained by the first k components
data=data-ones(n,1)*mean(data);
[U,S,V] = svd(data,0);
S2=diag(S).^2;
out=sum(S2(1:k))/sum(S2(1:p));
%---------------------------------

&gt;&gt; resbpca10K1=bpca(10000,scor88,1);
&gt;&gt; resbpca10K=bpca(10000,scor88,2);
&gt;&gt; hist(resbpca10K,100)
</PRE>one component <BR><IMG 
src="Bootstrapping a Principal Component Analysis.files/btspca1.gif" width=400 
,> two components <BR><IMG 
src="Bootstrapping a Principal Component Analysis.files/btspca2.gif" width=400 
,> 
<P>
<H3><A name=SECTION002121200000000000000>How to generate a multivariate 
normal</A> </H3>Choleski decomposition of a matrix is the decomposition of <IMG 
height=16 alt=$A$ 
src="Bootstrapping a Principal Component Analysis.files/img259.png" width=19 
align=bottom border=0> of the form <BR>
<P></P>
<DIV align=center><!-- MATH \begin{displaymath}A=L ^\prime L\end{displaymath} --><IMG 
height=28 alt="\begin{displaymath}A=L ^\prime L\end{displaymath}" 
src="Bootstrapping a Principal Component Analysis.files/img260.png" width=68 
border=0> </DIV><BR clear=all>
<P></P><PRE>&gt;&gt; chl=chol(cov(scor88));
&gt;&gt; round(chl'*chl)
ans =
   306   127   102   106   117
   127   173    85    95    99
   102    85   113   112   122
   106    95   112   220   156
   117    99   122   156   298

&gt;&gt; xs=randn(88,5);
&gt;&gt; mean(xs)
ans =
    0.0480   -0.2056    0.1209    0.1097    0.1051
&gt;&gt; corrcoef(xs)
ans =
    1.0000   -0.1623   -0.0664   -0.1947   -0.1042
   -0.1623    1.0000    0.1412    0.0361    0.0600
   -0.0664    0.1412    1.0000   -0.2334    0.0056
   -0.1947    0.0361   -0.2334    1.0000   -0.0734
   -0.1042    0.0600    0.0056   -0.0734    1.0000
&gt;&gt; ys=xs*chl; 
&gt;&gt; cov(ys)
ans =
  225.8649   65.7374   57.5737   29.1913   40.6952
   65.7374  149.5844   74.5579   74.1853   83.3744
   57.5737   74.5579   93.5467   66.5157   89.7249
   29.1913   74.1853   66.5157  152.4465   88.0262
   40.6952   83.3744   89.7249   88.0262  210.1678
</PRE><B>Parametric Bootstrap Simulations</B> <PRE>&gt;&gt; xs=randn(88,5);
&gt;&gt; mean(xs)
ans= 0.013  0.035 -0.151 -0.059 0.033
&gt;&gt; ys=xs*chl; 
&gt;&gt; round(cov(ys))
   309   169   147   130   150
   169   237   137   160   146
   147   137   161   158   149
   130   160   158   252   172
   150   146   149   172   272
function out=spca(B,orig,k);
%Parametric Bootstrap of the PCA of a dataset
%Output the proportion
%of variance for the first k axes
out=zeros(1,B);
[n,p]=size(orig);
chl=chol(cov(orig));
for (b =(1:B))
    rands=randn(n,p);
    newsample=rands*chl;
    out(b)=pca(newsample,88,5,k);
end
r100k=spca(100000,scor88,2);
hist(r100k,200);
</PRE>
<P>
<HR>
<!--Navigation Panel--><A 
href="http://www-stat.stanford.edu/~susan/courses/s208/node19.html" 
name=tex2html397><IMG height=24 alt=next 
src="Bootstrapping a Principal Component Analysis.files/next.png" width=37 
align=bottom border=0></A> <A 
href="http://www-stat.stanford.edu/~susan/courses/s208/node6.html" 
name=tex2html395><IMG height=24 alt=up 
src="Bootstrapping a Principal Component Analysis.files/up.png" width=26 
align=bottom border=0></A> <A 
href="http://www-stat.stanford.edu/~susan/courses/s208/node17.html" 
name=tex2html389><IMG height=24 alt=previous 
src="Bootstrapping a Principal Component Analysis.files/prev.png" width=63 
align=bottom border=0></A> <BR><B>Next:</B> <A 
href="http://www-stat.stanford.edu/~susan/courses/s208/node19.html" 
name=tex2html398>Confidence Intervals</A> <B>Up:</B> <A 
href="http://www-stat.stanford.edu/~susan/courses/s208/node6.html" 
name=tex2html396>Lectures</A> <B>Previous:</B> <A 
href="http://www-stat.stanford.edu/~susan/courses/s208/node17.html" 
name=tex2html390>Cross Validation</A> <!--End of Navigation Panel-->
<ADDRESS>Susan Holmes 2004-05-19 </ADDRESS></BODY></HTML>

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?