⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 fastmcd.m

📁 data description toolbox 1.6 单类分类器工具包
💻 M
📖 第 1 页 / 共 5 页
字号:
function [res,raw]=fastmcd(data,options);% version 22/12/2000, revised 19/01/2001, new reweighted correction factors and old cutoff 9/07/2001%% FASTMCD computes the MCD estimator of a multivariate data set.  This % estimator is given by the subset of h observations with smallest covariance % determinant.  The MCD location estimate is then the mean of those h points,% and the MCD scatter estimate is their covariance matrix.  The default value % of h is roughly 0.75n (where n is the total number of observations), but the % user may choose each value between n/2 and n.%% The MCD method is intended for continuous variables, and assumes that % the number of observations n is at least 5 times the number of variables p. % If p is too large relative to n, it would be better to first reduce% p by variable selection or principal components.%% The MCD method was introduced in:%%   Rousseeuw, P.J. (1984), "Least Median of Squares Regression," %   Journal of the American Statistical Association, Vol. 79, pp. 871-881.%% The MCD is a robust method in the sense that the estimates are not unduly % influenced by outliers in the data, even if there are many outliers. % Due to the MCD's robustness, we can detect outliers by their large% robust distances. The latter are defined like the usual Mahalanobis% distance, but based on the MCD location estimate and scatter matrix% (instead of the nonrobust sample mean and covariance matrix).%% The FASTMCD algorithm uses several time-saving techniques which % make it available as a routine tool to analyze data sets with large n,% and to detect deviating substructures in them. A full description of the % algorithm can be found in:%%   Rousseeuw, P.J. and Van Driessen, K. (1999), "A Fast Algorithm for the %   Minimum Covariance Determinant Estimator," Technometrics, 41, pp. 212-223.%% An important feature of the FASTMCD algorithm is that it allows for exact % fit situations, i.e. when more than h observations lie on a (hyper)plane. % Then the program still yields the MCD location and scatter matrix, the latter% being singular (as it should be), as well as the equation of the hyperplane.%% Usage:%   [res,raw]=fastmcd(data,options)                                       %% If only one output argument is listed, only the final result ('res') is returned.% The first input argument 'data' is a vector or matrix.  Columns represent % variables, and rows represent observations.  Missing values (NaN's) and % infinite values (Inf's) are allowed, since observations (rows) with missing % or infinite values will automatically be excluded from the computations.%% The second input argument 'options' is a structure.  It specifies certain % parameters of the algorithm:%                                                           %  options.cor: If non-zero, the robust correlation matrix will be %               returned. The default value is 0.      %  options.alpha: The percentage of observations whose covariance determinant will %                 be minimized.  Any value between 0.5 and 1 may be specified.%                 The default value is 0.75.%  options.ntrial: The number of random trial subsamples that are drawn for %                  large datasets. The default is 500.  %% The output structure 'raw' contains intermediate results, with the following % fields :%%  raw.center: The raw MCD location of the data.%  raw.cov: The raw MCD covariance matrix (multiplied by a finite sample %           correction factor and an asymptotic consistency factor).%  raw.cor: The raw MCD correlation matrix, if options.cor was non-zero.%  raw.objective: The determinant of the raw MCD covariance matrix.%  raw.robdist: The distance of each observation from the raw MCD location%               of the data, relative to the raw MCD scatter matrix 'raw.cov' %  raw.wt: Weights based on the estimated raw covariance matrix 'raw.cov' and %          the estimated raw location of the data. These weights determine %          which observations are used to compute the final MCD estimates. %% The output structure 'res' contains the final results, namely:%%  res.n_obs: The number of data observations (without missing values).%  res.quan: The number of observations that have determined the MCD estimator, %            i.e. the value of h. %  res.mahalanobis:  The distance of each observation from the classical%                    center of the data, relative to the classical shape%                    of the data. Often, outlying points fail to have a%                    large Mahalanobis distance because of the masking%                    effect.%  res.center: The robust location of the data, obtained after reweighting, if %              the raw MCD is not singular.  Otherwise the raw MCD center is %              given here.%  res.cov: The robust covariance matrix, obtained after reweighting and %           multiplying with a finite sample correction factor and an asymptotic%           consistency factor, if the raw MCD is not singular.  Otherwise the %           raw MCD covariance matrix is given here.%  res.cor: The robust correlation matrix, obtained after reweighting, if %           options.cor was non-zero.%  res.method: A character string containing information about the method and %              about singular subsamples (if any). %  res.robdist:  The distance of each observation to the final,%                reweighted MCD center of the data, relative to the%                reweighted MCD scatter of the data.  These distances allow%                us to easily identify the outliers. If the reweighted MCD%                is singular, raw.robdist is given here.%  res.flag: Flags based on the reweighted covariance matrix and the %            reweighted location of the data.  These flags determine which %            observations can be considered as outliers. If the reweighted%            MCD is singular, raw.wt is given here.%  res.plane:  In case of an exact fit, res.plane contains the coefficients%              of a (hyper)plane a_1(x_i1-m_1)+...+a_p(x_ip-m_p)=0 %              containing at least h observations, where (m_1,...,m_p)%              is the MCD location of these observations.%  res.X: The data matrix. Rows containing missing or infinite values are %         ommitted. %% FASTMCD also automatically calls the function PLOTMCD which creates plots for % visualizing outliers detected by FASTMCD. The plots that can be produced are:%% 1. Plot of Mahalanobis distances versus case number.% 2. Plot of robust distances versus case number. % 3. QQ plot: shows robust distances versus chi-squared quantiles.% 4. Robust distances versus Mahalanobis distances (i.e. the D-D plot).% 5. The 97.5% robust tolerance ellipse (if the dataset is bivariate).%% Usage:%     plotmcd(mcdres,options)%% The first input argument 'mcdres' is the output argument of the function % FASTMCD. The second input argument 'options' is a structure containing:%                                                           %  options.ask: A logical flag: if set to 0, all plots are produced sequentially;%               if set to 1, PLOTMCD displays a menu listing all the plots that %               can be produced. The default value is 1. %  options.nid: Number of points (must be less than n) to be highlighted in the %               appropriate plots. These will be the 'nid' most extreme points,%               i.e. those with largest robust distance.  %  options.xlab: Label of the X-axis in the MCD tolerance ellipse plot.%  options.ylab: Label of the Y-axis in the MCD tolerance ellipse plot.% The fastmcd algorithm works as follows:%%       The dataset contains n cases and p variables. %       When n < 2*nmini (see below), the algorithm analyzes the dataset as a whole.%       When n >= 2*nmini (see below), the algorithm uses several subdatasets.%%       When the dataset is analyzed as a whole, a trial subsample of p+1 cases %       is taken, of which the mean and covariance matrix are calculated. %       The h cases with smallest relative distances are used to calculate %       the next mean and covariance matrix, and this cycle is repeated csteps1 %       times. For small n we consider all subsets of p+1 out of n, otherwise%       the algorithm draws 500 random subsets by default.%       Afterwards, the 10 best solutions (means and corresponding covariance %       matrices) are used as starting values for the final iterations. %       These iterations stop when two subsequent determinants become equal. %       (At most csteps3 iteration steps are taken.) The solution with smallest %       determinant is retained. %%       When the dataset contains more than 2*nmini cases, the algorithm does part %       of the calculations on (at most) maxgroup nonoverlapping subdatasets, of %       (roughly) maxobs cases. %%       Stage 1: For each trial subsample in each subdataset, csteps1 (see below) iterations are %       carried out in that subdataset. For each subdataset, the 10 best solutions are %       stored.   %       %       Stage 2 considers the union of the subdatasets, called the merged set. %       (If n is large, the merged set is a proper subset of the entire dataset.) %       In this merged set, each of the 'best solutions' of stage 1 are used as starting %       values for csteps2 (sse below) iterations. Also here, the 10 best solutions are stored.%%       Stage 3 depends on n, the total number of cases in the dataset. %       If n <= 5000, all 10 preliminary solutions are iterated. %       If n > 5000, only the best preliminary solution is iterated. %       The number of iterations decreases to 1 according to n*p (If n*p <= 100,000 we %       iterate csteps3 (sse below) times, whereas for n*p > 1,000,000 we take only one iteration step). % The maximum value for n (= number of observations) is:nmax=50000;% The maximum value for p (= number of variables) is:pmax=50;% To change the number of subdatasets and their size, the values of % maxgroup and nmini can be changed. maxgroup=5;nmini=300;% The number of iteration steps in stages 1,2 and 3 can be changed% by adapting the parameters csteps1, csteps2, and csteps3.csteps1=2;csteps2=2;csteps3=100;% dtrial : number of subsamples if not all (p+1)-subsets will be considered. dtrial=500;% The 0.975 quantile of the chi-squared distribution.chi2q=[5.02389,7.37776,9.34840,11.1433,12.8325,...       14.4494,16.0128,17.5346,19.0228,20.4831,21.920,23.337,...       24.736,26.119,27.488,28.845,30.191,31.526,32.852,34.170,...       35.479,36.781,38.076,39.364,40.646,41.923,43.194,44.461,...       45.722,46.979,48.232,49.481,50.725,51.966,53.203,54.437,...       55.668,56.896,58.120,59.342,60.561,61.777,62.990,64.201,...       65.410,66.617,67.821,69.022,70.222,71.420]; % Median of the chi-squared distribution. chimed=[0.454937,1.38629,2.36597,3.35670,4.35146,...       5.34812,6.34581,7.34412,8.34283,9.34182,10.34,11.34,12.34,...       13.34,14.34,15.34,16.34,17.34,18.34,19.34,20.34,21.34,22.34,...       23.34,24.34,25.34,26.34,27.34,28.34,29.34,30.34,31.34,32.34,...       33.34,34.34,35.34,36.34,37.34,38.34,39.34,40.34,41.34,42.34,...       43.34,44.34,45.34,46.34,47.33,48.33,49.33];seed=0;quan=0;alpha=0.75;file=0;% The value of the fields of the input argument OPTIONS are now determined.% If the user hasn't given a value to one of the fields, the default value % is assigned to it.if nargin==1   cor=0;   ntrial=dtrial;   lts=[];elseif isstruct(options)    names=fieldnames(options);      	if strmatch('cor',names,'exact')      cor=options.cor;   else      cor=0;	end	if strmatch('alpha',names,'exact')      quan=1;      alpha=options.alpha;	end	if strmatch('ntrial',names,'exact')   	ntrial=options.ntrial;	else   	ntrial=dtrial;   end      if strmatch('lts',names,'exact')      lts=options.lts;   else      lts=0;   end   else   error('The second input argument is not a structure.')end   if size(data,1)==1    data=data';end   % Observations with missing or infinite values are ommitted. ok=all(isfinite(data),2);data=data(ok,:);n=size(data,1);p=size(data,2);% Some checks are now performed.if n==0   error('All observations have missing or infinite values.')endif n > nmax   error(['The program allows for at most ' int2str(nmax) ' observations.'])endif p > pmax   error(['The program allows for at most ' int2str(pmax) ' variables.'])endif n < 2*p   error('Need at least 2*(number of variables) observations.')end% hmin is the minimum number of observations whose covariance determinant % will be minimized.  hmin=quanf(0.5,n,p);if ~quan   h=quanf(0.75,n,p);else   h=quanf(alpha,n,p);   if h < hmin                                                            error(['The MCD must cover at least ' int2str(hmin) ' observations.'])   elseif h > n      error('quan is greater than the number of non-missings and non-infinites.')   endendfid=NaN;% The value of some fields of the output argument are already known.res.n_obs=n;res.quan=h;res.X=data;% Some initializations.res.flag=repmat(NaN,1,length(ok));raw.wt=repmat(NaN,1,length(ok));raw.robdist=repmat(NaN,1,length(ok));res.robdist=repmat(NaN,1,length(ok));res.mahalanobis=repmat(NaN,1,length(ok));if ~lts   res.method=sprintf('\nMinimum Covariance Determinant Estimator.');else   res.method=sprintf('\nThe function fastmcd.m is called to compute robust distances.');end   correl=NaN;%    z    : if at least h observations lie on a hyperplane, then z contains the %           coefficients of that plane.  % weights : weights of the observations that are not excluded from the computations. %           These are the observations that don't contain missing or infinite values.% bestobj : best objective value found.z(1:p)=0;weights=zeros(1,n);bestobj=inf;% The breakdown point of the MCD estimator is computed. if h==hmin   %the breakdown point is maximal.   breakdown=(h-p)*100/n;else   breakdown=(n-h+1)*100/n;end% The classical estimates are computed.    clasmean=mean(data);clascov=cov(data);   if p < 5   eps=1e-12;elseif p <= 8   eps=1e-14;else   eps=1e-16;end% The standardization of the data will now be performed.med=median(data);mad=sort(abs(data-repmat(med,n,1)));mad=mad(h,:);ii=min(find(mad < eps));if length(ii)    % The h-th order statistic is zero for the ii-th variable. The array plane contains   % all the observations which have the same value for the ii-th variable.   plane=find(abs(data(:,ii)-med(ii)) < eps)';   meanplane=mean(data(plane,:));   weights(plane)=1;   if p==1      res.flag=weights;      raw.wt=weights;      [raw.center,res.center]=deal(meanplane);      [raw.cov,res.cov,raw.objective]=deal(0);      if ~lts         res.method=sprintf('\nUnivariate location and scale estimation.');            res.method=strvcat(res.method,sprintf('%g of the %g observations are identical.',length(plane),n));         disp(res.method);      end   else      z(ii)=1;      res.plane=z;      covplane=cov(data(plane,:));      [raw.center,raw.cov,res.center,res.cov,raw.objective,raw.wt,res.flag,...      res.method]=displ(3,length(plane),weights,n,p,meanplane,covplane,res.method,z,ok,...                        raw.wt,res.flag,file,fid,0,NaN,h,ii);   end   return

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -