📄 fastmcd.m
字号:
function [res,raw]=fastmcd(data,options);% version 22/12/2000, revised 19/01/2001, new reweighted correction factors and old cutoff 9/07/2001%% FASTMCD computes the MCD estimator of a multivariate data set. This % estimator is given by the subset of h observations with smallest covariance % determinant. The MCD location estimate is then the mean of those h points,% and the MCD scatter estimate is their covariance matrix. The default value % of h is roughly 0.75n (where n is the total number of observations), but the % user may choose each value between n/2 and n.%% The MCD method is intended for continuous variables, and assumes that % the number of observations n is at least 5 times the number of variables p. % If p is too large relative to n, it would be better to first reduce% p by variable selection or principal components.%% The MCD method was introduced in:%% Rousseeuw, P.J. (1984), "Least Median of Squares Regression," % Journal of the American Statistical Association, Vol. 79, pp. 871-881.%% The MCD is a robust method in the sense that the estimates are not unduly % influenced by outliers in the data, even if there are many outliers. % Due to the MCD's robustness, we can detect outliers by their large% robust distances. The latter are defined like the usual Mahalanobis% distance, but based on the MCD location estimate and scatter matrix% (instead of the nonrobust sample mean and covariance matrix).%% The FASTMCD algorithm uses several time-saving techniques which % make it available as a routine tool to analyze data sets with large n,% and to detect deviating substructures in them. A full description of the % algorithm can be found in:%% Rousseeuw, P.J. and Van Driessen, K. (1999), "A Fast Algorithm for the % Minimum Covariance Determinant Estimator," Technometrics, 41, pp. 212-223.%% An important feature of the FASTMCD algorithm is that it allows for exact % fit situations, i.e. when more than h observations lie on a (hyper)plane. % Then the program still yields the MCD location and scatter matrix, the latter% being singular (as it should be), as well as the equation of the hyperplane.%% Usage:% [res,raw]=fastmcd(data,options) %% If only one output argument is listed, only the final result ('res') is returned.% The first input argument 'data' is a vector or matrix. Columns represent % variables, and rows represent observations. Missing values (NaN's) and % infinite values (Inf's) are allowed, since observations (rows) with missing % or infinite values will automatically be excluded from the computations.%% The second input argument 'options' is a structure. It specifies certain % parameters of the algorithm:% % options.cor: If non-zero, the robust correlation matrix will be % returned. The default value is 0. % options.alpha: The percentage of observations whose covariance determinant will % be minimized. Any value between 0.5 and 1 may be specified.% The default value is 0.75.% options.ntrial: The number of random trial subsamples that are drawn for % large datasets. The default is 500. %% The output structure 'raw' contains intermediate results, with the following % fields :%% raw.center: The raw MCD location of the data.% raw.cov: The raw MCD covariance matrix (multiplied by a finite sample % correction factor and an asymptotic consistency factor).% raw.cor: The raw MCD correlation matrix, if options.cor was non-zero.% raw.objective: The determinant of the raw MCD covariance matrix.% raw.robdist: The distance of each observation from the raw MCD location% of the data, relative to the raw MCD scatter matrix 'raw.cov' % raw.wt: Weights based on the estimated raw covariance matrix 'raw.cov' and % the estimated raw location of the data. These weights determine % which observations are used to compute the final MCD estimates. %% The output structure 'res' contains the final results, namely:%% res.n_obs: The number of data observations (without missing values).% res.quan: The number of observations that have determined the MCD estimator, % i.e. the value of h. % res.mahalanobis: The distance of each observation from the classical% center of the data, relative to the classical shape% of the data. Often, outlying points fail to have a% large Mahalanobis distance because of the masking% effect.% res.center: The robust location of the data, obtained after reweighting, if % the raw MCD is not singular. Otherwise the raw MCD center is % given here.% res.cov: The robust covariance matrix, obtained after reweighting and % multiplying with a finite sample correction factor and an asymptotic% consistency factor, if the raw MCD is not singular. Otherwise the % raw MCD covariance matrix is given here.% res.cor: The robust correlation matrix, obtained after reweighting, if % options.cor was non-zero.% res.method: A character string containing information about the method and % about singular subsamples (if any). % res.robdist: The distance of each observation to the final,% reweighted MCD center of the data, relative to the% reweighted MCD scatter of the data. These distances allow% us to easily identify the outliers. If the reweighted MCD% is singular, raw.robdist is given here.% res.flag: Flags based on the reweighted covariance matrix and the % reweighted location of the data. These flags determine which % observations can be considered as outliers. If the reweighted% MCD is singular, raw.wt is given here.% res.plane: In case of an exact fit, res.plane contains the coefficients% of a (hyper)plane a_1(x_i1-m_1)+...+a_p(x_ip-m_p)=0 % containing at least h observations, where (m_1,...,m_p)% is the MCD location of these observations.% res.X: The data matrix. Rows containing missing or infinite values are % ommitted. %% FASTMCD also automatically calls the function PLOTMCD which creates plots for % visualizing outliers detected by FASTMCD. The plots that can be produced are:%% 1. Plot of Mahalanobis distances versus case number.% 2. Plot of robust distances versus case number. % 3. QQ plot: shows robust distances versus chi-squared quantiles.% 4. Robust distances versus Mahalanobis distances (i.e. the D-D plot).% 5. The 97.5% robust tolerance ellipse (if the dataset is bivariate).%% Usage:% plotmcd(mcdres,options)%% The first input argument 'mcdres' is the output argument of the function % FASTMCD. The second input argument 'options' is a structure containing:% % options.ask: A logical flag: if set to 0, all plots are produced sequentially;% if set to 1, PLOTMCD displays a menu listing all the plots that % can be produced. The default value is 1. % options.nid: Number of points (must be less than n) to be highlighted in the % appropriate plots. These will be the 'nid' most extreme points,% i.e. those with largest robust distance. % options.xlab: Label of the X-axis in the MCD tolerance ellipse plot.% options.ylab: Label of the Y-axis in the MCD tolerance ellipse plot.% The fastmcd algorithm works as follows:%% The dataset contains n cases and p variables. % When n < 2*nmini (see below), the algorithm analyzes the dataset as a whole.% When n >= 2*nmini (see below), the algorithm uses several subdatasets.%% When the dataset is analyzed as a whole, a trial subsample of p+1 cases % is taken, of which the mean and covariance matrix are calculated. % The h cases with smallest relative distances are used to calculate % the next mean and covariance matrix, and this cycle is repeated csteps1 % times. For small n we consider all subsets of p+1 out of n, otherwise% the algorithm draws 500 random subsets by default.% Afterwards, the 10 best solutions (means and corresponding covariance % matrices) are used as starting values for the final iterations. % These iterations stop when two subsequent determinants become equal. % (At most csteps3 iteration steps are taken.) The solution with smallest % determinant is retained. %% When the dataset contains more than 2*nmini cases, the algorithm does part % of the calculations on (at most) maxgroup nonoverlapping subdatasets, of % (roughly) maxobs cases. %% Stage 1: For each trial subsample in each subdataset, csteps1 (see below) iterations are % carried out in that subdataset. For each subdataset, the 10 best solutions are % stored. % % Stage 2 considers the union of the subdatasets, called the merged set. % (If n is large, the merged set is a proper subset of the entire dataset.) % In this merged set, each of the 'best solutions' of stage 1 are used as starting % values for csteps2 (sse below) iterations. Also here, the 10 best solutions are stored.%% Stage 3 depends on n, the total number of cases in the dataset. % If n <= 5000, all 10 preliminary solutions are iterated. % If n > 5000, only the best preliminary solution is iterated. % The number of iterations decreases to 1 according to n*p (If n*p <= 100,000 we % iterate csteps3 (sse below) times, whereas for n*p > 1,000,000 we take only one iteration step). % The maximum value for n (= number of observations) is:nmax=50000;% The maximum value for p (= number of variables) is:pmax=50;% To change the number of subdatasets and their size, the values of % maxgroup and nmini can be changed. maxgroup=5;nmini=300;% The number of iteration steps in stages 1,2 and 3 can be changed% by adapting the parameters csteps1, csteps2, and csteps3.csteps1=2;csteps2=2;csteps3=100;% dtrial : number of subsamples if not all (p+1)-subsets will be considered. dtrial=500;% The 0.975 quantile of the chi-squared distribution.chi2q=[5.02389,7.37776,9.34840,11.1433,12.8325,... 14.4494,16.0128,17.5346,19.0228,20.4831,21.920,23.337,... 24.736,26.119,27.488,28.845,30.191,31.526,32.852,34.170,... 35.479,36.781,38.076,39.364,40.646,41.923,43.194,44.461,... 45.722,46.979,48.232,49.481,50.725,51.966,53.203,54.437,... 55.668,56.896,58.120,59.342,60.561,61.777,62.990,64.201,... 65.410,66.617,67.821,69.022,70.222,71.420]; % Median of the chi-squared distribution. chimed=[0.454937,1.38629,2.36597,3.35670,4.35146,... 5.34812,6.34581,7.34412,8.34283,9.34182,10.34,11.34,12.34,... 13.34,14.34,15.34,16.34,17.34,18.34,19.34,20.34,21.34,22.34,... 23.34,24.34,25.34,26.34,27.34,28.34,29.34,30.34,31.34,32.34,... 33.34,34.34,35.34,36.34,37.34,38.34,39.34,40.34,41.34,42.34,... 43.34,44.34,45.34,46.34,47.33,48.33,49.33];seed=0;quan=0;alpha=0.75;file=0;% The value of the fields of the input argument OPTIONS are now determined.% If the user hasn't given a value to one of the fields, the default value % is assigned to it.if nargin==1 cor=0; ntrial=dtrial; lts=[];elseif isstruct(options) names=fieldnames(options); if strmatch('cor',names,'exact') cor=options.cor; else cor=0; end if strmatch('alpha',names,'exact') quan=1; alpha=options.alpha; end if strmatch('ntrial',names,'exact') ntrial=options.ntrial; else ntrial=dtrial; end if strmatch('lts',names,'exact') lts=options.lts; else lts=0; end else error('The second input argument is not a structure.')end if size(data,1)==1 data=data';end % Observations with missing or infinite values are ommitted. ok=all(isfinite(data),2);data=data(ok,:);n=size(data,1);p=size(data,2);% Some checks are now performed.if n==0 error('All observations have missing or infinite values.')endif n > nmax error(['The program allows for at most ' int2str(nmax) ' observations.'])endif p > pmax error(['The program allows for at most ' int2str(pmax) ' variables.'])endif n < 2*p error('Need at least 2*(number of variables) observations.')end% hmin is the minimum number of observations whose covariance determinant % will be minimized. hmin=quanf(0.5,n,p);if ~quan h=quanf(0.75,n,p);else h=quanf(alpha,n,p); if h < hmin error(['The MCD must cover at least ' int2str(hmin) ' observations.']) elseif h > n error('quan is greater than the number of non-missings and non-infinites.') endendfid=NaN;% The value of some fields of the output argument are already known.res.n_obs=n;res.quan=h;res.X=data;% Some initializations.res.flag=repmat(NaN,1,length(ok));raw.wt=repmat(NaN,1,length(ok));raw.robdist=repmat(NaN,1,length(ok));res.robdist=repmat(NaN,1,length(ok));res.mahalanobis=repmat(NaN,1,length(ok));if ~lts res.method=sprintf('\nMinimum Covariance Determinant Estimator.');else res.method=sprintf('\nThe function fastmcd.m is called to compute robust distances.');end correl=NaN;% z : if at least h observations lie on a hyperplane, then z contains the % coefficients of that plane. % weights : weights of the observations that are not excluded from the computations. % These are the observations that don't contain missing or infinite values.% bestobj : best objective value found.z(1:p)=0;weights=zeros(1,n);bestobj=inf;% The breakdown point of the MCD estimator is computed. if h==hmin %the breakdown point is maximal. breakdown=(h-p)*100/n;else breakdown=(n-h+1)*100/n;end% The classical estimates are computed. clasmean=mean(data);clascov=cov(data); if p < 5 eps=1e-12;elseif p <= 8 eps=1e-14;else eps=1e-16;end% The standardization of the data will now be performed.med=median(data);mad=sort(abs(data-repmat(med,n,1)));mad=mad(h,:);ii=min(find(mad < eps));if length(ii) % The h-th order statistic is zero for the ii-th variable. The array plane contains % all the observations which have the same value for the ii-th variable. plane=find(abs(data(:,ii)-med(ii)) < eps)'; meanplane=mean(data(plane,:)); weights(plane)=1; if p==1 res.flag=weights; raw.wt=weights; [raw.center,res.center]=deal(meanplane); [raw.cov,res.cov,raw.objective]=deal(0); if ~lts res.method=sprintf('\nUnivariate location and scale estimation.'); res.method=strvcat(res.method,sprintf('%g of the %g observations are identical.',length(plane),n)); disp(res.method); end else z(ii)=1; res.plane=z; covplane=cov(data(plane,:)); [raw.center,raw.cov,res.center,res.cov,raw.objective,raw.wt,res.flag,... res.method]=displ(3,length(plane),weights,n,p,meanplane,covplane,res.method,z,ok,... raw.wt,res.flag,file,fid,0,NaN,h,ii); end return
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -