📄 regem.m
字号:
function [X, M, C, Xerr] = regem(X, options)%REGEM Imputation of missing values with regularized EM algorithm.%% [X, M, C, Xerr] = REGEM(X, OPTIONS) replaces missing values% (NaNs) in the data matrix X with imputed values. REGEM% returns% % X, the data matrix with imputed values substituted for NaNs, % M, the estimated mean of X, % C, the estimated covariance matrix of X,% Xerr, an estimated standard error of the imputed values.% % Missing values are imputed with a regularized expectation% maximization (EM) algorithm. In an iteration of the EM algorithm,% given estimates of the mean and of the covariance matrix are% revised in three steps. First, for each record X(i,:) with% missing values, the regression parameters of the variables with% missing values on the variables with available values are% computed from the estimates of the mean and of the covariance% matrix. Second, the missing values in a record X(i,:) are filled% in with their conditional expectation values given the available% values and the estimates of the mean and of the covariance% matrix, the conditional expectation values being the product of% the available values and the estimated regression% coefficients. Third, the mean and the covariance matrix are% re-estimated, the mean as the sample mean of the completed% dataset and the covariance matrix as the sum of the sample% covariance matrix of the completed dataset and an estimate of the% conditional covariance matrix of the imputation error. %% In the regularized EM algorithm, the parameters of the regression% models are estimated by a regularized regression method. By% default, the parameters of the regression models are estimated by% an individual ridge regression for each missing value in a% record, with one regularization parameter (ridge parameter) per% missing value. Optionally, the parameters of the regression% models can be estimated by a multiple ridge regression for each% record with missing values, with one regularization parameter per% record with missing values. The regularization parameters for the% ridge regressions are selected as the minimizers of the% generalized cross-validation (GCV) function. As another option,% the parameters of the regression models can be estimated by% truncated total least squares. The truncation parameter, a% discrete regularization parameter, is fixed and must be given as% an input argument. The regularized EM algorithm with truncated% total least squares is faster than the regularized EM algorithm% with with ridge regression, requiring only one eigendecomposition% per iteration instead of one eigendecomposition per record and% iteration. But an adaptive choice of truncation parameter has not% been implemented for truncated total least squares. So the% truncated total least squares regressions can be used to compute% initial values for EM iterations with ridge regressions, in which% the regularization parameter is chosen adaptively.% % As default initial condition for the imputation algorithm, the% mean of the data is computed from the available values, mean% values are filled in for missing values, and a covariance matrix% is estimated as the sample covariance matrix of the completed% dataset with mean values substituted for missing% values. Optionally, initial estimates for the missing values and% for the covariance matrix estimate can be given as input% arguments.% % The OPTIONS structure specifies parameters in the algorithm:%% Field name Parameter Default%% OPTIONS.regress Regression procedure to be used: 'mridge'% 'mridge': multiple ridge regression% 'iridge': individual ridge regressions% 'ttls': truncated total least squares % regression % % OPTIONS.stagtol Stagnation tolerance: quit when 5e-3 % consecutive iterates of the missing% values are so close that% norm( Xmis(it)-Xmis(it-1) ) % <= stagtol * norm( Xmis(it-1) )% % OPTIONS.maxit Maximum number of EM iterations. 30% % OPTIONS.inflation Inflation factor for the residual 1 % covariance matrix. Because of the % regularization, the residual covariance % matrix underestimates the conditional % covariance matrix of the imputation % error. The inflation factor is to correct % this underestimation. The update of the % covariance matrix estimate is computed % with residual covariance matrices % inflated by the factor OPTIONS.inflation,% and the estimates of the imputation error% are inflated by the same factor. %% OPTIONS.disp Diagnostic output of algorithm. Set to 1% zero for no diagnostic output.%% OPTIONS.regpar Regularization parameter. not set % For ridge regression, set regpar to % sqrt(eps) for mild regularization; leave % regpar unset for GCV selection of% regularization parameters.% For TTLS regression, regpar must be set% and is a fixed truncation parameter. %% OPTIONS.relvar_res Minimum relative variance of residuals. 5e-2% From the parameter OPTIONS.relvar_res, a% lower bound for the regularization % parameter is constructed, in order to % prevent GCV from erroneously choosing % too small a regularization parameter.% % OPTIONS.minvarfrac Minimum fraction of total variation in 0% standardized variables that must be % retained in the regularization.% From the parameter OPTIONS.minvarfrac, % an approximate upper bound for the % regularization parameter is constructed. % The default value OPTIONS.minvarfrac = 0 % essentially corresponds to no upper bound % for the regularization parameter. % % OPTIONS.Xmis0 Initial imputed values. Xmis0 is a not set% (possibly sparse) matrix of the same % size as X with initial guesses in place% of the NaNs in X. % % OPTIONS.C0 Initial estimate of covariance matrix. not set% If no initial covariance matrix C0 is % given but initial estimates Xmis0 of the % missing values are given, the sample % covariance matrix of the dataset % completed with initial imputed values is % taken as an initial estimate of the % covariance matrix. % % OPTIONS.Xcmp Display the weighted rms difference not set% between the imputed values and the % values given in Xcmp, a matrix of the % same size as X but without missing % values. By default, REGEM displays % the rms difference between the imputed % values at consecutive iterations. The % option of displaying the difference % between the imputed values and reference % values exists for testing purposes.%% OPTIONS.neigs Number of eigenvalue-eigenvector pairs not set% to be computed for TTLS regression. % By default, all nonzero eigenvalues and % corresponding eigenvectors are computed. % By computing fewer (neigs) eigenvectors, % the computations can be accelerated, but % the residual covariance matrices become % inaccurate. Consequently, the residual % covariance matrices underestimate the % imputation error conditional covariance % matrices more and more as neigs is % decreased. % References: % [1] T. Schneider, 2001: Analysis of incomplete climate data:% Estimation of mean values and covariance matrices and% imputation of missing values. Journal of Climate, 14,% 853--871. % [2] R. J. A. Little and D. B. Rubin, 1987: Statistical% Analysis with Missing Data. Wiley Series in Probability% and Mathematical Statistics. (For EM algorithm.) % [3] P. C. Hansen, 1997: Rank-Deficient and Discrete Ill-Posed% Problems: Numerical Aspects of Linear Inversion. SIAM% Monographs on Mathematical Modeling and Computation.% (For regularization techniques, including the selection of % regularization parameters.) error(nargchk(1, 2, nargin)) % check number of input arguments if ndims(X) > 2, error('X must be vector or 2-D array.'); end % if X is a vector, make sure it is a column vector (a single variable) if length(X)==prod(size(X)) X = X(:); end [n, p] = size(X); % number of degrees of freedom for estimation of covariance matrix dofC = n - 1; % use degrees of freedom correction % ============== process options ======================== if nargin ==1 | isempty(options) fopts = []; else fopts = fieldnames(options); end % initialize options structure for regression modules optreg = []; if strmatch('regress', fopts) regress = lower(options.regress); switch regress case {'mridge', 'iridge'} % OK case {'ttls'} if ~strmatch('regpar', fopts) error('Truncation parameter for TTLS regression must be given.') else trunc = min([options.regpar, n-1, p]); end if strmatch('neigs', fopts) neigs = options.neigs; else neigs = min(n-1, p); end otherwise error(['Unknown regression method ', regress]) end else regress = 'mridge'; end if strmatch('stagtol', fopts) stagtol = options.stagtol; else stagtol = 5e-3; end if strmatch('maxit', fopts) maxit = options.maxit; else maxit = 30; end if strmatch('inflation', fopts) inflation = options.inflation; else inflation = 1; end if strmatch('relvar_res', fopts) optreg.relvar_res = options.relvar_res; else optreg.relvar_res = 5e-2; end if strmatch('minvarfrac', fopts) optreg.minvarfrac = options.minvarfrac; else optreg.minvarfrac = 0; end h_given = 0; if strmatch('regpar', fopts) h_given = 1; optreg.regpar = options.regpar; if strmatch(regress, 'iridge') regress = 'mridge';
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -