📄 treec.m
字号:
%TREEC Build a decision tree classifier% % W = TREEC(A,CRIT,PRUNE,T)% % Computation of a decision tree classifier out of a dataset A using % a binary splitting criterion CRIT:% INFCRIT - information gain% MAXCRIT - purity (default)% FISHCRIT - Fisher criterion% % Pruning is defined by prune:% PRUNE = -1 pessimistic pruning as defined by Quinlan. % PRUNE = -2 testset pruning using the dataset T, or, if not% supplied, an artificially generated testset of 5 x size of% the training set based on parzen density estimates.% see PARZENML and GENDATP.% PRUNE = 0 no pruning (default).% PRUNE > 0 early pruning, e.g. prune = 3% PRUNE = 10 causes heavy pruning.% % If CRIT or PRUNE are set to NaN they are optimised by REGOPTC.%% see also DATASETS, MAPPINGS, TREE_MAP, REGOPTC% Copyright: R.P.W. Duin, r.p.w.duin@prtools.org% Faculty EWI, Delft University of Technology% P.O. Box 5031, 2600 GA Delft, The Netherlands% $Id: treec.m,v 1.8 2008/01/25 10:13:29 duin Exp $function w = treec(a,crit,prune,t) prtrace(mfilename); % When no input data is given, an empty tree is defined: if nargin == 0 | isempty(a) if nargin <2, w = mapping('treec'); elseif nargin < 3, w = mapping('treec',{crit}); elseif nargin < 4, w = mapping('treec',{crit,prune}); else, w = mapping('treec',{crit,prune,t}); end w = setname(w,'Decision Tree'); return end if nargin < 3, prune = []; end if nargin < 2, crit = []; end parmin_max = [1,3;-1,10]; optcrit = inf; if isnan(crit) & isnan(prune) % optimize criterion and pruning, grid search global REGOPT_OPTCRIT REGOPT_PARS for n = 1:3 defs = {n,0}; v = regoptc(a,mfilename,{crit,prune},defs,[2],parmin_max,testc([],'soft'),[0,0]); if REGOPT_OPTCRIT < optcrit w = v; optcrit = REGOPT_OPTCRIT; regoptpars = REGOPT_PARS; end end REGOPT_PARS = regoptpars; elseif isnan(crit) % optimize criterion defs = {1,0}; w = regoptc(a,mfilename,{crit,prune},defs,[1],parmin_max,testc([],'soft'),[0,0]); elseif isnan(prune) % optimize pruning defs = {1,0}; w = regoptc(a,mfilename,{crit,prune},defs,[2],parmin_max,testc([],'soft'),[0,0]); else % training for given parameters islabtype(a,'crisp'); isvaldfile(a,1,2); % at least 1 object per class, 2 classes a = testdatasize(a); % First get some useful parameters: [m,k,c] = getsize(a); nlab = getnlab(a); % Define the splitting criterion: if nargin == 1 | isempty(crit), crit = 2; end if ~isstr(crit) if crit == 0 | crit == 1, crit = 'infcrit'; elseif crit == 2, crit = 'maxcrit'; elseif crit == 3, crit = 'fishcrit'; else, error('Unknown criterion value'); end end % Now the training can really start: if (nargin == 1) | (nargin == 2) tree = maketree(+a,nlab,c,crit); elseif nargin > 2 % We have to apply a pruning strategy: if prune == -1, prune = 'prunep'; end if prune == -2, prune = 'prunet'; end % The strategy can be prunep/prunet: if isstr(prune) tree = maketree(+a,nlab,c,crit); if prune == 'prunep' tree = prunep(tree,a,nlab); elseif prune == 'prunet' if nargin < 4 t = gendatp(a,5*sum(nlab==1)); end tree = prunet(tree,t); else error('unknown pruning option defined'); end else % otherwise the tree is just cut after level 'prune' tree = maketree(+a,nlab,c,crit,prune); end else error('Wrong number of parameters') end % Store the results: w = mapping('tree_map','trained',{tree,1},getlablist(a),k,c); w = setname(w,'Decision Tree'); w = setcost(w,a); end return%MAKETREE General tree building algorithm% % tree = maketree(A,nlab,c,crit,stop)% % Constructs a binary decision tree using the criterion function% specified in the string crit ('maxcrit', 'fishcrit' or 'infcrit' % (default)) for a set of objects A. stop is an optional argument % defining early stopping according to the Chi-squared test as % defined by Quinlan [1]. stop = 0 (default) gives a perfect tree % (no pruning) stop = 3 gives a pruned version stop = 10 a heavily % pruned version. % % Definition of the resulting tree:% % tree(n,1) - feature number to be used in node n% tree(n,2) - threshold t to be used% tree(n,3) - node to be processed if value <= t% tree(n,4) - node to be processed if value > t% tree(n,5:4+c) - aposteriori probabilities for all classes in% node n% % If tree(n,3) == 0, stop, class in tree(n,1)% % This is a low-level routine called by treec.% % See also infstop, infcrit, maxcrit, fishcrit and mapt.% Authors: Guido te Brake, TWI/SSOR, Delft University of Technology% R.P.W. Duin, TN/PH, Delft University of Technology% Copyright: R.P.W. Duin, duin@ph.tn.tudelft.nl% Faculty of Applied Physics, Delft University of Technology% P.O. Box 5046, 2600 GA Delft, The Netherlandsfunction tree = maketree(a,nlab,c,crit,stop) prtrace(mfilename); [m,k] = size(a); if nargin < 5, stop = 0; end; if nargin < 4, crit = []; end; if isempty(crit), crit = 'infcrit'; end; % Construct the tree: % When all objects have the same label, create an end-node: if all([nlab == nlab(1)]) % Avoid giving 0-1 probabilities, but 'regularize' them a bit using % a 'uniform' Bayesian prior: p = ones(1,c)/(m+c); p(nlab(1)) = (m+1)/(m+c); tree = [nlab(1),0,0,0,p]; else % now the tree is recursively constructed further: [f,j,t] = feval(crit,+a,nlab); % use desired split criterion if isempty(t) crt = 0; else crt = infstop(+a,nlab,j,t); % use desired early stopping criterion end p = sum(expandd(nlab),1); if length(p) < c, p = [p,zeros(1,c-length(p))]; end % When the stop criterion is not reached yet, we recursively split % further: if crt > stop % Make the left branch: J = find(a(:,j) <= t); tl = maketree(+a(J,:),nlab(J),c,crit,stop); % Make the right branch: K = find(a(:,j) > t); tr = maketree(+a(K,:),nlab(K),c,crit,stop); % Fix the node labelings before the branches can be 'glued' % together to a big tree: [t1,t2] = size(tl); tl = tl + [zeros(t1,2) tl(:,[3 4])>0 zeros(t1,c)]; [t3,t4] = size(tr); tr = tr + (t1+1)*[zeros(t3,2) tr(:,[3 4])>0 zeros(t3,c)]; % Make the complete tree: the split-node and the branches: tree= [[j,t,2,t1+2,(p+1)/(m+c)]; tl; tr]; else % We reached the stop criterion, so make an end-node: [mt,cmax] = max(p); tree = [cmax,0,0,0,(p+1)/(m+c)]; end end return%MAXCRIT Maximum entropy criterion for best feature split.% % [f,j,t] = maxcrit(A,nlabels)% % Computes the value of the maximum purity f for all features over % the data set A given its numeric labels. j is the optimum feature,% t its threshold. This is a low level routine called for constructing% decision trees.% % [1] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, % Classification and regression trees, Wadsworth, California, 1984. % Copyright: R.P.W. Duin, duin@ph.tn.tudelft.nl % Faculty of Applied Physics, Delft University of Technology% P.O. Box 5046, 2600 GA Delft, The Netherlandsfunction [f,j,t] = maxcrit(a,nlab) prtrace(mfilename); [m,k] = size(a); c = max(nlab); % -variable T is an (2c)x k matrix containing: % minimum feature values class 1 % maximum feature values class 1 % minimum feature values class 2 % maximum feature values class 2 % etc. % -variable R (same size) contains: % fraction of objects which is < min. class 1. % fraction of objects which is > max. class 1. % fraction of objects which is < min. class 2. % fraction of objects which is > max. class 2. % etc. % These values are collected and computed in the next loop: T = zeros(2*c,k); R = zeros(2*c,k); for j = 1:c L = (nlab == j); if sum(L) == 0 T([2*j-1:2*j],:) = zeros(2,k); R([2*j-1:2*j],:) = zeros(2,k); else T(2*j-1,:) = min(a(L,:),[],1); R(2*j-1,:) = sum(a < ones(m,1)*T(2*j-1,:),1); T(2*j,:) = max(a(L,:),[],1); R(2*j,:) = sum(a > ones(m,1)*T(2*j,:),1); end end % From R the purity index for all features is computed: G = R .* (m-R); % and the best feature is found: [gmax,tmax] = max(G,[],1); [f,j] = max(gmax); Tmax = tmax(j); if Tmax ~= 2*floor(Tmax/2) t = (T(Tmax,j) + max(a(find(a(:,j) < T(Tmax,j)),j)))/2; else t = (T(Tmax,j) + min(a(find(a(:,j) > T(Tmax,j)),j)))/2; end
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -