⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 kmeans.m

📁 Matlab的kmeans算法实现
💻 M
📖 第 1 页 / 共 2 页
字号:
function [idx, C, sumD, D] = kmeans(X, k, varargin)
%KMEANS K-means clustering.
%   IDX = KMEANS(X, K) partitions the points in the N-by-P data matrix
%   X into K clusters.  This partition minimizes the sum, over all
%   clusters, of the within-cluster sums of point-to-cluster-centroid
%   distances.  Rows of X correspond to points, columns correspond to
%   variables.  KMEANS returns an N-by-1 vector IDX containing the
%   cluster indices of each point.  By default, KMEANS uses squared
%   Euclidean distances.
%
%   KMEANS treats NaNs as missing data, and ignores any rows of X that
%   contain NaNs. 
%
%   [IDX, C] = KMEANS(X, K) returns the K cluster centroid locations in
%   the K-by-P matrix C.
%
%   [IDX, C, SUMD] = KMEANS(X, K) returns the within-cluster sums of
%   point-to-centroid distances in the 1-by-K vector sumD.
%
%   [IDX, C, SUMD, D] = KMEANS(X, K) returns distances from each point
%   to every centroid in the N-by-K matrix D.
%
%   [ ... ] = KMEANS(..., 'PARAM1',val1, 'PARAM2',val2, ...) specifies
%   optional parameter name/value pairs to control the iterative algorithm
%   used by KMEANS.  Parameters are:
%
%   'Distance' - Distance measure, in P-dimensional space, that KMEANS
%      should minimize with respect to.  Choices are:
%          'sqEuclidean'  - Squared Euclidean distance (the default)
%          'cityblock'    - Sum of absolute differences, a.k.a. L1 distance
%          'cosine'       - One minus the cosine of the included angle
%                           between points (treated as vectors)
%          'correlation'  - One minus the sample correlation between points
%                           (treated as sequences of values)
%          'Hamming'      - Percentage of bits that differ (only suitable
%                           for binary data)
%
%   'Start' - Method used to choose initial cluster centroid positions,
%      sometimes known as "seeds".  Choices are:
%          'sample'  - Select K observations from X at random (the default)
%          'uniform' - Select K points uniformly at random from the range
%                      of X.  Not valid for Hamming distance.
%          'cluster' - Perform preliminary clustering phase on random 10%
%                      subsample of X.  This preliminary phase is itself
%                      initialized using 'sample'.
%           matrix   - A K-by-P matrix of starting locations.  In this case,
%                      you can pass in [] for K, and KMEANS infers K from
%                      the first dimension of the matrix.  You can also
%                      supply a 3D array, implying a value for 'Replicates'
%                      from the array's third dimension.
%
%   'Replicates' - Number of times to repeat the clustering, each with a
%      new set of initial centroids.  A positive integer, default is 1.
%
%   'EmptyAction' - Action to take if a cluster loses all of its member
%      observations.  Choices are:
%          'error'     - Treat an empty cluster as an error (the default)
%          'drop'      - Remove any clusters that become empty, and set
%                        the corresponding values in C and D to NaN.
%          'singleton' - Create a new cluster consisting of the one
%                        observation furthest from its centroid.
%
%   'Options' - Options for the iterative algorithm used to minimize the
%       fitting criterion, as created by STATSET.  Choices of STATSET
%       parameters are:
%
%          'Display'  - Level of display output.  Choices are 'off', (the
%                       default), 'iter', and 'final'.
%          'MaxIter'  - Maximum number of iterations allowed.  Default is 100.
%
%   'OnlinePhase' - Flag indicating whether KMEANS should perform an "on-line
%      update" phase in addition to a "batch update" phase.  The on-line phase
%      can be time consuming for large data sets, but guarantees a solution
%      that is local minimum of the distance criterion, i.e., a partition of
%      the data where moving any single point to a different cluster increases
%      the total sum of distances.  'on' (the default) or 'off'.
%
%   Example:
%
%       X = [randn(20,2)+ones(20,2); randn(20,2)-ones(20,2)];
%       opts = statset('Display','final');
%       [cidx, ctrs] = kmeans(X, 2, 'Distance','city', ...
%                             'Replicates',5, 'Options',opts);
%       plot(X(cidx==1,1),X(cidx==1,2),'r.', ...
%            X(cidx==2,1),X(cidx==2,2),'b.', ctrs(:,1),ctrs(:,2),'kx');
%
%   See also LINKAGE, CLUSTERDATA, SILHOUETTE.

%   KMEANS uses a two-phase iterative algorithm to minimize the sum of
%   point-to-centroid distances, summed over all K clusters.  The first phase
%   uses what the literature often describes as "batch" updates, where each
%   iteration consists of reassigning points to their nearest cluster
%   centroid, all at once, followed by recalculation of cluster centroids.
%   This phase occasionally (especially for small datasets) does not converge
%   to solution that is a local minimum, i.e., a partition of the data where
%   moving any single point to a different cluster increases the total sum of
%   distances.  Thus, the batch phase be thought of as providing a fast but
%   potentially only approximate solution as a starting point for the second
%   phase.  The second phase uses what the literature often describes as
%   "on-line" updates, where points are individually reassigned if doing so
%   will reduce the sum of distances, and cluster centroids are recomputed
%   after each reassignment.  Each iteration during this second phase consists
%   of one pass though all the points.  The on-line phase will converge to a
%   local minimum, although there may be other local minima with lower total
%   sum of distances.  The problem of finding the global minimum can only be
%   solved in general by an exhaustive (or clever, or lucky) choice of
%   starting points, but using several replicates with random starting points
%   typically results in a solution that is a global minumum.
%
% References:
%
%   [1] Seber, G.A.F., Multivariate Observations, Wiley, New York, 1984.
%   [2] Spath, H. (1985) Cluster Dissection and Analysis: Theory, FORTRAN
%       Programs, Examples, translated by J. Goldschmidt, Halsted Press,
%       New York, 226 pp.

%   Copyright 1993-2007 The MathWorks, Inc.
%   $Revision: 1.4.4.8 $  $Date: 2007/06/14 05:25:34 $

if nargin < 2
    error('stats:kmeans:TooFewInputs','At least two input arguments required.');
end

[ignore,wasnan,X] = statremovenan(X);
hadNaNs = any(wasnan);
if hadNaNs
    warning('stats:kmeans:MissingDataRemoved','Ignoring rows of X with missing data.');
end

% n points in p dimensional space
[n, p] = size(X);

pnames = {   'distance'  'start' 'replicates' 'emptyaction' 'onlinephase' 'options' 'maxiter' 'display'};
dflts =  {'sqeuclidean' 'sample'          []         'error'         'on'        []        []        []};
[eid,errmsg,distance,start,reps,emptyact,online,options,maxit,display] ...
                       = statgetargs(pnames, dflts, varargin{:});
if ~isempty(eid)
    error(sprintf('stats:kmeans:%s',eid),errmsg);
end

if ischar(distance)
    distNames = {'sqeuclidean','cityblock','cosine','correlation','hamming'};
    j = strmatch(lower(distance), distNames);
    if length(j) > 1
        error('stats:kmeans:AmbiguousDistance', ...
              'Ambiguous ''Distance'' parameter value:  %s.', distance);
    elseif isempty(j)
        error('stats:kmeans:UnknownDistance', ...
              'Unknown ''Distance'' parameter value:  %s.', distance);
    end
    distance = distNames{j};
    switch distance 
    case 'cosine'
        Xnorm = sqrt(sum(X.^2, 2));
        if any(min(Xnorm) <= eps(max(Xnorm)))
            error('stats:kmeans:ZeroDataForCos', ...
                  ['Some points have small relative magnitudes, making them ', ...
                   'effectively zero.\nEither remove those points, or choose a ', ...
                   'distance other than ''cosine''.']);
        end
        X = X ./ Xnorm(:,ones(1,p));
    case 'correlation'
        X = X - repmat(mean(X,2),1,p);
        Xnorm = sqrt(sum(X.^2, 2));
        if any(min(Xnorm) <= eps(max(Xnorm)))
            error('stats:kmeans:ConstantDataForCorr', ...
                  ['Some points have small relative standard deviations, making them ', ...
                   'effectively constant.\nEither remove those points, or choose a ', ...
                   'distance other than ''correlation''.']);
        end
        X = X ./ Xnorm(:,ones(1,p));
    case 'hamming'
        if ~all(ismember(X(:),[0 1]))
            error('stats:kmeans:NonbinaryDataForHamm', ...
                  'Non-binary data cannot be clustered using Hamming distance.');
        end
    end
else
    error('stats:kmeans:InvalidDistance', ...
          'The ''Distance'' parameter value must be a string.');
end

if ischar(start)
    startNames = {'uniform','sample','cluster'};
    j = strmatch(lower(start), startNames);
    if length(j) > 1
        error('stats:kmeans:AmbiguousStart', ...
              'Ambiguous ''Start'' parameter value:  %s.', start);
    elseif isempty(j)
        error('stats:kmeans:UnknownStart', ...
              'Unknown ''Start'' parameter value:  %s.', start);
    elseif isempty(k)
        error('stats:kmeans:MissingK', ...
              'You must specify the number of clusters, K.');
    end
    start = startNames{j};
    if strcmp(start, 'uniform')
        if strcmp(distance, 'hamming')
            error('stats:kmeans:UniformStartForHamm', ...
                  'Hamming distance cannot be initialized with uniform random values.');
        end
        Xmins = min(X,[],1);
        Xmaxs = max(X,[],1);
    end
elseif isnumeric(start)
    CC = start;
    start = 'numeric';
    if isempty(k)
        k = size(CC,1);
    elseif k ~= size(CC,1);
        error('stats:kmeans:MisshapedStart', ...
              'The ''Start'' matrix must have K rows.');
    elseif size(CC,2) ~= p
        error('stats:kmeans:MisshapedStart', ...
              'The ''Start'' matrix must have the same number of columns as X.');
    end
    if isempty(reps)
        reps = size(CC,3);
    elseif reps ~= size(CC,3);
        error('stats:kmeans:MisshapedStart', ...
              'The third dimension of the ''Start'' array must match the ''replicates'' parameter value.');
    end
    
    % Need to center explicit starting points for 'correlation'. (Re)normalization
    % for 'cosine'/'correlation' is done at each iteration.
    if isequal(distance, 'correlation')
        CC = CC - repmat(mean(CC,2),[1,p,1]);
    end
else
    error('stats:kmeans:InvalidStart', ...
          'The ''Start'' parameter value must be a string or a numeric matrix or array.');
end

if ischar(emptyact)
    emptyactNames = {'error','drop','singleton'};
    j = strmatch(lower(emptyact), emptyactNames);
    if length(j) > 1
        error('stats:kmeans:AmbiguousEmptyAction', ...
              'Ambiguous ''EmptyAction'' parameter value:  %s.', emptyact);
    elseif isempty(j)
        error('stats:kmeans:UnknownEmptyAction', ...
              'Unknown ''EmptyAction'' parameter value:  %s.', emptyact);
    end
    emptyact = emptyactNames{j};
else
    error('stats:kmeans:InvalidEmptyAction', ...
          'The ''EmptyAction'' parameter value must be a string.');
end

if ischar(online)
    j = strmatch(lower(online), {'on','off'});
    if length(j) > 1
        error('stats:kmeans:AmbiguousOnlinePhase', ...
              'Ambiguous ''OnlinePhase'' parameter value:  %s.', online);
    elseif isempty(j)
        error('stats:kmeans:UnknownOnlinePhase', ...
              'Unknown ''OnlinePhase'' parameter value:  %s.', online);
    end
    online = (j==1);
else
    error('stats:kmeans:InvalidOnlinePhase', ...
          'The ''OnlinePhase'' parameter value must be ''on'' or ''off''.');
end

% 'maxiter' and 'display' are grandfathered as separate param name/value pairs
if ~isempty(display)
    options = statset(options,'Display',display);
end
if ~isempty(maxit)
    options = statset(options,'MaxIter',maxit);
end

options = statset(statset('kmeans'), options);
display = strmatch(lower(options.Display), {'off','notify','final','iter'}) - 1;
maxit = options.MaxIter;

if ~(isscalar(k) && isnumeric(k) && isreal(k) && k > 0 && (round(k)==k))
    error('stats:kmeans:InvalidK', ...
          'X must be a positive integer value.');
% elseif k == 1
    % this special case works automatically
elseif n < k
    error('stats:kmeans:TooManyClusters', ...
          'X must have more rows than the number of clusters.');
end

% Assume one replicate
if isempty(reps)
    reps = 1;
end

%
% Done with input argument processing, begin clustering
%

dispfmt = '%6d\t%6d\t%8d\t%12g';
if online, Del = NaN(n,k); end % reassignment criterion

totsumDBest = Inf;
emptyErrCnt = 0;
for rep = 1:reps
    switch start
    case 'uniform'
        C = unifrnd(Xmins(ones(k,1),:), Xmaxs(ones(k,1),:));
        % For 'cosine' and 'correlation', these are uniform inside a subset
        % of the unit hypersphere.  Still need to center them for
        % 'correlation'.  (Re)normalization for 'cosine'/'correlation' is
        % done at each iteration.
        if isequal(distance, 'correlation')
            C = C - repmat(mean(C,2),1,p);
        end
        if isa(X,'single')
            C = single(C);
        end
    case 'sample'
        C = X(randsample(n,k),:);
        if ~isfloat(C)      % X may be logical
            C = double(C);
        end
    case 'cluster'
        Xsubset = X(randsample(n,floor(.1*n)),:);
        [dum, C] = kmeans(Xsubset, k, varargin{:}, 'start','sample', 'replicates',1);
    case 'numeric'
        C = CC(:,:,rep);
    end
    
    % Compute the distance from every point to each cluster centroid and the
    % initial assignment of points to clusters
    D = distfun(X, C, distance, 0);
    [d, idx] = min(D, [], 2);
    m = accumarray(idx,1,[k,1]);

    try % catch empty cluster errors and move on to next rep
        
        % Begin phase one:  batch reassignments
        converged = batchUpdate();
        
        % Begin phase two:  single reassignments
        if online
            converged = onlineUpdate();
        end
        
        if ~converged
            warning('stats:kmeans:FailedToConverge', ...
                    'Failed to converge in %d iterations%s.',maxit,repsMsg(rep,reps));
        end

        % Calculate cluster-wise sums of distances
        nonempties = find(m>0);
        D(:,nonempties) = distfun(X, C(nonempties,:), distance, iter);
        d = D((idx-1)*n + (1:n)');
        sumD = accumarray(idx,d,[k,1]);
        totsumD = sum(sumD);
        
        if display > 1 % 'final' or 'iter'
            disp(sprintf('%d iterations, total sum of distances = %g',iter,totsumD));
        end

        % Save the best solution so far
        if totsumD < totsumDBest
            totsumDBest = totsumD;
            idxBest = idx;
            Cbest = C;
            sumDBest = sumD;
            if nargout > 3
                Dbest = D;
            end
        end

    % If an empty cluster error occurred in one of multiple replicates, catch
    % it, warn, and move on to next replicate.  Error only when all replicates
    % fail.  Rethrow an other kind of error.
    catch
        err = lasterror;
        if reps == 1 || ~isequal(err.identifier,'stats:kmeans:EmptyCluster')
            rethrow(err);
        else
            emptyErrCnt = emptyErrCnt + 1;
            warning('stats:kmeans:EmptyCluster', ...
                    'Replicate %d terminated: empty cluster created at iteration %d.',rep,iter);
            if emptyErrCnt == reps
                error('stats:kmeans:EmptyClusterAllReps', ...
                      'An empty cluster error occurred in every replicate.');
            end
        end
    end % catch
    
end % replicates

% Return the best solution
idx = idxBest;
C = Cbest;
sumD = sumDBest;
if nargout > 3
    D = Dbest;
end

if hadNaNs
    idx = statinsertnan(wasnan, idx);
end


%------------------------------------------------------------------

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -