📄 kmeansclusterer.java
字号:
/* * LingPipe v. 3.5 * Copyright (C) 2003-2008 Alias-i * * This program is licensed under the Alias-i Royalty Free License * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Alias-i * Royalty Free License Version 1 for more details. * * You should have received a copy of the Alias-i Royalty Free License * Version 1 along with this program; if not, visit * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211, * +1 (718) 290-9170. */package com.aliasi.cluster;import com.aliasi.util.Arrays;import com.aliasi.util.Distance;import com.aliasi.util.FeatureExtractor;import com.aliasi.util.ObjectToDoubleMap;import com.aliasi.util.SmallSet;import java.util.ArrayList;import java.util.HashSet;import java.util.List;import java.util.Map;import java.util.Set;/** * A <code>KMeansClusterer</code> provides an implementation of * standard k-means clustering based on vectors constructed by feature * extractors. An instance fixes a specific value of <code>k</code>, * the number of clusters returned. * * <h3>K-Means Clustering Algorithm</h3> * * <p>If fewer than the maximum number of elements, <code>k</code>, is * clustered, a clustering with that number of singleton clusters is * returned. * * <p>The elements being clustered are first converted into feature * vectors using a feature extractor. These feature vectors are then * evenly distributed among the clusters using random assignment. * * <p>Each cluster is represented by the centroid of the feature * vectors assigned to it. Centroids are just dimension-wise averages: * * <pre> * centroid(v[0],...,v[n-1]) = (v[0] + ... + v[n-1]) / n</pre> * * and are thus vectors in the same space as the feature vectors * being clustered. * * <p>Feature vectors are always compared to cluster centroids * using Euclidean distance: * * <pre> * distance(x,y) = sqrt( (x - y) * (x - y) ) * = sqrt(<big><big>Σ</big></big><sub><sub>i</sub></sub> (x[i] - y[i])<sup>2</sup>)</pre> * * Thus any normalization should be done as part of the feature * extraction phase. A common normalization is to divide each feature * vector by its length to produce normal vectors, which weights each * dimension equally and results in distances monotonically related to * cosines. * * <p>The algorithm then iteratively improves cluster assignments. * Each iteration consists of walking through each feature vector and * assigning it to the cluster with the closest centroid. Ties are * determined non-deterministically. The centroids are then * recomputed at the end of each iteration. When no elements change * clusters during an iteration, the result is returned. The * algorithm will also return if the maximum number of iterations is * reached without arriving at a fixed point. * * <p>These iterations have some fancy names in the literature. The * whole approach is known as "Lloyd's Algorithm", * "Voronoi iteration", or simply "relaxation". * The final centroids forms a so-called "Voronoi Tesselation" * of the feature vector space of the objects being clustered. This * tesselation is similar to k-nearest neighbors classification (see * {@link com.aliasi.classify.KnnClassifier}). * * * <h3>K-means as Minimization</h3> * * <p>K-means clustering may be viewed as a gradient descent * solution to the minimization of distances to cluster centroids, * which we define here as error: * * <pre> * error(cs) = <big><big><big>Σ</big></big></big><sub>c in cs</sub> <big><big><big>Σ</big></big></big><sub>x in c</sub> distance(x,centroid(C))</pre> * * where <code>cs</code> is the set of clusters, <code>c</code> is a * cluster in <code>c</code>, <code>centroid(c)</code> * is the vector representing the centroid of the cluster <code>c</code>, and * <code>distance(x,centroid(x))</code> is the distance between an * element <code>x</code> of cluster <code>c</code> and the cluster's * centroid. * * * <h3>Convergence Guarantees</h3> * * <p>K-means clustering is guaranteed to converge to a solution * because each increment reduces error as defined in the previous * section. There are only finitely many clusterings, so eventually * k-means will converge. The only problem is that the size of the * finite set of clusterings is exponential in the number of input * elements. * * <h3>Relation to Gaussian Mixtures and Expectation/Maximization</h3> * * <p>The k-means clustering algorithm is implicitly based on a * multi-dimensional Gaussian with independent dimensions with their * own means and variances. Estimates are carried out by maximum * likelihood; that is, no prior, or equivalently the fully * uninformative prior, is used. Where k-means differs from standard * expectation/maximization (EM) is that k-means reweights the * expectations so that the closest centroid gets an expectation of 1.0 * and all other centroids get an expectation of 0.0. This approach * has been called "winner-take-all" EM in the EM literature. * * * <h3>Local Maxima and Multiple Runs</h3> * * <p>Like the EM algorithm, k-means clustering is highly sensitive * to the initial assignment of elements. In practice, it is often * helpful to apply k-means clustering repeatedly to the same input * data, returning the clustering which optimizes some evaluation * metric. This metric is typically within-cluster scatter, because * the number of dimensions is fixed. Within-cluster scatter may * be computed with the static method {@link ClusterScore#withinClusterScatter(Set,Distance)}. * * * <h3>Degenerate Solutions</h3> * * <p>In some cases, the iterative approach taken by k-means leads to * a solution in which not every cluster is populated. This happens * during a step in which no feature vector is closest to a given * centroid. In many of these cases, rerunning the clusterer will * find a solution with k clusters. * * * <h3>Picking a good <code>K</code></h3> * * <p>The number of clusters, k, may also be varied. In this case, * new k-means clustering instances must be created, as each uses * a fixed number of clusters. * * <p>By varying <code>k</code>, the maximum number of clusters, * the within-cluster scatter may be compared across different choices * of <code>k</code>. Typically, a value of <code>k</code> is chosen * at a knee of the within-cluster scatter scores. There are automatic * ways of performing this selection, though they are heuristically * rather than theoretically motivated. * * <p>In practice, it is technically possible, though unlikely, for * clusters to wind up with no elements in them. This implementation * will simply return fewer clusters than the maximum specified in * this case. * * * <h3>Multiple Runs and Bootstrap Relation Estimates</h3> * * <p>Multiple runs may be used to provide bootstrap estimates * of the relatedness of any two elements. Bootstrap estimates * work by subsampling the elements to cluster with replacement * and then running k-means on them. This is repeated multiple * times, and the percentage of runs in which two elements fall * in the same cluster forms the bootstrap estimate of the likelihood * that they are in the same cluster. * * * <h3>Efficiency</h3> * * <p>The main advantage K-means has over hierarchical clustering is * that it only requires linear time in the number of elements being * clustered. There's a constant factor for the maximum number of * iterations allowed and another constant factor for the length of * the feature vectors. * * <h3>Implementation Notes</h3> * * <p>The current implementation is an * inefficient brute-force approach that computes the distance between * every element and every centroid on every iteration. These * distances are computed between maps, which itself is not very * efficient. In the future, we may substitute a more efficient * version of k-means implemented to this same interface. * * <p>More efficient implementations could use kd-trees to index the * points, could cache scores of input feature vectors against * clusters, and would only compute differential updates when elements * change cluster. Threading could also be used to compute the * distances. * * <h3>References</h3> * * <ul> * <li> * J. B. MacQueen. 1967. Some Methods for classification and Analysis of Multivariate Observations. * <i> Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability</i>. * University of California Press. * * <li> * Andrew Moore's <a href="http://www.autonlab.org/tutorials/kmeans11.pdf">K-Means Tutorial</a> including most of the mathematics * </li> * * <li> * Matteo Matteucci's <a href="http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/kmeans.html">K-Means Tutorial</a> including avery nice <a href="http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/AppletKM.html">interactive servlet demo</a> * </li> * * <li> * Hastie, T., R. Tibshirani and J.H. Friedman. 2001. <i>The * Elements of Statistical Learning</i>. Springer-Verlag. * </li> * <li> * Wikipedia: <a href="http://en.wikipedia.org/wiki/K-means_algorithm">K-means Algorithm</a> * </li> * </ul> * * @author Bob Carpenter * @version 3.1.1 * @since LingPipe2.0 */public class KMeansClusterer<E> implements Clusterer<E> { final FeatureExtractor<E> mFeatureExtractor; final int mNumClusters; final int mMaxIterations; /** * Construct a k-means clusterer with the specified feature * extractor, number of clusters and limit on number of iterations * to run optimization. Initialization of each cluster is with * random shuffling of all elements into a cluster. * * <p>If the number of iterations is set to zero, the result * will be random balanced clusterings of the specified size. * * @param featureExtractor Feature extractor for this clusterer. * @param numClusters Number of clusters to return. * @param maxIterations Maximum number of iterations during * optimization. * @throws IllegalArgumentException If the number of clusters is * less than 1, or if the maximum number of iterations is less * than 0. */ public KMeansClusterer(FeatureExtractor<E> featureExtractor, int numClusters, int maxIterations) { if (numClusters < 1) { String msg = "Number of clusters must be positive." + " Found numClusters=" + numClusters; throw new IllegalArgumentException(msg); } if (maxIterations < 0) {
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -