📄 the k-means procedure.htm

📁 JAVA 本程序所实现的功能为对数据进行无监督的学习
💻 HTM
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0055)http://www.cse.unsw.edu.au/~akgu380/Kmeans/K-means.html -->
<HTML><HEAD><TITLE>The k-Means Procedure</TITLE>
<META content="text/html; charset=iso-8859-1" http-equiv=Content-Type>
<META content="MSHTML 5.00.3806.1700" name=GENERATOR>
<META content="C:\Program Files\Microsoft Office\Office\html.dot" name=Template>
<META content="Mozilla/4.76 [en] (Windows NT 5.0; U) [Netscape]" 
name=GENERATOR></HEAD>
<BODY link=#0000ff vLink=#800080><B><FONT face=Arial><FONT size=+4>K-means 
algorithm</FONT></FONT></B> <BR><A 
href="http://www.cse.unsw.edu.au/~akgu380/Kmeans/K-means.html">Description</A> | 
<A href="http://www.cse.unsw.edu.au/~akgu380/Kmeans/KmeansDemo.html">Demo</A> | 
<A href="http://www.cse.unsw.edu.au/~akgu380/Kmeans/KmeansCode.html">Code</A> | 
<P>This is a simple non-hierarchical clustering algorithm which requires the 
number of clusters as an input. It works by initially creating a random centroid 
for each of the clusters. Then the data is classified depending upon the minimum 
distance to the centroids. The centroid's position is recalculated everytime a 
component is added to the cluster and this continues until all the components 
are grouped into the final required number of clusters and the centroids do not 
change in successive calculations. 
<P><B><I><FONT face=Arial>The k-Means Procedure</FONT></I></B> 
<P>Suppose that we have n example feature vectors x<SUB>1</SUB>, x<SUB>2</SUB>, 
..., x<SUB>n</SUB> all from the same class, and we know that they fall into k 
compact clusters, k &lt; n. Let m<SUB>i</SUB> be the mean of the vectors in 
Cluster i. If the clusters are well separated, we can use a minimum-distance 
classifier to separate them. That is, we can say that x is in Cluster i if || x 
- mi || is the minimum of all the k distances. This suggests the following 
procedure for finding the k means: 
<P>Make initial guesses for the means m<SUB>1</SUB>, m<SUB>2</SUB>, ..., 
m<SUB>k</SUB> 
<P>Until there are no changes in any mean 
<DIR>
<DIR>Use the estimated means to classify the examples into clusters using a 
distance measure 
<P>For i from 1 to k 
<DIR>
<DIR>Replace m<SUB>i</SUB> with the mean of all of the examples for Cluster 
i</DIR></DIR>end_for</DIR></DIR>end_until 
<P>This is a simple version of the k-means procedure. It can be viewed as a 
greedy algorithm for partitioning the n examples into k clusters so as to 
minimize the sum of the squared distances to the cluster centers. It does have 
some weaknesses. 
<UL>
  <LI>The way to initialize the means was not specified. One popular way to 
  start is to randomly choose k of the examples. 
  <LI>The results produced depend on the initial values for the means, and it 
  frequently happens that sub-optimal partitions are found. The standard 
  solution is to try a number of different starting points. 
  <LI>It can happen that the set of examples closest to m<SUB>i</SUB> is empty, 
  so that m<SUB>i</SUB> cannot be updated. This is an annoyance that must be 
  handled in an implementation, but that we shall ignore. 
  <LI>The results depend on the metric used to measure || x - mi ||. A popular 
  solution is to normalize each variable by its standard deviation, though this 
  is not always desirable. 
  <LI>The results depend on the value of k. </LI></UL><B><I><FONT 
face=Arial>Implementation issues</FONT></I></B> 
<P>An implementation of this algorithm needs to pay attention to the following 
issues. 
<P><B>Initialization of k-means</B>: Since the basic k-means algorithm does not 
guarantee finding a global optimum and only finds a local minima, the solution 
obtained often depends very much on how it is initialized. There are a number of 
different heuristics used to initialize the algorithm. Two popular methods are - 

<UL>
  <LI>assign i<SUP>th</SUP> sample to the i modulo k<SUP>th</SUP> class. 
  <LI>randomly assign data to k different classes. </LI></UL><B>Distance 
measure</B>: The algorithm spends most of its time computing the distance. So, 
the measure chosen will have significant impact on the total time taken to 
compute. 
<P><B>Number of iterations</B>: Theoretically, k-means should terminate when no 
more samples are changing classes. There are proofs of termination for k-means, 
which rely on the fact that both steps of k-means (assign sample to nearest 
centers, move centers to cluster centroids) reduce variance. Typically, two 
criteria are used - 
<UL>
  <LI>terminate after a predefined fixed number of iterations, or 
  <LI>terminate after fewer than n sample change classes. </LI></UL><B>Dead 
clusters</B>: Seldom a centroid may not have any members in the next iteration. 
Such a cluster needs to be taken care of with some heuristics. 
<P><B>Number of k-means</B>: Usually, the number of clusters required is an 
input variable. However, certain heuristic cost function based on inter and 
intra cluster variance may be used to choose k in a model selection scenario. 
<P>
<HR width="100%">
<BR><A 
href="http://www.cse.unsw.edu.au/~akgu380/Kmeans/K-means.html">Description</A> | 
<A href="http://www.cse.unsw.edu.au/~akgu380/Kmeans/KmeansDemo.html">Demo</A> | 
<A href="http://www.cse.unsw.edu.au/~akgu380/Kmeans/KmeansCode.html">Code</A> | 
<P>Comments to <A 
href="mailto:akgu380@cse.unsw.edu.au">akgu380@cse.unsw.edu.au</A> <BR>&nbsp; 
<BR>&nbsp; <BR>&nbsp; <BR>&nbsp; <BR>&nbsp; <BR>&nbsp; <BR>&nbsp; <BR>&nbsp; 
<BR>&nbsp; <BR>&nbsp; <BR>&nbsp; <BR>&nbsp; <BR>&nbsp; <BR>&nbsp; <BR>&nbsp; 
<BR>&nbsp; </P></BODY></HTML>
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -