⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 clustering - hierarchical.htm

📁 数据挖掘Apriori算法的java源码
💻 HTM
📖 第 1 页 / 共 2 页
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0084)http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/hierarchical.html -->
<HTML><HEAD><TITLE>Clustering - Hierarchical</TITLE>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.2800.1106" name=GENERATOR></HEAD>
<BODY>
<P align=center><STRONG><FONT face="Times New Roman, Times, serif" size=+3><EM>A 
Tutorial on Clustering Algorithms</EM></FONT></STRONG></P>
<HR>
<STRONG><FONT size=+2></FONT></STRONG>
<P align=center><FONT face="Times New Roman, Times, serif"><A 
href="http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/index.html">Introduction</A> 
| <A 
href="http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/kmeans.html">K-means</A> 
| <A 
href="http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/cmeans.html">Fuzzy 
C-means</A> | Hierarchical | <A 
href="http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/mixture.html">Mixture 
of Gaussians</A> | <A 
href="http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/links.html">Links</A></FONT></P>
<HR>

<P align=center><FONT face="Arial, Helvetica, sans-serif" 
size=+2><STRONG>Hierarchical Clustering Algorithms</STRONG></FONT><BR></P>
<P align=justify><FONT face="Arial, Helvetica, sans-serif" size=+1><EM>How They 
Work</EM></FONT><FONT face="Times New Roman, Times, serif"><BR>Given a set of N 
items to be clustered, and an N*N distance (or similarity) matrix, the basic 
process of hierarchical clustering (defined by <A 
href="http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/hierarchical.html#johnson">S.C. 
Johnson in 1967</A>) is this: </FONT></P>
<OL>
  <LI><FONT face="Times New Roman, Times, serif" align="justify">Start by 
  assigning each item to a cluster, so that if you have N items, you now have N 
  clusters, each containing just one item. Let the distances (similarities) 
  between the clusters the same as the distances (similarities) between the 
  items they contain.</FONT> 
  <LI><FONT face="Times New Roman, Times, serif">Find the closest (most similar) 
  pair of clusters and merge them into a single cluster, so that now you have 
  one cluster less.</FONT> 
  <LI><FONT face="Times New Roman, Times, serif">Compute distances 
  (similarities) between the new cluster and each of the old clusters.</FONT> 
  <LI><FONT face="Times New Roman, Times, serif">Repeat steps 2 and 3 until all 
  items are clustered into a single cluster of size N. (*)</FONT> </LI></OL>
<P align=justify><FONT face="Times New Roman, Times, serif">Step 3 can be done 
in different ways, which is what distinguishes <EM>single-linkage</EM> from 
<EM>complete-linkage</EM> and <EM>average-linkage</EM> clustering.<BR>In 
<EM>single-linkage</EM> clustering (also called the <EM>connectedness</EM> or 
<EM>minimum</EM> method), we consider the distance between one cluster and 
another cluster to be equal to the <U>shortest</U> distance from any member of 
one cluster to any member of the other cluster. If the data consist of 
similarities, we consider the similarity between one cluster and another cluster 
to be equal to the <U>greatest</U> similarity from any member of one cluster to 
any member of the other cluster.<BR>In <EM>complete-linkage</EM> clustering 
(also called the <EM>diameter</EM> or <EM>maximum</EM> method), we consider the 
distance between one cluster and another cluster to be equal to the 
<U>greatest</U> distance from any member of one cluster to any member of the 
other cluster.<BR>In <EM>average-linkage</EM> clustering, we consider the 
distance between one cluster and another cluster to be equal to the 
<U>average</U> distance from any member of one cluster to any member of the 
other cluster.<BR>A variation on average-link clustering is the UCLUS method of 
<A 
href="http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/hierarchical.html#dandrade">R. 
D'Andrade (1978)</A> which uses the <U>median</U> distance, which is much more 
outlier-proof than the average distance.</FONT></P>
<P align=justify><FONT face="Times New Roman, Times, serif">This kind of 
hierarchical clustering is called <EM>agglomerative</EM> because it merges 
clusters iteratively. There is also a <EM>divisive</EM> hierarchical clustering 
which does the reverse by starting with all objects in one cluster and 
subdividing them into smaller pieces. Divisive methods are not generally 
available, and rarely have been applied.</FONT></P>
<P align=justify><FONT face="Times New Roman, Times, serif">(*) Of course there 
is no point in having all the N items grouped in a single cluster but, once you 
have got the complete hierarchical tree, if you want k clusters you just have to 
cut the k-1 longest links.</FONT></P>
<P></P>
<P align=justify><FONT face="Arial, Helvetica, sans-serif" 
size=+1><EM>Single-Linkage Clustering: The Algorithm</EM></FONT><FONT 
face="Times New Roman, Times, serif"><BR>Let抯 now take a deeper look at how 
Johnson抯 algorithm works in the case of single-linkage clustering.<BR>The 
algorithm is an agglomerative scheme that erases rows and columns in the 
proximity matrix as old clusters are merged into new ones.</FONT></P>
<P align=justify><FONT face="Times New Roman, Times, serif">The N*N proximity 
matrix is D = [d(i,j)]. The clusterings are assigned sequence numbers 
0,1,......, (n-1) and L(k) is the level of the kth clustering. A cluster with 
sequence number m is denoted (m) and the proximity between clusters (r) and (s) 
is denoted d [(r),(s)]. </FONT></P>
<P align=justify><FONT face="Times New Roman, Times, serif">The algorithm is 
composed of the following steps:</FONT></P>
<TABLE width="75%" align=center border=1>
  <TBODY>
  <TR>
    <TD>
      <OL>
        <LI><EM><FONT face="Times New Roman, Times, serif" align="justify">Begin 
        with the disjoint clustering having level L(0) = 0 and sequence number m 
        = 0.<BR></FONT></EM>
        <LI><EM><FONT face="Times New Roman, Times, serif">Find the least 
        dissimilar pair of clusters in the current clustering, say pair (r), 
        (s), according to</FONT><BR><FONT 
        face="Times New Roman, Times, serif"><BR>d[(r),(s)] = min 
        d[(i),(j)]</FONT><BR><FONT 
        face="Times New Roman, Times, serif"><BR>where the minimum is over all 
        pairs of clusters in the current clustering.<BR></FONT></EM>
        <LI><EM><FONT face="Times New Roman, Times, serif">Increment the 
        sequence number : m = m +1. Merge clusters (r) and (s) into a single 
        cluster to form the next clustering m. Set the level of this clustering 
        to</FONT><BR><FONT face="Times New Roman, Times, serif"><BR>L(m) = 
        d[(r),(s)]<BR></FONT></EM>
        <LI><EM><FONT face="Times New Roman, Times, serif">Update the proximity 
        matrix, D, by deleting the rows and columns corresponding to clusters 
        (r) and (s) and adding a row and column corresponding to the newly 
        formed cluster. The proximity between the new cluster, denoted (r,s) and 
        old cluster (k) is defined in this way:</FONT><BR><FONT 
        face="Times New Roman, Times, serif"><BR>d[(k), (r,s)] = min d[(k),(r)], 
        d[(k),(s)]<BR></FONT></EM>
        <LI><EM><FONT face="Times New Roman, Times, serif">If all objects are in 
        one cluster, stop. Else, go to step 2.</FONT></EM> 
</LI></OL></TD></TR></TBODY></TABLE>
<P align=justify></P>
<P align=justify><FONT face="Arial, Helvetica, sans-serif" size=+1><EM>An 
Example</EM></FONT><FONT face="Times New Roman, Times, serif"><BR>Let抯 now see 
a simple example: a hierarchical clustering of distances in kilometers between 
some Italian cities. The method used is single-linkage.</FONT></P>
<P align=justify><FONT face="Times New Roman, Times, serif"><STRONG>Input 
distance matrix</STRONG> (L = 0 for all the clusters):</FONT></P>
<TABLE width="35%" align=center border=1>
  <TBODY>
  <TR>
    <TD>&nbsp;</TD>
    <TD>
      <DIV align=center><STRONG>BA</STRONG></DIV></TD>
    <TD>
      <DIV align=center><STRONG>FI</STRONG></DIV></TD>
    <TD>
      <DIV align=center><STRONG>MI</STRONG></DIV></TD>
    <TD>
      <DIV align=center><STRONG>NA</STRONG></DIV></TD>
    <TD>
      <DIV align=center><STRONG>RM</STRONG></DIV></TD>
    <TD>
      <DIV align=center><STRONG>TO</STRONG></DIV></TD></TR>
  <TR>
    <TD>
      <DIV align=center><STRONG>BA</STRONG></DIV></TD>
    <TD>
      <DIV align=center>0</DIV></TD>
    <TD>
      <DIV align=center>662</DIV></TD>
    <TD>
      <DIV align=center>877</DIV></TD>
    <TD>
      <DIV align=center>255</DIV></TD>
    <TD>
      <DIV align=center>412</DIV></TD>
    <TD>
      <DIV align=center>996</DIV></TD></TR>
  <TR>
    <TD>
      <DIV align=center><STRONG>FI</STRONG></DIV></TD>
    <TD>
      <DIV align=center>662</DIV></TD>
    <TD>
      <DIV align=center>0</DIV></TD>
    <TD>
      <DIV align=center>295</DIV></TD>
    <TD>
      <DIV align=center>468</DIV></TD>
    <TD>
      <DIV align=center>268</DIV></TD>
    <TD>
      <DIV align=center>400</DIV></TD></TR>
  <TR>
    <TD>
      <DIV align=center><STRONG>MI</STRONG></DIV></TD>
    <TD>
      <DIV align=center>877</DIV></TD>
    <TD>
      <DIV align=center>295</DIV></TD>
    <TD>
      <DIV align=center>0</DIV></TD>
    <TD>
      <DIV align=center>754</DIV></TD>
    <TD>
      <DIV align=center>564</DIV></TD>
    <TD>
      <DIV align=center>138</DIV></TD></TR>
  <TR>
    <TD>
      <DIV align=center><STRONG>NA</STRONG></DIV></TD>
    <TD>
      <DIV align=center>255</DIV></TD>
    <TD>
      <DIV align=center>468</DIV></TD>
    <TD>
      <DIV align=center>754</DIV></TD>
    <TD>
      <DIV align=center>0</DIV></TD>
    <TD>
      <DIV align=center>219</DIV></TD>
    <TD>
      <DIV align=center>869</DIV></TD></TR>
  <TR>
    <TD>
      <DIV align=center><STRONG>RM</STRONG></DIV></TD>
    <TD>
      <DIV align=center>412</DIV></TD>
    <TD>
      <DIV align=center>268</DIV></TD>
    <TD>
      <DIV align=center>564</DIV></TD>
    <TD>
      <DIV align=center>219</DIV></TD>
    <TD>
      <DIV align=center>0</DIV></TD>
    <TD>
      <DIV align=center>669</DIV></TD></TR>
  <TR>
    <TD>
      <DIV align=center><STRONG>TO</STRONG></DIV></TD>
    <TD>
      <DIV align=center>996</DIV></TD>
    <TD>
      <DIV align=center>400</DIV></TD>
    <TD>
      <DIV align=center>138</DIV></TD>
    <TD>
      <DIV align=center>869</DIV></TD>
    <TD>
      <DIV align=center>669</DIV></TD>
    <TD>
      <DIV align=center>0</DIV></TD></TR></TBODY></TABLE>
<P align=center><IMG src="Clustering - Hierarchical.files/italia01.gif"></P>
<P align=justify><FONT face="Times New Roman, Times, serif">The nearest pair of 
cities is MI and TO, at distance 138. These are merged into a single cluster 
called "MI/TO". The level of the new cluster is L(MI/TO) = 138 and the new 
sequence number is m = 1.<BR></FONT><FONT 
face="Times New Roman, Times, serif">Then we compute the distance from this new 
compound object to all other objects. In single link clustering the rule is that 
the distance from the compound object to another object is equal to the shortest 
distance from any member of the cluster to the outside object. So the distance 
from "MI/TO" to RM is chosen to be 564, which is the distance from MI to RM, and 
so on.</FONT></P>

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -