📄 clustering - hierarchical.htm
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0084)http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/hierarchical.html -->
<HTML><HEAD><TITLE>Clustering - Hierarchical</TITLE>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.2800.1106" name=GENERATOR></HEAD>
<BODY>
<P align=center><STRONG><FONT face="Times New Roman, Times, serif" size=+3><EM>A
Tutorial on Clustering Algorithms</EM></FONT></STRONG></P>
<HR>
<STRONG><FONT size=+2></FONT></STRONG>
<P align=center><FONT face="Times New Roman, Times, serif"><A
href="http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/index.html">Introduction</A>
| <A
href="http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/kmeans.html">K-means</A>
| <A
href="http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/cmeans.html">Fuzzy
C-means</A> | Hierarchical | <A
href="http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/mixture.html">Mixture
of Gaussians</A> | <A
href="http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/links.html">Links</A></FONT></P>
<HR>
<P align=center><FONT face="Arial, Helvetica, sans-serif"
size=+2><STRONG>Hierarchical Clustering Algorithms</STRONG></FONT><BR></P>
<P align=justify><FONT face="Arial, Helvetica, sans-serif" size=+1><EM>How They
Work</EM></FONT><FONT face="Times New Roman, Times, serif"><BR>Given a set of N
items to be clustered, and an N*N distance (or similarity) matrix, the basic
process of hierarchical clustering (defined by <A
href="http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/hierarchical.html#johnson">S.C.
Johnson in 1967</A>) is this: </FONT></P>
<OL>
<LI><FONT face="Times New Roman, Times, serif" align="justify">Start by
assigning each item to a cluster, so that if you have N items, you now have N
clusters, each containing just one item. Let the distances (similarities)
between the clusters the same as the distances (similarities) between the
items they contain.</FONT>
<LI><FONT face="Times New Roman, Times, serif">Find the closest (most similar)
pair of clusters and merge them into a single cluster, so that now you have
one cluster less.</FONT>
<LI><FONT face="Times New Roman, Times, serif">Compute distances
(similarities) between the new cluster and each of the old clusters.</FONT>
<LI><FONT face="Times New Roman, Times, serif">Repeat steps 2 and 3 until all
items are clustered into a single cluster of size N. (*)</FONT> </LI></OL>
<P align=justify><FONT face="Times New Roman, Times, serif">Step 3 can be done
in different ways, which is what distinguishes <EM>single-linkage</EM> from
<EM>complete-linkage</EM> and <EM>average-linkage</EM> clustering.<BR>In
<EM>single-linkage</EM> clustering (also called the <EM>connectedness</EM> or
<EM>minimum</EM> method), we consider the distance between one cluster and
another cluster to be equal to the <U>shortest</U> distance from any member of
one cluster to any member of the other cluster. If the data consist of
similarities, we consider the similarity between one cluster and another cluster
to be equal to the <U>greatest</U> similarity from any member of one cluster to
any member of the other cluster.<BR>In <EM>complete-linkage</EM> clustering
(also called the <EM>diameter</EM> or <EM>maximum</EM> method), we consider the
distance between one cluster and another cluster to be equal to the
<U>greatest</U> distance from any member of one cluster to any member of the
other cluster.<BR>In <EM>average-linkage</EM> clustering, we consider the
distance between one cluster and another cluster to be equal to the
<U>average</U> distance from any member of one cluster to any member of the
other cluster.<BR>A variation on average-link clustering is the UCLUS method of
<A
href="http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/hierarchical.html#dandrade">R.
D'Andrade (1978)</A> which uses the <U>median</U> distance, which is much more
outlier-proof than the average distance.</FONT></P>
<P align=justify><FONT face="Times New Roman, Times, serif">This kind of
hierarchical clustering is called <EM>agglomerative</EM> because it merges
clusters iteratively. There is also a <EM>divisive</EM> hierarchical clustering
which does the reverse by starting with all objects in one cluster and
subdividing them into smaller pieces. Divisive methods are not generally
available, and rarely have been applied.</FONT></P>
<P align=justify><FONT face="Times New Roman, Times, serif">(*) Of course there
is no point in having all the N items grouped in a single cluster but, once you
have got the complete hierarchical tree, if you want k clusters you just have to
cut the k-1 longest links.</FONT></P>
<P></P>
<P align=justify><FONT face="Arial, Helvetica, sans-serif"
size=+1><EM>Single-Linkage Clustering: The Algorithm</EM></FONT><FONT
face="Times New Roman, Times, serif"><BR>Let抯 now take a deeper look at how
Johnson抯 algorithm works in the case of single-linkage clustering.<BR>The
algorithm is an agglomerative scheme that erases rows and columns in the
proximity matrix as old clusters are merged into new ones.</FONT></P>
<P align=justify><FONT face="Times New Roman, Times, serif">The N*N proximity
matrix is D = [d(i,j)]. The clusterings are assigned sequence numbers
0,1,......, (n-1) and L(k) is the level of the kth clustering. A cluster with
sequence number m is denoted (m) and the proximity between clusters (r) and (s)
is denoted d [(r),(s)]. </FONT></P>
<P align=justify><FONT face="Times New Roman, Times, serif">The algorithm is
composed of the following steps:</FONT></P>
<TABLE width="75%" align=center border=1>
<TBODY>
<TR>
<TD>
<OL>
<LI><EM><FONT face="Times New Roman, Times, serif" align="justify">Begin
with the disjoint clustering having level L(0) = 0 and sequence number m
= 0.<BR></FONT></EM>
<LI><EM><FONT face="Times New Roman, Times, serif">Find the least
dissimilar pair of clusters in the current clustering, say pair (r),
(s), according to</FONT><BR><FONT
face="Times New Roman, Times, serif"><BR>d[(r),(s)] = min
d[(i),(j)]</FONT><BR><FONT
face="Times New Roman, Times, serif"><BR>where the minimum is over all
pairs of clusters in the current clustering.<BR></FONT></EM>
<LI><EM><FONT face="Times New Roman, Times, serif">Increment the
sequence number : m = m +1. Merge clusters (r) and (s) into a single
cluster to form the next clustering m. Set the level of this clustering
to</FONT><BR><FONT face="Times New Roman, Times, serif"><BR>L(m) =
d[(r),(s)]<BR></FONT></EM>
<LI><EM><FONT face="Times New Roman, Times, serif">Update the proximity
matrix, D, by deleting the rows and columns corresponding to clusters
(r) and (s) and adding a row and column corresponding to the newly
formed cluster. The proximity between the new cluster, denoted (r,s) and
old cluster (k) is defined in this way:</FONT><BR><FONT
face="Times New Roman, Times, serif"><BR>d[(k), (r,s)] = min d[(k),(r)],
d[(k),(s)]<BR></FONT></EM>
<LI><EM><FONT face="Times New Roman, Times, serif">If all objects are in
one cluster, stop. Else, go to step 2.</FONT></EM>
</LI></OL></TD></TR></TBODY></TABLE>
<P align=justify></P>
<P align=justify><FONT face="Arial, Helvetica, sans-serif" size=+1><EM>An
Example</EM></FONT><FONT face="Times New Roman, Times, serif"><BR>Let抯 now see
a simple example: a hierarchical clustering of distances in kilometers between
some Italian cities. The method used is single-linkage.</FONT></P>
<P align=justify><FONT face="Times New Roman, Times, serif"><STRONG>Input
distance matrix</STRONG> (L = 0 for all the clusters):</FONT></P>
<TABLE width="35%" align=center border=1>
<TBODY>
<TR>
<TD> </TD>
<TD>
<DIV align=center><STRONG>BA</STRONG></DIV></TD>
<TD>
<DIV align=center><STRONG>FI</STRONG></DIV></TD>
<TD>
<DIV align=center><STRONG>MI</STRONG></DIV></TD>
<TD>
<DIV align=center><STRONG>NA</STRONG></DIV></TD>
<TD>
<DIV align=center><STRONG>RM</STRONG></DIV></TD>
<TD>
<DIV align=center><STRONG>TO</STRONG></DIV></TD></TR>
<TR>
<TD>
<DIV align=center><STRONG>BA</STRONG></DIV></TD>
<TD>
<DIV align=center>0</DIV></TD>
<TD>
<DIV align=center>662</DIV></TD>
<TD>
<DIV align=center>877</DIV></TD>
<TD>
<DIV align=center>255</DIV></TD>
<TD>
<DIV align=center>412</DIV></TD>
<TD>
<DIV align=center>996</DIV></TD></TR>
<TR>
<TD>
<DIV align=center><STRONG>FI</STRONG></DIV></TD>
<TD>
<DIV align=center>662</DIV></TD>
<TD>
<DIV align=center>0</DIV></TD>
<TD>
<DIV align=center>295</DIV></TD>
<TD>
<DIV align=center>468</DIV></TD>
<TD>
<DIV align=center>268</DIV></TD>
<TD>
<DIV align=center>400</DIV></TD></TR>
<TR>
<TD>
<DIV align=center><STRONG>MI</STRONG></DIV></TD>
<TD>
<DIV align=center>877</DIV></TD>
<TD>
<DIV align=center>295</DIV></TD>
<TD>
<DIV align=center>0</DIV></TD>
<TD>
<DIV align=center>754</DIV></TD>
<TD>
<DIV align=center>564</DIV></TD>
<TD>
<DIV align=center>138</DIV></TD></TR>
<TR>
<TD>
<DIV align=center><STRONG>NA</STRONG></DIV></TD>
<TD>
<DIV align=center>255</DIV></TD>
<TD>
<DIV align=center>468</DIV></TD>
<TD>
<DIV align=center>754</DIV></TD>
<TD>
<DIV align=center>0</DIV></TD>
<TD>
<DIV align=center>219</DIV></TD>
<TD>
<DIV align=center>869</DIV></TD></TR>
<TR>
<TD>
<DIV align=center><STRONG>RM</STRONG></DIV></TD>
<TD>
<DIV align=center>412</DIV></TD>
<TD>
<DIV align=center>268</DIV></TD>
<TD>
<DIV align=center>564</DIV></TD>
<TD>
<DIV align=center>219</DIV></TD>
<TD>
<DIV align=center>0</DIV></TD>
<TD>
<DIV align=center>669</DIV></TD></TR>
<TR>
<TD>
<DIV align=center><STRONG>TO</STRONG></DIV></TD>
<TD>
<DIV align=center>996</DIV></TD>
<TD>
<DIV align=center>400</DIV></TD>
<TD>
<DIV align=center>138</DIV></TD>
<TD>
<DIV align=center>869</DIV></TD>
<TD>
<DIV align=center>669</DIV></TD>
<TD>
<DIV align=center>0</DIV></TD></TR></TBODY></TABLE>
<P align=center><IMG src="Clustering - Hierarchical.files/italia01.gif"></P>
<P align=justify><FONT face="Times New Roman, Times, serif">The nearest pair of
cities is MI and TO, at distance 138. These are merged into a single cluster
called "MI/TO". The level of the new cluster is L(MI/TO) = 138 and the new
sequence number is m = 1.<BR></FONT><FONT
face="Times New Roman, Times, serif">Then we compute the distance from this new
compound object to all other objects. In single link clustering the rule is that
the distance from the compound object to another object is equal to the shortest
distance from any member of the cluster to the outside object. So the distance
from "MI/TO" to RM is chosen to be 564, which is the distance from MI to RM, and
so on.</FONT></P>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -