algorithm

来自「Andreas Stolcke开发的聚类源码, 版本2.9。聚类方法是层次合并聚」· 代码 · 共 117 行

TXT

117 行

Here is some information regarding the clustering method used in cluster,since many people have asked for some.THE METHODThe algorithm works by iteratively merging smaller clusters into biggerones.It starts with one data point per cluster.  Then look for the smallestdistance between any two clusters (i.e., data points at this stage).Distance is Euclidean by default, but can be any p-norm (see the -noption). Those two cluster are merged, i.e., they form a new clusterwith the two earlier ones as subclusters, which gives rise to a branchin the cluster tree.  For purposes of distance computation the new clusteris represented by the average of all its data points.The merging is repeated until only one cluster is left.LITERATUREAs pointed out by Fionn Murtagh (fmurtagh@eso.org), the algorithm implementedby cluster is known as the Centroid or "Unweighted pair group" method.(see Sneath:73 below).An account of the origin of this algorithm can be found inthe following mail from Yoshiro Miyata:    Date: Sat, 6 Feb 93 10:50:41 JST    From: miyata@sccs.chukyo-u.ac.jp (Yoshiro MIYATA)    To: stolcke@ICSI.Berkeley.EDU    Subject: cluster algorithm    When I wrote that code several years ago, being a psychology grad    student I knew next to nothing about this branch of statistics.  It    was just a fun project on a rainy weekend sort of thing.  I don't    remember reading any book .. I probably heard about clustering    analysis in someone's talk, imagined what it should be like and coded    it up.  I must have been lucky if that was close enough to something    in the literature!!The included MAIL from fmurtagh@eso.org mentions several alternative methodsand points out references.REFERENCES@book{Sneath:73,author =	{Peter H. A. Sneath and Robert R. Sokal},title =		{Numerical taxonomy; the principles and practice of numerical		classification},publisher =	{W. H. Freeman},address =	{San Francisco},yyear =		{1973}}@book{Murtagh:85,author =	{Fionn Murtagh},title =		{Multidimensional clustering algorithms},publisher =	{Physica-Verlag},address =	{Vienna},year =		{1985}}@article{Gower:69,author =        {J. C. Gower and G. J. S. Ross},title =         {Minimum Spanning Trees and Single Linkage Cluster Analysis},journal =       {Applied Statistics},volume =        {18},number =        {1},pages =         {54-64},year =          {1969},annote =        {Describes algorithms for MST and SLCA, in particular how to		obtain the latter from the former.}}@article{Zahn:71,author =        {Charles T. Zahn},title =         {Graph-Theoretical Methods for Detecting and Describing                Gestalt Clusters},pages =         {68-86},journal =       {IEEE Transactions on Computers},volume =        {C-20},number =        {1},month =         jan,year =          {1971},annote =        {Explores the minimum spanning tree as a tool for                clustering.  Applications include cluster segmentation,                density gradient detection, particle tracking, description                of Gaussian clusters, hierarchical clustering,                and relative compactness computation.                Also contains many pointers into the clustering literature.}}THE ALGORITHMIn the latest version of cluster, minimum distance pairs are found bykeeping track of the nearest neighbor of each data point (as well asthe distance).  The nnbs are initially computed in O(n^2) time.  When anew cluster is formed, one of the child nodes is discarded, while theother is replaced by the new node representing the new cluster.  Foreach of the other points, it is checked whether the nearest neighborhas changed due to the new cluster, and updated if so.  This can bedone in constant time for each point (except for the rare case wherethe previous nearest neighbor happened to be one of the two points nowreplaced by the cluster).  On each iteration, searching for theshortest distance pair takes O(n) time using nearest neighbor list.There are n iterations overall, so the total time is O(n^2).For historical notes on the less efficient algorithm in the old cluster,see the BENCHMARKS file.  Also note that an even more efficientimplementation (O(n log n)) should be possible.  Unfortunately, I'm just a dedicated user of cluster, and not expert in these matters.Andreas Stolcke1/21/93

algorithm - 源码说明

本页面展示了「Andreas Stolcke开发的聚类源码, 版本2.9。聚类方法是层次合并聚类」中的 algorithm 源码文件，采用编程语言编写，共 117 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。

虫虫下载站收录了大量与Andreas相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。

⌨️ 快捷键说明

复制代码Ctrl + C

搜索代码Ctrl + F

全屏模式F11

增大字号Ctrl + =

减小字号Ctrl + -

显示快捷键?