⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 algorithm

📁 Andreas Stolcke开发的聚类源码, 版本2.9。聚类方法是层次合并聚类
💻
字号:
Here is some information regarding the clustering method used in cluster,since many people have asked for some.THE METHODThe algorithm works by iteratively merging smaller clusters into biggerones.It starts with one data point per cluster.  Then look for the smallestdistance between any two clusters (i.e., data points at this stage).Distance is Euclidean by default, but can be any p-norm (see the -noption). Those two cluster are merged, i.e., they form a new clusterwith the two earlier ones as subclusters, which gives rise to a branchin the cluster tree.  For purposes of distance computation the new clusteris represented by the average of all its data points.The merging is repeated until only one cluster is left.LITERATUREAs pointed out by Fionn Murtagh (fmurtagh@eso.org), the algorithm implementedby cluster is known as the Centroid or "Unweighted pair group" method.(see Sneath:73 below).An account of the origin of this algorithm can be found inthe following mail from Yoshiro Miyata:    Date: Sat, 6 Feb 93 10:50:41 JST    From: miyata@sccs.chukyo-u.ac.jp (Yoshiro MIYATA)    To: stolcke@ICSI.Berkeley.EDU    Subject: cluster algorithm    When I wrote that code several years ago, being a psychology grad    student I knew next to nothing about this branch of statistics.  It    was just a fun project on a rainy weekend sort of thing.  I don't    remember reading any book .. I probably heard about clustering    analysis in someone's talk, imagined what it should be like and coded    it up.  I must have been lucky if that was close enough to something    in the literature!!The included MAIL from fmurtagh@eso.org mentions several alternative methodsand points out references.REFERENCES@book{Sneath:73,author =	{Peter H. A. Sneath and Robert R. Sokal},title =		{Numerical taxonomy; the principles and practice of numerical		classification},publisher =	{W. H. Freeman},address =	{San Francisco},yyear =		{1973}}@book{Murtagh:85,author =	{Fionn Murtagh},title =		{Multidimensional clustering algorithms},publisher =	{Physica-Verlag},address =	{Vienna},year =		{1985}}@article{Gower:69,author =        {J. C. Gower and G. J. S. Ross},title =         {Minimum Spanning Trees and Single Linkage Cluster Analysis},journal =       {Applied Statistics},volume =        {18},number =        {1},pages =         {54-64},year =          {1969},annote =        {Describes algorithms for MST and SLCA, in particular how to		obtain the latter from the former.}}@article{Zahn:71,author =        {Charles T. Zahn},title =         {Graph-Theoretical Methods for Detecting and Describing                Gestalt Clusters},pages =         {68-86},journal =       {IEEE Transactions on Computers},volume =        {C-20},number =        {1},month =         jan,year =          {1971},annote =        {Explores the minimum spanning tree as a tool for                clustering.  Applications include cluster segmentation,                density gradient detection, particle tracking, description                of Gaussian clusters, hierarchical clustering,                and relative compactness computation.                Also contains many pointers into the clustering literature.}}THE ALGORITHMIn the latest version of cluster, minimum distance pairs are found bykeeping track of the nearest neighbor of each data point (as well asthe distance).  The nnbs are initially computed in O(n^2) time.  When anew cluster is formed, one of the child nodes is discarded, while theother is replaced by the new node representing the new cluster.  Foreach of the other points, it is checked whether the nearest neighborhas changed due to the new cluster, and updated if so.  This can bedone in constant time for each point (except for the rare case wherethe previous nearest neighbor happened to be one of the two points nowreplaced by the cluster).  On each iteration, searching for theshortest distance pair takes O(n) time using nearest neighbor list.There are n iterations overall, so the total time is O(n^2).For historical notes on the less efficient algorithm in the old cluster,see the BENCHMARKS file.  Also note that an even more efficientimplementation (O(n log n)) should be possible.  Unfortunately, I'm just a dedicated user of cluster, and not expert in these matters.Andreas Stolcke1/21/93

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -