readmeenglish.txt

来自「cluster validation tools matlab toolbox」· 文本代码 · 共 88 行

TXT

88 行

Help file of Cluster Validation Toolbox for estimating the number of clusters (CVT-NC)
 (Version 2.0)

Your comments are welcome at:
http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=13916
E-mail: sunice9@yahoo.com

(1) Contents of CVT-NC
    The CVT-NC includes 4 External validity indices and 8 internal validity indices, and the sub-routine "validity_Index.m" is designed to use them. 
    This tool is suitable for the research work such as the performance comparison of different indices on estimation of the number of clusters, algorithm design for applications by using or improving part codes of this tool, and etc. A much better visual tool (more validity indices and clustering algorithms) will come soon, but it is inconvenience to adjust codes. (finding its arrival at http://www.mathworks.com/matlabcentral/fileexchange/loadAuthor.do?objectType=author&objectId=1095267)
i)  External validity indices when true class labels are known:
    Rand index
    Adjusted Rand index
    Mirkin index
    Hubert index
ii) Internal validity indices when true class labels are unknown:
    Silhouette
    Davies-Bouldin
    Calinski-Harabasz
    Krzanowski-Lai
    Hartigan
    weighted inter- to intra-cluster ratio
    Homogeneity
    Separation
iii) Others
    Error rate (compared with true labels)
    System Evolution: it is used to estimate the number of clusters and give separable degrees between clusters.

Note 1: The codes of Rand, Adjusted Rand, Mirkin, Hubert indices are from David Corney (D.Corney@cs.ucl.ac.uk), who holds the copyright.

Note 2: Error rate: The error rate might be inaccurate if  the clustering solution under true NC has error rate >20%, since "valid_errorate" designed here can not deal with complex cases.

(2) Contents of main file "mainClusterValidationNC.m" 
    It is designed to use validity indices to estimate the number of clusters (NC) for PAM and K-means clustering algorithms. 
Part 1: Selection of a data set, initialization and computation of distance/dissimilarity matrix. 
Part 2: A clustering algorithm Runs (N-1 times) to yield k clusters (k=2,3,...,N). 
Part 3: Cluster validation for Estimating the number of clusters (NC). "validity_Index"
Part 4: System evolution estimates NC.  "SystemEvolution_findk"

Note 1: The programs of Part 4 and the demo data sets are available from:
http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=11889

Note2: The programs are tested under Matlab 6.5 and 7.2.

(3) PAM & K-means clustering algorithms included in this program
  The K-means codes are from Mathworks. The initialization of K-means is to select K centroids from data at random, for other choices refer to the kmeans.m of Matlab (inner function of Matlab).

  The PAM (partitioning around medoids) is a robust clustering algorithm to minimize a sum of dissimilarities of data points to their closest medoids, and tends to be more robust than K-means, or a robust “version” of K-means. PAM needs pre-assigned NC as input parameter, similar to K-means. It seems not suitable to large data sets, and might run slow for a data set with number of data points over such as 2000.
  The programs of PAM have been included in the Matlab library LIBRA (http://wis.kuleuven.be/stat/robust/LIBRA.html), the statistic analysis software S-plus (http://www.splus.com/) and the cluster package of R (http://cran.r-project.org/). The PAM codes in this program are from LIBRA.

(4) Pearson similarity/distance
  Pearson similarity/distance is the linear correlation coefficient between two vectors and has its value range from -1 to 1, and it is commonly used to measure the similarity/distances between genes. 
  For the correct computation of indices, in this program the correlation coefficient is normalized to [0,1] by R(i,j)=(1-R(i,j))/2 as distances, where 0 is the closest distance and 1 the farthest one, and it is easy to convert it back by 1-2R(i,j). 
  For example, assume that there be two genes g1 and g2, then R(g1,g2)=1 means that their distance is the farthest, and R(g1,g1)=0 means that g1 itself has the closest distance. 

(5) Input: a data file like "yourdata.txt"
  The input data file is the tab delimited text file with numeric tabular data or similar Matlab file format (e.g. rows denote data points/elements and columns denote dimensions),  and all the data should be numeric values and without missing values.

  If you use Euclidean distance, please put the data file before "case 21". If true class labels are known and in 1st column, put the data file before "case 11", otherwise in "case 11". 
  If you use Pearson distance, please put the data file after "case 20". If true class labels are known and in 1st column, put the data file between "case 21" and "case 40", otherwise after "case 40". 
 
(6) Output
   The PAM/K-means is first used to divide a data set into k clusters (k=1,2,3,…,N), resulting in N clustering solutions; and then the validity indices/methods estimate the optimal NC ko based on these solutions with seeking limit N=ko+6. The found ko indicated by a square symbol is shown in the figures.
   When a cluster has few elements (e.g.<4), the PAM/K-means will not go on (see rows in the clustering part)

(7) Demo data sets (all have true class labels in 1st column of the data file)
Dataset             #class   #elements    dimension    features
4k2bigsmall_far      4 	400	          2	    far small-large clusters
4k2bigsmall_lap      4	400	          2	    overlapping small-large clusters
8k2close                  8	800	          2	    close well-separated clusters
8k2lap                      8	800	          2	    overlapping clusters
6k20close                6	400	          20	close well-separated
6k40far	                6	400	          40	far well-separated
4k20lap	                4	400	          20	overlapping
4k40lap	                4	400	          40	overlapping
leuk72_3k	                3	72	          39	close well-separated
lym96_4k	                4	96	          46	close well-separated
g205	                4	205	          80	close large-small
y208	                4	208	          79	far well-separated

Note: These demo data sets are available from:
http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=11889

---------------------------------------------------------------------------------------------------
This software is distributed under the LGPL license.  (see Copyright.txt)
Copyright (C) 2006-2007.
Last modified: April 5, 2007

readmeenglish.txt - 源码说明

本页面展示了「cluster validation tools matlab toolbox」中的 readmeenglish.txt 源码文件，采用文本编程语言编写，共 88 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。

虫虫开发者社区收录了大量与MATLAB相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。

⌨️ 快捷键说明

复制代码Ctrl + C

搜索代码Ctrl + F

全屏模式F11

增大字号Ctrl + =

减小字号Ctrl + -

显示快捷键?