readme

it is the Data Mining Algorithm source code.
字号:
cli/clx/cle/clc - probabilistic and fuzzy clusteringThis file provides some explanations on how to use the programs cli,clx, cle, clc to induce, execute and evaluate a set of clusters.However, it does not explain all options of these programs. For alist of options, call cli, clx, cle, clc without any arguments.Enjoy,Christian Borgelte-mail: borgelt@iws.cs.uni-magdeburg.deWWW:    http://fuzzy.cs.uni-magdeburg.de/~borgelt------------------------------------------------------------------------In this directory (cluster/ex) you can find the well-known iris data(measurements of the sepal length / width and the petal length / widthof three types of iris flowers) in formats suitable for the clusteringprograms. There are two versions: a matrix version iris.pat, whichcontains only a matrix of numbers, and a table version iris.tab, whichcontains column names and an additional column with the iris typeinformation.The matrix version can be processed with the option -M. To inducea set of three clusters with the fuzzy c-means algorithm, type  cli -M -c3 iris.pat iris.clsThe option -c3 instructs the program to find three clusters. iris.patis the input file containing the data, iris.cls the output file towhich  a description of the clusters will be written. The result ofthis program call should look like this (contents of iris.cls):  scales   = [5.84333, 1.21168], [3.05733, 2.30197],             [3.758, 0.568374], [1.19933, 1.31632];  function = cauchy(2,0);  normmode = sum1;  params   = {{ [-1.00478,  0.846484, -1.28465, -1.23865] },              { [-0.0383646, -0.818721,  0.32297,  0.232151] },              { [ 1.06925,  0.0374249,  0.970174,  1.02979] },                1 };The first two lines specify the scaling parameters (offset and scalingfactor), which describe how the input data are scaled in order toachieve a distribution with mean 0 and variance 1 in each dimension.The reason for this scaling is to avoid a distortion of the clusteringresult due to considerably different ranges of values in the inputdimensions.The third line states the membership function used (it is the same forall clusters). In this case it is the (generalized) Cauchy function  f(d) = 1/(d^a +b),where d is the distance from the cluster center, with parameters a = 2and b = 0. That is, the (unnormalized) degree of membership is computedas the inverse squared distance from the cluster center. An alternativeis the (generalized) Gaussian function  f(d) = exp(-0.5 *d^a),which can be selected with the option -G.The fourth line states the normalization mode for the membershipdegrees. Here it is "sum1", which means that the membership degreesare scaled in such a way that they sum up to 1.The line starting with "params" and the two lines following it specifythe cluster parameters, which in this case (fuzzy c-means algorithm)are the coordinates of the cluster centers. Each section enclosed incurly braces specifies the center of one cluster. At the end of thelist of cluster description the value 1 specifies that all clustershave a uniform size (isotropic variance) of 1.If no normalization of the input ranges is desired, it can be switchedoff with the option -q. For example  cli -M -qc3 iris.pat iris.cls(note how several options can be combined) yields  function = cauchy(2,0);  normmode = sum1;  params   = {{ [ 5.00397,  3.41409,  1.48282,  0.253546] },              { [ 5.88893,  2.76107,  4.36395,  1.39732] },              { [ 6.77501,  3.05238,  5.64678,  2.05355] },                1 };Here the scaling parameters all specify the identity function (and hencethey are not explicitely listed), so that the clustering algorithm isexecuted directly in the input space.The induced set of clusters can than be executed on the data in orderto compute the membership degrees for the different data points. Thisis done with the program mclx. For example,  clx iris.cls iris.pat iris.outcreates a table iris.out, which contains three additional columns -one for each cluster. These columns hold the degrees of membership,rounded to two decimal places. (If a higher (or lower) accuracy isdesired, the output format of the membership degrees can be changedwith the option -o.)If only the cluster with the highest degree of membership is desired,one may use the option -c, which produces only one additional columncontaining the index of the cluster with the highest degree ofmembership. To this another column, containing the membership degreefor this cluster, may be added with the option -m.The programs cli and clx perform exactly the same tasks as the programsmcli and mclx, only on a different input format, namely the format ofthe file iris.tab. This format is processed in connection with a domaindescription file (here: iris.dom) that specifies which columns are tobe used and the data types of these columns. In this way it is possibleto execute the clustering algorithm on a subset of the attributeswithout changing the data file. It is also possible to handle symbolicattributes, which are coded by a simple 1-in-n code before they arepresented to the clustering algorithm. As a consequence, the output ofthe program cli contains (compared to the output of the program mcli)an additional section stating the domain information for the attributes.Both programs, mcli as well as cli, are highly parameterizable, sothat a large variety of clustering algorithms can be carried out.Here is a list of some options that lead to well-known algorithms:options     algorithm-jhard      hard  c-means algorithmnone        fuzzy c-means algorithm-v          axes-parallel Gustafson-Kessel algorithm-V          general       Gustafson-Kessel algorithm-wvG        axes-parallel Gath-Geva (FMLE) algorithm-wVG        general       Gath-Geva (FMLE) algorithm-wvGNx1     axes-parallel mixture of Gaussians (EM algorithm)-wVGNx1     general       mixture of Gaussians (EM algorithm)Explanation of the individual options:-j#      membership normalization mode-v       adaptable variances-V       adaptable covariances (covariance matrix)-Z       adaptable cluster sizes-w       adaptable weights/prior probabilities-G       Gaussian radial function (default: Cauchy function)-N       normalize to unit integral (probability density)-x       exponent for weight of pattern/data pointIt is usually advisable to initialize the higher algorithms (likeGustafson-Kessel and Gath-Geva) with a few epochs of the fuzzy c-meansalgorithm. This can be achieved by exploiting that the programs mcliand cli can read in a clustering result. That is, by a call like  cli -MOV iris.pat iris.gk iris.clsthe fuzzy c-means result obtained with the program call stated above(stored in iris.cls) is further processed with the Gustafson-Kesselalgorithm. The result is written to the file iris.gk. The option -Ois necessary to overwrite the cluster type and radial functionparameters read from the input file with the command line values.The shape and size regularization options (-H and -R) are describedbriefly in the file ../doc/regular.tex
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -