kmlocal-doc.tex
来自「高效的k-means算法实现」· TEX 代码 · 共 866 行 · 第 1/3 页
TEX
866 行
\subsubsection{Options used in Lloyd's Algorithm and Hybrid Algorithms}\begin{description*}\item[\SF{damp\_factor \BR{float}}] ~ A dampening factor in the interval (0,1]. The value 1 is the standard Lloyd's algorithm. Otherwise, each point is only moved by this fraction of the way from its current location to the centroid. Default: 1\item[\SF{min\_accum\_rdl \BR{float}}] ~ This is used in Lloyd's algorithm algorithm which perform multiple swaps. When performing p swaps, we actually may perform fewer than p. We stop performing swaps, whenever the total distortion (from the start of the run) has decreased by at least this amount. Default: 0.10\item[\SF{max\_run\_stage \BR{int}}] ~ This is used in Lloyd's algorithm. A run terminates after this many stages. Default: 100\end{description*}\subsubsection{Options specific to the Swap algorithm}\begin{description*}\item[\SF{max\_swaps \BR{int}}] ~ Maximum swaps at any given stage. Default: 1\end{description*}\subsubsection{Options specific to the hybrid algorithms}\begin{description*}\item[\SF{min\_consec\_rdl \BR{float}}] ~ This is used in the hybrid algorithms. If the RDL of two consecutive runs is less than this value Lloyd's algorithm is deemed to have converged, and the run ends.\end{description*}\subsubsection{Options specific to the (complex) hybrid algorithm}\begin{description*}\item[\SF{init\_prob\_accept \BR{float}}] ~ The initial probability of accepting a solution that does not improve the distortion.\item[\SF{temp\_run\_length \BR{int}}] ~ The number of stages before changing the temperature.\item[\SF{temp\_reduc\_factor \BR{float}}] ~ The factor by which temperature is reduced at the end of a temperature run.\end{description*} \subsection{Example}Option directives (such as ``dim,'' ``data\_size,'' ``seed'') thatmerely set option values are indented. Operation directives(``gen\_data\_pts,'' ``run\_kmeans'') are not indented.{\small\begin{verbatim}title Experiment_1A # experiment title stats summary # print summary information print_assignments yes # print final cluster assignments dim 2 # dimension 2 data_size 5000 # 5000 data points colors 30 # ...broken into 30 clusters std_dev 0.025 # ...each with std deviation 0.025 distribution clus_gauss # clustered gaussian distribution seed 1 # random number seedgen_data_pts # generate the data points kcenters 20 # place k=20 center points distribution uniform # ...uniformly distributed seed 2gen_centers # generate initial center points lloyd_max_tot_stage 20 0 0 0 # terminate Lloyd's after 20 stagesprint Running-lloyd'srun_kmeans lloyd # run using Lloyd's algorithm seed 2 # use same seedgen_centers # regenerate same center points max_swaps 3 # at most 3 swaps swap_max_tot_stage 0 3 0 2 # at most 3*k^2 = 1200 stages swap_max_run_stage 50 0 0 0 # at most 50 stages per run swap_min_run_gain 0.02 # stop run if distortion drops 2%print Running-swaprun_kmeans swap # run using swap heuristic\end{verbatim}}\section{Sample Program}Although KMlocal is not a library, it is possible to invoke thevarious algorithm directly from program. The algorithm is designedaround a collection of C++ objects. The include the following:\begin{description}\item[KMdata: (KMdata.h)] This object stores the data points. The constructor is given the dimension and the number of points to allocate. If $P$ is an object of this type, then $P[i][d]$ refers to the $d$th coordinate of the $i$th point. One the key elements to the efficiency of the algorithms presented is the use of the filtering algorithm for computing the nearest cluster center for each data point. This requires the construction of a data structure called a kc-tree. This tree is constructed by the method \texttt{P.buildKcTree()}, which should be done prior to running any of the clustering algorithms.\item[KMfilterCenters: (KMfilterCenters.h)] This object stores the center points (in a manner that makes the use of the filtering algorithm possible). The constructor is passed two arguments, the desired number of center points $k$, and the associated data points. On completion of the execution of the clustering algorithm, the centers are stored in this structure. It supports a method \texttt{print()}, which prints the centers to the standard output, and method \texttt{getDist()}, which returns the total distortion.\item[KMlocal: (KMlocal.h)] There are currently four different clustering algorithms supported. They are all designed around a common local search template, called KMlocal. This is a generic template, and so cannot be invoked directly. The following derived objects can be invoked: \begin{description*} \item[KMlocalLloyds:] Repeated Lloyd's algorithm. \item[KMlocalSwap:] The swap heuristic. \item[KMlocalEZ\_Hybrid:] A simple hybrid, which simply alternates Lloyd's algorithm and the swap algorithm. \item[KMlocalHybrid:] A more complex hybrid algorithm, which involves simulated annealing. \end{description*} These algorithms are described in Section~\ref{algo.sec}, above. The constructor to each algorithm is passed two things, the KMfilterCenter structure for the center points and the KMterm structure (described below), which specifies the termination conditions. It supports the method \texttt{execute()}, which initializes the center points to random positions, executes the clustering algorithm, and returns with the center structure modified to hold the final centers.\item[KMterm: (KMterm.h)] Each of the local search algorithms consists of some number of incremental movements of the center points, called \emph{stages}. Stages are grouped into longer organizational units called \emph{runs}, and runs are grouped into longer units called \emph{phases}. The meaning of the transition from runs to phases depends on the individual algorithm. Intuitively, a run involves a search for a better solution, through some local search operations. If this search results in a better solution, the run is said to be successful. A phase consists of a series of consecutive successful runs. When a run is unsuccessful, the phase ends. The definition of when a run and a phase is complete depends on a number of parameters. These parameters are stored in the KMterm object. See the files KMterm.h and KMlocal.h for more detailed explanation of their exact meaning. One important parameter in the termination condition is the maximum total stages. All the current algorithms simply execute until reaching this number of stages. It is defined by a set of four parameters $(a,b,c,d)$. This was described earlier in Section~\ref{term.sec}. These are the first four parameters in the constructor for KMterm.\end{description}A minimal sample program is given below. It generates a set of randomdata points (\emph{dataPoints}). This is done using a function\emph{kmUniformPts}, which generates a set of uniformly distributedpoints in the cube $[-1,1]^d$, where $d$ is the dimension. It thenconstructs a kc-tree for these points. Next, it generates a centerpoint structure (\emph{ctrs}). It declares a local search algorithm(\emph{kmAlg}) with which to perform the clustering. In this case it isthe repeated Lloyd's algorithm, but the commented code indicates thepossible choices for the other clustering algorithms. It creates KMtermobject, which (in addition to a number of cryptic options) requests thatthe algorithm be run for 100 stages (given by the first four parametersbeing $(100, 0, 0, 0)$). It executes this algorithm, and prints theresulting distortion and center points. (This file can be found in\texttt{src/kmlminimal.cpp}. A more complete sample program can be foundin \texttt{src/kmlsample.cpp}.){\small\begin{verbatim}#include <cstdlib> // C standard includes#include <iostream> // C++ I/O#include <string> // C++ strings#include "KMlocal.h" // k-means algorithmsusing namespace std; // make std:: available// execution parameters (see KMterm.h and KMlocal.h)KMterm term(100, 0, 0, 0, // run for 100 stages 0.10, 0.10, 3, // other typical parameter values 0.50, 10, 0.95);int main(){ int k = 4; // number of centers int dim = 2; // dimension int nPts = 20; // number of data points KMdata dataPts(dim, nPts); // allocate data storage kmUniformPts(dataPts.getPts(), nPts, dim); // generate random points dataPts.buildKcTree(); // build filtering structure KMfilterCenters ctrs(k, dataPts); // allocate centers // run the algorithm KMlocalLloyds kmAlg(ctrs, term); // repeated Lloyd's // KMlocalSwap kmAlg(ctrs, term); // Swap heuristic // KMlocalEZ_Hybrid kmAlg(ctrs, term); // EZ-Hybrid heuristic // KMlocalHybrid kmAlg(ctrs, term); // Hybrid heuristic ctrs = kmAlg.execute(); // execute // print number of stages cout << "Number of stages: " << kmAlg.getTotalStages() << "\n"; // print average distortion cout << "Average distortion: " << ctrs.getDist(false)/nPts << "\n"; ctrs.print(); // print final centers return EXIT_SUCCESS;}\end{verbatim}}The output of the minimal sample program when run on a Sun 5 runningSolaris 5.8 is shown below.{\small\begin{verbatim}Number of stages: 100Average distortion: 0.0806688 0 [ 0.538642 -0.747656 ] dist = 0.143903 1 [ 0.058456 -0.350953 ] dist = 0.0607307 2 [ 0.333405 0.368641 ] dist = 0.958262 3 [ -0.677253 -0.534964 ] dist = 0.450481\end{verbatim}}The data used as input to this minimal program is generated randomly,and so the final results depend on the system's random number generator.Different systems will likely produce different results. For example,the same program when compiled on a PC under Microsoft VisualStudio.NET, produced the following very different output.{\small\begin{verbatim}Number of stages: 100Average distortion: 0.137674 0 [ 0.767327 0.301355 ] dist = 0.999582 1 [ -0.790716 0.392132 ] dist = 0.534576 2 [ -0.215044 -0.942462 ] dist = 0.369202 3 [ -0.200438 0.495125 ] dist = 0.850113\end{verbatim}}\section{How Do I Get Started Quickly?}If you have some data that you wish to cluster, I would suggest startingwith the program \texttt{src/kmlsample.cpp} and modifying it for yourpurposes. The parameters that you will have to change are the desirednumber of clusters (\texttt{k}), the dimension (\texttt{dim}), themaximum number of points (\texttt{maxPts}). You should set the numberof iteration stages (\texttt{stages}) to a value that you feel is largeenough to guarantee good convergence. A value on the order of 100--500should be sufficient for most instances. Larger values may be neededfor higher dimensional or hard to cluster data sets. You will also needto modify the section of the program that inputs the points.This program tries all four of the various clustering algorithm. Youshould see which produces the smallest distortion for your data. Basedon our experience, the algorithm that has the best overall performanceis KMlocalHybrid. Finally, remove all the calls to the clusteringalgorithms, except for the final one you wish to use.If you are using a PC running Microsoft Windows, precompiled versions of\texttt{kmlminimal}, \texttt{kmlsample}, and \texttt{kmltest} can befound in the directory \texttt{KMLwin32/bin}. The solution and projectfiles can be found in \texttt{KMLwin32} as well, for compilation usingMicrosoft Visual Studio.NET (under Visual C++).\end{document}
⌨️ 快捷键说明
复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?