kmlocal-doc.tex

来自「高效的k-means算法实现」· TEX 代码 · 共 866 行 · 第 1/3 页

TEX
866
字号
\subsubsection{Options used in Lloyd's Algorithm and Hybrid Algorithms}\begin{description*}\item[\SF{damp\_factor \BR{float}}] ~	A dampening factor in the interval (0,1].  The value 1 is the	standard Lloyd's algorithm.  Otherwise, each point is only moved	by this fraction of the way from its current location to the	centroid.  Default: 1\item[\SF{min\_accum\_rdl \BR{float}}] ~	This is used in Lloyd's algorithm algorithm which perform	multiple swaps.  When performing p swaps, we actually may	perform fewer than p.  We stop performing swaps, whenever the	total distortion (from the start of the run) has decreased by at	least this amount.  Default: 0.10\item[\SF{max\_run\_stage \BR{int}}] ~	This is used in Lloyd's algorithm.  A run terminates after this	many stages.  Default: 100\end{description*}\subsubsection{Options specific to the Swap algorithm}\begin{description*}\item[\SF{max\_swaps \BR{int}}] ~	Maximum swaps at any given stage.  Default: 1\end{description*}\subsubsection{Options specific to the hybrid algorithms}\begin{description*}\item[\SF{min\_consec\_rdl \BR{float}}] ~	This is used in the hybrid algorithms.  If the RDL of two	consecutive runs is less than this value Lloyd's algorithm is	deemed to have converged, and the run ends.\end{description*}\subsubsection{Options specific to the (complex) hybrid algorithm}\begin{description*}\item[\SF{init\_prob\_accept \BR{float}}] ~	The initial probability of accepting a solution that does not	improve the distortion.\item[\SF{temp\_run\_length \BR{int}}] ~	The number of stages before changing the temperature.\item[\SF{temp\_reduc\_factor \BR{float}}] ~	The factor by which temperature is reduced at the end of a	temperature run.\end{description*}    \subsection{Example}Option directives (such as ``dim,'' ``data\_size,'' ``seed'') thatmerely set option values are indented.  Operation directives(``gen\_data\_pts,'' ``run\_kmeans'') are not indented.{\small\begin{verbatim}title Experiment_1A                     # experiment title   stats summary                        # print summary information   print_assignments yes                # print final cluster assignments   dim 2                                # dimension 2     data_size 5000                       # 5000 data points   colors 30                            # ...broken into 30 clusters   std_dev 0.025                        # ...each with std deviation 0.025   distribution clus_gauss              # clustered gaussian distribution   seed 1                               # random number seedgen_data_pts                            # generate the data points    kcenters 20                          # place k=20 center points   distribution uniform                 # ...uniformly distributed   seed 2gen_centers                             # generate initial center points     lloyd_max_tot_stage 20 0 0 0         # terminate Lloyd's after 20 stagesprint Running-lloyd'srun_kmeans lloyd                        # run using Lloyd's algorithm    seed 2                               # use same seedgen_centers                             # regenerate same center points   max_swaps 3                          # at most 3 swaps   swap_max_tot_stage 0 3 0 2           # at most 3*k^2 = 1200 stages   swap_max_run_stage 50 0 0 0          # at most 50 stages per run   swap_min_run_gain 0.02               # stop run if distortion drops 2%print Running-swaprun_kmeans swap                         # run using swap heuristic\end{verbatim}}\section{Sample Program}Although KMlocal is not a library, it is possible to invoke thevarious algorithm directly from program.  The algorithm is designedaround a collection of C++ objects.  The include the following:\begin{description}\item[KMdata: (KMdata.h)] This object stores the data points.  The	constructor is given the dimension and the number of points to	allocate.  If $P$ is an object of this type, then $P[i][d]$	refers to the $d$th coordinate of the $i$th point.	One the key elements to the efficiency of the algorithms	presented is the use of the filtering algorithm for computing	the nearest cluster center for each data point.  This requires	the construction of a data structure called a kc-tree.  This tree	is constructed by the method \texttt{P.buildKcTree()}, which	should be done prior to running any of the clustering	algorithms.\item[KMfilterCenters: (KMfilterCenters.h)] This object stores the	center points (in a manner that makes the use of the filtering	algorithm possible).  The constructor is passed two arguments,	the desired number of center points $k$, and the associated	data points.		On completion of the execution of the clustering algorithm,	the centers are stored in this structure.  It supports a	method \texttt{print()}, which prints the centers to	the standard output, and method \texttt{getDist()}, which	returns the total distortion.\item[KMlocal: (KMlocal.h)] There are currently four different	clustering algorithms supported.  They are all designed around a	common local search template, called KMlocal.  This is a generic	template, and so cannot be invoked directly.  The following	derived objects can be invoked:	\begin{description*}	\item[KMlocalLloyds:] Repeated Lloyd's algorithm.	\item[KMlocalSwap:] The swap heuristic.	\item[KMlocalEZ\_Hybrid:] A simple hybrid, which simply		alternates Lloyd's algorithm and the swap algorithm.	\item[KMlocalHybrid:] A more complex hybrid algorithm, which		involves simulated annealing.	\end{description*}	These algorithms are described in Section~\ref{algo.sec}, above.	The constructor to each algorithm is passed two things, the	KMfilterCenter structure for the center points and the KMterm	structure (described below), which specifies the termination	conditions.  It supports the method \texttt{execute()}, which	initializes the center points to random positions, executes the	clustering algorithm, and returns with the center structure	modified to hold the final centers.\item[KMterm: (KMterm.h)] Each of the local search algorithms consists	of some number of incremental movements of the center points,	called \emph{stages}.  Stages are grouped into longer	organizational units called \emph{runs}, and runs are grouped	into longer units called \emph{phases}.  The meaning of the	transition from runs to phases depends on the individual	algorithm.  Intuitively, a run involves a search for a better	solution, through some local search operations.  If this search	results in a better solution, the run is said to be successful.	A phase consists of a series of consecutive successful runs.	When a run is unsuccessful, the phase ends.	The definition of when a run and a phase is complete depends on	a number of parameters.  These parameters are stored in the	KMterm object.  See the files KMterm.h and KMlocal.h for more	detailed explanation of their exact meaning.	One important parameter in the termination condition is the	maximum total stages.  All the current algorithms simply execute	until reaching this number of stages.  It is defined by a set of	four parameters $(a,b,c,d)$.  This was described earlier in 	Section~\ref{term.sec}.  These are the first four parameters	in the constructor for KMterm.\end{description}A minimal sample program is given below.  It generates a set of randomdata points (\emph{dataPoints}).  This is done using a function\emph{kmUniformPts}, which generates a set of uniformly distributedpoints in the cube $[-1,1]^d$, where $d$ is the dimension.  It thenconstructs a kc-tree for these points.  Next, it generates a centerpoint structure (\emph{ctrs}).  It declares a local search algorithm(\emph{kmAlg}) with which to perform the clustering.  In this case it isthe repeated Lloyd's algorithm, but the commented code indicates thepossible choices for the other clustering algorithms.  It creates KMtermobject, which (in addition to a number of cryptic options) requests thatthe algorithm be run for 100 stages (given by the first four parametersbeing $(100, 0, 0, 0)$).  It executes this algorithm, and prints theresulting distortion and center points. (This file can be found in\texttt{src/kmlminimal.cpp}. A more complete sample program can be foundin \texttt{src/kmlsample.cpp}.){\small\begin{verbatim}#include <cstdlib>                      // C standard includes#include <iostream>                     // C++ I/O#include <string>                       // C++ strings#include "KMlocal.h"                    // k-means algorithmsusing namespace std;                    // make std:: available// execution parameters (see KMterm.h and KMlocal.h)KMterm  term(100, 0, 0, 0,              // run for 100 stages             0.10, 0.10, 3,             // other typical parameter values              0.50, 10, 0.95);int main(){    int         k       = 4;                    // number of centers    int         dim     = 2;                    // dimension    int         nPts    = 20;                   // number of data points    KMdata dataPts(dim, nPts);                  // allocate data storage    kmUniformPts(dataPts.getPts(), nPts, dim);  // generate random points    dataPts.buildKcTree();                      // build filtering structure    KMfilterCenters ctrs(k, dataPts);           // allocate centers                                                // run the algorithm    KMlocalLloyds       kmAlg(ctrs, term);      // repeated Lloyd's    // KMlocalSwap      kmAlg(ctrs, term);      // Swap heuristic    // KMlocalEZ_Hybrid kmAlg(ctrs, term);      // EZ-Hybrid heuristic    // KMlocalHybrid    kmAlg(ctrs, term);      // Hybrid heuristic    ctrs = kmAlg.execute();                     // execute                                                // print number of stages    cout << "Number of stages: " << kmAlg.getTotalStages() << "\n";                                                // print average distortion    cout << "Average distortion: " << ctrs.getDist(false)/nPts << "\n";    ctrs.print();                               // print final centers    return EXIT_SUCCESS;}\end{verbatim}}The output of the minimal sample program when run on a Sun 5 runningSolaris 5.8 is shown below.{\small\begin{verbatim}Number of stages: 100Average distortion: 0.0806688       0        [ 0.538642 -0.747656 ] dist = 0.143903       1        [ 0.058456 -0.350953 ] dist = 0.0607307       2        [ 0.333405 0.368641 ] dist = 0.958262       3        [ -0.677253 -0.534964 ] dist = 0.450481\end{verbatim}}The data used as input to this minimal program is generated randomly,and so the final results depend on the system's random number generator.Different systems will likely produce different results.  For example,the same program when compiled on a PC under Microsoft VisualStudio.NET, produced the following very different output.{\small\begin{verbatim}Number of stages: 100Average distortion: 0.137674       0	[ 0.767327 0.301355 ] dist = 0.999582       1	[ -0.790716 0.392132 ] dist = 0.534576       2	[ -0.215044 -0.942462 ] dist = 0.369202       3	[ -0.200438 0.495125 ] dist = 0.850113\end{verbatim}}\section{How Do I Get Started Quickly?}If you have some data that you wish to cluster, I would suggest startingwith the program \texttt{src/kmlsample.cpp} and modifying it for yourpurposes.  The parameters that you will have to change are the desirednumber of clusters (\texttt{k}), the dimension (\texttt{dim}), themaximum number of points (\texttt{maxPts}).  You should set the numberof iteration stages  (\texttt{stages}) to a value that you feel is largeenough to guarantee good convergence.  A value on the order of 100--500should be sufficient for most instances.  Larger values may be neededfor higher dimensional or hard to cluster data sets.  You will also needto modify the section of the program that inputs the points.This program tries all four of the various clustering algorithm.  Youshould see which produces the smallest distortion for your data.  Basedon our experience, the algorithm that has the best overall performanceis KMlocalHybrid.  Finally, remove all the calls to the clusteringalgorithms, except for the final one you wish to use.If you are using a PC running Microsoft Windows, precompiled versions of\texttt{kmlminimal}, \texttt{kmlsample}, and \texttt{kmltest} can befound in the directory \texttt{KMLwin32/bin}.  The solution and projectfiles can be found in \texttt{KMLwin32} as well, for compilation usingMicrosoft Visual Studio.NET (under Visual C++).\end{document}

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?