⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 kmlocal-doc.tex

📁 高效的k-means算法实现
💻 TEX
📖 第 1 页 / 共 3 页
字号:
%----------------------------------------------------------------------% File:				kmlocal-doc.tex% By:				David Mount% Last modified:	08/10/05% Description:		User manual for KMlocal%----------------------------------------------------------------------% Copyright (c) 2005 University of Maryland and David Mount.  All% Rights Reserved.%% Permission to use, copy, and distribute this software and its% documentation is hereby granted free of charge, provided that% (1) it is not a component of a commercial product, and% (2) this notice appears in all copies of the software and%     related documentation.%% The University of Maryland (U.M.) and the authors make no represen-% tations about the suitability or fitness of this software for any% purpose.  It is provided "as is" without express or implied warranty.%----------------------------------------------------------------------% History% -------% Version: 1.0  04/29/2002%		Initial release% Version: 1.1  04/08/2003%		Added EZ_Hybrid and dampening.% Version: 1.2  09/13/2003%		Created documention directory.  Added sample programs,%		kmlsample.cpp and kmlminimal.cpp.% Version: 1.7  08/10/2005%		Fixed errors in kms_run documentation.  Added capability for%		reporting final assignment to clusters and validation.%----------------------------------------------------------------------%  Typical page setup for a latex2e document.%  Uncomment the desired style from the following options.%-----------------------------------------------------------------------\documentclass[11pt]{article}		% plain manuscript (11pt)%-----------------------------------------------------------------------%  Margin sizes (for plain manuscript style and letter-sized paper)%%  Note that the top and left margins include a 1 inch automatic offset.%  For A4 paper, change the \textwidth and \textheight below.%-----------------------------------------------------------------------\setlength{\evensidemargin}{0in}	% 1 inch margin on left\setlength{\oddsidemargin}{0in}\setlength{\headsep}{0pt}		% no header line\setlength{\topmargin}{0in}		% 1 inch margin on top\setlength{\headheight}{0pt}\setlength{\textwidth}{6.5in}		% text size for letter paper\setlength{\textheight}{8.5in}%   \setlength{\textwidth}{6.25in}	% text size for A4 paper%   \setlength{\textheight}{9in}\setlength{\parskip}{3pt plus 1.5pt minus 1.5pt}	% paragraph skip%-----------------------------------------------------------------------%  Special math symbols %       floor, ceiling, angled brackets%-----------------------------------------------------------------------\newcommand{\floor}[1]{\left\lfloor #1\right\rfloor}\newcommand{\ceil}[1]{\left\lceil #1\right\rceil}\newcommand{\ang}[1]{\langle #1\rangle}%-----------------------------------------------------------------------%  Tighter lists%-----------------------------------------------------------------------\newenvironment{enumerate*}%            % Tighter enumerated list  {\begin{enumerate}%    \setlength{\itemsep}{-0.5ex}%    \setlength{\parsep}{0pt}}%  {\end{enumerate}}\newenvironment{itemize*}%              % Tighter itemized list  {\begin{itemize}%    \setlength{\itemsep}{-0.5ex}%    \setlength{\parsep}{0pt}}%  {\end{itemize}}\newenvironment{description*}%          % Tighter decsription list  {\begin{description}%    \setlength{\itemsep}{-0.5ex}%    \setlength{\parsep}{0pt}}%  {\end{description}}%-----------------------------------------------------------------------%  Title%-----------------------------------------------------------------------\begin{document}\title{KMlocal: A Testbed for $k$-means Clustering Algorithms}\author{David M. Mount\thanks{Copyright (c) 2002--2005 University of	Maryland and David Mount. All Rights Reserved.  Partially	supported by the National Science Foundation under	grant CCR-0098151.} \\	Department of Computer Science and \\	Institute for Advanced Computer Studies \\	University of Maryland \\	College Park, Maryland 20742 \\        Email: mount@cs.umd.edu.}\date{Version: 1.7 \\	August 10, 2005}\maketitle\section{Introduction}This is a collection of programs for performing k-means clustering basedon local search.  In $k$-means clustering we are given a set of $n$ datapoints in $d$-dimensional space and an integer $k$, and the problem isto determine a set of $k$ points, called centers, so as to minimize themean squared distance from each data point to its nearest center, calledthe \emph{average distortion}.A popular algorithm for doing $k$-means clustering is called the\emph{$k$-means algorith}, or \emph{Lloyd's algorithm}. Lloyd'salgorithm is based on the simple observation that the optimal placementof a center is at the centroid of the associated cluster.  Given any setof $k$ centers $Z$, for each center $z \in Z$, let $V(z)$ denote its\emph{neighborhood}, that is, the set of data points for which $z$ is thenearest neighbor.  Each stage of Lloyd's algorithm moves every centerpoint $z$ to the centroid of $V(z)$ and then updates $V(z)$ byrecomputing the distance from each point to its nearest center.  Thesesteps are repeated until some convergence condition is met.However, Lloyd's algorithm can get stuck in locally minimal solutionsthat are far from the optimal.  For this reason it is common to considerheuristics based on \emph{local search}, in which centers are swapped inand out of an existing solution (typically at random).  Such a swap isaccepted if it decreases the average distortion, and otherwise it isignored.  It is also possible to combine these two approaches (Lloyd'salgorithm and local search), producing a type of hybrid solution.  Thereare many variants on these themes.This program provides a number of different algorithms for doing $k$-meansclustering based on these ideas and combinations.  Further informationcan be found in the following paper.\begin{quotation}T. Kanungo, D. M. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A.Y. Wu, ``A Local Search Approximation Algorithm for k-Means Clustering''\emph{Proc. of the 18th ACM Symp. on Computational Geometry}, 2002,10--18.\end{quotation}It is also available from\centerline{\textsf{http://www.cs.umd.edu/\~{}mount/pubs.html}}\section{Compilation}Let us assume that you are in the kmlocal root directory, from which thesubdirectories \textsf{src}, \textsf{bin}, and \textsf{test} branch off.To start, you can compile the kmltest program by entering (from the rootdirectory) ``\textsf{make}''.  This is set up for the g++ compiler,version 2.7.2 or higher on Solaris.  It will probably generate a numberof error messages if you try it from another compiler or platform.  Theexecutable binary will be left in the file \textsf{bin/kmltest}.\section{Overview of the Algorithms} \label{algo.sec}There are three different procedures for performing k-means, which havebeen implemented here.  The main issue is how the neighbors are computedfor each center.\begin{description}\item[Lloyd's:] Repeatedly applies Lloyd's algorithm with randomly	sampled starting points.\item[Swap:] A local search heuristic, which works by performing swaps	between existing centers and a set of candidate centers.\item[Hybrid:] A more complex hybrid of Lloyd's and Swap, which performs	some number of swaps followed by some number of iterations of	Lloyd's algorithm.  To avoid getting trapped in local minima,	an approach similar to simulated annealing is included as well.\item[EZ\_Hybrid:] A simple hybrid algorithm, which does one swap	followed by some number of iterations of Lloyd's.\end{description}All the algorithms are based around a generic local search structure.The generic algorithm begins by generating an initial solution curr andsaving it in best.  These objects are local to the KMlocal structure.The value of curr reflects the current solution and best reflects thebest solution seen so far.  The generic algorithm consists of somenumber of basic iterations, called stages.  Each stage involves theexecution of one step of either the swap heuristic or Lloyd's algorithm.Each of the algorithms differ in how they apply these stages.Stages are grouped into runs.  Intuitively, a run involves a (small)number of stages in search of a better solution.  A run might end, say,because a better solution was found or a fixed number of stages havebeen performed without any improvement.  After a run is finished, wecheck to see whether we want to accept the solution.  Presumably thishappens if the cost is lower, but it may happen even if the cost isinferior in other circumstances (e.g., as part of a simulated annealingapproach).  Accepting a solution means copying the current solution tothe saved solution.  In some cases, the acceptance may involve resetingthe current solution to a random solution.There are some concepts that are important to run/phase transitions.One is the maximum number of stages.  Most algorithms provide some sortof parameter that limits the number of stages that the algorithm canspend in a particular run (before giving up).  The second is therelative distortion loss, or \emph{RDL}. (See also KMterm.h.) The RDL isdefined:\[    \mbox{RDL} =  \frac{\mbox{initDistortion} - \mbox{currDistortion}}%                   {\mbox{initDistortion}}.\]Note that a positive value indicates that the distortion has decreased.The definition of ``initDistortion'' depends on the algorithm.  It maybe the distortion of the previous stage (RDL = consecutive RDL), or itmay be the distortion at the beginning of the run (RDL = accumulatedRDL).\subsection{Lloyds}This is Lloyd's algorithm with random restarts The algorithm is brokeninto phases, and each phase is broken into runs.  Each phase starts bysampling center points at random.  Each run is provided two parameters,a maximum number of runs per stage (max\_run\_stage) and a minimumaccumulated relative distortion loss (min\_accum\_rdl).  If theaccumulated RDL for the run exceeds this value, then the run ends insuccess.  If the number of stages is exceeded before this happens, therun ends in failure.  The phase ends with the first failed run.\subsection{Swap}This algorithm iteratively changes centers by performing swaps.  Eachrun consists of a given number (max\_swaps) executions of the swapheuristic.\subsection{EZ\_Hybrid}This implements a simple hybrid algorithm (compared to the full hybrid).The algorithm performs only one swap, followed by some number ofiterations of Lloyd's algorithm.  Lloyd's algorithm is repeated untilthe consecutive RDL falls below a given threshold.A stage constitutes one invocation of the Swap or Lloyd's algorithm.  Arun consists of a single swap followed by a consecutive sequence ofLloyd's steps.  A graphical representation of one run is presentedbelow.  The decision to make another pass through Lloyd's is based onwhether the relative improvement since the last stage (consecutiverelative distortion loss) is above a certain fixed threshold(min\_consec\_rdl).\subsection{Hybrid}This implements a more complex hybrid algorithm, which combines both ofswapping and Lloyd's algorithm with a variant of simulated annealing.The algorithm's execution is broken into the following differentprocesses: one swap, a consecutive sequence of Lloyd's steps, and anacceptance test.  If we pass the acceptance test, we take the resultingsolution, and otherwise we restore the old solution.The decision to perform another Lloyd's step or go on to acceptance isbased on whether the relative improvement since the last stage(consecutive relative distortion loss) is above a certain fixedthreshold (min\_consec\_rdl).  If the resulting solution is better thanthe saved solution, then we accept it.  Otherwise, we use the simulatedannealing decision choice (described below) to decide whether to acceptit.  The choice to accept a poorer solution occurs with probability\[	\exp \left( \frac{\mbox{RDL}}{T} \right),\]where RDL is the relative distortion loss (relative to the savedsolution), and T is the current temperature.  Note that if $\mbox{RDL} >0$ (improvement) then this quantity is $> 1$, and so we always accept.The temperature value $T$ is a decreasing function of the number of thenumber of stages.  It starts at some initial value $T_0$ and decreasesslowly over time.  Rather than using the standard (slow) Boltzmanannealing schedule, we use the following fast annealing schedule, every$L$ stages we set $T = T_F \cdot T$, where:\begin{description}\item[$L$ (temp\_run\_length):] is an integer parameter set by the	user.  (Presumably it depends on the number of centers and	the dimension of the space.)\item[$T_F$ (temp\_reduc\_factor):] is a real number of the form	$1-x$, for some small $x$.\end{description}The initial temperature $T_0$ is a tricky parameter to set.  The usersupplies a parameter $p_0$ (init\_prob\_accept), the initial probabilityof accepting a random swap.  However, the probability of acceting a swapdepends on the given RDL value.  To estimate this, for the first $L$runs we use $p_0$ as the probability.  Over these runs we compute theaverage rdl value.  Once the first $L$ runs have completed, we set $T_0$so that:\[	\exp \left( -\frac{\mbox{avgRDL}}{T_0} \right) = p_0.\]or equivalently\[	T_0 = -\frac{\mbox{avgRDL}}{\ln p_0}.\]

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -