⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 kmlocal-doc.tex

📁 高效的k-means算法实现
💻 TEX
📖 第 1 页 / 共 3 页
字号:
\section{The Kmltest Driver Program}Kmltest is a driver for testing and evaluating various algorithms forthe $k$-means problem, for point clustering in multi-dimensional spaces.It allows the user to generate or input data sets, to specify the numberof centers and generate or input their initial positions, and then torun one of a number of $k$-means procedures.  The test program is run asfollows:\begin{center}	\textsf{kmltest $<$ test\_input $>$ test\_output}\end{center}  where the \textsf{test\_input} file contains a list of directives asdescribed below.  Directives consist of a directive name, followed bylist of arguments (depending on the directive).  Arguments anddirectives are separated by white space (blank, tab, and newline).String arguments are not quoted, and consist of a string of nonwhitecharacters.  A character ``\textsf{\#}'' denotes a comment.  Thefollowing characters up to the end of line are ignored.  Comments mayonly be inserted between directives (not within the argument list of adirective).  \subsection{Basic operations}The test program can perform the following operations.  How theseoperations are performed depends on the options which are describedlater. \newcommand{\BR}[1]{$\ang{\mbox{#1}}$}\newcommand{\SF}[1]{\textsf{#1}}  \subsubsection{Data Generation}\begin{description*}\item[\SF{read\_data\_pts \BR{file}}] ~	Create a set of data points whose coordinates are input from	file \BR{file}.  Prior to this, data\_size must be set.  At most	data\_size points will be read from the file.  The actual number	of points in the file may be less.\item[\SF{gen\_data\_pts}] ~	Create a set of data points whose coordinates are generated from	the current point distribution.\end{description*}  % \subsubsection{Initial Center Placement}% % \textbf{WARNING:} Currently center points are passed to the procedures,% but they always ignored.  The current implementations generate center% points by sampling from the data points.  These commands are here for% future enhancements, which will allow the user to specify the initial% center points.% % \begin{description*}% \item[\SF{read\_centers \BR{file}}] ~%   % 	(Unimplemented!) Create a set of center points whose coordinates% 	are input from file \BR{file}.% % \item[\SF{gen\_centers}] ~% % 	(Unimplemented!) Create a set of query points whose coordinates% 	are generated from the current point distribution.% % \item[\SF{sample\_centers}] ~% %   	(Unimplemented!) Sample centers from data points.% \end{description*}  \subsection{Running k-means}\begin{description*}\item[\SF{run\_kmeans \BR{method}}] ~  	Apply $k$-means clustering to the current point set and clusters.	The string specifies the desired version of $k$-means.  These	include:	\begin{description*}	\item[\SF{lloyd}] -- runs Lloyd's algorithm using the filtering		algorithm.	\item[\SF{swap}] -- runs the swap heuristic.	\item[\SF{hybrid}] -- runs Lloyd's algorithm and the swap		heuristic in alternating steps.	\item[\SF{EZ-hybrid}] -- a simpler version of hybrid. One swap		followed by some number of Lloyd's.	\end{description*}	See Section~\ref{algo.sec} for further details.\end{description*}\subsection{Miscellaneous}Note that in these commands, the string arguments may have no embeddedblanks.\begin{description*}\item[\SF{title \BR{string}}] ~	A title printed to the output file.\item[\SF{print \BR{string}}] ~  	Outputs a string to console (cerr).\item[\SF{get\_distortion}] ~	Computes and prints distortion (the sum of squared distances for	the current centers).  Note that the $k$-means algorithms	compute and print the distortion for all stages but stage 0, if	the stats level is set to ``\SF{stage}.''\end{description*}  \subsection{Options}How the above operations are performed depends on a set of options.  Ifan option is not specified, a default value is used. An option retainsits value until it is set again.  String inputs are not enclosed inquotes, and must contain no embedded white space.  \subsubsection{Options affecting nearest neighbor search}\begin{description*}\item[\SF{split\_rule \BR{type}}] ~  	Type of splitting rule to use in building the search tree.	Choices are:	\begin{description*}	\item[\SF{kd}] -- optimized kd-tree	\item[\SF{fair}] -- fair split	\item[\SF{midpt}] -- midpoint split	\item[\SF{sl\_midpt}] -- sliding midpt split	\item[\SF{sl\_fair}] -- sliding fair split	\item[\SF{suggest}] -- authors' choice for best	\end{description*}	The default is ``\SF{suggest}.''  See the file \SF{kd\_split.cc}	for more detailed information.  (Currently this is ignored!)  \item[\SF{bucket\_size \BR{int}}] ~  	Bucket size, that is, the maximum number of points stored in	each leaf node.\end{description*}  \subsubsection{Options affecting data and query point generation}\begin{description*}\item[\SF{kcenters \BR{int}}] ~  	Number of centers.  Default = 5.\item[\SF{dim \BR{int}}] ~  	Dimension of the space.\item[\SF{seed \BR{int}}] ~  	Seed for random number generation.\item[\SF{data\_size \BR{int}}] ~	Number of data points to generate for gen\_data\_pts points and	the maximum number of data points to be read for read\_data\_pts.	If this exceeds max\_data\_size, then max\_data\_size is incremented	to match this value.  Default = 100.\item[\SF{std\_dev \BR{float}}] ~	Standard deviation (used in gauss, and clustered distributions).	This is the ``small'' distribution for clus\_ellipsoids.  Default	= 1.\item[\SF{std\_dev\_lo \BR{float}}, \SF{std\_dev\_hi \BR{float}}] ~  	Low and high standard deviations (used in clus\_ellipsoids).	Default = 1.\item[\SF{corr\_coef \BR{float}}] ~  	Correlation coefficient (used in co-gauss and co\_lapace	distributions). Default = 0.05.\item[\SF{colors \BR{int}}] ~	Number of color classes (clusters) (used in the clustered	distributions).  Def. = 5.\item[\SF{new\_clust}] ~	Once generated, cluster centers are not normally regenerated.	This is so that both centers and data points can be generated	using the same set of clusters.  This option forces new cluster	centers to be generated with the next generation of either data	or center points.\item[\SF{max\_clus\_dim \BR{int}}] ~	Maximum dimension of clusters (used in clus\_orth\_flats and	clus\_ellipsoids).  Default = 1.\item[\SF{distribution \BR{string}}] ~  	Type of input distribution	\begin{description*}  	\item[\SF{uniform}]	-- uniform over cube $[-1,1]^d$.  	\item[\SF{gauss}]	-- Gaussian with mean 0  	\item[\SF{laplace}]	-- Laplacian, mean 0 and var 1  	\item[\SF{co\_gauss}]	-- correlated Gaussian  	\item[\SF{co\_laplace}]	-- correlated Laplacian  	\item[\SF{clus\_gauss}]	-- clustered Gaussian  	\item[\SF{clus\_orth\_flats}] -- clusters of orth flats  	\item[\SF{clus\_ellipsoids}] -- clusters of ellipsoids  	\item[\SF{multi\_clus}]  -- multi-sized clusters	\end{description*}  	See the file rand.cc for further information.\item[\SF{replacement \BR{string}}] ~  	Sampling option for sample\_centers.	\begin{description*}  	\item[\SF{on}]	-- sample with replacement  	\item[\SF{off}]	-- sample without replacement	\end{description*}\end{description*}  \subsubsection{Options affection general program behavior}\begin{description*}\item[\SF{stats \BR{string}}] ~  	Level of statistics output	\begin{description*}  	\item[\SF{silent}]	 = no output,  	\item[\SF{exec\_time}]	+= execution time only  	\item[\SF{summary}]	+= summary of complete $k$-means  	\item[\SF{stage}]	+= summary of each stage  	\item[\SF{trace}]	+= show everything as it happens.	\end{description*}\item[\SF{print\_points \BR{string}}] ~	Print the points after reading or generating them.  The	argument is either ``yes'' or ``no''.  Default = ``no''.\item[\SF{show\_assignments \BR{string}}] ~	After running the clustering algorithm, print the indices of the	center point to which each data point has been assigned along	with its distance to this center.  The argument is either	``yes'' or ``no''.  Default = ``no''.\item[\SF{dump \BR{file}}] ~  	Dump summary to <file> (for analysis by some other program).\item[\SF{validate \BR{string}}] ~	Validate experiment and compute average error.  Since validation	causes exact nearest neighbors to be computed by the brute force	method, this can take a long time.  Valid arguments are:	\begin{description*}	\item[\SF{yes}] (Not implemented!) turn validation on	\item[\SF{no}] turn validation off	\end{description*}\end{description*}\subsubsection{Options affecting termination} \label{term.sec}The way of controlling the program's termination is to specify themaximum number of stages.  (In theory, a better way would be todetermine when the algorithm has converged, but this seems to be a verycomplex task to me.)  Each time the algorithm moves the center pointsand recomputes the distortion constitutes a stage.  The maximum numberof stages is based on the number of data points $n$ (data\_size) and thenumber of centers $k$ (kcenters) and four coefficients, $(a,b,c,d)$,using the following formula:\[  	\textrm{MaxTotalStages} = a + (b \cdot k + c \cdot n)^d\]If the final result is 0, then the algorithm runs without terminating.\begin{description*}\item[\SF{max\_tot\_stage \BR{4 $\times$ float}}] ~	Maximum total stages for given as parameters.	Default: $\ang{0,0,0,0}$.\end{description*}

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -