kmltest.cpp

高效的k-means算法实现
CPP
第 1 页 / 共 4 页
字号:
12 3 4 下一页
//----------------------------------------------------------------------//	File:		kmltest.cc//	Programmer:	David Mount//	Last modified:	08/07/05//	Description:	test/evaluation driver for kMeans//----------------------------------------------------------------------// Copyright (C) 2004-2005 David M. Mount and University of Maryland// All Rights Reserved.// // This program is free software; you can redistribute it and/or modify// it under the terms of the GNU General Public License as published by// the Free Software Foundation; either version 2 of the License, or (at// your option) any later version.  See the file Copyright.txt in the// main directory.// // The University of Maryland and the authors make no representations// about the suitability or fitness of this software for any purpose.// It is provided "as is" without express or implied warranty.//----------------------------------------------------------------------#include <iostream>			// file I/O#include <string>			// string ops#include <ctime>			// clock#include <cmath>			// math routines#include "KMeans.h"			// k-means includes#include "KMterm.h"			// k-means termination#include "KMdata.h"			// data point set#include "KMfilterCenters.h"		// center point set (for filtering)#include "KMlocal.h"			// k-means algorithms#include "KMrand.h"			// random point generation//----------------------------------------------------------------------// kmltest//// This program is a driver for testing and evaluating various algorithms// for the k-means problem, for point clustering in multi-dimensional// spaces.  It allows the user to generate or input data sets, to specify// the number of centers and generate or input their initial positions,// and then to run one of a number of k-means procedures.//// Overview:// ---------// The test program is run as follows:// // 	kmltest < test_input > test_output//// where the test_input file contains a list of directives as described// below.  Directives consist of a directive name, followed by list of// arguments (depending on the directive).  Arguments and directives are// separated by white space (blank, tab, and newline).  String arguments// are not quoted, and consist of a string of nonwhite chacters.  A// character "#" denotes a comment.  The following characters up to// the end of line are ignored.  Comments may only be inserted between// directives (not within the argument list of a directive).//// Algorithm Overview:// -------------------// The generic algorithm begins by generating an initial solution curr// and saving it in best.  These objects are local to the KMlocal// structure.   The value of curr reflects the current solution and// best reflects the best solution seen so far.//// The generic algorithm consists of some number of basic iterations,// called stages.  Each stage involves the execution of one step of// either the swap heuristic or Lloyd's algorithm.  Each of the// algorithms differ in how they apply these stages.//// Stages are grouped into runs.  Intuitively, a run involves a// (small) number of stages in search of a better solution.  A run might// end, say, because a better solution was found or a fixed number of// stages have been performed without any improvement. //// After a run is finished, we check to see whether we want to accept// the solution.  Presumably this happens if the cost is lower, but it// may happen even if the cost is inferior in other circumstances (e.g.,// as part of a simulated annealing approach).  Accepting a solution// means copying the current solution to the saved solution.  In some// cases, the acceptance may involve reseting the current solution to a// random solution.//// Algorithm Details:// ------------------// There are some concepts that are important to run/phase transitions.// One is the maximum number of stages.  Most algorithms provide some// sort of parameter that limits the number of stages that the algorithm// can spend in a particular run (before giving up).  The second is the// relative distortion loss, or RDL. (See also KMterm.h.) The RDL is// defined:////		initDistortion - currDistortion//       RDL =  -------------------------------//                    initDistortion//// Note that a positive value indicates that the distortion has// decreased.  The definition of "initDistortion" depends on the// algorithm.  It may be the distortion of the previous stage (RDL =// consecutive RDL), or it may be the distortion at the beginning of the// run (RDL = accumulated RDL).//// Here is a more detailed explanation of the algorithms.//// KMlocalLloyds// -------------//	This is Lloyd's algorithm with random restarts The algorithm is//	broken into phases, and each phase is broken into runs.  Each//	phase starts by sampling center points at random.  Each run is//	provided two parameters, a maximum number of runs per stage//	(max_run_stage) and a minimum accumulated relative distortion//	loss (min_accum_rdl).  If the accumulated RDL for the run//	exceeds this value, then the run ends in success.  If the number//	of stages is exceeded before this happens, the run ends in//	failure.  The phase ends with the first failed run.// KMlocalSwap// -----------//	This algorithm iteratively changes centers by performing swaps.//	Each run consists of a given number (max_swaps) executions of//	the swap heuristic.// KMlocalEZ_Hybrid// ----------------//	This implements a simple hybrid algorithm (compared to//	KMlocalHybrid).  The algorithm performs only one swap, followed//	by some number of iterations of Lloyd's algorithm.  Lloyd's//	algorithm is repeated until the consecutive RDL falls below a//	given threshold.////	A stage constitutes one invocation of the Swap or Lloyd's//	algorithm.  A run consists of a single swap followed by a//	consecutive sequence of Lloyd's steps.  A graphical//	representation of one run is presented below.  The decision to//	make another pass through Lloyd's is based on whether the//	relative improvement since the last stage (consecutive relative//	distortion loss) is above a certain fixed threshold//	(min_consec_rdl).// KMlocalHybrid// -------------//	This implements a more complex hybrid algorithm, which combines//	both of swapping and Lloyd's algorithm with a variant of//	simulated annealing.  The algorithm's execution is broken into//	the following different processes: one swap, a consecutive//	sequence of Lloyd's steps, and an acceptance test.  If we pass//	the acceptance test, we take the resulting solution, and//	otherwise we restore the old solution.////	The decision to perform another Lloyd's step or go on to//	acceptance is based on whether the relative improvement since//	the last stage (consecutive relative distortion loss) is above a//	certain fixed threshold (min_consec_rdl).////	If the resulting solution is better than the saved solution,//	then we accept it.  Otherwise, we use the simulated annealing//	decision choice (described below) to decide whether to accept//	it.////	Simulated Annealing Choice//	--------------------------//	The choice to accept a poorer solution occurs with probability////		exp(RDL/T),////	where RDL is the relative distortion loss (relative to the//	saved solution), and T is the current temperature.  Note that//	if RDL > 0 (improvement) then this quantity is > 1, and so we//	always accept.////	The temperature value T is a decreasing function of the number//	of the number of stages.  It starts at some initial value T0 and//	decreases slowly over time.  Rather than using the standard (slow)//	Boltzman annealing schedule, we use the following fast annealing//	schedule, every L stages we set////		T = TF * T,////	where://		L (= temp_run_length) is an integer parameter set by the//		    user.  (Presumably it depends on the number of//		    centers and the dimension of the space.)//		TF (= temp_reduc_factor) is a real number of the form 1-x,//		    for some small x.////	The initial temperature T0 is a tricky thing to set.  The user//	supplies a parameter p0 = init_prob_accept, the initial//	probability of accepting a random swap.  However, the//	probability of acceting a swap depends on the given RDL value.//	To estimate this, for the first L runs we use p0 as the//	probability.  Over these runs we compute the average rdl value.//	Once the first L runs have completed, we set T0 so that:////		exp(-averageRDL/T0) = initProbAccept,////	or equivalently////		T0 = -averageRDL/(ln initProbAccept).//// Basic operations:// -----------------// The test program can perform the following operations.  How these// operations are performed depends on the options which are described// later. ////	Data Generation://	----------------//	read_data_pts <file>	Create a set of data points whose//				coordinates are input from file <file>.//				Prior to this, data_size must be set.//				At most data_size points will be read//				from the file.  The actual number of//				points in the file may be less.//	gen_data_pts		Create a set of data points whose//				coordinates are generated from the//				current point distribution.////	Running k-means://	----------------//	run_kmeans <string>	Apply k-means clustering to the current//				point set and clusters.  The string//				specifies the desired version of k-means.//				These include://				    lloyd = runs Lloyd's algorithm using//					    the filtering algorithm.//				    swap  = runs the swap heuristic.//				    hybrid = runs Lloyd's algorithm//					    and the swap heuristic in//					    alternating steps.//				    EZ-hybrid = a simpler version of//					    hybrid. One swap followed//					    by some number of Lloyd's.////	Miscellaneous: (Strings may have no embedded blanks.)//	-----------------------------------------------------//	title <string>		Experiment title.//	print <string>		Output string to console (cerr).//	get_distortion		Computes and prints distortion (the sum//				of squared distances for the current//				centers).  Note that the kmeans algorithms//				compute and print the distortion for all//				stages but stage 0, if the stats level//				is set to stage.//// Options:// --------// How these operations are performed depends on a set of options.// If an option is not specified, a default value is used. An option// retains its value until it is set again.  String inputs are not// enclosed in quotes, and must contain no embedded white space (sorry,// this is C++'s convention).//// Options affecting data and query point generation:// --------------------------------------------------//	kcenters <int>		Number of centers.  Default = 5.//	dim <int>		Dimension of space.//	seed <int>		Seed for random number generation.//	data_size <int>		Number of data points to generate for//				gen_data_pts points and the maximum//				number of data points to be read for//				read_data_pts.  If this exceeds//				max_data_size, then max_data_size is//				incremented to match this value.//				Default = 100.//	std_dev <float>		Standard deviation (used in gauss, and//				clustered distributions).  This is the//				"small" distribution for clus_ellipsoids.//				Default = 1.//	std_dev_lo <float>	Low and high standard deviations (used in//	std_dev_hi <float>	clus_ellipsoids).  Default = 1.//	corr_coef <float>	Correlation coefficient (used in co-gauss//				and co_lapace distributions). Default = 0.05.//	colors <int>		Number of color classes (clusters) (used//				in the clustered distributions).  Def. = 5.//	new_clust		Once generated, cluster centers are not//				normally regenerated.  This is so that both//				centers and data points can be generated//				using the same set of clusters.  This option//				forces new cluster centers to be generated//				with the next generation of either data or//				center points.//	max_clus_dim <int>	Maximum dimension of clusters (used in//				clus_orth_flats and clus_ellipsoids).//				Default = 1.//	distribution <string>	Type of input distribution//				    uniform	= uniform over cube [-1,1]^d.//				    gauss	= Gaussian with mean 0//				    laplace	= Laplacian, mean 0 and var 1//				    co_gauss	= correlated Gaussian//				    co_laplace	= correlated Laplacian//				    clus_gauss	= clustered Gaussian//				    clus_orth_flats = clusters of orth flats//				    clus_ellipsoids = clusters of ellipsoids//				    multi_clus  = multi-sized clusters//				See the file rand.cc for further information.//// Options affection general program behavior:// -------------------------------------------//	stats <string>		Level of statistics output//				    silent	 = no output,//				    exec_time	+= execution time only//				    summary	+= summary of complete kmeans//				    stage	+= summary of each stage//				    step	+= show each step//				    centers	+= print centers each time//				    tree	+= show the tree.//				Default = "summary".//	print_points <string>	Print points after reading/generating?//				Argument is either "yes" or "no".//				Default = "no".//	show_assignments <string>//				Print the indices of the center to//				which each point has been assigned along//				with its distance to this center, after//				the algorithm terminates.  Argument is//				either "yes" or "no".//				Default = "no".//	validate <string>	Test (through a slow brute-force search)//				that the center assignments requested by//				show_assignments is correct.  This is//				used for debugging, and will slow//				overall execution.  Argument is either//				"yes" or "no".//				Default = "no".//				(Note: There are other elements of the//				computation that could be validated.//				Perhaps with time these will be included//				under this option.)//// Options affecting termination:// ------------------------------// The way of controlling the program's termination is to specify the// maximum number of stages.  (In theory, a better way would be to// determine when the algorithm has converged, but this seems to be a// very complex task to me.)  Each time the algorithm moves the center// points and recomputes the distortion constitutes a stage.  The// maximum number of stages is based on the number of data points n// (data_size) and the number of centers k (kcenters) and four// coefficients, a,...,d, using the following formula:////	MAX_STAGE = a + (b*k + c*n)^d//// In addition there are the following parameters.  They are based// on the "relative distortion loss" (RDL), which is defined to be// the relative decrease in the distortion.////	max_tot_stage 4*<float>
12 3 4 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -