📄 kmltest.cpp
字号:
//----------------------------------------------------------------------// File: kmltest.cc// Programmer: David Mount// Last modified: 08/07/05// Description: test/evaluation driver for kMeans//----------------------------------------------------------------------// Copyright (C) 2004-2005 David M. Mount and University of Maryland// All Rights Reserved.// // This program is free software; you can redistribute it and/or modify// it under the terms of the GNU General Public License as published by// the Free Software Foundation; either version 2 of the License, or (at// your option) any later version. See the file Copyright.txt in the// main directory.// // The University of Maryland and the authors make no representations// about the suitability or fitness of this software for any purpose.// It is provided "as is" without express or implied warranty.//----------------------------------------------------------------------#include <iostream> // file I/O#include <string> // string ops#include <ctime> // clock#include <cmath> // math routines#include "KMeans.h" // k-means includes#include "KMterm.h" // k-means termination#include "KMdata.h" // data point set#include "KMfilterCenters.h" // center point set (for filtering)#include "KMlocal.h" // k-means algorithms#include "KMrand.h" // random point generation//----------------------------------------------------------------------// kmltest//// This program is a driver for testing and evaluating various algorithms// for the k-means problem, for point clustering in multi-dimensional// spaces. It allows the user to generate or input data sets, to specify// the number of centers and generate or input their initial positions,// and then to run one of a number of k-means procedures.//// Overview:// ---------// The test program is run as follows:// // kmltest < test_input > test_output//// where the test_input file contains a list of directives as described// below. Directives consist of a directive name, followed by list of// arguments (depending on the directive). Arguments and directives are// separated by white space (blank, tab, and newline). String arguments// are not quoted, and consist of a string of nonwhite chacters. A// character "#" denotes a comment. The following characters up to// the end of line are ignored. Comments may only be inserted between// directives (not within the argument list of a directive).//// Algorithm Overview:// -------------------// The generic algorithm begins by generating an initial solution curr// and saving it in best. These objects are local to the KMlocal// structure. The value of curr reflects the current solution and// best reflects the best solution seen so far.//// The generic algorithm consists of some number of basic iterations,// called stages. Each stage involves the execution of one step of// either the swap heuristic or Lloyd's algorithm. Each of the// algorithms differ in how they apply these stages.//// Stages are grouped into runs. Intuitively, a run involves a// (small) number of stages in search of a better solution. A run might// end, say, because a better solution was found or a fixed number of// stages have been performed without any improvement. //// After a run is finished, we check to see whether we want to accept// the solution. Presumably this happens if the cost is lower, but it// may happen even if the cost is inferior in other circumstances (e.g.,// as part of a simulated annealing approach). Accepting a solution// means copying the current solution to the saved solution. In some// cases, the acceptance may involve reseting the current solution to a// random solution.//// Algorithm Details:// ------------------// There are some concepts that are important to run/phase transitions.// One is the maximum number of stages. Most algorithms provide some// sort of parameter that limits the number of stages that the algorithm// can spend in a particular run (before giving up). The second is the// relative distortion loss, or RDL. (See also KMterm.h.) The RDL is// defined://// initDistortion - currDistortion// RDL = -------------------------------// initDistortion//// Note that a positive value indicates that the distortion has// decreased. The definition of "initDistortion" depends on the// algorithm. It may be the distortion of the previous stage (RDL =// consecutive RDL), or it may be the distortion at the beginning of the// run (RDL = accumulated RDL).//// Here is a more detailed explanation of the algorithms.//// KMlocalLloyds// -------------// This is Lloyd's algorithm with random restarts The algorithm is// broken into phases, and each phase is broken into runs. Each// phase starts by sampling center points at random. Each run is// provided two parameters, a maximum number of runs per stage// (max_run_stage) and a minimum accumulated relative distortion// loss (min_accum_rdl). If the accumulated RDL for the run// exceeds this value, then the run ends in success. If the number// of stages is exceeded before this happens, the run ends in// failure. The phase ends with the first failed run.// KMlocalSwap// -----------// This algorithm iteratively changes centers by performing swaps.// Each run consists of a given number (max_swaps) executions of// the swap heuristic.// KMlocalEZ_Hybrid// ----------------// This implements a simple hybrid algorithm (compared to// KMlocalHybrid). The algorithm performs only one swap, followed// by some number of iterations of Lloyd's algorithm. Lloyd's// algorithm is repeated until the consecutive RDL falls below a// given threshold.//// A stage constitutes one invocation of the Swap or Lloyd's// algorithm. A run consists of a single swap followed by a// consecutive sequence of Lloyd's steps. A graphical// representation of one run is presented below. The decision to// make another pass through Lloyd's is based on whether the// relative improvement since the last stage (consecutive relative// distortion loss) is above a certain fixed threshold// (min_consec_rdl).// KMlocalHybrid// -------------// This implements a more complex hybrid algorithm, which combines// both of swapping and Lloyd's algorithm with a variant of// simulated annealing. The algorithm's execution is broken into// the following different processes: one swap, a consecutive// sequence of Lloyd's steps, and an acceptance test. If we pass// the acceptance test, we take the resulting solution, and// otherwise we restore the old solution.//// The decision to perform another Lloyd's step or go on to// acceptance is based on whether the relative improvement since// the last stage (consecutive relative distortion loss) is above a// certain fixed threshold (min_consec_rdl).//// If the resulting solution is better than the saved solution,// then we accept it. Otherwise, we use the simulated annealing// decision choice (described below) to decide whether to accept// it.//// Simulated Annealing Choice// --------------------------// The choice to accept a poorer solution occurs with probability//// exp(RDL/T),//// where RDL is the relative distortion loss (relative to the// saved solution), and T is the current temperature. Note that// if RDL > 0 (improvement) then this quantity is > 1, and so we// always accept.//// The temperature value T is a decreasing function of the number// of the number of stages. It starts at some initial value T0 and// decreases slowly over time. Rather than using the standard (slow)// Boltzman annealing schedule, we use the following fast annealing// schedule, every L stages we set//// T = TF * T,//// where:// L (= temp_run_length) is an integer parameter set by the// user. (Presumably it depends on the number of// centers and the dimension of the space.)// TF (= temp_reduc_factor) is a real number of the form 1-x,// for some small x.//// The initial temperature T0 is a tricky thing to set. The user// supplies a parameter p0 = init_prob_accept, the initial// probability of accepting a random swap. However, the// probability of acceting a swap depends on the given RDL value.// To estimate this, for the first L runs we use p0 as the// probability. Over these runs we compute the average rdl value.// Once the first L runs have completed, we set T0 so that://// exp(-averageRDL/T0) = initProbAccept,//// or equivalently//// T0 = -averageRDL/(ln initProbAccept).//// Basic operations:// -----------------// The test program can perform the following operations. How these// operations are performed depends on the options which are described// later. //// Data Generation:// ----------------// read_data_pts <file> Create a set of data points whose// coordinates are input from file <file>.// Prior to this, data_size must be set.// At most data_size points will be read// from the file. The actual number of// points in the file may be less.// gen_data_pts Create a set of data points whose// coordinates are generated from the// current point distribution.//// Running k-means:// ----------------// run_kmeans <string> Apply k-means clustering to the current// point set and clusters. The string// specifies the desired version of k-means.// These include:// lloyd = runs Lloyd's algorithm using// the filtering algorithm.// swap = runs the swap heuristic.// hybrid = runs Lloyd's algorithm// and the swap heuristic in// alternating steps.// EZ-hybrid = a simpler version of// hybrid. One swap followed// by some number of Lloyd's.//// Miscellaneous: (Strings may have no embedded blanks.)// -----------------------------------------------------// title <string> Experiment title.// print <string> Output string to console (cerr).// get_distortion Computes and prints distortion (the sum// of squared distances for the current// centers). Note that the kmeans algorithms// compute and print the distortion for all// stages but stage 0, if the stats level// is set to stage.//// Options:// --------// How these operations are performed depends on a set of options.// If an option is not specified, a default value is used. An option// retains its value until it is set again. String inputs are not// enclosed in quotes, and must contain no embedded white space (sorry,// this is C++'s convention).//// Options affecting data and query point generation:// --------------------------------------------------// kcenters <int> Number of centers. Default = 5.// dim <int> Dimension of space.// seed <int> Seed for random number generation.// data_size <int> Number of data points to generate for// gen_data_pts points and the maximum// number of data points to be read for// read_data_pts. If this exceeds// max_data_size, then max_data_size is// incremented to match this value.// Default = 100.// std_dev <float> Standard deviation (used in gauss, and// clustered distributions). This is the// "small" distribution for clus_ellipsoids.// Default = 1.// std_dev_lo <float> Low and high standard deviations (used in// std_dev_hi <float> clus_ellipsoids). Default = 1.// corr_coef <float> Correlation coefficient (used in co-gauss// and co_lapace distributions). Default = 0.05.// colors <int> Number of color classes (clusters) (used// in the clustered distributions). Def. = 5.// new_clust Once generated, cluster centers are not// normally regenerated. This is so that both// centers and data points can be generated// using the same set of clusters. This option// forces new cluster centers to be generated// with the next generation of either data or// center points.// max_clus_dim <int> Maximum dimension of clusters (used in// clus_orth_flats and clus_ellipsoids).// Default = 1.// distribution <string> Type of input distribution// uniform = uniform over cube [-1,1]^d.// gauss = Gaussian with mean 0// laplace = Laplacian, mean 0 and var 1// co_gauss = correlated Gaussian// co_laplace = correlated Laplacian// clus_gauss = clustered Gaussian// clus_orth_flats = clusters of orth flats// clus_ellipsoids = clusters of ellipsoids// multi_clus = multi-sized clusters// See the file rand.cc for further information.//// Options affection general program behavior:// -------------------------------------------// stats <string> Level of statistics output// silent = no output,// exec_time += execution time only// summary += summary of complete kmeans// stage += summary of each stage// step += show each step// centers += print centers each time// tree += show the tree.// Default = "summary".// print_points <string> Print points after reading/generating?// Argument is either "yes" or "no".// Default = "no".// show_assignments <string>// Print the indices of the center to// which each point has been assigned along// with its distance to this center, after// the algorithm terminates. Argument is// either "yes" or "no".// Default = "no".// validate <string> Test (through a slow brute-force search)// that the center assignments requested by// show_assignments is correct. This is// used for debugging, and will slow// overall execution. Argument is either// "yes" or "no".// Default = "no".// (Note: There are other elements of the// computation that could be validated.// Perhaps with time these will be included// under this option.)//// Options affecting termination:// ------------------------------// The way of controlling the program's termination is to specify the// maximum number of stages. (In theory, a better way would be to// determine when the algorithm has converged, but this seems to be a// very complex task to me.) Each time the algorithm moves the center// points and recomputes the distortion constitutes a stage. The// maximum number of stages is based on the number of data points n// (data_size) and the number of centers k (kcenters) and four// coefficients, a,...,d, using the following formula://// MAX_STAGE = a + (b*k + c*n)^d//// In addition there are the following parameters. They are based// on the "relative distortion loss" (RDL), which is defined to be// the relative decrease in the distortion.//// max_tot_stage 4*<float>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -