📄 readme.txt
字号:
INTRODUCTIONThis package contains the source code for the software that was used torun the experiments for the articleF. Farnstrom, J. Lewis and C. Elkan, "Scalability for Clustering AlgorithmsRevisited", In ACM SIGKDD Explorations, vol. 2, no. 1, pp. 51-57, August 2000. The source code has been released to the public so that others canverify our results and try different algorithms for clusteringlarge databases. Since this software from the beginning was writtenonly for our own use, its usability may be lacking, and the source codeis currently in general not adequately commented. If you find thesoftware useful and would like to see a more userfriendly versionof it, let us know and we will consider putting more work into it.If you publish scientific work based upon this software, pleasecite the above article.Questions about compiling and using the software can be directed toFredrik Farnstrom through E-mail to fredrikf@mail.com or by writing toFredrik FarnstromSkolvagen 44373 00 JamjoSWEDENPlease don't bother to report minor bugs, unless they're likely tosignificantly have influenced the results in our article.More general questions or comments about scalable clustering algorithmsshould be directed to Dr. Charles Elkan at elkan@cs.ucsd.eduLICENSEThis software has been released under the GNU General Public License.For more information about this, please read the file 'LICENSE'.COMPILINGOn a UNIX-like platform with the Gnu C++ Compiler (g++), everything thatneeds to be done is to type 'make'. To use another commandline-based compileror to change compiler options, edit the file 'Makefile'. If there's a needto compile the program under some other platform or compiler environment,see 'Makefile' for information about how the object files should belinked together into executables.Typing 'make clean' will remove executables and object files createdduring compiling. As this software is of an experimental nature, there'sno 'make install' or 'configure' available.Standard C libraries and a math library are needed for compilation.In addition, the software uses the functions srand48() and drand48()for random number generation. If these are not available, they canbe replaced by other functions, satisfying:// Return a non-negative double-precision floating-point value uniformly// distributed between [0.0, 1.0).double drand48(void);// Set seed value for random number generation.void srand48(long int seedval);For measuring the running time of the programs, the clock() functionis used. This conforms to ANSI C and should thus not cause any problems,if it does anyway, it's only used by the main functions, so just replaceit with some other function to measure running time.The software has been successfully compiled on the Intel/Linux andSPARC/Solaris platforms.CLASSESThe software was mainly designed using an object-oriented approach, andwas implemented in C/C++.Below is a table summarizing the classes used by the main software.Class name Declared in Implemented in Description of purpose---------- ----------- -------------- ----------------------Singleton singleton.h singleton.cpp Stores a single point and some extra information about it.Subcluster subcluster.h subcluster.cpp Stores sufficient statistics about a cluster. Has functions to compute distances and to perturb cluster means.MainCluster subcluster.h subcluster.cpp Inherits SubclusterDatabase database.h - Abstract base class for interfaces to databases.SyntheticData syntheticdata.h syntheticdata.cpp Inherits Database. Creates synthetic data from Gaussian distributions.KddCupData kddcup98data.h kddcup98data.cpp A subclass to Database, reading data from the KDD Cup '98 dataset.RAMBuffer rambuffer.h rambuffer.cpp A buffer storing points and clusters, making sure that the memory they occupy is limited.KMeans kmeans.h kmeans.cpp An implementation of the regular K-means algorithm.BufferedKMeans kmeans.h kmeans.cpp Regular K-means operating on a fixed size buffer.ScalableKMeans scalablekmeans.h sc..ns.cpp The implementation of the scalable algorithms, incl. Bradley et al and our "simple algorithm".RUNNING EXPERIMENTS ON KDD CUP '98 DATATo run experiments on data from KDD Cup '98, do the following:1. Download the dataset and run 'kdd', see UTILITIES below.2. Run 'kdd2' to prepare for normalization.3. The program can use either ASCII or binary format for the floating point dataset. To use ASCII, set READBINARY to 0 in 'kddcup98data.cpp'. The dataset will be read from 'cup98b.dat', as created by 'scramble'. To use binary, set READBINARY to 1. The data will be read from 'cup98b.dat' as created by 'scramble2'.4. Update the constants in ScalableKMeans to their desired values for the clusterings.5. Recompile, make sure that scalablekmeans.o (delete it, then type 'make') gets recompiled so that the new constant values take effect.6. Run 'scramble' if using ASCII or 'scramble2' if using binary data, to create a new random permutation of the dataset.7. Run 'cluster1'. This performs five clusterings using each method. Statistics about the clusterings are appended in text format to 'scale10.txt', 'scale1.txt', 'naive10.txt', 'naive1.txt', 'random10.txt', 'random1.txt' and 'kmeans.txt'.8. Repeat from 6 if desired. Good idea to use a shell or perl script if you're doing this many times.9. Analyze the statistics in the 'scale10.txt', 'scale1.txt' etc files. The awk script 'process.awk' may be useful for this.RUNNING EXPERIMENTS ON SYNTHETIC DATA1. Update constants in ScalableKMeans to their desired values for the clusterings.2. Recompile, make sure that scalablekmeans.o (delete it, then type 'make') gets recompiled so that the new constant values take effect.3. Run 'cluster2'. This performs five clusterings using each method, on 50 different synthetic datasets. Statistics about the clusterings are saved to 'cluster#.dat', where # is 1-6.4. Analyze the statistics in the 'cluster#.dat' files.RUNNING EXPERIMENTS ON OTHER DATASETS1. Create a new Database subclass. Look at kddcup98data.h/.cpp to see how this is done. Basically, there are only two functions, reset() that is called to start reading from the beginning of the dataset, and getPoint() that reads the next record, returning a pointer to it or 0 if the whole dataset has been read.2. Use the ScalableKMeans/BufferedKMeans/KMeans classes with your new Database subclass to do the clusterings. See 'main1.cpp' for an example of how to do this.UTILITIESkddReads the data file from KDD Cup '98. To use, download the dataset'cup98lrn.txt', as of October 2000 available in compressed form fromhttp://kdd.ics.uci.edu/databases/kddcup98/kddcup98.htmlLoad it in a text editor, remove the first line that contains the labels, andsave it as 'cup98lrnB.txt'. Now, run 'kdd'. This extracts some of the featuresfrom the dataset and writes them to a new file 'cup98.dat'. 'cup98lrn.txt' and'cup98lrnB.txt' can now be deleted if desired.kdd2Reads the file 'cup98.dat' generated by 'kdd' as above. Computes statisticsfor the features and performs normalization, saving the new dataset as'cup98data.txt'. I don't think this is used for anything, but the programalso saves the statistics to the file 'cup98msd.txt', and this is neededby 'kddcup98data.cpp'.Observe the constants at the beginning of 'kdd2.c', these may need to bechanged if other features are used, if running on a subset of the cup98lrndataset, or if running on the validation set (cup98val).scramble / scramble2'scramble' reads the file 'cup98.dat' created by 'kdd'. A random (seed fromclock) permutation is applied to the records of the dataset, and the newdataset is written to 'cup98b.dat', with floating point numbers storedin ASCII format. 'scramble2' is identical, except that it writes to a filecalled 'cup98data.dat' and saves the floating point numbers in a binaryformat instead.----------------------Fredrik FarnstromOctober 8, 2000
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -