readme.txt

来自「Scalable k-means software and test datas」· 文本代码 · 共 201 行
TXT
201 行
INTRODUCTIONThis package contains the source code for the software that was used torun the experiments for the articleF. Farnstrom, J. Lewis and C. Elkan, "Scalability for Clustering AlgorithmsRevisited", In ACM SIGKDD Explorations, vol. 2, no. 1, pp. 51-57, August 2000. The source code has been released to the public so that others canverify our results and try different algorithms for clusteringlarge databases. Since this software from the beginning was writtenonly for our own use, its usability may be lacking, and the source codeis currently in general not adequately commented. If you find thesoftware useful and would like to see a more userfriendly versionof it, let us know and we will consider putting more work into it.If you publish scientific work based upon this software, pleasecite the above article.Questions about compiling and using the software can be directed toFredrik Farnstrom through E-mail to fredrikf@mail.com or by writing toFredrik FarnstromSkolvagen 44373 00 JamjoSWEDENPlease don't bother to report minor bugs, unless they're likely tosignificantly have influenced the results in our article.More general questions or comments about scalable clustering algorithmsshould be directed to Dr. Charles Elkan at elkan@cs.ucsd.eduLICENSEThis software has been released under the GNU General Public License.For more information about this, please read the file 'LICENSE'.COMPILINGOn a UNIX-like platform with the Gnu C++ Compiler (g++), everything thatneeds to be done is to type 'make'. To use another commandline-based compileror to change compiler options, edit the file 'Makefile'. If there's a needto compile the program under some other platform or compiler environment,see 'Makefile' for information about how the object files should belinked together into executables.Typing 'make clean' will remove executables and object files createdduring compiling. As this software is of an experimental nature, there'sno 'make install' or 'configure' available.Standard C libraries and a math library are needed for compilation.In addition, the software uses the functions srand48() and drand48()for random number generation. If these are not available, they canbe replaced by other functions, satisfying:// Return a non-negative double-precision floating-point value uniformly// distributed between [0.0, 1.0).double drand48(void);// Set seed value for random number generation.void srand48(long int seedval);For measuring the running time of the programs, the clock() functionis used. This conforms to ANSI C and should thus not cause any problems,if it does anyway, it's only used by the main functions, so just replaceit with some other function to measure running time.The software has been successfully compiled on the Intel/Linux andSPARC/Solaris platforms.CLASSESThe software was mainly designed using an object-oriented approach, andwas implemented in C/C++.Below is a table summarizing the classes used by the main software.Class name  Declared in     Implemented in  Description of purpose----------  -----------     --------------  ----------------------Singleton   singleton.h     singleton.cpp   Stores a single point and some                                            extra information about it.Subcluster  subcluster.h    subcluster.cpp  Stores sufficient statistics                                            about a cluster. Has functions                                            to compute distances and to                                            perturb cluster means.MainCluster subcluster.h    subcluster.cpp  Inherits SubclusterDatabase    database.h      -               Abstract base class for                                            interfaces to databases.SyntheticData syntheticdata.h syntheticdata.cpp Inherits Database. Creates                                            synthetic data from Gaussian                                            distributions.KddCupData  kddcup98data.h  kddcup98data.cpp A subclass to Database,                                            reading data from the KDD                                            Cup '98 dataset.RAMBuffer   rambuffer.h     rambuffer.cpp   A buffer storing points and                                            clusters, making sure that the                                            memory they occupy is limited.KMeans      kmeans.h        kmeans.cpp      An implementation of the                                            regular K-means algorithm.BufferedKMeans kmeans.h     kmeans.cpp      Regular K-means operating on                                            a fixed size buffer.ScalableKMeans scalablekmeans.h sc..ns.cpp  The implementation of the                                            scalable algorithms, incl.                                            Bradley et al and our                                            "simple algorithm".RUNNING EXPERIMENTS ON KDD CUP '98 DATATo run experiments on data from KDD Cup '98, do the following:1. Download the dataset and run 'kdd', see UTILITIES below.2. Run 'kdd2' to prepare for normalization.3. The program can use either ASCII or binary format for the floating    point dataset. To use ASCII, set READBINARY to 0 in 'kddcup98data.cpp'.    The dataset will be read from 'cup98b.dat', as created by 'scramble'.    To use binary, set READBINARY to 1. The data will be read from    'cup98b.dat' as created by 'scramble2'.4. Update the constants in ScalableKMeans to their desired values for    the clusterings.5. Recompile, make sure that scalablekmeans.o (delete it, then type 'make')    gets recompiled so that the new constant values take effect.6. Run 'scramble' if using ASCII or 'scramble2' if using binary data, to    create a new random permutation of the dataset.7. Run 'cluster1'. This performs five clusterings using each method.    Statistics about the clusterings are appended in text format to    'scale10.txt', 'scale1.txt', 'naive10.txt', 'naive1.txt',    'random10.txt', 'random1.txt' and 'kmeans.txt'.8. Repeat from 6 if desired. Good idea to use a shell or perl script if    you're doing this many times.9. Analyze the statistics in the 'scale10.txt', 'scale1.txt' etc files. The    awk script 'process.awk' may be useful for this.RUNNING EXPERIMENTS ON SYNTHETIC DATA1. Update constants in ScalableKMeans to their desired values for the    clusterings.2. Recompile, make sure that scalablekmeans.o (delete it, then type 'make')    gets recompiled so that the new constant values take effect.3. Run 'cluster2'. This performs five clusterings using each method,    on 50 different synthetic datasets.    Statistics about the clusterings are saved to 'cluster#.dat',    where # is 1-6.4. Analyze the statistics in the 'cluster#.dat' files.RUNNING EXPERIMENTS ON OTHER DATASETS1. Create a new Database subclass. Look at kddcup98data.h/.cpp to    see how this is done. Basically, there are only two functions,    reset() that is called to start reading from the beginning of the    dataset, and getPoint() that reads the next record, returning    a pointer to it or 0 if the whole dataset has been read.2. Use the ScalableKMeans/BufferedKMeans/KMeans classes with your new    Database subclass to do the clusterings. See 'main1.cpp' for an    example of how to do this.UTILITIESkddReads the data file from KDD Cup '98. To use, download the dataset'cup98lrn.txt', as of October 2000 available in compressed form fromhttp://kdd.ics.uci.edu/databases/kddcup98/kddcup98.htmlLoad it in a text editor, remove the first line that contains the labels, andsave it as 'cup98lrnB.txt'. Now, run 'kdd'. This extracts some of the featuresfrom the dataset and writes them to a new file 'cup98.dat'. 'cup98lrn.txt' and'cup98lrnB.txt' can now be deleted if desired.kdd2Reads the file 'cup98.dat' generated by 'kdd' as above. Computes statisticsfor the features and performs normalization, saving the new dataset as'cup98data.txt'. I don't think this is used for anything, but the programalso saves the statistics to the file 'cup98msd.txt', and this is neededby 'kddcup98data.cpp'.Observe the constants at the beginning of 'kdd2.c', these may need to bechanged if other features are used, if running on a subset of the cup98lrndataset, or if running on the validation set (cup98val).scramble / scramble2'scramble' reads the file 'cup98.dat' created by 'kdd'. A random (seed fromclock) permutation is applied to the records of the dataset, and the newdataset is written to 'cup98b.dat', with floating point numbers storedin ASCII format. 'scramble2' is identical, except that it writes to a filecalled 'cup98data.dat' and saves the floating point numbers in a binaryformat instead.----------------------Fredrik FarnstromOctober 8, 2000
readme.txt - 源码说明

本页面展示了「Scalable k-means software and test datasets This package (a Unix tar file, gzipped) contains the sou」中的 readme.txt 源码文件，采用文本编程语言编写，共 201 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与Scalable相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?