📄 benchmarks

📁 Andreas Stolcke开发的聚类源码, 版本2.9。聚类方法是层次合并聚类

💻

字号:

The algorithm used in cluster version up to 2.3 was not very sophisticated.It constructs a triangular distance matrix which is repeatedly searchedlinearly for the shortest distance pair of data points.Thus the algorithm is cubic in the number n of data points.There is also a component that is linear in the length l of the patterns,but that is typically dominated by the n^3 part.In version 2.4 I made some trivial changes that not only cut the amountof memory, but also brought about a 20% improvement in run time.In version 2.5 another simple but effective change was made in the waydistances are recorded.  Instead of a full n-by-n matrix, only thenearest neighbor for each data point is recorded, along with its distance.From this we can find the overall smallest distance pair in time O(n).For each point, the nearest neighbor info can be updated in time O(1) inthe best case, and O(n) in the worst case; amortized time is O(1) per datapoint.  Thus the overall time becomes O(n^2).  As a bonus we can get rid ofthe triangular distance matrix and recompute distance the few times we need to,which brings the space complexity from O(n^2) to O(n).  This typically savesseveral megabytes.Using fancy methods for nearest neighbor finding (e.g., kd-trees) we should beable to get to O(n log n) average time complexity, but this will be left for alater version.Below are some benchmark results that confirm these analyses.------User CPU time measured on a SPARCstation 2 with	time cluster -g test.file >/dev/nullno. vectors x length	version 2.3	version 2.4	version 2.5100 x 10 		1.1 sec		0.8 sec		0.78 sec200 x 10		4.0 sec		3.4 sec		2.2 sec500 x 10		40.4 sec	32.2 sec	10.3 sec1000 x 1		272 sec		213 sec		10.6 sec1000 x 10		292 sec		230 sec		39.2 sec100 x 100		4.4 sec		4.1 sec		4.5 sec

⌨️ 快捷键说明

复制代码 Ctrl + C

搜索代码 Ctrl + F

全屏模式 F11

切换主题 Ctrl + Shift + D

显示快捷键 ?

增大字号 Ctrl + =

减小字号 Ctrl + -