📄 benchmarks
字号:
The algorithm used in cluster version up to 2.3 was not very sophisticated.It constructs a triangular distance matrix which is repeatedly searchedlinearly for the shortest distance pair of data points.Thus the algorithm is cubic in the number n of data points.There is also a component that is linear in the length l of the patterns,but that is typically dominated by the n^3 part.In version 2.4 I made some trivial changes that not only cut the amountof memory, but also brought about a 20% improvement in run time.In version 2.5 another simple but effective change was made in the waydistances are recorded. Instead of a full n-by-n matrix, only thenearest neighbor for each data point is recorded, along with its distance.From this we can find the overall smallest distance pair in time O(n).For each point, the nearest neighbor info can be updated in time O(1) inthe best case, and O(n) in the worst case; amortized time is O(1) per datapoint. Thus the overall time becomes O(n^2). As a bonus we can get rid ofthe triangular distance matrix and recompute distance the few times we need to,which brings the space complexity from O(n^2) to O(n). This typically savesseveral megabytes.Using fancy methods for nearest neighbor finding (e.g., kd-trees) we should beable to get to O(n log n) average time complexity, but this will be left for alater version.Below are some benchmark results that confirm these analyses.------User CPU time measured on a SPARCstation 2 with time cluster -g test.file >/dev/nullno. vectors x length version 2.3 version 2.4 version 2.5100 x 10 1.1 sec 0.8 sec 0.78 sec200 x 10 4.0 sec 3.4 sec 2.2 sec500 x 10 40.4 sec 32.2 sec 10.3 sec1000 x 1 272 sec 213 sec 10.6 sec1000 x 10 292 sec 230 sec 39.2 sec100 x 100 4.4 sec 4.1 sec 4.5 sec
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -