📄 log

📁 数据挖掘经典的hierarchial clustering algorithm
💻
字号:
capture the major clusters. splitcodeword will make up 256.
So throwing noise is good which keeps threshold low.

Date 07/22/96
landsend.daily.dat patern mining:
in ~zhang/public/LANDSEND and devise/cluster/run -session landsend4.tk:
State Distribution: LatLon.cluster
Christmas Season Orders: DateOrders.cluster
Christmas Season TotalAmount: DateTotalAmount.cluster
Christmas Season Locations: DateLatLon.cluster

Date 07/18/96
fix phase2.
scan for outliers only once.

Date 07/17/96
fix phase1 first.

Phase1A scheme works fine:
        Average Nearest Neighbor Distance,
        compress to about 50%

Phase1B scheme
        Average Nearest Neighbor Distance,
        compress to about x%

Date 07/16/96
status.* are not "sqr_ed", but lower levels are "sqr_ed" due to
distance and fitness.

Date 07/16/96
RecyQueue is changed back as variable length queue.

Date 07/16/96
AvgDNNScanLeafEntry(dtype) gets approximate average Nearest_Neighbor
distance of all leaf entries.
tested fine.

Date 07/15/96
With density (0.25*average density) + connectivity, rg1.r and rg2.b
and rg4.r are all identified.
Outliers are deleted directly from the contree:
save space, reduce threshold

Date 07/11/96
Try LANDSEND dataset for density and clustering with all combinations.
FtSurvey(0.01) can be 0 because there might be 2 similar entry in
the MaxLeaf node due to split/merge process.

Date 07/10/96
contree.* and components.* are used together tested fine with
test_contree.C (as phase3.C)
Nonleaf Nodes above Leaf Nodes are very meaningful wrt. clusters and outliers.
Generate rg4.b and rg4.r for testing clusters of any size and shape.
connected function in cfentry.C is critical wrt. connectivity.
need to consider both connectivity + density.

Date 07/09/96
components.* tested fine with
test_components.C (as main.C)
test_components.dat (as "tmp" and gnuplot to see how it is like).
CurVertex and CurComponent will move pointers but not delete units.

Date 06/16/96
test C++ stream input and Devise kind read in read.C:
Devise kind read is too slow.
It is fixed and Devise kind reading is faster now.

Date 06/12/96
Concentrate on KDD Journal paper. (BIRCH6: ScanLeaf adn FtScheme1)
BIRCH8: (should match BIRCH6 in quality)
components.* are used for connected components
contree.* for connected trees

Date 05/23/96
Start to implement component tree.
Find paper in KDD96 about connectivity.
CF tree and Component tree exist simitaneously. contree.*
Copy from CF tree to Component tree.

Date 05/23/96
Find the average distance between sample points and their qth nearest
neighbor and use it as h. That ensures that on average q points lie within
the standard deviation of kernel. It has been suggested that q=10 points
is a reasonable value.
Adapt this heuristic for selecting the initial H0.

Date 05/17/96
Phase 3: hierarchy, clarans, kmeans, lloyd
Hierarchy is working fine with "cluster" generated.
Phase 4: works fine with "label" and "refcluster" generated.

Date 05//15/96
BIRCH7 (Phase1, Phase2, Density) + Devise AttrProj + ntrees works!
compile with g++-2.7.1

Date 04/22/96
BIRCH7 is for integrating with Devise scheme reading.
util.*
vector.*,
rectangle.*,
cfentry.*
cftree.*

Date 04/18/96
BIRCH bug when MemSize=1:

Stats->MemUsed=2 > Paras->MemSize=1
SplitBuffer->Full()==1
So it never takes more data even through the threshold is increased and the
tree is rebuilt but still has size of 2 ==> Dead loop.

Date 04/18/96
BIRCH bug when using spikes.dat:

Modify weight for 1 to 0.001, and works fine.
Guess "cfentry" overflow, need further debugging, NO!.

Found "FtScheme6" is not right, the way d is computed:
It assumes K=1 (normalized to 1-hyper-cube), so when the
data is weighted 10000 times, K=1 is wrong.

Date 04/15/96
Can calculate Density f(x) , Probability P(x<a) and MISE(f1,f2).

Date 04/14/96
Created "Density" class interface.

Date 04/11/96
Modify scheme file: new para.config: one main tree + outlier tree,
output:
clusters (phase3+phase4): Number of clusters or T;
density functions (smooth): Number of bars within the range.

Date 04/11/96
Fix bug in UNIFORM's special case when sigma==0 for f(x) and R(f'').

Date 04/08/96
Modify smooth.C to provide "difference comparisons" and "h* comparisons".

Date 03/21/96
Implement kmeans.C

Date 03/12/96
fix bug before call erf(x):
        when r^2>0 (could be wrong with error accumulated),
        then sqrt(r^2) is valid
erf(x) is found in math.h.
ERROR 1.0e-10 is too large, changed to 1.0e-30

Date 03/05/96

2 Experiments:
Use empirical H* selection,
Assume normal distribution, plot f1 and f2.
Assume uniform distribution, plot f1 and f2.

Erf(z) is implemented based on "numerical recipes" in numeric.*.

Date 02/29/96
3 Experiments:

1. Normal distr data --> h* --> f1
                         wi(h*,gi(x)) --> f2
2. Overlap: increase  or decrease
                        let TreeSize=1 and #leafentry=2 by using proper T

Overlap is caused using Min(D) at leaf level.
If we use Min(Di), then the overlap should disappear.

So Modify the code Leaf::ClosestOne to be able to handle 2 options.

Date 02/21/96
How to observe the distribution inside each CF?

        #ifdef CF_DISTR
        code ...
        #endif CF_DISTR

        Set InitFt such that a small tree is finished in one pass.
        Set a file with each leaf entry added.
        Write to the file each time a point is inserted.

Right now, it works only for TreeSize=1 (a single leaf node).
So set the threshold directly to the target.

Date 02/21/96

Approach 2:
Dataset Construction with specific density function:
        (1) k+1-dimensional curve described by a function y=f(x1,...,xk).
        (2) Uniform distribution in k+1-dimensional space
        (3) k-dimensional projection of k+1-dimensional data points underneath
        the curve y=f(x1,...,xk).
        (4) N: keep generating uniform data points until those underneath
        the curve reach N.

Date 02/13/96
Write separate code for generating density function at point granularity.
Compare them with the same given range, bars and H.

Date 02/19/96
RECTANGLE calculations are correct: but output in the format
of <center, length> on each dimension.

Date 02/08/96
Approach 1:
Dataset Generation with 3  peaks of data:
1. Created "S.triple.script" for generating different histogram.
        3 peaks of data:
        <distribution,n,mean/location,sd/shape>

2. Re-check the concept of h in the code about Kernel Smoothing.
        It is correct. If K(x) is standard normal distribution
        with 0 as mean and 1 as std, then K((x-u)/h)/h is generalized
        normal distribution with u as mean and h as std.

Date 01/23/96
Add Histogram and Kernel Smooth.

Date 01/18/96
Increasing Threshold:
(1) Data Oriented Approach: Based on current data to estimate.
(2) Memory Utilization Approach: Based on the memory utilization rate, use
current "fractal dimension" to compress the tree and make room for new
data. 50% - 90% relevant to Passi: initial 50%, later 90%.

However based on discussions about "fractal dimensions":

        > I have read your paper about "Analysis of R-trees Using the Concept
        > of Fractal Dimension". My question is if any spatial dataset can
        > be modeled by the fractal dimension power law? If not, how to decide
        > what kind of dataset can (by trying to see if they can fit as a straight line?) ?

Yes, exactly - you try to see whether a straight line fits the plot
for some reasonably large range of scales (r1 , r2),
eg., r2 >= 10*r1

a data set might NOT be fitted by a straight line,
or it may be fitted by a straight line of slope s1 in one interval,
by another line of slope s2 in another, etc.

For more details, see the book by Schroeder, cited in the PODS-94 paper

    From
    Christos Faloutsos
    Department of Computer Science
    University of Maryland
    College Park, MD 20742-3255

    ph#:  (301)-405-2695
    FAX#: (301)-405-6707
    http://www.cs.umd.edu/~christos/


To take the questions more or less in order:  There are many spatial datasets
(in physics, biology, even in the growth of cities) which look roughly
fractal.  (Probably none of them are _truly_ fractal, because of the atomic
nature of matter.)  Usually these are the results of stochastic processes,
i.e. random fractals.  A good book on this is Brian Kaye's _A Random Walk
Through Fractal Dimensions_.  But the question you're asking, ``How can I
decide what kind of dataset is good to be modelled by a random fractal,'' is
rather trickier.  Just from looking at the data, it's almost impossible to
tell what would be a good model; you need _some_ understanding of the
process that produced the data.  (Of course, you can tell some things just
from the data: if it has an integer dimension, it's almost certainly not
from a random fractal!)  Many random-walk processes lead to random
fractals, and vice versa, but there are exceptions both ways.  (Again,
Kaye's book discusses this nicely.)

As to specific dimensions, I've not heard of the scaling dimension before,
nor the cluster dimension.  The box, box-counting and Minkowski
dimensions are all the same.  The definition of the Hausdorff dimension is a
bit different, but the box and Hausdorff dimensions of all but the most
pathological objects are the same, and physicists use them
interchangably.  (The box dimension is much easier to calculate.)  I
_suspect_ they're the same as your scaling dimension.

        From Cosma Shalizi
        http://www.physics.wisc.edu/~shalizi/


Date 12/20/95
ScanLeaf is not proved but with better performance because 'merge/split' freely.
CompactTree is proved but with worse performance because restricted merging.
ShiftTree is proved and with almost the same performance as ScanLeaf.

If we use AbsorbEntry1, ScanLeaf is not a major problem.
If we use AbsorbEntry2, ShiftTree should be used.

Date 12/18/95
Happy Birthday!!!
Merge in the FtSurvey3 and then compact tree from right to left.
Try on Lena.m.ascb and find that: two extremes of entries.
A lot of noise are left due to the new AbsorbEntry.

Date 12/15/95
Done with CompactTree1 and CompactTree2: but has precision problem.
So get rid of all sqrt calculations inside.
Ft is kept as Ft^2.

Date 12/12/95
Find that ScanLeaf (proof+algorithm) is not right. Design new algorithm
to be consisitent with the proof.
Let me call the new algorithm CompactTree instead of ScanLeaf.
Be careful of leaf and nonleaf, empty entry and nonempty entry.

Date 12/09/95
AbsorbEntry: Absorb an entry if it fits in a leaf (new entry or merged)
with no splits.  (Previously only merged)

Date 12/08/95
in FtSurvey1, FtSurvey2, and FtSurvey3
Real crowdest leaf node: scan all leaf nodes
Approximate crowdest leaf node: top down

Date 12/07/95
Fix bug in ClosestDiffTwo: so that FtSurvey1 and FtSurvey2 are
sepcial cases of FtSurvey3 (0% or 100%), already tested fine.
Fix bug in FtSurvey3: only check the merged entries instead of all
entries in the leaf hierarchy.

Fix bug in NoiseSurvey: NoiseRate is based on current tree,
                        not accumulated noise

Devise is used to plot histogram, image, entry.
S.random.script has all the programs for tranforming data.

Timing: with 80x1024 memory, 100000 tuples
rebuilt  1.15 seconds
scan data 25.00 seconds

Date 12/06/95
Compare FtScheme2,3,5 with noise option on/off
FtScheme6: Compare delta T effects
FtScheme5: Dstortion Cutoff (10%) for deciding hierarchy cutoff

Date 12/04/95
Implemented "lloyd0" for centroid only.
If we want to use "lloyd0" for entry generally, we need to
prove it still converges with the entry constraints.

Date 12/04/95
Implemented "FtSurvey3" by forming a hierarchy in the leaf node
and then split to reach Volume Stable.

Date 12/01/95
Fix bug in ScanLeftLeaf(), Use Stats->LeftLeafK instead of a local k

Date 11/29/95
try "FtSurvey2" by merging a whole leaf node:
        bad (1) Ft grows too fast; (2) leaf size is artificial
find bugs wrt. ''sqrt'' and ''pow of 1.0/Dimension''

Date 11/24/95
(1) Compile on rockyroad: g++ -MM is used for make dependency
(2) Debug segmentation fault: if/while (pointer!=NULL && ...)
                              free_nonleaf after FtSurvey

Date 11/22/95
(1) Fixed PRECISION_ERROR(10^(-10)) in fitness about a leaf entry.
(2) Try on "norm100000.100" , "grid1000.4.10" and "grid1000.4.10r"
and the misplaced points are minors:
They are either taken care by noise option or by refinement option.

Date 11/16/95
Phase 4 Improvements: For each P in N data points, find P's
                      nearest seed from K seeds.
a: naive
b: norms
        Have tested the a and b generate exactly the same results.
        QuickSort is not critical wrt. time, so it is a recursion.
        Other operations relevant to N need to be improved with
                "dynamic constraints"   (done)
                "avoid sqrt",           (done)
                "avoid recursion",      (done: 3 procedures by 3 inlines)
                "partial distortion",   (conflicting with modulization)

c: tree minmax

   AI: 1) MinMax procedure for tree pruning
       2) Search order
       3) R-tree, MBR, MinDist, MaxDist

d: hybrid

Date 11/15/95
Density distribution and NoiseRate estimation

OutlierTree Option: Another Stat object is created for building Outlier Tree
datafile ==> maintree<-->splitbuffer, outlierqueue, outliertree (several levels)

Date 11/9/95
Assume half disk for splits and half disk for outliers:
Result: FtScheme1 (HISTORY=5) + BirchPhase1a:  work fine.
        FtScheme1 (HISTORY=5) + BirchPhase1b:  work fine.
        FtScheme2 (HISTORY=2) + BirchPhase1a:  work fine.
        FtScheme2 (HISTORY=2) + BirchPhase1b:  work fine

if Phase1 ends with much smaller T1,
then it may need more aggressive noise removing.

Conservative Phase1:
    BirchPhase1a: aggressive, add last pass in phase1
    BirchPhase1b: conservative, add last pass in phase1

Date 11/8/95
Add FtScheme2:  packed volume only,
                HISTORY_TIMES=2 only, most recent trend,
                (HISTORY_TIMES=5 for FtScheme1)
                adjust previous overshot or undershot by the line,
                adjust previous overshot conservatively

Date 11/6/95
DIMENSION constant --> Dimension variable
        constructor -> construct object
        init(short d) -> allocate space dynamically
        Vector tmp; tmp.Init(d);
        Rectangle tmp; tmp.Init(d);
        Entry tmp; tmp.Init(d);

Date 11/3/95
add D4:
use timeutil.h and timeutil.C for timing
Paras is treated as global to all

Date 11/1/95
fix bug in FtSurvey:
        ClosestTwo(BDtype): same points form 2 entries in a leaf node
                    return a distance 0.
        ClosestDiffTwo(D3): return the smallest non-zero D3 distance
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -