📄 group.mx
字号:
@' The contents of this file are subject to the MonetDB Public License@' Version 1.1 (the "License"); you may not use this file except in@' compliance with the License. You may obtain a copy of the License at@' http://monetdb.cwi.nl/Legal/MonetDBLicense-1.1.html@'@' Software distributed under the License is distributed on an "AS IS"@' basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See the@' License for the specific language governing rights and limitations@' under the License.@'@' The Original Code is the MonetDB Database System.@'@' The Initial Developer of the Original Code is CWI.@' Portions created by CWI are Copyright (C) 1997-2007 CWI.@' All Rights Reserved.@f group@a M.L. Kersten, P. Boncz, A.P. de Vries, N.J. Nes@v 2.3@+ The group moduleThis module contains the primitives to construct, derive, andperform statistical operations on BATs representing groups.The default scheme in Monet is to assume the head to representthe group identifier and the tail an element in the group.Groups play an important role in datamining, where they are usedto construct cross-tables. Such cross tables over a singleBAT are already supported by the histogram function.This module provides extensions to support identification of groups in a(multi-)dimensional space.The module implementation has a long history. The first implementationprovided several alternatives to produce/derive the grouping.A more complete (and complex) scheme was derived during itsextensive use in the context of the Data Distilleries product.The current implementation is partly a cleanup of this code-base,but also enables provides better access to the intermediatestructures produced in the process, i.e. the histogram andthe sub-group mapping. They can be used for various optimizationschemes at the MAL level.The prime limitation of the current implementation is that anunderlying database of @{oid->any@} BATs is assumed.This enables representation of each group using an oid,and the value representation of the group can be accordingly beretrieved easily. An optimized implementation in which we use positionalinteger id's (as embodied by Monet's void type) is also available.This limitation on (v)oid-headers is marginal. The primitive GRPsplitproduces for any BAT two copies with both a (v)oid header.@- AlgorithmsThere are several approaches to build a cross table. The one chosen hereis aimed at incremental construction, such that re-use of intermediatesbecomes possible. Starting with the first dimension, a BAT is derived torepresent the various groups, called a @emph{GRP BAT} or cross-table BAT.@- Cross Table (GRP)A cross table is an <oid,oid> BAT where the first (head) denotes a tuple inthe cross table and the second (tail) marks all identical lists.The tail-oids contain group identifiers; that is, @emph{this value is equal@strong{iff} two tuples belong to the same group}. The group identifiers arechosen from the domain of the tuple-identifiers. This simplifiesgetting back to the original tuples, when talking about a group.If the tuple-oid of 'John' is chosen as a group-id, you might view thisas saying that each member of the group is 'like John' with respectto the grouping-criterion.@- Successively the subgroups can be identified by modifying the GRP BAT orto derive a new GRP BAT for the subgroups. After all groups have beenidentified this way, a BAT histogram operation can be used to obtainthe counts of each data cube. Other aggregation operations using the MILset aggregate construct @{X@}(bat) can be used as well; note for instance that histogram == @{count@}(b.reverse()).The Monet interface module specification is shown below.Ideally we should defined stronger type constraints, e.g.command group.new(attr:bat[@{void,oid@},:any_1]@{@malmodule group;command new(b:bat[:any_1,:any_2], start:int, incr:int, grpsize:int) :bat[:any_1,:int] address GRPgroup0comment "Produces a new BAT with identical head column, and in the tail column groups of equally valued integers within each group. Parameters: a start group -value, -number increment, -size.";command new(attr:bat[:any_1,:any_2] ) (histo:bat[:any_1,:int], grp:bat[:any_1,:void]) address GRPgroup;command new(attr:bat[:any_1,:any_2] ) (histo:bat[:any_1,:int], grp:bat[:any_1,:oid]) address GRPgroup;command new(attr:bat[:any_1,:any_2], N:int, rng:int) (histo:bat[:any_1,:int],grp:bat[:any_1,:oid]) address GRPgroup_customcomment "Cross tabulation group initialization like GRPgroup, but with user provided #bits in hashmask and #distinct values in range.";command derive(hist:bat[:any_1,:int], map:bat[:any_1,:oid], attr:bat[:any_1,:any_2]) (histo:bat[:any_1,:int],grp:bat[:any_1,:oid]) address GRPderivecomment "Cross tabulation group extension step. Returned head values are identical as in 'ct'. Tail values are from the same domain and indicate further refinement of the groups in 'ct', taking into account also the tail-values in 'attr'.";command derive(histo:bat[:void,:int], map:bat[:void,:oid], attr:bat[:oid,:any_2]) (hist:bat[:oid,:int],grp:bat[:oid,:oid]) address GRPderive;command refine(b:bat[:any_2,:any_3], a:bat[:any_2,:any_1]) :bat[:any_2,:oid] address GRPrefinecomment "refine the ordering of a tail-ordered BAT by sub-ordering on the values of a second bat 'a' (where the heads of a and b match 1-1). The effect of this is similar to (hash-based) GRPderive, with the distinction that the group ids respect the ordering of the group values.";command refine(b:bat[:oid,:any_3], a:bat[:void,:any_1]) :bat[:oid,:oid] address GRPrefine;command refine(b:bat[:void,:any_3], a:bat[:oid,:any_1]) :bat[:oid,:oid] address GRPrefine;command refine_reverse(b:bat[:any_2,:any_3], a:bat[:any_2,:any_1]) :bat[:any_2,:oid] address GRPrefine_revcomment "refine the ordering of a tail-ordered BAT by sub-ordering on the values of a second bat 'a' (where the heads of a and b match 1-1). The effect of this is similar to (hash-based) GRPderive, with the distinction that the group ids respect the ordering of the group values.";@-@+ Group Aggregate operationsThis module also contains some efficient aggregate functions overgroups that compute their result in one scan.For the groups we assume a bat structure where the head indicates the groupand the tail contains the group elements. This leads to the situation thatmost value-based operators work on the tail, while counting groupsis focussed on the head.@= grpSignaturecommand sum(b:bat[:any_2,:@1], e:bat[:any_2,:any_1]) :bat[:any_2,:@1] address GRPsum_@1comment "grouped tail sum";command sum(b:bat[:any_2,:@1], size:int) :bat[:any_2,:@1] address GRPwindowsum_@1comment "Tail sum of groups of a fixed size";command sum(b:bat[:any_2,:@1], size:int, shift:int) :bat[:any_2,:@1] address GRPslidingsum_@1comment "Tail sum of groups of a sliding window of fixed size";command avg(b:bat[:any_2,:@1], e:bat[:any_2,:any_1]) :bat[:any_2,:@1] address GRPavg_@1comment "grouped tail average";@-Not used yet#command min(b:bat[:any_2,:@1], size:int) :bat[:any_2,:@1] #address GRPwindowmin@1#comment "Tail minimum of groups of a fixed size";#command max(b:bat[:any_2,:@1], size:int) :bat[:any_2,:@1] #address GRPwindowmax@1#comment "Tail minimum of groups of a fixed size";@mal @:grpSignature(sht)@ @:grpSignature(int)@ @:grpSignature(lng)@ @:grpSignature(flt)@ @:grpSignature(dbl)@command min(b:bat[:any_2,:any_1], e:bat[:any_2,:any_3]) :bat[:any_2,:any_1] address GRPmincomment "Grouped tail minimum";command max(b:bat[:any_2,:any_1], e:bat[:any_2,:any_3]) :bat[:any_2,:any_1] address GRPmaxcomment "Grouped tail maximum";command count(b:bat[:any_2,:any_1], e:bat[:any_2,:any_3], nonils:bit) :bat[:any_2,:int] address GRPaggr_countcomment "Grouped count";command size(b:bat[:any_2,:bit], e:bat[:any_2,:any_1]) :bat[:any_2,:int] address GRPsizecomment "Grouped count of true values";@-We need a few routines to support MonetDB/XQ implementation.They are focussed on edge BATs [:oid,:oid] where the tail isassumed a sequence and the head denotes a group.Finding the first and last element maps into findingthe min/max oid within a group.@= xqMinMaxDefcommand min(b:bat[:oid,:@1]):bat[:oid,:@1]address GRPmin_oid_@1comment "Select the minimum element of each group";command max(b:bat[:oid,:@1]):bat[:oid,:@1]address GRPmax_oid_@1comment "Select the minimum element of each group";@mal @:xqMinMaxDef(oid)@ @:xqMinMaxDef(sht)@ @:xqMinMaxDef(int)@ @:xqMinMaxDef(lng)@ @:xqMinMaxDef(flt)@ @:xqMinMaxDef(dbl)@command prelude()address GRPprelude;group.prelude();@+ Implementation CodeInclusion of the xtables requires some preliminary definitionsand als renaming :group by something else, because Mx can;t handlemacros identical to file names.@h#ifndef _GROUP_H_#define _GROUP_H_#include "gdk.h"extern int CTderive(BAT **B, BAT **H, BAT *ct_hist, BAT *ct_map, BAT *b);extern int CTgroup(BAT **B, BAT **H, BAT *b);#endif /* _GROUP_H_ */@c#include "mal_config.h"#include "group.h"#include "algebra.h"static int TYPE_mapentry;static int TYPE_idxentry;intgrp_new(BAT *b, BAT *h){ if (h) { BATkey(h, TRUE); BATkey(BATmirror(h), FALSE); h->tsorted = 0; if ((h->hsorted = BAThordered(b)) & 1) { if (BATcount(h) == BATcount(b)) { ALIGNsetH(h, BATmirror(b)); } } else if (BATorder(h) == NULL) { BBPreclaim(h); if (b) BBPreclaim(b); return GDK_FAIL; } BBPkeepref(h->batCacheid); } else { assert(h); } BBPkeepref(b->batCacheid); return GDK_SUCCEED;}@+ Core Grouping AlgorithmsWe use hash-grouping all the way. This implementation employsa simple sequential scan through the operands, adding groupvalues to a hash-table. This hash-table gives access to the groupidentifiers, which are always OIDs.This strategy is also followed on binary groupings; herewe construct a special integer consisting of the XORed hashnumberof both columns. In such a way, we can build a hash table onmap_entries (instead of simple atomic values -- the unary case).In the unary group case, we optimized processing on 1-byteand 2-byte values by using direct mapping in an array instead ofhashing.@c#define HASH_chr(p) ((hash_t) (*(unsigned char*) (p)))#define HASH_sht(p) ((hash_t) (*(unsigned short*) (p)))#define HASH_int(p) ((hash_t) *(unsigned int*) (p))#define HASH_lng(p) ((hash_t)(((unsigned int*)(p))[0]^((unsigned int*)(p))[1]))#define HASH_any(p) ((*hashfcn)(p))#define match_sync(b,p,r) r += yy#define match_hash(b,p,r) BUNfndOID(r,b,p); if (r == NULL) continue;#define declare_atom int any = b->ttype; hash_t (*hashfcn)(ptr) = BATatoms[any].atomHash;#define declare_simple /* any and hash would otherwise give unused variable warning */#define htype_sync(b) BAThdense(b)?TYPE_void:TYPE_oid#define htype_hash(b) TYPE_oid#define ttype_simple(b,t) t#define ttype_atom(b,t) b->ttype#define STANDARD_MASK ((hash_t) 1023)/* Note: following macros take advantage of clustered property; if b is clustered, then we can stop early traversing collision lists. BTW, simply stopping possibly breaks chain construction, so the resulting map is not directly reuseable as a hash table; the current Monet cannot however handle multiple accellerators, so this ain't a real problem for now :) */#define declare_unclustered /* avoid warning */#define declare_clustered int samecluster = TRUE;#define chain_unclustered for (zz = hash[c]; zz != HASH_MAX; zz = e->link)#define chain_clustered for (zz = hash[c]; (zz != HASH_MAX) && (samecluster); zz = e->link)#define tst_grp_unclustered(eq,p,t) (eq(p, tcur, t))#define tst_grp_clustered(eq,p,t) (samecluster = eq(p, tcur, t))#define tst_derive_unclustered(eq,p,t) (e->hcur == hcur && eq(p, tcur, t))#define tst_derive_clustered(eq,p,t) ((samecluster = e->hcur == hcur) && eq(p, tcur, t))/* NOTE: the first two fields of idxentry_t and mapentry_t MUST be the same */typedef struct { oid hcur; /* old group id */ hash_t link; /* hash link */} idxentry_t;typedef struct { oid hcur; /* old group id */ hash_t link; /* hash link */ oid gid; /* new group id */ int cnt; /* histogram count */} mapentry_t;typedef struct { BAT *map; /* [mapentry,value] elements */ hash_t *hash, mask; /* hash buckets and mask */ Heap hp; /* storage for hash buckets */} map_T;@:map_init_def(STANDARD,STANDARD_MASK,4096)@@:map_init_def(CUSTOM,custom_MASK,custom_rng)@@= map_init_def#define map_init_@1(map,hash,mask,entry,mapsize) \ if (m) { \ map = m->map; hash = m->hash; mask = m->mask; \ } else { \ hash_t yy; \ map = BATnew(TYPE_mapentry, tailtype(b,TRUE), @3); \ hash = (hash_t*) GDKmalloc((int)(sizeof(hash_t)*((mask=@2)+1))); \ for (yy=0; yy<=@2; yy++) { \ hash[yy] = HASH_MAX; \ } \ } \ entry.cnt = 1; \ mapsize = BUNindex(map, BUNlast(map));@cvoidmap_free(map_T m){ BBPreclaim(m.map); HEAPfree(&m.hp);}#ifndef offsetof#define offsetof(type, member) ((size_t) &((type *) 0)->member)#endifBAT *map2histo(BAT *map){ if (map == NULL || map->htype != TYPE_mapentry || VIEWparent(map) || map->batSharecnt > 1 || BATgetaccess(map) != BAT_WRITE) { if (map) BBPreclaim(map); return NULL; } /* trickily transform a bat[mapentry,any] into bat[oid,int] */ map->htype = BATmirror(map)->ttype = TYPE_oid; map->ttype = BATmirror(map)->htype = TYPE_int; if (map->tvarsized && map->theap) { HEAPfree(map->theap); GDKfree(map->theap); map->theap = NULL; } BATmirror(map)->hvarsized = map->tvarsized = 0; /* do these in the right order: map->dims.headloc is used, so * change it at end */ BATmirror(map)->dims.headloc = map->dims.tailloc = map->dims.headloc + (int) offsetof(mapentry_t, cnt); BATmirror(map)->dims.tailloc = map->dims.headloc = map->dims.headloc + (int) offsetof(mapentry_t, gid); return map;}@}@-The group macro is split along three dimensions:@multitable @columnfractions 0.2 0.8@item [type:] @tabType specific implementation for selecting the righthash function and data size etc.;@item [clustered:] @tabThe @{clustered and unclustered@} select theappropriate algorithm, i.e., with or without taking advantage of an order of values in the parent groups;@item [physical properties:] @tabValues @{standard and custom@}, choosing between a fixed predefined and a custom hashmask. Customallows the user to determine the size of the hashmask (and indirectlythe estimated size of the result). The hashmask is $2^n - 1$ where $n$is given by the user, or 1023 otherwise, and the derived resultsize is $4 \cdot 2^n$.@end multitableFurther research should point out whether fitting a simple statisticalmodel (possibly a simple mixture model) can help choose these parametersautomatically; the current idea is that the user (which could be adomain-specific extension of the higher-level language) knows theproperties of the data, especially for IR in which the standard groupingsettings differ significantly from the original datamining application.@{@c#define group_params_STANDARD /* fixed */#define group_params_CUSTOM hash_t custom_MASK, int custom_rng,@= groupAllBAT *CTgroup_@1_@4_@5(group_params_@5 BAT *b, BAT *bn, map_T *m){ oid *dst = (oid*) BUNfirst(bn); hash_t xx=0, *hash, mask; size_t zz, mapsize; mapentry_t entry, *e; BUN p, q, r; BAT *map = NULL; declare_@3#ifdef MKspeedup size_t idx;#endif map_init_@5(map,hash,mask,entry,mapsize); if (map == NULL) return NULL; /* core hash grouping algorithm */#ifdef MKspeedup/* MK: Experimental code. Should be triggered when you attemptto consume a lot of your physical memory. First indication, a factor 2.*/ (void) q; if( BATprepareHash(BATmirror(b))){ stream_printf(GDKout,"#M5 IO group join\n"); for( idx=0; idx< b->thash->mask; idx++) for( xx=b->thash->hash[idx]; xx != HASH_MAX; xx= b->thash->link[xx]){ declare_@4 ptr tcur; p= BUNptr(b,xx); tcur = BUN@2(b,p);#else BATloopFast(b, p, q, xx) { declare_@4 ptr tcur = BUN@2(b,p);#endif /* hash-lookup of 'tcur' in map */ hash_t c = HASH_@1(tcur); c = mix_int(c) & mask; chain_@4 { r = BUNptr(map,zz); e = (mapentry_t*) BUNhloc(map,r); if (tst_grp_@4(@3_EQ, BUN@2(map,r), @1)) { if (m == NULL) e->cnt++; goto found; } } /* not found-> insert new element in map (and hash) */ if (m) { zz = mapsize; } else { entry.gid = *(oid*) BUNhead(b,p); } entry.link = hash[c]; hash[c] = mapsize++; bunfastins(map, &entry, tcur); e = &entry;found: /* ultra-fast 'insert' of [oid,gid] into ct */ if (bn->htype) *dst++ = *(oid*) BUNhead(b,p); *dst++ = m?zz:e->gid; }#ifdef MKspeedup BATsort(BATmirror(bn));
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -