📄 mkey.mx
字号:
@' The contents of this file are subject to the MonetDB Public License@' Version 1.1 (the "License"); you may not use this file except in@' compliance with the License. You may obtain a copy of the License at@' http://monetdb.cwi.nl/Legal/MonetDBLicense-1.1.html@'@' Software distributed under the License is distributed on an "AS IS"@' basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See the@' License for the specific language governing rights and limitations@' under the License.@'@' The Original Code is the MonetDB Database System.@'@' The Initial Developer of the Original Code is CWI.@' Portions created by CWI are Copyright (C) 1997-2007 CWI.@' All Rights Reserved.@f mkey@a Peter Boncz, Stefan Manegold@-@v 1.1@+ Multi-Attribute Equi-Join@{@malmodule mkey;command rotate(v:int, nbits:int) :int address MKEYrotatecomment "left-rotate an int by nbits";pattern hash(v:any):int address MKEYhash comment "compute a hash int number from any value";command hash(v:chr):int address MKEYhash_chr; command hash(v:sht):int address MKEYhash_sht; command hash(v:int):int address MKEYhash_int; command hash(v:flt):int address MKEYhash_flt; command hash(v:dbl):int address MKEYhash_dbl; command hash(v:lng):int address MKEYhash_lng; command hash(v:str):int address MKEYhash_str; pattern bulk_rotate_xor_hash(h:int, nbits:int, v:any) :int address MKEYrotate_xor_hashcomment "post: [:xor=]([:rotate=](h, nbits), [hash](b))";command bulk_rotate_xor_hash(h:bat[:oid,:int], nbits:int,b:bat[:oid,:any_1]) :bat[:oid,:int] address MKEYbulk_rotate_xor_hashcomment "pre: h and b should be synced on head post: [:xor=]([:rotate=](h, nbits), [hash](b))";@-@}When creating a join, we want to make a unique key of the attributes on bothsides and then join these keys. Consider the following BATs.@verbatimorders customer link==================== ===================== =========== zipcode hnr zipcode hnr oid cido1 13 9 c1 11 10 o1 c5o2 11 10 c2 11 11 o2 c1o3 11 11 c3 12 2 o3 c2o4 12 5 c4 12 1 o4 nilo5 11 10 c5 13 9 o5 c1o6 12 2 c6 14 7 o6 c3o7 13 9 o7 c5o8 12 1 o8 c4o9 13 9 o9 c5@end verbatimThe current approach is designed to take minimal memory, as our previoussolutions to the problem did not scale well. In case of singular keys,the link is executed by a simple join. Before going into the join, wemake sure the end result size is not too large, which is done by lookingat relation sizes (if the other key is unique) or, if that is not possible,by computing the exact join size.The join algorithm was also improved to do dynamic sampling to determinewith high accuracy the join size, so that we can alloc in one go a memoryregion of sufficient size. This also reduces the ds\_link memory requirements.For compound keys, those that consist of multiple attributes, we now computea derived column that contains an integer hash value derived from allkey columns.This is done by computing a hash value for each individual key columnand combining those by bitwise XOR and left-rotation. That is, for eachcolumn,we rotate the working hash value by N bits and XOR the hash valueof the column over it. The working hash value is initialized with zero,and after all columns are processed, this working value is used as output.Computing the hash value for all columns in the key for one table is doneby the command hash(). Hence, we do hash on both sides, and jointhat together with a simple join:join(hash(keys), hash(keys.reverse);One complication of this procedure are nil values:@table@itemize@item it may happen that the final hash-value (an int formed by arandom bit pattern) accidentally has the value of int(nil).Notice that join never matches nil values.Hence these accidental nils must be replaced by a begin value (currently: 0).@item in case any of the compound key values is nil, our nil semanticsrequire us that those tuples may never match on a join. Consequently,during the hash() processing of all compound key columns for computingthe hash value, we also maintain a bit-bat that records which tuples hada nil value. The bit-bat is initialized to false, and the results of thenil-check on each column is OR-ed to it.Afterwards, the hash-value of all tuples that have this nil-bit set toTRUE are forced to int(nil), which will exclude them from matching.@end itemizeJoining on hash values produces a @emph{superset} of the join result:it may happen that two different key combinations hash on the same value,which will make them match on the join (false hits). The final partof the ds\_link therefore consists of filtering out the false hits.This is done incrementally by joining back the join result to the originalcolumns, incrementally one by one for each pair of correspondingcolumns. These values are compared with each other and we AND theresult of this comparison together for each pair of columns.The bat containing these bits is initialized to all TRUE and serves asfinal result after all column pairs have been compared.The initial join result is finally filtered with this bit-bat.Joining back from the initial join-result to the original columns onboth sides takes quite a lot of memory. For this reason, the falsehit-filtering is done in slices (not all tuples at one time).In this way the memory requirements of this phase are kept low.In fact, the most memory demanding part of the join is the int-joinon hash number, which takes N*24 bytes (where N= |L| = |R|).In comparison, the previous CTmultigroup/CTmultiderive approachtook N*48 bytes. Additionally, by making it possible to use merge-sort,it avoids severe performance degradation (memory thrashing) as producedby the old ds\_link when the inner join relation would be larger than memory.If ds\_link performance is still an issue, the sort-merge join used herecould be replaced by partitioned hash-join with radix-cluster/decluster.@{@+ Implementation@h#ifndef _MKEY_H#define _MKEY_H#include <mal.h>#include "mal_interpreter.h"#include "mal_exception.h"#ifdef WIN32#ifndef LIBMKEY#define mkey_export extern __declspec(dllimport)#else#define mkey_export extern __declspec(dllexport)#endif#else#define mkey_export extern#endif#define GDK_ROTATE(x,y,z,m) ((((x) << (y)) & ~(m)) | (((x) >> (z)) & (m)))mkey_export str MKEYrotate(int *ret, int *v, int *nbits);mkey_export str MKEYhash(MalBlkPtr mb, MalStkPtr stk, InstrPtr p);mkey_export str MKEYhash_chr(int *ret, chr *v);mkey_export str MKEYhash_sht(int *ret, sht *v);mkey_export str MKEYhash_int(int *ret, int *v);mkey_export str MKEYhash_flt(int *ret, flt *v);mkey_export str MKEYhash_dbl(int *ret, dbl *v);mkey_export str MKEYhash_lng(int *ret, lng *v);mkey_export str MKEYhash_str(int *ret, str *v);mkey_export str MKEYrotate_xor_hash(MalBlkPtr mb, MalStkPtr stk, InstrPtr p);mkey_export str MKEYbulk_rotate_xor_hash(int *ret, int *hid, int *nbits,int *bid);#endif /* _MKEY_H */@-new functionality for the low-resource-consumption ds_link. It willfirst one by one create a hash value out of the multiple attributes.This hash value is computed by xoring and rotating individual hashvalues together. We create a hash and rotate command to do this.@c#include "mal_config.h"#include "mkey.h"/* TODO: nil handling. however; we do not want to lose time in bulk_rotate_xor_hash with that */intCMDrotate(int *res, int *val, int *n){ *res = GDK_ROTATE(*val, *n, 32 - *n, (1 << *n) - 1); return GDK_SUCCEED;}intCMDhash_chr(int *res, chr *val){ *res = *(chr *) val; return GDK_SUCCEED;}intCMDhash_sht(int *res, sht *val){ *res = *(sht *) val; return GDK_SUCCEED;}intCMDhash_int(int *res, int *val){ *res = *(int *) val; return GDK_SUCCEED;}intCMDhash_flt(int *res, flt *val){ *res = *(int *) val; return GDK_SUCCEED;}intCMDhash_lng(int *res, lng *val){ *res = ((int *) val)[0] ^ ((int *) val)[1]; return GDK_SUCCEED;}intCMDhash_dbl(int *res, dbl *val){ *res = ((int *) val)[0] ^ ((int *) val)[1]; return GDK_SUCCEED;}intCMDhash_str(int *res, str val){ *res = strHash(val); return GDK_SUCCEED;}intCMDhash(int *res, ptr val, int tpe){ hash_t code; /* 64-bits on 64 systems; we truncate it here to save space */ switch (ATOMstorage(tpe)) { case TYPE_void: code = int_nil; break; case TYPE_chr: code = *(chr *) val; break; case TYPE_sht: code = *(sht *) val; break; case TYPE_int: case TYPE_flt: code = *(int *) val; break; case TYPE_lng: case TYPE_dbl: code = ((int *) val)[0] ^ ((int *) val)[1]; break; case TYPE_str: code = strHash((char*)val); break; default: code = (*BATatoms[tpe].atomHash) (val); } *res = code; return GDK_SUCCEED;}strMKEYrotate_xor_hash(MalBlkPtr mb, MalStkPtr stk, InstrPtr p){ int *dst = (int*) getArgReference(stk,p,0); int *h = (int*) getArgReference(stk,p,1); int *rotate = (int*) getArgReference(stk,p,2); int tpe = getArgType(mb,p,3); ptr *pval = (ptr) getArgReference(stk,p,3); int lbit = *rotate; int rbit = 32 - *rotate; int mask = (1 << lbit) - 1; if (tpe == TYPE_chr) { chr *cur = (chr*) pval; *dst = GDK_ROTATE(*h, lbit, rbit, mask) ^ *cur; } else if (tpe == TYPE_sht) { sht *cur = (sht*) pval; *dst = GDK_ROTATE(*h, lbit, rbit, mask) ^ *cur; } else if (tpe == TYPE_int || tpe == TYPE_flt) { int *cur = (int*) pval; *dst = GDK_ROTATE(*h, lbit, rbit, mask) ^ *cur; } else if (tpe == TYPE_lng || tpe == TYPE_dbl) { lng *cur = (lng*) pval; int val = (int)( cur[0] ^ cur[1]); *dst = GDK_ROTATE(*h, lbit, rbit, mask) ^ val; } else if (tpe == TYPE_str) { /* TYPE_str */ str cur = *(str*) pval; hash_t val = strHash(cur); *dst = GDK_ROTATE(*h, lbit, rbit, mask) ^ val; } else { hash_t (*hash) (ptr) = BATatoms[tpe].atomHash; *dst = GDK_ROTATE(*h, lbit, rbit, mask) ^ (*hash) (pval); } return MAL_SUCCEED;}intCMDbulk_rotate_xor_hash(BAT **res, BAT *bn, int *rotate, BAT *b){ int *dst = (int *) BUNtloc(bn, BUNfirst(bn)); int tpe = ATOMstorage(b->ttype); int lbit = *rotate; int rbit = 32 - *rotate; int mask = (1 << lbit) - 1; int xx = BUNsize(b); int yy = BUNsize(bn); if (!ALIGNsynced(bn, b)) { GDKerror("CMDbulk_rotate_xor_hash: (%s,%d,%s): not synced on head.\n", BATgetId(bn), *rotate, BATgetId(b)); return GDK_FAIL; } else if (VIEWparent(bn) || bn->batRestricted) { GDKerror("CMDbulk_rotate_xor_hash: (%s,%d,%s): left operand not writeable.\n", BATgetId(bn), *rotate, BATgetId(b)); return GDK_FAIL; } else if (*rotate < 0 || *rotate >= 32) { GDKerror("CMDbulk_rotate_xor_hash: (%s,%d,%s): illegal number of rotate bits.\n", BATgetId(bn), *rotate, BATgetId(b)); return GDK_FAIL; } else if (tpe == TYPE_chr) { chr *cur = (chr *) BUNtloc(b, BUNfirst(b)); chr *end = (chr *) BUNtloc(b, BUNlast(b)); while (cur < end) { *dst = GDK_ROTATE(*dst, lbit, rbit, mask) ^ *cur; cur = (chr *) (((BUN) cur) + xx); dst = (int *) (((BUN) dst) + yy); } } else if (tpe == TYPE_sht) { sht *cur = (sht *) BUNtloc(b, BUNfirst(b)); sht *end = (sht *) BUNtloc(b, BUNlast(b)); while (cur < end) { *dst = GDK_ROTATE(*dst, lbit, rbit, mask) ^ *cur; cur = (sht *) (((BUN) cur) + xx); dst = (int *) (((BUN) dst) + yy); } } else if (tpe == TYPE_int || tpe == TYPE_flt) { int *cur = (int *) BUNtloc(b, BUNfirst(b)); int *end = (int *) BUNtloc(b, BUNlast(b)); while (cur < end) { *dst = GDK_ROTATE(*dst, lbit, rbit, mask) ^ *cur; cur = (int *) (((BUN) cur) + xx); dst = (int *) (((BUN) dst) + yy); } } else if (tpe == TYPE_lng || tpe == TYPE_dbl) { int *cur = (int *) BUNtloc(b, BUNfirst(b)); int *end = (int *) BUNtloc(b, BUNlast(b)); while (cur < end) { int val = cur[0] ^ cur[1]; *dst = GDK_ROTATE(*dst, lbit, rbit, mask) ^ val; cur = (int *) (((BUN) cur) + xx); dst = (int *) (((BUN) dst) + yy); } } else if (tpe == TYPE_str) { /* TYPE_str */ int *cur = (int *) BUNtloc(b, BUNfirst(b)); int *end = (int *) BUNtloc(b, BUNlast(b)); str base = b->theap->base; while (cur < end) { hash_t val; val = strHash(base + *cur); *dst = GDK_ROTATE(*dst, lbit, rbit, mask) ^ val; cur = (int *) (((BUN) cur) + xx); dst = (int *) (((BUN) dst) + yy); } } else if (b->ttype == TYPE_void) { BUN p, q; BATloopFast(b, p, q, xx) { *dst = GDK_ROTATE(*dst, lbit, rbit, mask) ^ *(int *) BUNtail(b, p); dst = (int *) (((BUN) dst) + yy); } } else { hash_t (*hash) (ptr) = BATatoms[b->ttype].atomHash; BUN p, q; BATloopFast(b, p, q, xx) { *dst = GDK_ROTATE(*dst, lbit, rbit, mask) ^ (*hash) (BUNtail(b, p)); dst = (int *) (((BUN) dst) + yy); } } /* we return an already existing parameter BAT, so we must fix it */ *res = bn; bn->tsorted = 0; if (bn->tkey) BATkey(BATmirror(bn), FALSE); BBPfix(bn->batCacheid); return GDK_SUCCEED;}@+ MAL wrapperThe remainder contains the wrapping needed for M5@cstr MKEYhash_chr(int *ret, chr *v){ CMDhash_chr(ret,v); return MAL_SUCCEED;}str MKEYhash_sht(int *ret, sht *v){ CMDhash_sht(ret,v); return MAL_SUCCEED;}str MKEYhash_int(int *ret, int *v){ CMDhash_int(ret,v); return MAL_SUCCEED;}str MKEYhash_flt(int *ret, flt *v){ CMDhash_flt(ret,v); return MAL_SUCCEED;}str MKEYhash_dbl(int *ret, dbl *v){ CMDhash_dbl(ret,v); return MAL_SUCCEED;}str MKEYhash_lng(int *ret, lng *v){ CMDhash_lng(ret,v); return MAL_SUCCEED;}str MKEYhash_str(int *ret, str *v){ CMDhash_str(ret,*v); return MAL_SUCCEED;}strMKEYrotate(int *res, int *val, int *n){ CMDrotate(res,val,n); return MAL_SUCCEED;}strMKEYhash(MalBlkPtr mb, MalStkPtr stk, InstrPtr p){ int *ret; ptr val; ret= (int*) getArgReference(stk,p,0); val= (ptr) getArgReference(stk,p,1); CMDhash(ret, val, getArgType(mb,p,1)); return MAL_SUCCEED;}strMKEYbulk_rotate_xor_hash(int *ret, int *hid, int *nbits, int *bid){ BAT *hn, *b, *bn=0; if ((hn = BATdescriptor(*hid)) == NULL) { throw(MAL, "mkey.bulk_rotate_xor_hash", "Cannot access descriptor"); } if ((b = BATdescriptor(*bid)) == NULL) { BBPreleaseref(hn->batCacheid); throw(MAL, "mkey.bulk_rotate_xor_hash", "Cannot access descriptor"); } if( CMDbulk_rotate_xor_hash(&bn,hn,nbits,b) == GDK_FAIL){ BBPreleaseref(hn->batCacheid); BBPreleaseref(b->batCacheid); throw(MAL, "mkey.bulk_rotate_xor_hash", "command failed"); } BBPreleaseref(hn->batCacheid); BBPreleaseref(b->batCacheid); *ret= bn->batCacheid; BBPkeepref(bn->batCacheid); return MAL_SUCCEED;}@}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -