📄 radix.mx
字号:
@TIf the BATs storing the columns of the "larger" table do not fit the memory cache anymore,the positional joins in the last Y statements of the MIL script will start to generate cachemisses. This is caused by the fact that the OIDs in the tail of the res\_larger BATs arenot sorted; hence the access to the larger\_bX column BATs is random.This problem can be solved, by sorting the result of the join first on the OIDs that pointto the "larger" table (the head column of res\_join):\begin{verbatim}..03 res_join := join(larger_key, smaller_key.reverse).sort;04 res_larger_sorted := res_join.mark(0@0).reverse;05 res_smaller := res_join.reverse.mark(0@0).reverse;.. # positional-join projected columns from larger table into resultB1 res_b1 := join(res_larger_sorted, larger_b1);BX ....BZ res_bZ := join(res_larger_sorted, larger_bZ);\end{verbatim}As a result, the res\_larger BAT will be ordered on tail, hence the positional joins on the larger\_bXcolumns will cause a nice sequential access to both res\_larger (as it is scanned in its role as"outer" join operand) and larger\_bX (due to the fact that the lookup values from res\_larger arenow sorted). We must, however, take into account that res\_joinmay be a BAT that itself is larger than the memory cache, in which case the sorting operationitself could cause a great many cache misses itself, and therefore perform badly. Let us thereforeshortly discuss the memory access properties of Monet's sorting algorithms.Monet uses a CPU-optimized quicksort algorithm for sorting large relations. The CPU-optimizationsreduce the amount of function calls, by doing all value-comparison and data movement in line,using C macros. In this sense it differs from the standard Unix library call qsort(), as thatroutine compares values with a user-provided function, and (often) moves values with memcpy().The memory access pattern of the Monet quicksort consists of one sequential scan per recursionlevel (walking two cursors simultaneously, one from the start of the BAT forward, as well as anotherfrom the end of the BAT backward, until both meet in the middle). Quicksort is binary recursive andtherefore takes log2(ntuples) recursion levels to sort a BAT, hence its total memory access consists of log2(ntuples)sequential scans. However, since quicksort zooms into ever smaller sub-chunks of the BAT, there willbe cache re-use in the deeper recursion levels as soon as such a chunk fits the memory cache, whichhappens when sizeof(chunk) = sizeof(BAT)/(2\^level) <= sizeof(cache). Hence, the total memory cost ofquicksort is log2(ntuples)-log2(sizeof(cache)/sizeof(tuple)) sequential scans.In all, the Monet quicksort implementation behaves quite good both concerning CPU efficiency andmemory access pattern. Still, for some simple data types (in particular columns containing OIDs) onecan further improve the memory access performance by using {\em radix-sort} instead of quicksort.Radix-sort is essentially a radix-cluster on all bits, hence we do:\begin{verbatim}..03 res_join := join(larger_key, smaller_key.reverse).reverse.radix_cluster(R1,..,Rp).reverse;..\end{verbatim}Where p is a suitable number of passes and R=R1+..+Rp is the total number of "significant bits"(the most significant bit in a collection of integer values is the highest bit set in all values).The head column of the join(larger\_key, smaller\_key.reverse) is of type OID, and containsthe OIDs from the matching tuples in the larger table. Table-OID are automatically generated bythe VOID columns of Monet, and therefore these integer values are from the range [0,...,N], where Nis the number of tuples in the "larger" table. We call such an integersub-domain a "dense" domain. As a result, the number of significant bits is minimal (i.e. R=log2(N),there are no "spoiled" values), and we do not expect skew in such a column. This motivates our choiceto implement radix-cluster for the OID type by getting radix bits {\em without} hashing (for all othertypes, we hash first). Hashing is not necessary due to absence of value-skew on OID columns, andabsence of hashing allows us to use radix-cluster as radix-sort.@- Partitioned Hash Join@TWe now discuss the case that even the smaller\_key BAT with its hash structure does not fit thesmallest cache. What happens then in the join-phase? Since the hash-join algorithm exhibits a randomaccess pattern, compulsory cache misses will start to appear up to the point that each access tothe hash-table will be a miss. Even in the direct hashing employed by Monet, this amounts to at least3 misses per tuple in the "larger" table (one in the hash-mask, at least one in the BAT, and at leastone in the hash-link list to conclude that this is the last match). To these memory cost, we shouldalso add the cost of constructing the hash table, which amount to (at least) one miss per tuple inthe "smaller" table. As cache misses can be as expensive as 100 CPU cycles, and pure CPU costs fordirect hashing in Monet can be as low as 20 cycles per tuple, such a high cache miss ratiotremendously slows down the join (e.g. by a factor 10).In the situation mentioned above, performance can be improved by using partitioned hash-join,as presented earlier in this module, instead of simple hash-join. The partitioned hash-join usesradix-cluster to quickly cluster both the smaller\_key and larger\_key BATs into clusters that fitthe memory cache, and then repeatedly performs hash-join on the corresponding clusters. In this way,the random access is restricted to areas that fit the memory cache, hence the expensive cache missesdisappear (mostly).This is realized in Monet by using radix-clustering both relations on H bits, e.g. larger\_keyin l passes, and smaller\_key in s passes, such that H = L1+..+Ll = S1+..+Ss:\begin{verbatim}00 # first radix cluster both key columns on H bits (maybe different number of passes and bit settings)01 cluster_larger := radix_cluster(larger_key, L1,..,Ll);02 cluster_smaller := radix_cluster(smaller_key, S1,..,Ss); # partitioned hash join on clusters of H radix-bits.03 res_join := phash_join(cluster_larger, cluster_smaller.reverse, H);..\end{verbatim}Line 03 above uses phash\_join, but could alternatively use radix-join:\begin{verbatim}..03 res_join := radix_join(cluster_larger, cluster_smaller.reverse, H);..\end{verbatim}From this point on, the same code as in the previous MIL script could be applied to fetch the columnvalues for columns a1..aY from "smaller" and b1..bZ from "larger". The remaining problem here is that boththe larger\_bX *and* the smaller\_aX BATs will tend to be bigger than the smallest memory cache (though thisalso depends on the join hit-ratio, but let use suppose it is >= 1).To improve this, we can sort on the OIDs from the "larger" table, like described in the previous section.This will enhance the access pattern of the subsequent positional joins to the larger\_bX column BATs:\begin{verbatim}.. # partitioned hash join on clusters of H radix-bits, followed by radix-sort on head column.03 res_join := phash_join(cluster_larger, cluster_smaller.reverse, H).reverse.radix_cluster(R1,..,Rp).reverse;04 res_larger_sorted := res_join.mark(0@0).reverse;05 res_smaller := res_join.reverse.mark(0@0).reverse;..\end{verbatim}Now, as the smaller\_aX column BATs probably are also larger than the memory cache, we would like to dothe same for the "smaller" table. But, we then cannot sort res\_join twice. We could sort res\_smaller on tailafter line 05:\begin{verbatim}..05 res_smaller := res_join.reverse.mark(0@0).reverse.radix_cluster(R1,..,Rp);..\end{verbatim}However, this approach of the problem only transfers the problem to later phases of query processing.The positional joins would run fine, but as a tail-sorted res\_smaller would be a BAT[oid,oid] (i.e.it would no longer have a VOID head column), the result of the positional joins with the smaller\_aXBAT[void,T]s would be of the form BAT[oid,T]. These results would not only take more space than thedesired form BAT[void,T], but would also create a problem in further use of the query result, as theseres\_aX BATs will not be sorted on head. Join access to them would go to the hash-join rather than thepositional join, and due to the random access this would pose a memory caching problem as theseres\_aX BATs tend to be larger than the memory cache.Therefore, for these projection joins, we propose the use of a new memory-conscious join algorithm that iscalled {\em clustered positional join} which implements a join(BAT[void,oid] L, BAT[void,T] R) : BAT[void,T]This algorithm consists of three phases, of which we already know the first two:\begin{description}\item[partial radix-cluster]First, the res\_smaller is {\em partially} radix-clustered on tail-OID. That is, the relation L (= res\_smaller inthe positional joins to the column BATs smaller\_aX) is clustered on some number of highest significantradix-bits, but not on all radix-bits. Because the radix-cluster on OIDs does not use a hash-function,clustering an OID column on all significant bits radix-sorts it, or - as in this case - on a subset of thehighest significant bits, partially orders it. The partial ordering of OIDs in chunks is done in such a way that thesize of the corresponding chunk in R (remember R is a BAT[void,T] and has all OIDs in a densely ascending sequence) fitsthe memory cache.\item[positional join]The purpose of the radix-clustering in the previous phase L is to accelerate the positional join between L and R(i.e. res\_smaller and smaller\_aX). Because the OIDs in the tail of L are now partially sorted, each chunk in Lwill only randomly access data from one chunk in R. Therefore, during positional join, these chunks in R staymemory resident, accelerating performance with respect to to doing the same with a non-clustered L (where each access toR would be a cache miss).\item[radix decluster]The result of the positional join is a BAT[oid,T]. Still, we know that the head column, when sorted, wouldform an void column. What is now the fastest way to sort it and convert it back to a void BAT? One special propertyof the radix-cluster algorithm is that when we cluster on tail column, each result chunk will have the head-valuesin order. In this case, the clustered version of L (res\_smaller\_clustered, see below) has the head OIDs in order {\em within thechunk}. This sub-ordering is also carried over by the positional join to result of the positional join: in each'virtual chunk' in the BAT[oid,T], the OIDs appear in order. Therefore, we can perform an merge operation to mergeall BAT[oid,T] chunks into a BAT[void,T] result. Normally, the cost of a merge is at least O(log(P)*N), where Nis the total number of tuples, and P is the number of chunks. By using the special property that eventually,the merged OIDs form a densely ascending sequence (0@0, 1@0,..,N@0), we can bring this cost back to O(N)! This {\em radix decluster}algorithm keeps a windows open of OIDs [windowStart, windowStart+1, ..., windowStart+windowSize-1] during themerge. Each iteration of the algorithm finds all the next windowSize OIDs in the chunks and inserts themin the result BAT[void,T]. This is done by going to through all (not yet empty) chunks and inserting from the top ofeach chunk all elements whose OID fits into the window. Due to the fact that all chunks are sorted, we aresure to have found all OIDs after having processed all chunks. Then, the window is shifted windowSize positionsand the process repeats. The windowSize is typically a multiple of the number of chunks, such that per iterationin each chunk multiple tuples fall into the window. Because these multiple tuples are accessed sequentiallyin the chunk, the chunk cache lines will be re-used and performance will be good. The only restriction on the windowSizeis that the insertion window on the output BAT must fit the memory cache. This will only start to exceed on verylarge table sizes (a possible remedy is then to perform the merge in multiple passes).\end{description}In all, in Monet this join strategy is expressed in the following MIL:\begin{verbatim}.. # subcluster on Rs significant radix-bits, ignoring lowest Ri bits05 res_smaller_clustered := res_join.reverse.mark(0@0).reverse.radix_cluster(-Ri, Rs); # positional-join and decluster projected columns from smaller table into resultA0 borders_smaller := res_smaller_clustered.radix_count(Ri, Rs);A1 res_a1 := join(res_smaller_clustered, smaller_a1).radix_decluster(borders_smaller);AX ....AY res_aY := join(res_smaller_clustered, smaller_aY).radix_decluster(borders_smaller);..\end{verbatim}Possibly, one could consider to use this approach as well for the projections on the larger table(instead of the sort). However, sorting once (with radix-sort by radix-clustering on all significantbits) is probably faster than one-time partial radix-cluster, followed by multiple times positional declusteroperations (one for each projected column). Therefore, it is best to do one of the set of projectionswith the sort/join strategy and the other with the clustered positional join strategy.The decision to do which (sort on the smaller or on the larger table?), could also be made in function of thenumber of projection attributes needed from the "smaller" and "larger" tables. That is, one could choose to dothe clustered positional join on the table with least projections, and sort/join on the other.The full MIL script when the individual BATs of both the "smaller" and "larger" tables as well asthe (N-M) join result, exceed the memory cache becomes:\begin{verbatim} # first radix cluster both key columns on H bits (maybe different number of passes and bit settings)01 cluster_larger := radix_cluster(larger_key, L1,..,Ll);02 cluster_smaller := radix_cluster(smaller_key, S1,..,Ss); # partitioned hash join on clusters of H radix-bits, followed by radix-sort on head column.03 res_join := phash_join(cluster_larger, cluster_smaller.reverse, H).reverse.radix_cluster(R1,..,Rp).reverse;04 res_larger_sorted := res_join.mark(0@0).reverse; # subcluster on Rs significant radix-bits, ignoring lowest Ri bits05 res_smaller_clustered := res_join.reverse.mark(0@0).reverse.radix_cluster(-Ri, Rs); # positional-join and decluster projected columns from smaller table into resultA0 borders_smaller := res_smaller_clustered.radix_count(Ri, Rs);A1 res_a1 := join(res_smaller_clustered, smaller_a1).radix_decluster(borders_smaller);AX ....AY res_aY := join(res_smaller_clustered, smaller_aY).radix_decluster(borders_smaller); # positional-join projected columns from larger table into resultB1 res_b1 := join(res_larger_sorted, larger_b1);BX ....BZ res_bZ := join(res_larger_sorted, larger_bZ);\end{verbatim}@+ The Relational Approach@TA cache-conscious join in a relational DBMS would first radix-cluster both the smaller and larger table,where in the process it would project on just the selected columns. As the relational model does notseparate its algebraic actions by column, as Monet in MIL does, it cannot use the technique of type-expansionand have primitives that are optimized for a specific type. In other words, the join operator in a relationalDBMS must be able to handle tables of variable number of columns with variable constellations of column types.This is either implemented in the kernel of a relational DBMS by using ADT interfaces with functions dereferencedfrom a type table, or through late-binding style C++ overloaded methods to handle data values of variable types.That means that, e.g. during the radix-cluster, for each projected column, at least one function call is executedto move a data value from the source record to the destination record.To model this extra function calling overhead, we use the integerX Monet types, that were created specificallyfor these experiments. An integerX value models a relational tuple that stores X simple integer column values.Through some tricks, copying one integer8 value in Monet is actually done by copying 8 integers width memcpy(),which closely mimics what happens in a typical relational DBMS.In order to give a concrete example, we suppose "smaller" and "larger" each have 128 columns, andthe projection widths are Y=Z=8; we emulate relational storage of both tables in Monet by storingthem as BAT[integer128,integer], where the tail columns contain the "key" value and the head containsall other columns.The entire MIL sequence are then the following 5 statements:\begin{verbatim}01 smaller_all := [integer]([integer](smaller_key,1).reverse, 128).reverse;02 larger_all := [integer]([integer](larger_key,1).reverse, 128).reverse;03 smaller_view := [integer](smaller_all.reverse, 8).reverse;04 larger_view := [integer](larger_all.reverse, 8).reverse;05 cluster_smaller := radix_cluster(smaller_view, S1,..,Ss);06 cluster_larger := radix_cluster(larger_view, L1,..,Ll);07 res_join := phash_join(cluster_larger, cluster_smaller.reverse, H);\end{verbatim}Notice that the fact that smaller\_view and larger\_view are MIL views on the base BATssmaller\_all and larger\_all, means that the projection is never materialized. Projectingis done on the fly during the first pass of radix cluster, just like what would happen ina relational system. What is more, the copying of each integer8 (view) value from itsstorage as integer128 is done with 8 memcpy() calls that fetch values at regular intervals(i.e. at positions 0, 16, 32, 48, 64, 80, 96 and 112). In practice, this means thateach memcpy() causes a memory cache miss.We expect that the relational strategy will have a different performance characteristic thanMonet. First, there will be many more cache misses due to the reasons described above.Even radix-cluster cannot avoid those cache misses, as it is inherent to the relationalstorage format of base data. Second, there will be much more CPU cost, due to the fact thatfunction-call overhead cannot be eliminated. In Monet algorithms, CPU optimizations causea 5-fold performance improvement.Therefore, it might even be that the increased cost of radix-cluster will make simplehash-join faster than partitioned hash-join:\begin{verbatim}05 res_join := join(larger_view, smaller_view.reverse);\end{verbatim}In either way, we expect a relational performance to be a factor 10 slower than Moneton big sizes of both the "smaller" and "larger" tables with hit rates such that the result table is big.@* Module Definition@malmodule radix;pattern radix_cluster( b:bat[:any_1,:any_2], limits:str, perc:flt, radix1:int ... ):bat[:any_1,:any_2]address M5_RDX_radix_clustercomment"do N radix-cluster steps creating (radix1 * radix2 * ... * radixN) clusters. First pass uses the last radix parameter, and so on backwards. Partial radix cluster (i.e. skipping lower significant bits) can be indicated by passing a negative number of bits as first parameter. If you pass an appendable, empty, limits bat, a radix_count2 result with the resulting cluster boundaries is returned in it. Note that you pass a batname of b, as returned by b.bbpname(). If you pass a non-empty limits bat, it is used for resuming the clustering. The operation assumes that all significant highermost bits are already clustered, and limits contains the cluster sizes. If you pass a non-empty but writable limits bat, it will be used as above for resume, and will also be overwritten with the new cluster boundaries";pattern radix_cluster( b:bat[:any_1,:any_2], perc:flt, radix1:int ... ):bat[:any_1,:any_2]address M5_RDX_radix_cluster_l
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -