📄 radix.mx
字号:
@' The contents of this file are subject to the MonetDB Public License@' Version 1.1 (the "License"); you may not use this file except in@' compliance with the License. You may obtain a copy of the License at@' http://monetdb.cwi.nl/Legal/MonetDBLicense-1.1.html@'@' Software distributed under the License is distributed on an "AS IS"@' basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See the@' License for the specific language governing rights and limitations@' under the License.@'@' The Original Code is the MonetDB Database System.@'@' The Initial Developer of the Original Code is CWI.@' Portions created by CWI are Copyright (C) 1997-2007 CWI.@' All Rights Reserved.@f radix@a Peter Boncz@v 1.0@t Radix Algorithms@* IntroductionThis module introduces algorithms that enhance performance ofgeneric join processing in Monet, by optimizing both memory and CPUresources in modern hardware. We now shortly discuss the problems athand, and in the next section go into detail on how these algorithmsare applied in Monet's join processing strategy.@+ The Memory Story@TComputer RAM carries the acronym Random Access Memory, indicating thatthe memory access speed is independent of memory location. While thisis still (mostly) true, the imbalance in speed improvements between CPUspeed and access speed of the most common memory type DRAM has caused thesituation that reading a byte of memory takes 100 CPU cycles (e.g. given a1GHz processor and 100ns memory latency). Load instructions are frequent inmost programs (sometimes one in four instructions references memory), and inorder not to let the CPU stall for 100 cycles too often, the memory subsystemof a computer therefore does not solely consist of DRAM chips anymore, but alsohas various levels of "cache memory" built from more speedy SRAM chips. A typical moderncomputer has at least an L1 (level-one) cache (typical size 16-32KB, typical latency 5-10ns)and a L2 (level-two) cache (typical size 256-2MB, typical latency 10-30ns). The L1and sometimes even the L2 are now located on the CPU chip itself in order to reduce latency.Cache memories are organized in {\em cache lines} of a fixed width. A typical width foran L1 line is 32 bytes, whereas a line L2 can be 32-128 bytes long. A cache lineis the smallest unit of transfer, which means that on a miss, the memory system fetchesall bytes of the line from the lower levels of the memory hierarchy in one go.This simultaneous transfer often happens through a wide bus where each line in thebus gets one bit from a DRAM chip in parallel. In reality, transfer is not fully simultaneous,as the different bytes of a cache line arrive in a sequential burst, but the total differencein arrival-time tends to be no more than 4 cycles apart from first to last byte. This design ofthe memory caches (on all levels, but with varying size and line width parameters) has consequencesfor application performance, whose severity depend on what kind of {\em memory access patterns}the application exhibits. Reading (or writing) a memory location that is not in the cache, causesa miss, and the CPU is forced for waiting the latency period. It is obvious that subsequentreading of this exact same data will not cause sub-sequent cache misses, and be fast -- becausethis data is already in cache. But what happens with a sequential access pattern to data thatis not in the cache? Again, the first read causes a cache miss. However, subsequent readsto adjacent bytes access the same cache line that is already loaded and thereforedo {\em not} cause any cache misses. This is the reason why sequential memory accessis cheaper than random access, as the latter pattern may cause a cache miss on everyread (considering the access is to an uncached memory region).In all, costs of memory access can be high (up to 100 cycles, and rising),may occur frequently (up to one in four CPU instructions), and are stronglydependent on the access pattern of the application (is the access repetitive,sequential or random, and to how large a total region). This is not just atheoretical exercise for DBMS designers, since measurements on thememory performance of data-intensive operations like query processing inrelational DBMS products has indeed shown that CPUs are stalled for most oftheir time on memory cache misses.@- Optimizing Memory Access during Join@TWe focus here on improving the memory access performance of the joinoperator in order to gain performance. This is relevant, because themost popular main-memory join algorithm is hash-join, which exhibits arandom access pattern to a hash-table that is used to find values in an"inner" relation, while an "outer" relation is scanned sequentially.Concerning memory access, this algorithm works fine only as long asthe inner relation plus its hash-table can be cached at all levelsof the memory hierarchy. As soon as it does not fit the smallest cacheanymore, (multiple) cache misses will appear for each random access tothis inner relation and hash table (at least one miss to the hash-table andone to the relation itself). The CPU stalls caused by these misses can soonstart to dominate overall join costs, which can be in Monet -- without cachemisses -- as low as 20 cycles per tuple (hence two 100 cycle misses everytuple seriously damage performance).The idea of the radix-algorithms implemented in this module is to changethe access pattern of the join operation, in such a way that the overallnumber of cache misses is reduced. This is typically achieved by doingextra CPU work and/or by making additional sequential passes over thememory. In particular, the {\em partitioned-join} strategy first partitions(or clusters) both inner and outer relations into small clusters, such thateach cluster fits the smallest memory cache. The partitioned-join then onlyhas to combine tuples from corresponding clusters. This module provides twopartitioned-join algorithms:\begin{itemize}\item the {\em phash join} algorithm that performs hash-join on the matching clusters.\item as an alternative, we provide {\em radix join} that performs nested loopjoin on the matching clusters.\end{itemize}Phash join needs clusters where the clusters of the inner relation plus hashtable fit the smallest cache. If we assume that tuples in the inner relation(plus hash table) occupy 16 bytes and the L1 cache is 16KB, we need clustersof (less than) 1000 tuples. Radix join needs even smaller clusters, that consistof just 8 tuple to perform well.Partitioning/clustering into such small cluster sizes can become a memory accessproblem in itself; due to the high number of clusters required for a large relation.A straightforward clustering algorithm that creates H clusters, would allocateH output buffers and scan the input relation once, inserting each tuple into thecluster where it belongs. Hence there are H different "output cursors" that shouldreside in the cache to perform well. But, a 16KB L1 cache with 32 byte lines onlyholds 500 lines, hence when H exceeds 500, the "output cursors" cannot be cached andthe cluster operation itself will start to generate a huge number of cache misses.This reduces the efficiency of partitioned join.The {\em radix-cluster} algorithm proposedhere solves this problem by making multiple passes; in each pass the relation issubclustered on a number of radix-bits. Radix-bits are a subsequence of bits takenfrom a {\em radix-number}, which usually is the integer result of hashing the value on whichwe want to cluster. The first pass of the radix-cluster algorithm puts all tuples with anequal bit-pattern in the higher H1 radix-bits together in a cluster. The second pass startswhere the previous left off, and subdivides each cluster on the second-highest H2 bits.This process repeats for p passes such that H1*..*HP=H. By keeping Hi lower than thetotal amount of cache lines, high cache miss ratios can be avoided.On platforms that implement "software TLB" management (Transition Lookaside Buffer;the "cache" of most-recent translations of virtual memory addresses from logical tophysical form), TLB misses are also an expensive, and heavily influence performance.As the number of TLB entries is typically limited to 64, on such platforms the TLBposes an even lower bound on the size of Hi than the number of cache lines.This makes radix-cluster even more beneficial on those platforms.@+ The CPU Story@TModern CPUs are called {\em super-scalar}, by which is meant that the CPUhas two mechanisms for parallel processing:\begin{enumerate}\item CPU instruction execution is chopped into as many as 10-25 differentstages, which can be executed one after the other by different pieces of hardware.These pieces of hardware form a pipeline, so each cycle a new instruction can enterthe pipeline, while at the other end one leaves (or "graduates").The more stages the pipeline has, less tasks have to be performed per stage,hence the quicker the hardware can execute a stage, hence the higher the overallclock speed of the CPU can be. The search of ever higher CPU clock speeds henceexplains the trend of ever longer pipelines found in modern CPUs.\item multiple independent pipelines may be implemented, meaning thattwo CPU instructions that are independent can be pushed each cycle into twodifferent hardware pipelines for execution. Modern CPUs have at least2 and possibly up to 9 replicated pipelines (often separated in integer andfloating-point pipelines).This second trend is driven by the ever smaller process technology, which givesCPU designers the possibility to use ever more circuits on a single CPU. Asa consequence these circuits are used to create replicated execution unitsorganized in pipelines whose parallel activity is supposed to increase theperformance of the CPU.\end{enumerate}All this complexity comes at a price though, which is performance vulnerability.Application code must at all times have three totally independent instructionsready for execution to keep three replicated pipelines busy. This is probablyonly true for specific scientific computation code, other applications willleave the replicated pipelines mostly without use. Additionally, pipelined executionitself poses the challenge that before the previous execution is finished executing,the CPU has to guess correctly what the next instruction will be.That is, when one instruction enters the pipeline, at the next CPU cycle,it is only past stage one of typically 10-20 stages that have to pass forit to be fully executed, we have to push a next instruction into the pipeline.This turns nasty on if-then-else code like:\begin{verbatim}if (A)then Belse C\end{verbatim}The basic problem is that just after entering "if A" in the pipeline at stage 1,the CPU does not yet know whether this instruction will evaluate to true or false,hence it does not know whether the next instruction will be B or C. Modern CPUsresort in this situation to {\em speculative execution}, by e.g. putting B in thepipeline just because that taking the then-branch is default (a poor estimator)or because in a high percentage of the previous cases this piece of code was executed,"if A" turned out to evaluate to true (which is a better estimator).Clearly, the CPU will turn out to guess wrong in a certain percentage of cases(called the {\em mis-prediction rate}). Mis-predicting execution has performanceconsequences, as the real outcome of "if A" only comes to light when the instructionis already deep into the pipeline, and many instructions have already been insertedafter it. That work has to be thrown away. Suppose now C would have been the correctnext instruction instead of B, then the whole pipeline up to the stage where "if A"is then, needs to be flushed. Also, corrective action needs also to be taken in orderto e.g. undo all effect of executing B and all other flushed instructions (e.g. byrestoring CPU flags and registers) and we need to start over with C in stage 1. Noticethat a mis-prediction rate as low as 5\% on a 20-stage pipeline will typically cause 50%of the pipeline to be thrown away, which already decreases performance below the levelwhere a 20-stage pipeline at that speed is actually useful (i.e. some code would do betterat a lower speed with a shorter pipeline).The mis-prediction rate is obviously dependent on the type of code being executed.Code that contains many if-statements typically suffers a high mis-predictionrate (as explained above, correctly predicting 95% of the if-branches can still giveawful performance), whereas scientific code that does millions of independentscalar operations in a matrix multiplication is highly predictable and will sufferalmost none. In addition, such scientific code contains sufficient independentinstructions to keep a whole array of independent pipelines busy (assuming, for aminute, that we solved our first problem, memory access).@- The CPU optimization problem@TMany independent studies show that CPU resource usage during most DBMS loads is awful,plagued by low prediction rates (and high number of cache misses). This indicates thattypical DBMS software has a nature of being full of if-statements and branches,much different from scientific code used for matrix processing.We think that that is not necessary. DBMS tasks typically process millions ofindependent tuples that could well profit from the parallel capabilities ofmodern CPUs, just like scientific matrix code does. The question is: what needs tobe done in DBMS code to make it CPU-wise efficient?In Monet, we apply two techniques:\begin{itemize}\item macro-driven (explicit) loop unrolling. This is often dubbed code-expansion.Loop unrolling is a well-known technique to improve the pipelined performanceof code that processes a bulk data structure. Regrettably, compilers can onlydetect opportunity for loop unrolling when the bounds of the bulk structure (array)are known. The sizes of arrays that store database tables are not knownat compile time, hence the compiler needs to be helped a bit.\item factoring-out function calls. Function calls are an important source ofdependence among subsequent instructions. In a language like C, a function callmay modify any reachable memory location, hence the compiler must generate code toreload many values that are cached in registers. On top of that, executing a functioncall carries substantial stack management overhead (e.g. 20 cycles) and decreasesthe prediction-rate of the CPU.\end{itemize}We provide our radix-algorithms in such versions that they can be experimented withwith and without these optimization techniques enabled in order to monitortheir effectiveness.@* Join Processing Optimized for Memory/CPU cost@TWe now address the issue of optimizing generic join processing for optimal usage ofCPU resources and memory hardware on super-scalar CPUs featuring long pipelines andout-of-order speculative execution and memory subsystems that consist of deep hierarchieswith various levels of memory cache.We specifically want to compare the effectiveness of the 'Monet-approach' with a standard'relational approach'. We consider the generic join query:\begin{verbatim}SELECT larger.a1, .., larger.aY, smaller.b1, .., smaller.bZFROM larger, smallerWHERE larger.key = smaller.key\end{verbatim}Without loss of generality we assume that the "larger" table has the same amount or moretuples than the "smaller" table.@+ The Monet Approach@TIn the standard approach this query would be executed in Monet with the following MIL statements:\begin{verbatim}0102 # join is either positional-, merge- or hash-join.03 res_join := join(larger_key, smaller_key.reverse);04 res_larger := res_join.mark(0@0).reverse;05 res_smaller := res_join.reverse.mark(0@0).reverse; # positional-join projected columns from smaller table into resultA1 res_a1 := join(res_smaller, smaller_a1);AX ....AY res_aY := join(res_smaller, smaller_aY); # positional-join projected columns from larger table into resultB1 res_b1 := join(res_larger, larger_b1);BX ....BZ res_bZ := join(res_larger, larger_bZ);\end{verbatim}A positional join is a highly efficient kind of join found in the Monet system, that occurswhen an OID-column is joined with a VOID column. A VOID column is a column that containsa sequence of densely ascending OIDs: 1@0, 2@0, 3@0, ..., N@0. In its implementation,Monet does not materialize suchOID sequences, hence the type-name "void". It is easy to lookupa value in a VOID column, as the value you look up (e.g. 3@0) already tells its position (=3).The positional join algorithms joins an outer BAT[any,oid] with an inner BAT[void,any]by scanning over the inner-BAT and performing positional lookup into the outer BAT.In a typical data warehouse, the join at line 03 would be positional if the "key"columns are foreign keys between tables with a 1-1, 1-N or N-1 relationship (in those cases,one of the key columns would be of type VOID). However, if the columns are a N-M relationship,or if they do not form a foreign key at all, the joinwould become a merge-join or a hash-join. Merge-join is only taken if bothsmaller\_key and larger\_key are already be sorted on key (tail column). As this cannotgenerally be assumed, normally a hash-join would be the implementation chosen by Monet.A hash-join performs well as long as the smaller\_key BAT plus its associated hash-table(which adds about 8 bytes per tuple), is smaller than the memory cache.The latter phase of the query (lines A0-AY,B0-BZ) fetches column values from the projected columnsusing positional join. This performs fine up until table sizes of the larger table when onelarger\_bX column BAT starts to exceed the size of the memory cache.We now turn our attention to what happens if these sizes exceed. First, we discuss what happensif the larger\_bX BATs (which for simplicity we assume all have approximately the same size)do not fit the memory cache. Then, we discuss what happens if even the smaller\_key BAT plusits hash-table does not fit anymore.@- The Role of Sorting in improving Memory Access
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -