📄 opt_support.mx
字号:
@' The contents of this file are subject to the MonetDB Public License@' Version 1.1 (the "License"); you may not use this file except in@' compliance with the License. You may obtain a copy of the License at@' http://monetdb.cwi.nl/Legal/MonetDBLicense-1.1.html@'@' Software distributed under the License is distributed on an "AS IS"@' basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See the@' License for the specific language governing rights and limitations@' under the License.@'@' The Original Code is the MonetDB Database System.@'@' The Initial Developer of the Original Code is CWI.@' Portions created by CWI are Copyright (C) 1997-2007 CWI.@' All Rights Reserved.@f opt_support@a M. Kersten@* The MAL OptimizerOne of the prime reasons to design the MAL intermediate language is to havea high-level description for database queries, which is easy to generate bya front-end compiler and easy to decode, optimize and interpret.An optimizer needs several mechanisms to be effective. It should be able toperform a symbolic evaluation of a code fragment and collect the result inproperties for further decision making. The prototypical case is where an optimizer estimates the result size of a selection.Another major issue is to be able to generate and explore a space of alternative evaluation plans. This exploration may take place up front, but can also be ran at runtime for query fragments.@menu* Landscape::* Dependencies::* Building Blocks::* Framework::* Lifespan Analysis::* Flow Analysis::* Optimizer Toolkit::@end menu@node Landscape, Dependencies ,The MAL Optimizer, The MAL Optimizer@+ The Optimizer LandscapeA query optimizer is often a large and complex piece of code, whichenumerates alternative evaluation plans from which 'the best' planis selected for evaluation. Limited progress has been made sofar todecompose the optimizer into (orthogonal) components, because it is a common believe in research that a holistic view on the problem is a prerequisite to find the best plan. Conversely, commercial optimizers use a cost-model driven approach, whichexplores part of the space using a limited (up to 300) rewriting rules.Our hypothesis is that query optimization should be realized witha collection of query optimizer transformers (QOT),each dedicated to a specific task.Furthermore, they are assembled in scenarios to support specific applicationdomains or achieve a desired behavior. Such scenarios are selected on asession basis, a query basis, or dynamically atruntime; they are part of the query plan.The query transformer list below is under consideration for development.For each we consider its goal, approach, and expected impact.Moreover, the minimal prerequisites identify the essential optimizers that should have done their work already. For example, it doesn;t make senseto perform a static evaluation unless you have already propagated theconstants using Alias Removal.@emph{Constant expressions}Goal: to remove scalar expressions which need be evaluated once during thequery lifetime.Rationale: static expressions appear when variables used denote literal constants (e.g. 1+1),when catalog information can be merged with the plan (e.g. max(B.salary)), when session variables are used which are initialized once (e.g. user()).Early evaluation aids subsequent optimization.Approach: inspect all instructions to locate static expressions. Whether they should be removed depends on the expected re-use,which in most cases call for an explicit request upon query registration to do so. The result of a static evaluation provides a ground for alias removal.Impact: relevant for stored queries (MAL functions)Prereq: alias removal, common terms@emph{Relational Expression Optimizer}Goal: to evaluate a relational plan using properties of BATs,such as being empty or forming an aligned group.These optimizations assume that the code generator can detect propertieswhile compiling e.g. an SQL query.Impact: highPrereq:@emph{Alias Removal }Goal: to reduce the number of variables referenceing the same value,thereby reducing the analysis complexity.Rationale: query transformations often result in replacingthe right-hand side expression with a result variable. This pollutes thecode block with simple assignments e.g. V:=T. Within the descendant flow theoccurrence of V could be replaced by T, provided V is never assigned a newvalue.Approach: literal constants within a MAL block are already recognized andreplaced by a single variable. Impact: medium@emph{Common Term Optimizer }Goal: to reduce the amount of work by avoiding calculation of the sameoperation twice.Rationale: to simplify code generation for front-ends, they do not haveto remember the subexpressions already evaluated. It is much easier to detect at the MAL level.Approach: simply walk through the instruction sequence and locate identicalpatterns. (Enhance is with semantic equivalent instructions)Impact: HighPrereq: Alias Removal@emph{Dead Code Removal }Goal: to remove all instructions whose result is not usedRationale: due to sloppy coding or alternative execution pathsdead code may appear. Als XML Pathfinder is expected to producea large number of simple assignments.Approach: Every instruction should produce a value used somewhere else.Impact: low@emph{Heuristic Rule Rewrites }Goal: to reduce the volume as quick as possible.Rationale: most queries are focussed on a small part of the database.To avoid carrying too many intermediates, the selection should be performed asearly as possible in the process. This assumes that selectivity factors areknown upfront, which in turn depends on histogram of the value distribution.Approach: locate selections and push them back/forth through the flow graph.Impact: high@emph{Join Path Optimizer }Goal: to reduce the volume produced by a join sequenceRationale: join paths are potentially expensive operations. Ideally the joinpath is evaluated starting at the smallest component, so as to reduce thesize of the intermediate results.Approach: to successfully reduce the volume we need to estimate their processingcost. This calls for statistics over the value distribution, in particular,correlation histograms. If statistics are not available upfront, we have to restore to an incremental algorithm, which decides on the steps using thesize of the relations.Impact: high@emph{Operator Sort }Goal: to sort the dataflow graph in such a way as to reduce thecost, or to assure locality of access for operands.Rationale: A simple optimizer is to order the instructions forexecution by permutation of the query components Approach:Impact:@emph{Singleton Set }Goal: to replace sets that are known to produce preciselyone tuple.Rationale: Singleton sets can be represented by value pairs in theMAL program, which reduces to a scalar expression.Approach: Identify a set variable for replacement.Impact:@emph{Range Propagation }Goal: look for constant ranges in select statements andpropagate them through the code.Rationale: partitioned tables and views may give rise toexpressions that contain multiple selections over thesame BAT. If their arguments are constant, the resultof such selects can sometimes be predicted, or themultiple selections can be cascaded into a single operation.Impact: high, should be followed by alias removal and dead code removal @emph{Result Cacher }Goal: to reduce the processing cost by keeping track of expensive tocompute intermediate resultsRationale: Approach: result caching becomes active after an instruction has been evaluated. The result can be cached as long as its underlying operandsremain unchanged. Result caching can be made transparent to the user, butaffects the other quer optimizers.Impact: high@emph{Vector Execution }Goal: to rewrite a query to use a cache-optimal vector implementationRationale: processing in the cache is by far the best you can get.However, the operands may far exceed the cache size and should be brokeninto pieces followed by a staged execution of the fragments involved.Approach: replace the query plan with fragment streamersImpact:@emph{Staged Execution }Goal: to split a query plan into a number of steps, such that the firstresponse set is delivered as quickly as possible. The remainder is onlyproduced upon request.Rationale: interactive queries call for quick response and an indicationof the processing time involved to run it too completion. Approach: staged execution can be realized using a fragmentation schemeover the database, e.g. each table is replaced by a union of fragments.This fragmentation could be determined upfront by the user or is derivedfrom the query and database statistics.impact: high@emph{Code Parallizer }Goal: to exploit parallel IO and cpu processing in both SMP and MPP settings.Rationale: throwing more resources to solve a complex query helps, providedit is easy to determine that parallel processing recovers the administrativeoverheadApproach: every flow path segment can be handled by an independent process thread.Impact: high@emph{Query Evaluation Maps }Goal: to avoid touching any tuple that is not relevant for answering a query.Rationale: the majority of work in solving a query is to disgard tuples ofno interest and to find correlated tuples through join conditions. Ideally,the database learns these properties over time and re-organizes (or builtsa map) to replace disgarding by map lookup.Approach: piggyback selection and joins as database fragmentation instructionsImpact: high@emph{MAL Compiler (tactics)}Goal: to avoid interpretation of functional expressionsRationale: interpretation of arithmetic expressions with an interpreteris always expensive. Replacing a complex arithmetic expressin with asimple dynamically compiled C-functions often pays off. Especially forcached (MAL) queriesApproach:Impact: high@emph{Dynamic Query Scheduler (tactics)}Goal: to organize the work in a way so as to optimize resource usageRationale: straight interpretation of a query plan may not lead to the bestuse of the underlying resources. For example, the content of the runtimecache may provide an opportunity to safe time by accessing a cached sourceApproach: query scheduling is the last step before a relation algebra interpretertakes over control. The scheduling step involves a re-ordering of the instructions within the boundaries imposed by the flow graph.impact: medium@emph{Aggregate Groups }Goal: to reduce the cost of computing aggregate expressions over timesRationale: many of our applications call for calculation of aggregatesover dynamically defined groupings. They call for lengtly scans and it paysto piggyback all aggregate calculates, leaving their result in the cache forlater consumption (eg the optimizers)Approach:Impact: High@emph{Data Cube optimizer}Goal: to recognize data cube operationsRationale: Approach:Impact:@emph{Demand Driven Interpreter (tactics)}Goal: to use the best interpreter and libraries geared at the task at handRationale: Interpretation of a query plan can be based on differentcomputational models. A demand driven interpretation starts at the intended output and 'walks' backward through the flow graph to collect the pieces,possibly in a pipelined fashion. (Vulcano model)Approach: merely calls for a different implementation of the core operatorsImpact: high@emph{Iterator Strength Reduction }Goal: to reduce the cost of iterator execution by moving instructionsout of the loop.Rationale: although iteration at the MAL level should be avoided due tothe inherent low performance compared to built-in operators, it is notforbidden. In that case we should confine the iterator block to the minimalwork needed.Approach: inspect the flowgraph for each iterator and move instructions around.Impact: low@emph{Accumulator Evaluation }Goal: to replace operators with cheaper ones.Rationale: based on the actual state of the computation and the richness ofthe supporting libraries there may exists alternative routes to solve a query.Approach: Operator rewriting depends on properties. No general technique.The first implementation looks at calculator expressions such as theyappear frequently in the RAM compiler.Impact: highPrerequisite: should be called after common term optimizer to avoid clashes.@emph{Code Inliner }Goal: to reduce the calling depth of the interpreter and to obtaina better starting point for code squeezingRationale: substitution of code blocks (or macro expansion) leads tolonger linear code sequences. This provides opportunities for squeezing.Moreover, at runtime building and managing a stackframe is rather expensive.This should be avoided for functions called repeatedly.Approach: called explicity to inline a module (or symbol)Impact: medium@emph{Code Outliner }Goal: to reduce the program size by replacing a group with a singleinstructionRationale: inverse macro expansion leads toshorter linear code sequences. This provides opportunities for less interpreteroverhead, and to optimize complex, but repetative instruction sequenceswith a single hardwired callApproach: called explicitly to outline a module (or symbol)Impact: medium@emph{Garbage Collector }Goal: to release resources as quickly as possibleRationale: BATs referenced from a MAL program keep resources locked.Approach: In cooperation with a resource scheduler we should identify thosethat can be released quickly. It requires a forced gargabe collection callat the end of the BAT's lifespan.Impact: large@emph{Foreign Key replacements }Goal: to improve multi-attribute joins over foreign key constraintsRationale: the code produced by the SQL frontend involves foreign key constraints, which provides many opportunities for speedy code.Impact: large@node Dependencies, Building Blocks, Landscape, The MAL Optimizer@subsection Optimizer DependenciesThe optimizers are highly targeted to a particular problem.Aside from the resources available to invest in plan optimization,optimizers are partly dependent and may interfere.To aid selection of the components of interest, we have groupedthem in a preferred order of deployment.@multitable @columnfractions .12 .8@item Group A:@tab Code Inliner.@item @tab Constant Expression Evaluator. @item @tab Relational Expression Evaluator. @item @tab Strength Reduction.@item Group B:@tab Common Term Optimizer.@item @tab Query Evaluation Maps.@item Group C:@tab Join Path Optimizer.@item @tab Ranges Propagation.@item @tab Operator Cost Reduction.@item @tab Operator Sort.@item @tab Foreign Key handling.@item @tab Aggregate Groups.@item @tab Data Cube optimizer.@item @tab Heuristic Rule Rewrite.@item group D:@tab Code Parallizer.@item
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -