📄 readme

📁 PostgreSQL 8.1.4的源码适用于Linux下的开源数据库系统
💻
📖 第 1 页 / 共 2 页
字号:
上一页 12
  SeqScan       - a plain Path node with pathtype = T_SeqScan  IndexPath     - index scans  BitmapHeapPath - top of a bitmapped index scan  TidPath       - scan by CTID  AppendPath    - append multiple subpaths together  ResultPath    - a Result plan node (used for variable-free tlist or qual)  MaterialPath  - a Material plan node  UniquePath    - remove duplicate rows  NestPath      - nested-loop joins  MergePath     - merge joins  HashPath      - hash joins PathKeys       - a data structure representing the ordering of a pathThe optimizer spends a good deal of its time worrying about the orderingof the tuples returned by a path.  The reason this is useful is that byknowing the sort ordering of a path, we may be able to use that path asthe left or right input of a mergejoin and avoid an explicit sort step.Nestloops and hash joins don't really care what the order of their inputsis, but mergejoin needs suitably ordered inputs.  Therefore, all pathsgenerated during the optimization process are marked with their sort order(to the extent that it is known) for possible use by a higher-level merge.It is also possible to avoid an explicit sort step to implement a user'sORDER BY clause if the final path has the right ordering already, so thesort ordering is of interest even at the top level.  query_planner() willlook for the cheapest path with a sort order matching the desired order,and grouping_planner() will compare its cost to the cost of using thecheapest-overall path and doing an explicit sort.When we are generating paths for a particular RelOptInfo, we discard a pathif it is more expensive than another known path that has the same or bettersort order.  We will never discard a path that is the only known way toachieve a given sort order (without an explicit sort, that is).  In thisway, the next level up will have the maximum freedom to build mergejoinswithout sorting, since it can pick from any of the paths retained for itsinputs.PathKeys--------The PathKeys data structure represents what is known about the sort orderof a particular Path.Path.pathkeys is a List of Lists of PathKeyItem nodes that representthe sort order of the result generated by the Path.  The n'th sublistrepresents the n'th sort key of the result.In single/base relation RelOptInfo's, the Paths represent various waysof scanning the relation and the resulting ordering of the tuples.Sequential scan Paths have NIL pathkeys, indicating no known ordering.Index scans have Path.pathkeys that represent the chosen index's ordering,if any.  A single-key index would create a pathkey with a single sublist,e.g. ( (tab1.indexkey1/sortop1) ).  A multi-key index generates a sublistper key, e.g. ( (tab1.indexkey1/sortop1) (tab1.indexkey2/sortop2) ) whichshows major sort by indexkey1 (ordering by sortop1) and minor sort byindexkey2 with sortop2.Note that a multi-pass indexscan (OR clause scan) has NIL pathkeys sincewe can say nothing about the overall order of its result.  Also, anindexscan on an unordered type of index generates NIL pathkeys.  However,we can always create a pathkey by doing an explicit sort.  The pathkeysfor a Sort plan's output just represent the sort key fields and theordering operators used.Things get more interesting when we consider joins.  Suppose we do amergejoin between A and B using the mergeclause A.X = B.Y.  The outputof the mergejoin is sorted by X --- but it is also sorted by Y.  Werepresent this fact by listing both keys in a single pathkey sublist:( (A.X/xsortop B.Y/ysortop) ).  This pathkey asserts that the majorsort order of the Path can be taken to be *either* A.X or B.Y.They are equal, so they are both primary sort keys.  By doing this,we allow future joins to use either var as a pre-sorted key, so upperMergejoins may be able to avoid having to re-sort the Path.  This iswhy pathkeys is a List of Lists.We keep a sortop associated with each PathKeyItem because cross-data-typemergejoins are possible; for example int4 = int8 is mergejoinable.In this case we need to remember that the left var is ordered by int4ltwhile the right var is ordered by int8lt.  So the different members ofeach sublist could have different sortops.Note that while the order of the top list is meaningful (primary vs.secondary sort key), the order of each sublist is arbitrary.  Each sublistshould be regarded as a set of equivalent keys, with no significanceto the list order.With a little further thought, it becomes apparent that pathkeys forjoins need not only come from mergejoins.  For example, if we do anestloop join between outer relation A and inner relation B, then anypathkeys relevant to A are still valid for the join result: we havenot altered the order of the tuples from A.  Even more interesting,if there was a mergeclause (more formally, an "equijoin clause") A.X=B.Y,and A.X was a pathkey for the outer relation A, then we can assert thatB.Y is a pathkey for the join result; X was ordered before and still is,and the joined values of Y are equal to the joined values of X, so Ymust now be ordered too.  This is true even though we used neither anexplicit sort nor a mergejoin on Y.More generally, whenever we have an equijoin clause A.X = B.Y and apathkey A.X, we can add B.Y to that pathkey if B is part of the joinedrelation the pathkey is for, *no matter how we formed the join*.  It worksas long as the clause has been applied at some point while forming thejoin relation.  (In the current implementation, we always apply qualclauses as soon as possible, ie, as far down in the plan tree as possible.So we can treat the pathkeys as equivalent everywhere.  The exception iswhen the relations A and B are joined inside the nullable side of anOUTER JOIN and the equijoin clause comes from above the OUTER JOIN.  In thiscase we cannot apply the qual as soon as A and B are joined, so we do notconsider the pathkeys to be equivalent.  This could be improved if we wantedto go to the trouble of making pathkey equivalence be context-dependent,but that seems much more complex than it's worth.)In short, then: when producing the pathkeys for a merge or nestloop join,we can keep all of the keys of the outer path, since the ordering of theouter path will be preserved in the result.  Furthermore, we can add toeach pathkey sublist any inner vars that are equijoined to any of theouter vars in the sublist; this works regardless of whether we areimplementing the join using that equijoin clause as a mergeclause,or merely enforcing the clause after-the-fact as a qpqual filter.Although Hashjoins also work only with equijoin operators, it is *not*safe to consider the output of a Hashjoin to be sorted in any particularorder --- not even the outer path's order.  This is true because theexecutor might have to split the join into multiple batches.  Thereforea Hashjoin is always given NIL pathkeys.  (Also, we need to use onlymergejoinable operators when deducing which inner vars are now sorted,because a mergejoin operator tells us which left- and right-datatypesortops can be considered equivalent, whereas a hashjoin operatordoesn't imply anything about sort order.)Pathkeys are also useful to represent an ordering that we wish to achieve,since they are easily compared to the pathkeys of a potential candidatepath.  So, SortClause lists are turned into pathkeys lists for use insidethe optimizer.OK, now for how it *really* works:We did implement pathkeys just as described above, and found that theplanner spent a huge amount of time comparing pathkeys, because therepresentation of pathkeys as unordered lists made it expensive to decidewhether two were equal or not.  So, we've modified the representationas described next.If we scan the WHERE clause for equijoin clauses (mergejoinable clauses)during planner startup, we can construct lists of equivalent pathkey itemsfor the query.  There could be more than two items per equivalence set;for example, WHERE A.X = B.Y AND B.Y = C.Z AND D.R = E.S creates theequivalence sets { A.X B.Y C.Z } and { D.R E.S } (plus associated sortops).Any pathkey item that belongs to an equivalence set implies that all theother items in its set apply to the relation too, or at least all the onesthat are for fields present in the relation.  (Some of the items in theset might be for as-yet-unjoined relations.)  Furthermore, any multi-itempathkey sublist that appears at any stage of planning the query *must* bea subset of one or another of these equivalence sets; there's no way we'dhave put two items in the same pathkey sublist unless they were equijoinedin WHERE.Now suppose that we allow a pathkey sublist to contain pathkey items forvars that are not yet part of the pathkey's relation.  This introducesno logical difficulty, because such items can easily be seen to beirrelevant; we just mandate that they be ignored.  But having allowedthis, we can declare (by fiat) that any multiple-item pathkey sublistmust be "equal()" to the appropriate equivalence set.  In effect,whenever we make a pathkey sublist that mentions any var appearing in anequivalence set, we instantly add all the other vars equivalenced to it,whether they appear yet in the pathkey's relation or not.  And we alsomandate that the pathkey sublist appear in the same order as theequivalence set it comes from.In fact, we can go even further, and say that the canonical representationof a pathkey sublist is a pointer directly to the relevant equivalence set,which is kept in a list of pathkey equivalence sets for the query.  Thenpathkey sublist comparison reduces to pointer-equality checking!  To do thiswe also have to add single-element pathkey sublists to the query's list ofequivalence sets, but that's a small price to pay.By the way, it's OK and even useful for us to build equivalence setsthat mention multiple vars from the same relation.  For example, ifwe have WHERE A.X = A.Y and we are scanning A using an index on X,we can legitimately conclude that the path is sorted by Y as well;and this could be handy if Y is the variable used in other join clausesor ORDER BY.  So, any WHERE clause with a mergejoinable operator cancontribute to an equivalence set, even if it's not a join clause.As sketched so far, equijoin operators allow us to conclude thatA.X = B.Y and B.Y = C.Z together imply A.X = C.Z, even when differentdatatypes are involved.  What is not immediately obvious is that to usethe "canonical pathkey" representation, we *must* make this deduction.An example (from a real bug in Postgres 7.0) is a mergejoin for a querylike	SELECT * FROM t1, t2 WHERE t1.f2 = t2.f3 AND t1.f1 = t2.f3;The canonical-pathkey mechanism is able to deduce that t1.f1 = t1.f2(ie, both appear in the same canonical pathkey set).  If we sort t1and then apply a mergejoin, we *must* filter the t1 tuples using theimplied qualification f1 = f2, because otherwise the output of the sortwill be ordered by f1 or f2 (whichever we sort on) but not both.  Themerge will then fail since (depending on which qual clause it appliesfirst) it's expecting either ORDER BY f1,f2 or ORDER BY f2,f1, but theactual output of the sort has neither of these orderings.  The best fixfor this is to generate all the implied equality constraints for eachequijoin set and add these clauses to the query's qualification list.In other words, we *explicitly* deduce f1 = f2 and add this to the WHEREclause.  The constraint will be applied as a qpqual to the output of thescan on t1, resulting in sort output that is indeed ordered by both vars.This approach provides more information to the selectivity estimationcode than it would otherwise have, and reduces the number of tuplesprocessed in join stages, so it's a win to make these deductions evenif we weren't forced to.When we generate implied equality constraints, we may find ourselvesadding redundant clauses to specific relations.  For example, consider	SELECT * FROM t1, t2, t3 WHERE t1.a = t2.b AND t2.b = t3.c;We will generate the implied clause t1.a = t3.c and add it to the tree.This is good since it allows us to consider joining t1 and t3 directly,which we otherwise wouldn't do.  But when we reach the stage of joiningall three relations, we will have redundant join clauses --- eg, if wejoin t1 and t2 first, then the path that joins (t1 t2) to t3 will haveboth t2.b = t3.c and t1.a = t3.c as restriction clauses.  This is bad;not only is evaluation of the extra clause useless work at runtime,but the selectivity estimator routines will underestimate the numberof tuples produced since they won't know that the two clauses areperfectly redundant.  We fix this by detecting and removing redundantclauses as the restriction clause list is built for each join.  (Wecan't do it sooner, since which clauses are redundant will vary dependingon the join order.)Yet another implication of all this is that mergejoinable operatorsmust form closed equivalence sets.  For example, if "int2 = int4"and "int4 = int8" are both marked mergejoinable, then there had betterbe a mergejoinable "int2 = int8" operator as well.  Otherwise, whenwe're given WHERE int2var = int4var AND int4var = int8var, we'll failwhile trying to create a representation of the implied clauseint2var = int8var.An additional refinement we can make is to insist that canonical pathkeylists (sort orderings) do not mention the same pathkey set more than once.For example, a pathkey list ((A) (B) (A)) is redundant --- the secondoccurrence of (A) does not change the ordering, since the data must alreadybe sorted by A.  Although a user probably wouldn't write ORDER BY A,B,Adirectly, such redundancies are more probable once equijoin equivalenceshave been considered.  Also, the system is likely to generate redundantpathkey lists when computing the sort ordering needed for a mergejoin.  Byeliminating the redundancy, we save time and improve planning, since theplanner will more easily recognize equivalent orderings as being equivalent.Though Bob Devine <bob.devine@worldnet.att.net> was not involved in the coding of our optimizer, he is available to field questions aboutoptimizer topics.-- bjm & tgl
上一页 12
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -