📄 readme

📁 postgresql8.3.4源码,开源数据库
💻
📖 第 1 页 / 共 3 页
字号:
上一页 1 23
btbulkdelete has to get super-exclusive lock on every leaf page, not onlythe ones where it actually sees items to delete.WAL considerations------------------The insertion and deletion algorithms in themselves don't guarantee btreeconsistency after a crash.  To provide robustness, we depend on WALreplay.  A single WAL entry is effectively an atomic action --- we canredo it from the log if it fails to complete.Ordinary item insertions (that don't force a page split) are of coursesingle WAL entries, since they only affect one page.  The same forleaf-item deletions (if the deletion brings the leaf page to zero items,it is now a candidate to be deleted, but that is a separate action).An insertion that causes a page split is logged as a single WAL entry forthe changes occuring on the insertion's level --- including update of theright sibling's left-link --- followed by a second WAL entry for theinsertion on the parent level (which might itself be a page split, requiringan additional insertion above that, etc).For a root split, the followon WAL entry is a "new root" entry rather thanan "insertion" entry, but details are otherwise much the same.Because insertion involves multiple atomic actions, the WAL replay logichas to detect the case where a page split isn't followed by a matchinginsertion on the parent level, and then do that insertion on its own (andrecursively for any subsequent parent insertion, of course).  This isfeasible because the WAL entry for the split contains enough info to knowwhat must be inserted in the parent level.When splitting a non-root page that is alone on its level, the requiredmetapage update (of the "fast root" link) is performed and logged as partof the insertion into the parent level.  When splitting the root page, themetapage update is handled as part of the "new root" action.A page deletion is logged as a single WAL entry covering all fourrequired page updates (target page, left and right siblings, and parent)as an atomic event.  (Any required fast-root link update is also partof the WAL entry.)  If the parent page becomes half-dead but is notimmediately deleted due to a subsequent crash, there is no loss ofconsistency, and the empty page will be picked up by the next VACUUM.Other things that are handy to know-----------------------------------Page zero of every btree is a meta-data page.  This page stores thelocation of the root page --- both the true root and the current effectiveroot ("fast" root).  To avoid fetching the metapage for every single indexsearch, we cache a copy of the meta-data information in the index'srelcache entry (rd_amcache).  This is a bit ticklish since using the cacheimplies following a root page pointer that could be stale.  We requireevery metapage update to send out a SI "relcache inval" message on theindex relation.  That ensures that each backend will flush its cached copynot later than the start of its next transaction.  Therefore, stalepointers cannot be used for longer than the current transaction, whichreduces the problem to the same one already dealt with for concurrentVACUUM --- we can just imagine that each open transaction is potentially"already in flight" to the old root.The algorithm assumes we can fit at least three items per page(a "high key" and two real data items).  Therefore it's unsafeto accept items larger than 1/3rd page size.  Larger items wouldwork sometimes, but could cause failures later on depending onwhat else gets put on their page."ScanKey" data structures are used in two fundamentally different waysin this code, which we describe as "search" scankeys and "insertion"scankeys.  A search scankey is the kind passed to btbeginscan() orbtrescan() from outside the btree code.  The sk_func pointers in a searchscankey point to comparison functions that return boolean, such as int4lt.There might be more than one scankey entry for a given index column, ornone at all.  (We require the keys to appear in index column order, butthe order of multiple keys for a given column is unspecified.)  Aninsertion scankey uses the same array-of-ScanKey data structure, but thesk_func pointers point to btree comparison support functions (ie, 3-waycomparators that return int4 values interpreted as <0, =0, >0).  In aninsertion scankey there is exactly one entry per index column.  Insertionscankeys are built within the btree code (eg, by _bt_mkscankey()) and areused to locate the starting point of a scan, as well as for locating theplace to insert a new index tuple.  (Note: in the case of an insertionscankey built from a search scankey, there might be fewer keys thanindex columns, indicating that we have no constraints for the remainingindex columns.)  After we have located the starting point of a scan, theoriginal search scankey is consulted as each index entry is sequentiallyscanned to decide whether to return the entry and whether the scan canstop (see _bt_checkkeys()).Notes about data representation-------------------------------The right-sibling link required by L&Y is kept in the page "opaquedata" area, as is the left-sibling link, the page level, and some flags.The page level counts upwards from zero at the leaf level, to the treedepth minus 1 at the root.  (Counting up from the leaves ensures that wedon't need to renumber any existing pages when splitting the root.)The Postgres disk block data format (an array of items) doesn't fitLehman and Yao's alternating-keys-and-pointers notion of a disk page,so we have to play some games.On a page that is not rightmost in its tree level, the "high key" iskept in the page's first item, and real data items start at item 2.The link portion of the "high key" item goes unused.  A page that isrightmost has no "high key", so data items start with the first item.Putting the high key at the left, rather than the right, may seem odd,but it avoids moving the high key as we add data items.On a leaf page, the data items are simply links to (TIDs of) tuplesin the relation being indexed, with the associated key values.On a non-leaf page, the data items are down-links to child pages withbounding keys.  The key in each data item is the *lower* bound forkeys on that child page, so logically the key is to the left of thatdownlink.  The high key (if present) is the upper bound for the lastdownlink.  The first data item on each such page has no lower bound--- or lower bound of minus infinity, if you prefer.  The comparisonroutines must treat it accordingly.  The actual key stored in theitem is irrelevant, and need not be stored at all.  This arrangementcorresponds to the fact that an L&Y non-leaf page has one more pointerthan key.Notes to operator class implementors------------------------------------With this implementation, we require each supported combination ofdatatypes to supply us with a comparison procedure via pg_amproc.This procedure must take two nonnull values A and B and return an int32 < 0,0, or > 0 if A < B, A = B, or A > B, respectively.  The procedure mustnot return INT_MIN for "A < B", since the value may be negated beforebeing tested for sign.  A null result is disallowed, too.  See nbtcompare.cfor examples.There are some basic assumptions that a btree operator family must satisfy:An = operator must be an equivalence relation; that is, for all non-nullvalues A,B,C of the datatype:	A = A is true						reflexive law	if A = B, then B = A					symmetric law	if A = B and B = C, then A = C				transitive lawA < operator must be a strong ordering relation; that is, for all non-nullvalues A,B,C:	A < A is false						irreflexive law	if A < B and B < C, then A < C				transitive lawFurthermore, the ordering is total; that is, for all non-null values A,B:	exactly one of A < B, A = B, and B < A is true		trichotomy law(The trichotomy law justifies the definition of the comparison supportprocedure, of course.)The other three operators are defined in terms of these two in the obvious way,and must act consistently with them.For an operator family supporting multiple datatypes, the above laws must holdwhen A,B,C are taken from any datatypes in the family.  The transitive lawsare the trickiest to ensure, as in cross-type situations they representstatements that the behaviors of two or three different operators areconsistent.  As an example, it would not work to put float8 and numeric intoan opfamily, at least not with the current semantics that numerics areconverted to float8 for comparison to a float8.  Because of the limitedaccuracy of float8, this means there are distinct numeric values that willcompare equal to the same float8 value, and thus the transitive law fails.It should be fairly clear why a btree index requires these laws to hold withina single datatype: without them there is no ordering to arrange the keys with.Also, index searches using a key of a different datatype require comparisonsto behave sanely across two datatypes.  The extensions to three or moredatatypes within a family are not strictly required by the btree indexmechanism itself, but the planner relies on them for optimization purposes.
上一页 1 23
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -