📄 readme
字号:
however, because insertions and deletions on upper tree levels are alwaysdone by reference to child page numbers, not keys. The only cost is thatsearches may sometimes descend to the half-dead page and then have to moveright, rather than going directly to the sibling page.A deleted page cannot be reclaimed immediately, since there may be otherprocesses waiting to reference it (ie, search processes that just left theparent, or scans moving right or left from one of the siblings). Theseprocesses must observe that the page is marked dead and recoveraccordingly. Searches and forward scans simply follow the right-linkuntil they find a non-dead page --- this will be where the deleted page'skey-space moved to.Stepping left in a backward scan is complicated because we must considerthe possibility that the left sibling was just split (meaning we must findthe rightmost page derived from the left sibling), plus the possibilitythat the page we were just on has now been deleted and hence isn't in thesibling chain at all anymore. So the move-left algorithm becomes:0. Remember the page we are on as the "original page".1. Follow the original page's left-link (we're done if this is zero).2. If the current page is live and its right-link matches the "original page", we are done.3. Otherwise, move right one or more times looking for a live page whose right-link matches the "original page". If found, we are done. (In principle we could scan all the way to the right end of the index, but in practice it seems better to give up after a small number of tries. It's unlikely the original page's sibling split more than a few times while we were in flight to it; if we do not find a matching link in a few tries, then most likely the original page is deleted.)4. Return to the "original page". If it is still live, return to step 1 (we guessed wrong about it being deleted, and should restart with its current left-link). If it is dead, move right until a non-dead page is found (there must be one, since rightmost pages are never deleted), mark that as the new "original page", and return to step 1.This algorithm is correct because the live page found by step 4 will havethe same left keyspace boundary as the page we started from. Therefore,when we ultimately exit, it must be on a page whose right keyspaceboundary matches the left boundary of where we started --- which is whatwe need to be sure we don't miss or re-scan any items.A deleted page can only be reclaimed once there is no scan or search thathas a reference to it; until then, it must stay in place with itsright-link undisturbed. We implement this by waiting until alltransactions that were running at the time of deletion are dead; which isoverly strong, but is simple to implement within Postgres. When markeddead, a deleted page is labeled with the next-transaction counter value.VACUUM can reclaim the page for re-use when this transaction number isolder than the oldest open transaction. (NOTE: VACUUM FULL can reclaimsuch pages immediately.)Reclaiming a page doesn't actually change its state on disk --- we simplyrecord it in the shared-memory free space map, from which it will behanded out the next time a new page is needed for a page split. Thedeleted page's contents will be overwritten by the split operation.(Note: if we find a deleted page with an extremely old transactionnumber, it'd be worthwhile to re-mark it with FrozenTransactionId so thata later xid wraparound can't cause us to think the page is unreclaimable.But in more normal situations this would be a waste of a disk write.)Because we never delete the rightmost page of any level (and in particularnever delete the root), it's impossible for the height of the tree todecrease. After massive deletions we might have a scenario in which thetree is "skinny", with several single-page levels below the root.Operations will still be correct in this case, but we'd waste cyclesdescending through the single-page levels. To handle this we use an ideafrom Lanin and Shasha: we keep track of the "fast root" level, which isthe lowest single-page level. The meta-data page keeps a pointer to thislevel as well as the true root. All ordinary operations initiate theirsearches at the fast root not the true root. When we split a page that isalone on its level or delete the next-to-last page on a level (both casesare easily detected), we have to make sure that the fast root pointer isadjusted appropriately. In the split case, we do this work as part of theatomic update for the insertion into the parent level; in the delete caseas part of the atomic update for the delete (either way, the metapage hasto be the last page locked in the update to avoid deadlock risks). Thisavoids race conditions if two such operations are executing concurrently.VACUUM needs to do a linear scan of an index to search for empty leafpages and half-dead parent pages that can be deleted, as well as deletedpages that can be reclaimed because they are older than all opentransactions.WAL considerations------------------The insertion and deletion algorithms in themselves don't guarantee btreeconsistency after a crash. To provide robustness, we depend on WALreplay. A single WAL entry is effectively an atomic action --- we canredo it from the log if it fails to complete.Ordinary item insertions (that don't force a page split) are of coursesingle WAL entries, since they only affect one page. The same forleaf-item deletions (if the deletion brings the leaf page to zero items,it is now a candidate to be deleted, but that is a separate action).An insertion that causes a page split is logged as a single WAL entry forthe changes occuring on the insertion's level --- including update of theright sibling's left-link --- followed by a second WAL entry for theinsertion on the parent level (which might itself be a page split, requiringan additional insertion above that, etc).For a root split, the followon WAL entry is a "new root" entry rather thanan "insertion" entry, but details are otherwise much the same.Because insertion involves multiple atomic actions, the WAL replay logichas to detect the case where a page split isn't followed by a matchinginsertion on the parent level, and then do that insertion on its own (andrecursively for any subsequent parent insertion, of course). This isfeasible because the WAL entry for the split contains enough info to knowwhat must be inserted in the parent level.When splitting a non-root page that is alone on its level, the requiredmetapage update (of the "fast root" link) is performed and logged as partof the insertion into the parent level. When splitting the root page, themetapage update is handled as part of the "new root" action.A page deletion is logged as a single WAL entry covering all fourrequired page updates (target page, left and right siblings, and parent)as an atomic event. (Any required fast-root link update is also partof the WAL entry.) If the parent page becomes half-dead but is notimmediately deleted due to a subsequent crash, there is no loss ofconsistency, and the empty page will be picked up by the next VACUUM.Other things that are handy to know-----------------------------------Page zero of every btree is a meta-data page. This page stores thelocation of the root page --- both the true root and the current effectiveroot ("fast" root).The algorithm assumes we can fit at least three items per page(a "high key" and two real data items). Therefore it's unsafeto accept items larger than 1/3rd page size. Larger items wouldwork sometimes, but could cause failures later on depending onwhat else gets put on their page."ScanKey" data structures are used in two fundamentally different waysin this code. Searches for the initial position for a scan, as well asinsertions, use scankeys in which the comparison function is a 3-waycomparator (<0, =0, >0 result). These scankeys are built within thebtree code (eg, by _bt_mkscankey()) and used by _bt_compare(). Once weare positioned, sequential examination of tuples in a scan is done by_bt_checkkeys() using scankeys in which the comparison functions returnbooleans --- for example, int4lt might be used. These scankeys are theones originally passed in from outside the btree code. Samerepresentation, but different comparison functions!Notes about data representation-------------------------------The right-sibling link required by L&Y is kept in the page "opaquedata" area, as is the left-sibling link, the page level, and some flags.The page level counts upwards from zero at the leaf level, to the treedepth minus 1 at the root. (Counting up from the leaves ensures that wedon't need to renumber any existing pages when splitting the root.)The Postgres disk block data format (an array of items) doesn't fitLehman and Yao's alternating-keys-and-pointers notion of a disk page,so we have to play some games.On a page that is not rightmost in its tree level, the "high key" iskept in the page's first item, and real data items start at item 2.The link portion of the "high key" item goes unused. A page that isrightmost has no "high key", so data items start with the first item.Putting the high key at the left, rather than the right, may seem odd,but it avoids moving the high key as we add data items.On a leaf page, the data items are simply links to (TIDs of) tuplesin the relation being indexed, with the associated key values.On a non-leaf page, the data items are down-links to child pages withbounding keys. The key in each data item is the *lower* bound forkeys on that child page, so logically the key is to the left of thatdownlink. The high key (if present) is the upper bound for the lastdownlink. The first data item on each such page has no lower bound--- or lower bound of minus infinity, if you prefer. The comparisonroutines must treat it accordingly. The actual key stored in theitem is irrelevant, and need not be stored at all. This arrangementcorresponds to the fact that an L&Y non-leaf page has one more pointerthan key.Notes to operator class implementors------------------------------------With this implementation, we require each supported datatype to supplyus with a comparison procedure via pg_amproc. This procedure must taketwo nonnull values A and B and return an int32 < 0, 0, or > 0 if A < B,A = B, or A > B, respectively. See nbtcompare.c for examples.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -