📄 readme
字号:
$PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.8 2003/11/29 19:51:40 pgsql Exp $This directory contains a correct implementation of Lehman and Yao'shigh-concurrency B-tree management algorithm (P. Lehman and S. Yao,Efficient Locking for Concurrent Operations on B-Trees, ACM Transactionson Database Systems, Vol 6, No. 4, December 1981, pp 650-670). We alsouse a simplified version of the deletion logic described in Lanin andShasha (V. Lanin and D. Shasha, A Symmetric Concurrent B-Tree Algorithm,Proceedings of 1986 Fall Joint Computer Conference, pp 380-389).The Lehman and Yao algorithm and insertions-------------------------------------------We have made the following changes in order to incorporate the L&Y algorithminto Postgres:The requirement that all btree keys be unique is too onerous,but the algorithm won't work correctly without it. Fortunately, it isonly necessary that keys be unique on a single tree level, because L&Yonly use the assumption of key uniqueness when re-finding a key in aparent page (to determine where to insert the key for a split page).Therefore, we can use the link field to disambiguate multipleoccurrences of the same user key: only one entry in the parent levelwill be pointing at the page we had split. (Indeed we need not look atthe real "key" at all, just at the link field.) We can distinguishitems at the leaf level in the same way, by examining their links toheap tuples; we'd never have two items for the same heap tuple.Lehman and Yao assume that the key range for a subtree S is describedby Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parentpage. This does not work for nonunique keys (for example, if we haveenough equal keys to spread across several leaf pages, there *must* besome equal bounding keys in the first level up). Therefore we assumeKi <= v <= Ki+1 instead. A search that finds exact equality to abounding key in an upper tree level must descend to the left of thatkey to ensure it finds any equal keys in the preceding page. Aninsertion that sees the high key of its target page is equal to the keyto be inserted has a choice whether or not to move right, since the newkey could go on either page. (Currently, we try to find a page wherethere is room for the new key without a split.)Lehman and Yao don't require read locks, but assume that in-memorycopies of tree pages are unshared. Postgres shares in-memory buffersamong backends. As a result, we do page-level read locking on btreepages in order to guarantee that no record is modified while we areexamining it. This reduces concurrency but guaranteees correctbehavior. An advantage is that when trading in a read lock for awrite lock, we need not re-read the page after getting the write lock.Since we're also holding a pin on the shared buffer containing thepage, we know that buffer still contains the page and is up-to-date.We support the notion of an ordered "scan" of an index as well asinsertions, deletions, and simple lookups. A scan in the forwarddirection is no problem, we just use the right-sibling pointers thatL&Y require anyway. (Thus, once we have descended the tree to thecorrect start point for the scan, the scan looks only at leaf pagesand never at higher tree levels.) To support scans in the backwarddirection, we also store a "left sibling" link much like the "rightsibling". (This adds an extra step to the L&Y split algorithm: whileholding the write lock on the page being split, we also lock its formerright sibling to update that page's left-link. This is safe since nowriter of that page can be interested in acquiring a write lock on ourpage.) A backwards scan has one additional bit of complexity: afterfollowing the left-link we must account for the possibility that theleft sibling page got split before we could read it. So, we have tomove right until we find a page whose right-link matches the page wecame from. (Actually, it's even harder than that; see deletion discussionbelow.)Read locks on a page are held for as long as a scan is examining a page.But nbtree.c arranges to drop the read lock, but not the buffer pin,on the current page of a scan before control leaves nbtree. When wecome back to resume the scan, we have to re-grab the read lock andthen move right if the current item moved (see _bt_restscan()). Keepingthe pin ensures that the current item cannot move left or be deleted(see btbulkdelete).In most cases we release our lock and pin on a page before attemptingto acquire pin and lock on the page we are moving to. In a few placesit is necessary to lock the next page before releasing the current one.This is safe when moving right or up, but not when moving left or down(else we'd create the possibility of deadlocks).Lehman and Yao fail to discuss what must happen when the root pagebecomes full and must be split. Our implementation is to split theroot in the same way that any other page would be split, then constructa new root page holding pointers to both of the resulting pages (whichnow become siblings on the next level of the tree). The new root pageis then installed by altering the root pointer in the meta-data page (seebelow). This works because the root is not treated specially in anyother way --- in particular, searches will move right using its linkpointer if the link is set. Therefore, searches will find the datathat's been moved into the right sibling even if they read the meta-datapage before it got updated. This is the same reasoning that makes asplit of a non-root page safe. The locking considerations are similar too.When an inserter recurses up the tree, splitting internal pages to insertlinks to pages inserted on the level below, it is possible that it willneed to access a page above the level that was the root when it began itsdescent (or more accurately, the level that was the root when it read themeta-data page). In this case the stack it made while descending does nothelp for finding the correct page. When this happens, we find the correctplace by re-descending the tree until we reach the level one above thelevel we need to insert a link to, and then moving right as necessary.(Typically this will take only two fetches, the meta-data page and the newroot, but in principle there could have been more than one root splitsince we saw the root. We can identify the correct tree level by means ofthe level numbers stored in each page. The situation is rare enough thatwe do not need a more efficient solution.)Lehman and Yao assume fixed-size keys, but we must deal withvariable-size keys. Therefore there is not a fixed maximum number ofkeys per page; we just stuff in as many as will fit. When we split apage, we try to equalize the number of bytes, not items, assigned toeach of the resulting pages. Note we must include the incoming item inthis calculation, otherwise it is possible to find that the incomingitem doesn't fit on the split page where it needs to go!The deletion algorithm----------------------Deletions of leaf items are handled by getting a super-exclusive lock onthe target page, so that no other backend has a pin on the page when thedeletion starts. This means no scan is pointing at the page, so no otherbackend can lose its place due to the item deletion.The above does not work for deletion of items in internal pages, sinceother backends keep no lock nor pin on a page they have descended past.Instead, when a backend is ascending the tree using its stack, it mustbe prepared for the possibility that the item it wants is to the left ofthe recorded position (but it can't have moved left out of the recordedpage). Since we hold a lock on the lower page (per L&Y) until we havere-found the parent item that links to it, we can be assured that theparent item does still exist and can't have been deleted. Also, becausewe are matching downlink page numbers and not data keys, we don't have anyproblem with possibly misidentifying the parent item.We consider deleting an entire page from the btree only when it's becomecompletely empty of items. (Merging partly-full pages would allow betterspace reuse, but it seems impractical to move existing data items left orright to make this happen --- a scan moving in the opposite directionmight miss the items if so. We could do it during VACUUM FULL, though.)Also, we *never* delete the rightmost page on a tree level (thisrestriction simplifies the traversal algorithms, as explained below).To delete an empty page, we acquire write lock on its left sibling (ifany), the target page itself, the right sibling (there must be one), andthe parent page, in that order. The parent page must be found using thesame type of search as used to find the parent during an insertion split.Then we update the side-links in the siblings, mark the target pagedeleted, and remove the downlink from the parent, as well as the parent'supper bounding key for the target (the one separating it from its rightsibling). This causes the target page's key space to effectively belongto its right sibling. (Neither the left nor right sibling pages need tochange their "high key" if any; so there is no problem with possibly nothaving enough space to replace a high key.) The side-links in the targetpage are not changed.(Note: Lanin and Shasha prefer to make the key space move left, but theirargument for doing so hinges on not having left-links, which we haveanyway. So we simplify the algorithm by moving key space right.)To preserve consistency on the parent level, we cannot merge the key spaceof a page into its right sibling unless the right sibling is a child ofthe same parent --- otherwise, the parent's key space assignment changestoo, meaning we'd have to make bounding-key updates in its parent, andperhaps all the way up the tree. Since we can't possibly do thatatomically, we forbid this case. That means that the rightmost child of aparent node can't be deleted unless it's the only remaining child.When we delete the last remaining child of a parent page, we mark theparent page "half-dead" as part of the atomic update that deletes thechild page. This implicitly transfers the parent's key space to its rightsibling (which it must have, since we never delete the overall-rightmostpage of a level). No future insertions into the parent level are allowedto insert keys into the half-dead page --- they must move right to itssibling, instead. The parent remains empty and can be deleted in aseparate atomic action. (However, if it's the rightmost child of its ownparent, it might have to stay half-dead for awhile, until it's also theonly child.)Note that an empty leaf page is a valid tree state, but an empty interiorpage is not legal (an interior page must have children to delegate itskey space to). So an interior page *must* be marked half-dead as soonas its last child is deleted.The notion of a half-dead page means that the key space relationship betweenthe half-dead page's level and its parent's level may be a little out ofwhack: key space that appears to belong to the half-dead page's parent on theparent level may really belong to its right sibling. We can tolerate this,
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -