📄 readme.hot
字号:
$PostgreSQL: pgsql/src/backend/access/heap/README.HOT,v 1.2 2007/09/21 21:25:42 tgl Exp $ Heap Only Tuples (HOT)Introduction------------The Heap Only Tuple (HOT) feature eliminates redundant index entries andallows the re-use of space taken by DELETEd or obsoleted UPDATEd tupleswithout performing a table-wide vacuum. It does this by allowingsingle-page vacuuming, also called "defragmentation".Note: there is a Glossary at the end of this document that may be helpfulfor first-time readers.Technical Challenges--------------------Page-at-a-time vacuuming is normally impractical because of the costs offinding and removing the index entries that link to the tuples to bereclaimed. Standard vacuuming scans the indexes to ensure all such indexentries are removed, amortizing the index scan cost across as many deadtuples as possible; this approach does not scale down well to the case ofreclaiming just a few tuples. In principle one could recompute the indexkeys and do standard index searches to find the index entries, but this isrisky in the presence of possibly-buggy user-defined functions infunctional indexes. An allegedly immutable function that in fact is notimmutable might prevent us from re-finding an index entry (and we cannotthrow an error for not finding it, in view of the fact that dead indexentries are sometimes reclaimed early). That would lead to a seriouslycorrupt index, in the form of entries pointing to tuple slots that by nowcontain some unrelated content. In any case we would prefer to be ableto do vacuuming without invoking any user-written code.HOT solves this problem for a restricted but useful special case:where a tuple is repeatedly updated in ways that do not change itsindexed columns. (Here, "indexed column" means any column referencedat all in an index definition, including for example columns that aretested in a partial-index predicate but are not stored in the index.)An additional property of HOT is that it reduces index size by avoidingthe creation of identically-keyed index entries. This improves searchspeeds.Update Chains With a Single Index Entry---------------------------------------Without HOT, every version of a row in an update chain has its own indexentries, even if all indexed columns are the same. With HOT, a new tupleplaced on the same page and with all indexed columns the same as itsparent row version does not get new index entries. This means there isonly one index entry for the entire update chain on the heap page.An index-entry-less tuple is marked with the HEAP_ONLY_TUPLE flag.The prior row version is marked HEAP_HOT_UPDATED, and (as always in anupdate chain) its t_ctid field links forward to the newer version.For example: Index points to 1 lp [1] [2] [111111111]->[2222222222]In the above diagram, the index points to line pointer 1, and tuple 1 ismarked as HEAP_HOT_UPDATED. Tuple 2 is a HOT tuple, meaning it hasno index entry pointing to it, and is marked as HEAP_ONLY_TUPLE.Although tuple 2 is not directly referenced by the index, it can still befound by an index search: after traversing from the index to tuple 1,the index search proceeds forward to child tuples as long as it sees theHEAP_HOT_UPDATED flag set. Since we restrict the HOT chain to lie withina single page, this requires no additional page fetches and doesn'tintroduce much performance penalty.Eventually, tuple 1 will no longer be visible to any transaction.At that point its space could be reclaimed, but its line pointer cannot,since the index still links to that line pointer and we still need tobe able to find tuple 2 in an index search. HOT handles this by turningline pointer 1 into a "redirecting line pointer", which links to tuple 2but has no actual tuple attached. This state of affairs looks like Index points to 1 lp [1]->[2] [2222222222]If now the row is updated again, to version 3, the page looks like this: Index points to 1 lp [1]->[2] [3] [2222222222]->[3333333333]At some later time when no transaction can see tuple 2 in its snapshot,tuple 2 and its line pointer can be pruned entirely: Index points to 1 lp [1]------>[3] [3333333333]This is safe because no index entry points to line pointer 2. Subsequentinsertions into the page can now recycle both line pointer 2 and thespace formerly used by tuple 2.If an update changes any indexed column, or there is not room on thesame page for the new tuple, then the HOT chain ends: the last memberhas a regular t_ctid link to the next version and is not markedHEAP_HOT_UPDATED. (In principle we could continue a HOT chain acrosspages, but this would destroy the desired property of being able toreclaim space with just page-local manipulations. Anyway, we don'twant to have to chase through multiple heap pages to get from an indexentry to the desired tuple, so it seems better to create a new indexentry for the new tuple.) If further updates occur, the next versioncould become the root of a new HOT chain.Line pointer 1 has to remain as long as there is any non-dead member ofthe chain on the page. When there is not, it is marked "dead".This lets us reclaim the last child line pointer and associated tupleimmediately. The next regular VACUUM pass can reclaim the index entriespointing at the line pointer and then the line pointer itself. Since aline pointer is small compared to a tuple, this does not represent anundue space cost.Note: we can use a "dead" line pointer for any DELETEd tuple,whether it was part of a HOT chain or not. This allows space reclamationin advance of running VACUUM for plain DELETEs as well as HOT updates.The requirement for doing a HOT update is that none of the indexedcolumns are changed. This is checked at execution time by comparing thebinary representation of the old and new values. We insist on bitwiseequality rather than using datatype-specific equality routines. Themain reason to avoid the latter is that there might be multiple notionsof equality for a datatype, and we don't know exactly which one isrelevant for the indexes at hand. We assume that bitwise equalityguarantees equality for all purposes.Abort Cases-----------If a heap-only tuple's xmin is aborted, then it can be removed immediately:it was never visible to any other transaction, and all descendant rowversions must be aborted as well. Therefore we need not consider it partof a HOT chain. By the same token, if a HOT-updated tuple's xmax isaborted, there is no need to follow the chain link. However, there is arace condition here: the transaction that did the HOT update might abortbetween the time we inspect the HOT-updated tuple and the time we reachthe descendant heap-only tuple. It is conceivable that someone prunesthe heap-only tuple before that, and even conceivable that the line pointeris re-used for another purpose. Therefore, when following a HOT chain,it is always necessary to be prepared for the possibility that thelinked-to item pointer is unused, dead, or redirected; and if it is anormal item pointer, we still have to check that XMIN of the tuple matchesthe XMAX of the tuple we left. Otherwise we should assume that we havecome to the end of the HOT chain. Note that this sort of XMIN/XMAXmatching is required when following ordinary update chains anyway.(Early versions of the HOT code assumed that holding pin on the pagebuffer while following a HOT link would prevent this type of problem,but checking XMIN/XMAX matching is a much more robust solution.)Index/Sequential Scans----------------------When doing an index scan, whenever we reach a HEAP_HOT_UPDATED tuple whosexmax is not aborted, we need to follow its t_ctid link and check thatentry as well; possibly repeatedly until we reach the end of the HOTchain. (When using an MVCC snapshot it is possible to optimize this abit: there can be at most one visible tuple in the chain, so we can stopwhen we find it. This rule does not work for non-MVCC snapshots, though.)Sequential scans do not need to pay attention to the HOT links becausethey scan every item pointer on the page anyway. The same goes for abitmap heap scan with a lossy bitmap.Pruning-------HOT pruning means updating item pointers so that HOT chains arereduced in length, by collapsing out line pointers for intermediate deadtuples. Although this makes those line pointers available for re-use,it does not immediately make the space occupied by their tuples available.Defragmentation---------------Defragmentation centralizes unused space. After we have converted rootline pointers to redirected line pointers and pruned away any deadintermediate line pointers, the tuples they linked to are free space.But unless that space is adjacent to the central "hole" on the page(the pd_lower-to-pd_upper area) it cannot be used by tuple insertion.Defragmentation moves the surviving tuples to coalesce all the freespace into one "hole". This is done with the same PageRepairFragmentationfunction that regular VACUUM uses.When can/should we prune or defragment?---------------------------------------This is the most interesting question in HOT implementation, since thereis no simple right answer: we must use heuristics to determine when it'smost efficient to perform pruning and/or defragmenting.We cannot prune or defragment unless we can get a "buffer cleanup lock"on the target page; otherwise, pruning might destroy line pointers thatother backends have live references to, and defragmenting might movetuples that other backends have live pointers to. Thus the generalapproach must be to heuristically decide if we should try to pruneor defragment, and if so try to acquire the buffer cleanup lock withoutblocking. If we succeed we can proceed with our housekeeping work.If we cannot get the lock (which should not happen often, except undervery heavy contention) then the housekeeping has to be postponed tillsome other time. The worst-case consequence of this is only that anUPDATE cannot be made HOT but has to link to a new tuple version placed onsome other page, for lack of centralized space on the original page.Ideally we would do defragmenting only when we are about to attemptheap_update on a HOT-safe tuple. The difficulty with this approachis that the update query has certainly got a pin on the old tuple, andtherefore our attempt to acquire a buffer cleanup lock will always fail.(This corresponds to the idea that we don't want to move the old tupleout from under where the query's HeapTuple pointer points. It mightbe possible to finesse that, but it seems fragile.)Pruning, however, is potentially useful even when we are not about toinsert a new tuple, since shortening a HOT chain reduces the cost ofsubsequent index searches. However it is unclear that this gain islarge enough to accept any extra maintenance burden for.The currently planned heuristic is to prune and defrag when first accessinga page that potentially has prunable tuples (as flagged by the pd_prune_xidpage hint field) and that either has free space less than MAX(fillfactortarget free space, BLCKSZ/10) *or* has recently had an UPDATE fail tofind enough free space to store an updated tuple version. (These rulesare subject to change.)We have effectively implemented the "truncate dead tuples to just linepointer" idea that has been proposed and rejected before because of fearof line pointer bloat: we might end up with huge numbers of line pointersand just a few actual tuples on a page. To limit the damage in the worst
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -