📄 readme
字号:
5. Build a WAL log record and pass it to XLogInsert(); then update the page'sLSN and TLI using the returned XLOG location. For instance, recptr = XLogInsert(rmgr_id, info, rdata); PageSetLSN(dp, recptr); PageSetTLI(dp, ThisTimeLineID);6. END_CRIT_SECTION()7. Unlock and unpin the buffer(s).XLogInsert's "rdata" argument is an array of pointer/size items identifyingchunks of data to be written in the XLOG record, plus optional shared-bufferIDs for chunks that are in shared buffers rather than temporary variables.The "rdata" array must mention (at least once) each of the shared buffersbeing modified, unless the action is such that the WAL replay routine canreconstruct the entire page contents. XLogInsert includes the logic thattests to see whether a shared buffer has been modified since the lastcheckpoint. If not, the entire page contents are logged rather than just theportion(s) pointed to by "rdata".Because XLogInsert drops the rdata components associated with buffers itchooses to log in full, the WAL replay routines normally need to test to seewhich buffers were handled that way --- otherwise they may be misled aboutwhat the XLOG record actually contains. XLOG records that describe multi-pagechanges therefore require some care to design: you must be certain that youknow what data is indicated by each "BKP" bit. An example of the trickinessis that in a HEAP_UPDATE record, BKP(1) normally is associated with the sourcepage and BKP(2) is associated with the destination page --- but if these arethe same page, only BKP(1) would have been set.For this reason as well as the risk of deadlocking on buffer locks, it's bestto design WAL records so that they reflect small atomic actions involving justone or a few pages. The current XLOG infrastructure cannot handle WAL recordsinvolving references to more than three shared buffers, anyway.In the case where the WAL record contains enough information to re-generatethe entire contents of a page, do *not* show that page's buffer ID in therdata array, even if some of the rdata items point into the buffer. This isbecause you don't want XLogInsert to log the whole page contents. Thestandard replay-routine pattern for this case is reln = XLogOpenRelation(rnode); buffer = XLogReadBuffer(reln, blkno, true); Assert(BufferIsValid(buffer)); page = (Page) BufferGetPage(buffer); ... initialize the page ... PageSetLSN(page, lsn); PageSetTLI(page, ThisTimeLineID); MarkBufferDirty(buffer); UnlockReleaseBuffer(buffer);In the case where the WAL record provides only enough information toincrementally update the page, the rdata array *must* mention the bufferID at least once; otherwise there is no defense against torn-page problems.The standard replay-routine pattern for this case is if (record->xl_info & XLR_BKP_BLOCK_n) << do nothing, page was rewritten from logged copy >>; reln = XLogOpenRelation(rnode); buffer = XLogReadBuffer(reln, blkno, false); if (!BufferIsValid(buffer)) << do nothing, page has been deleted >>; page = (Page) BufferGetPage(buffer); if (XLByteLE(lsn, PageGetLSN(page))) { /* changes are already applied */ UnlockReleaseBuffer(buffer); return; } ... apply the change ... PageSetLSN(page, lsn); PageSetTLI(page, ThisTimeLineID); MarkBufferDirty(buffer); UnlockReleaseBuffer(buffer);As noted above, for a multi-page update you need to be able to determinewhich XLR_BKP_BLOCK_n flag applies to each page. If a WAL record reflectsa combination of fully-rewritable and incremental updates, then the rewritablepages don't count for the XLR_BKP_BLOCK_n numbering. (XLR_BKP_BLOCK_n isassociated with the n'th distinct buffer ID seen in the "rdata" array, andper the above discussion, fully-rewritable buffers shouldn't be mentioned in"rdata".)Due to all these constraints, complex changes (such as a multilevel indexinsertion) normally need to be described by a series of atomic-action WALrecords. What do you do if the intermediate states are not self-consistent?The answer is that the WAL replay logic has to be able to fix things up.In btree indexes, for example, a page split requires insertion of a new key inthe parent btree level, but for locking reasons this has to be reflected bytwo separate WAL records. The replay code has to remember "unfinished" splitoperations, and match them up to subsequent insertions in the parent level.If no matching insert has been found by the time the WAL replay ends, thereplay code has to do the insertion on its own to restore the index toconsistency. Such insertions occur after WAL is operational, so they canand should write WAL records for the additional generated actions.Asynchronous Commit-------------------As of PostgreSQL 8.3 it is possible to perform asynchronous commits - i.e.,we don't wait while the WAL record for the commit is fsync'ed.We perform an asynchronous commit when synchronous_commit = off. Insteadof performing an XLogFlush() up to the LSN of the commit, we merely notethe LSN in shared memory. The backend then continues with other work.We record the LSN only for an asynchronous commit, not an abort; there'snever any need to flush an abort record, since the presumption after acrash would be that the transaction aborted anyway.We always force synchronous commit when the transaction is deletingrelations, to ensure the commit record is down to disk before the relationsare removed from the filesystem. Also, certain utility commands that havenon-roll-backable side effects (such as filesystem changes) force synccommit to minimize the window in which the filesystem change has been madebut the transaction isn't guaranteed committed.Every wal_writer_delay milliseconds, the walwriter process performs anXLogBackgroundFlush(). This checks the location of the last completelyfilled WAL page. If that has moved forwards, then we write all the changedbuffers up to that point, so that under full load we write only wholebuffers. If there has been a break in activity and the current WAL page isthe same as before, then we find out the LSN of the most recentasynchronous commit, and flush up to that point, if required (i.e.,if it's in the current WAL page). This arrangement in itself wouldguarantee that an async commit record reaches disk during at worst thesecond walwriter cycle after the transaction completes. However, we alsoallow XLogFlush to flush full buffers "flexibly" (ie, not wrapping aroundat the end of the circular WAL buffer area), so as to minimize the numberof writes issued under high load when multiple WAL pages are filled perwalwriter cycle. This makes the worst-case delay three walwriter cycles.There are some other subtle points to consider with asynchronous commits.First, for each page of CLOG we must remember the LSN of the latest commitaffecting the page, so that we can enforce the same flush-WAL-before-writerule that we do for ordinary relation pages. Otherwise the record of thecommit might reach disk before the WAL record does. Again, abort recordsneed not factor into this consideration.In fact, we store more than one LSN for each clog page. This relates tothe way we set transaction status hint bits during visibility tests.We must not set a transaction-committed hint bit on a relation page andhave that record make it to disk prior to the WAL record of the commit.Since visibility tests are normally made while holding buffer share locks,we do not have the option of changing the page's LSN to guarantee WALsynchronization. Instead, we defer the setting of the hint bit if we havenot yet flushed WAL as far as the LSN associated with the transaction.This requires tracking the LSN of each unflushed async commit. It isconvenient to associate this data with clog buffers: because we will flushWAL before writing a clog page, we know that we do not need to remember atransaction's LSN longer than the clog page holding its commit statusremains in memory. However, the naive approach of storing an LSN for eachclog position is unattractive: the LSNs are 32x bigger than the two-bitcommit status fields, and so we'd need 256K of additional shared memory foreach 8K clog buffer page. We choose instead to store a smaller number ofLSNs per page, where each LSN is the highest LSN associated with anytransaction commit in a contiguous range of transaction IDs on that page.This saves storage at the price of some possibly-unnecessary delay insetting transaction hint bits.How many transactions should share the same cached LSN (N)? If thesystem's workload consists only of small async-commit transactions, thenit's reasonable to have N similar to the number of transactions perwalwriter cycle, since that is the granularity with which transactions willbecome truly committed (and thus hintable) anyway. The worst case is wherea sync-commit xact shares a cached LSN with an async-commit xact thatcommits a bit later; even though we paid to sync the first xact to disk,we won't be able to hint its outputs until the second xact is sync'd, up tothree walwriter cycles later. This argues for keeping N (the group size)as small as possible. For the moment we are setting the group size to 32,which makes the LSN cache space the same size as the actual clog bufferspace (independently of BLCKSZ).It is useful that we can run both synchronous and asynchronous committransactions concurrently, but the safety of this is perhaps notimmediately obvious. Assume we have two transactions, T1 and T2. The LogSequence Number (LSN) is the point in the WAL sequence where a transactioncommit is recorded, so LSN1 and LSN2 are the commit records of thosetransactions. If T2 can see changes made by T1 then when T2 commits itmust be true that LSN2 follows LSN1. Thus when T2 commits it is certainthat all of the changes made by T1 are also now recorded in the WAL. Thisis true whether T1 was asynchronous or synchronous. As a result, it issafe for asynchronous commits and synchronous commits to work concurrentlywithout endangering data written by synchronous commits. Sub-transactionsare not important here since the final write to disk only occurs at thecommit of the top level transaction.Changes to data blocks cannot reach disk unless WAL is flushed up to thepoint of the LSN of the data blocks. Any attempt to write unsafe data todisk will trigger a write which ensures the safety of all data written bythat and prior transactions. Data blocks and clog pages are both protectedby LSNs.Changes to a temp table are not WAL-logged, hence could reach disk inadvance of T1's commit, but we don't care since temp table contents don'tsurvive crashes anyway.Database writes made via any of the paths we have introduced to avoid WALoverhead for bulk updates are also safe. In these cases it's entirelypossible for the data to reach disk before T1's commit, because T1 willfsync it down to disk without any sort of interlock, as soon as it finishesthe bulk update. However, all these paths are designed to write data thatno other transaction can see until after T1 commits. The situation is thusnot different from ordinary WAL-logged updates.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -