📄 readme
字号:
endures. Therefore, we have SubTransactionId, which is somewhat likeCommandId in that it's generated from a counter that we reset at the start ofeach top transaction. The top-level transaction itself has SubTransactionId 1,and subtransactions have IDs 2 and up. (Zero is reserved forInvalidSubTransactionId.) Note that subtransactions do not have theirown VXIDs; they use the parent top transaction's VXID.Interlocking transaction begin, transaction end, and snapshots--------------------------------------------------------------We try hard to minimize the amount of overhead and lock contention involvedin the frequent activities of beginning/ending a transaction and taking asnapshot. Unfortunately, we must have some interlocking for this, becausewe must ensure consistency about the commit order of transactions.For example, suppose an UPDATE in xact A is blocked by xact B's priorupdate of the same row, and xact B is doing commit while xact C gets asnapshot. Xact A can complete and commit as soon as B releases its locks.If xact C's GetSnapshotData sees xact B as still running, then it hadbetter see xact A as still running as well, or it will be able to see twotuple versions - one deleted by xact B and one inserted by xact A. Anotherreason why this would be bad is that C would see (in the row inserted by A)earlier changes by B, and it would be inconsistent for C not to see anyof B's changes elsewhere in the database.Formally, the correctness requirement is "if a snapshot A considerstransaction X as committed, and any of transaction X's snapshots consideredtransaction Y as committed, then snapshot A must consider transaction Y ascommitted".What we actually enforce is strict serialization of commits and rollbackswith snapshot-taking: we do not allow any transaction to exit the set ofrunning transactions while a snapshot is being taken. (This rule isstronger than necessary for consistency, but is relatively simple toenforce, and it assists with some other issues as explained below.) Theimplementation of this is that GetSnapshotData takes the ProcArrayLock inshared mode (so that multiple backends can take snapshots in parallel),but ProcArrayEndTransaction must take the ProcArrayLock in exclusive modewhile clearing MyProc->xid at transaction end (either commit or abort).ProcArrayEndTransaction also holds the lock while advancing the sharedlatestCompletedXid variable. This allows GetSnapshotData to uselatestCompletedXid + 1 as xmax for its snapshot: there can be notransaction >= this xid value that the snapshot needs to consider ascompleted.In short, then, the rule is that no transaction may exit the set ofcurrently-running transactions between the time we fetch latestCompletedXidand the time we finish building our snapshot. However, this restrictiononly applies to transactions that have an XID --- read-only transactionscan end without acquiring ProcArrayLock, since they don't affect anyoneelse's snapshot nor latestCompletedXid.Transaction start, per se, doesn't have any interlocking with theseconsiderations, since we no longer assign an XID immediately at transactionstart. But when we do decide to allocate an XID, GetNewTransactionId muststore the new XID into the shared ProcArray before releasing XidGenLock.This ensures that all top-level XIDs <= latestCompletedXid are eitherpresent in the ProcArray, or not running anymore. (This guarantee doesn'tapply to subtransaction XIDs, because of the possibility that there's notroom for them in the subxid array; instead we guarantee that they arepresent or the overflow flag is set.) If a backend released XidGenLockbefore storing its XID into MyProc, then it would be possible for anotherbackend to allocate and commit a later XID, causing latestCompletedXid topass the first backend's XID, before that value became visible in theProcArray. That would break GetOldestXmin, as discussed below.We allow GetNewTransactionId to store the XID into MyProc->xid (or thesubxid array) without taking ProcArrayLock. This was once necessary toavoid deadlock; while that is no longer the case, it's still beneficial forperformance. We are thereby relying on fetch/store of an XID to be atomic,else other backends might see a partially-set XID. This also means thatreaders of the ProcArray xid fields must be careful to fetch a value onlyonce, rather than assume they can read it multiple times and get the sameanswer each time. (Use volatile-qualified pointers when doing this, toensure that the C compiler does exactly what you tell it to.)Another important activity that uses the shared ProcArray is GetOldestXmin,which must determine a lower bound for the oldest xmin of any active MVCCsnapshot, system-wide. Each individual backend advertises the smallestxmin of its own snapshots in MyProc->xmin, or zero if it currently has nolive snapshots (eg, if it's between transactions or hasn't yet set asnapshot for a new transaction). GetOldestXmin takes the MIN() of thevalid xmin fields. It does this with only shared lock on ProcArrayLock,which means there is a potential race condition against other backendsdoing GetSnapshotData concurrently: we must be certain that a concurrentbackend that is about to set its xmin does not compute an xmin less thanwhat GetOldestXmin returns. We ensure that by including all the activeXIDs into the MIN() calculation, along with the valid xmins. The rule thattransactions can't exit without taking exclusive ProcArrayLock ensures thatconcurrent holders of shared ProcArrayLock will compute the same minimum ofcurrently-active XIDs: no xact, in particular not the oldest, can exitwhile we hold shared ProcArrayLock. So GetOldestXmin's view of the minimumactive XID will be the same as that of any concurrent GetSnapshotData, andso it can't produce an overestimate. If there is no active transaction atall, GetOldestXmin returns latestCompletedXid + 1, which is a lower boundfor the xmin that might be computed by concurrent or later GetSnapshotDatacalls. (We know that no XID less than this could be about to appear inthe ProcArray, because of the XidGenLock interlock discussed above.)GetSnapshotData also performs an oldest-xmin calculation (which had bettermatch GetOldestXmin's) and stores that into RecentGlobalXmin, which is usedfor some tuple age cutoff checks where a fresh call of GetOldestXmin seemstoo expensive. Note that while it is certain that two concurrentexecutions of GetSnapshotData will compute the same xmin for their ownsnapshots, as argued above, it is not certain that they will arrive at thesame estimate of RecentGlobalXmin. This is because we allow XID-lesstransactions to clear their MyProc->xmin asynchronously (without takingProcArrayLock), so one execution might see what had been the oldest xmin,and another not. This is OK since RecentGlobalXmin need only be a validlower bound. As noted above, we are already assuming that fetch/storeof the xid fields is atomic, so assuming it for xmin as well is no extrarisk.pg_clog and pg_subtrans-----------------------pg_clog and pg_subtrans are permanent (on-disk) storage of transaction relatedinformation. There is a limited number of pages of each kept in memory, soin many cases there is no need to actually read from disk. However, ifthere's a long running transaction or a backend sitting idle with an opentransaction, it may be necessary to be able to read and write this informationfrom disk. They also allow information to be permanent across server restarts.pg_clog records the commit status for each transaction that has been assignedan XID. A transaction can be in progress, committed, aborted, or"sub-committed". This last state means that it's a subtransaction that's nolonger running, but its parent has not updated its state yet (either it isstill running, or the backend crashed without updating its status). Asub-committed transaction's status will be updated again to the final value assoon as the parent commits or aborts, or when the parent is detected to beaborted.Savepoints are implemented using subtransactions. A subtransaction is atransaction inside a transaction; its commit or abort status is not onlydependent on whether it committed itself, but also whether its parenttransaction committed. To implement multiple savepoints in a transaction weallow unlimited transaction nesting depth, so any particular subtransaction'scommit state is dependent on the commit status of each and every ancestortransaction.The "subtransaction parent" (pg_subtrans) mechanism records, for eachtransaction with an XID, the TransactionId of its parent transaction. Thisinformation is stored as soon as the subtransaction is assigned an XID.Top-level transactions do not have a parent, so they leave their pg_subtransentries set to the default value of zero (InvalidTransactionId).pg_subtrans is used to check whether the transaction in question is stillrunning --- the main Xid of a transaction is recorded in the PGPROC struct,but since we allow arbitrary nesting of subtransactions, we can't fit all Xidsin shared memory, so we have to store them on disk. Note, however, that foreach transaction we keep a "cache" of Xids that are known to be part of thetransaction tree, so we can skip looking at pg_subtrans unless we know thecache has been overflowed. See storage/ipc/procarray.c for the gory details.slru.c is the supporting mechanism for both pg_clog and pg_subtrans. Itimplements the LRU policy for in-memory buffer pages. The high-level routinesfor pg_clog are implemented in transam.c, while the low-level functions are inclog.c. pg_subtrans is contained completely in subtrans.c.Write-Ahead Log coding----------------------The WAL subsystem (also called XLOG in the code) exists to guarantee crashrecovery. It can also be used to provide point-in-time recovery, as well ashot-standby replication via log shipping. Here are some notes aboutnon-obvious aspects of its design.A basic assumption of a write AHEAD log is that log entries must reach stablestorage before the data-page changes they describe. This ensures thatreplaying the log to its end will bring us to a consistent state where thereare no partially-performed transactions. To guarantee this, each data page(either heap or index) is marked with the LSN (log sequence number --- inpractice, a WAL file location) of the latest XLOG record affecting the page.Before the bufmgr can write out a dirty page, it must ensure that xlog hasbeen flushed to disk at least up to the page's LSN. This low-levelinteraction improves performance by not waiting for XLOG I/O until necessary.The LSN check exists only in the shared-buffer manager, not in the localbuffer manager used for temp tables; hence operations on temp tables must notbe WAL-logged.During WAL replay, we can check the LSN of a page to detect whether the changerecorded by the current log entry is already applied (it has been, if the pageLSN is >= the log entry's WAL location).Usually, log entries contain just enough information to redo a singleincremental update on a page (or small group of pages). This will work onlyif the filesystem and hardware implement data page writes as atomic actions,so that a page is never left in a corrupt partly-written state. Since that'soften an untenable assumption in practice, we log additional information toallow complete reconstruction of modified pages. The first WAL recordaffecting a given page after a checkpoint is made to contain a copy of theentire page, and we implement replay by restoring that page copy instead ofredoing the update. (This is more reliable than the data storage itself wouldbe because we can check the validity of the WAL record's CRC.) We can detectthe "first change after checkpoint" by noting whether the page's old LSNprecedes the end of WAL as of the last checkpoint (the RedoRecPtr).The general schema for executing a WAL-logged action is1. Pin and exclusive-lock the shared buffer(s) containing the data page(s)to be modified.2. START_CRIT_SECTION() (Any error during the next three steps must cause aPANIC because the shared buffers will contain unlogged changes, which wehave to ensure don't get to disk. Obviously, you should check conditionssuch as whether there's enough free space on the page before you start thecritical section.)3. Apply the required changes to the shared buffer(s).4. Mark the shared buffer(s) as dirty with MarkBufferDirty(). (This musthappen before the WAL record is inserted; see notes in SyncOneBuffer().)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -