📄 atomiccommit.html
字号:
order the rollback journals are deleted or the database files areunlocked.</p><h2>6.0 Additional Details Of The Commit Process</h2><p><a href="#section_3_0">Section 3.0</a> above provides an overview ofhow atomic commit works in SQLite. But it glosses over a number ofimportant details. The following subsections will attempt to fillin the gaps.</p><h3>6.1 Always Journal Complete Sectors</h3><p>When the original content of a database page is written intothe rollback journal (as shown in <a href="#section_3_5">section 3.5</a>),SQLite always writes a complete sectors worth of data, even if thepage size of the database is smaller than the sector size. Historically, the sector size in SQLite has been hard coded to 512bytes and since the minimum page size is also 512 bytes, this has neverbeen an issue. But beginning with SQLite version 3.3.14, it is possiblefor SQLite to use mass storage devices with a sector size larger than 512bytes. So, beginning with version 3.3.14, whenever any page within asector is written into the journal file, all pages in that same sectorare stored with it.</p><p>It is important to store all pages of a sector in the rollbackjournal in order to prevent database corruption following a powerloss while writing the sector. Suppose that pages 1, 2, 3, and 4 areall stored in sector 1 and that page 2 is modified. In order to writethe changes to page 2, the underlying hardware must also rewrite thecontent of pages 1, 3, and 4 since the hardware must write the completesector. If this write operation is interrupted by a power outage,one or more of the pages 1, 3, or 4 might be left with incorrect data.Hence, to avoid lasting corruption to the database, the original contentof all of those pages must be contained in the rollback journal.</p><h3>6.2 Dealing With Garbage Written Into Journal Files</h3><p>When data is appended to the end of the rollback journal,SQLite normally makes the pessimistic assumption that the fileis first extended with invalid "garbage" data and that afterwardsthe correct data replaces the garbage. In other words, SQLite assumesthat the file size is increased first and then afterwards the contentis written into the file. If a power failure occurs after the filesize has been increased but before the file content has been written,the rollback journal can be left containing garbage data. If afterpower is restored, another SQLite process sees the rollback journalcontaining the garbage data and tries to roll it back into the originaldatabase file, it might copy some of the garbage into the database fileand thus corrupt the database file.</p><p>SQLite uses two defenses against this problem. In the first place,SQLite records the number of pages in the rollback journal in the headerof the rollback journal. This number is initially zero. So during anattempt to rollback an incomplete (and possibly corrupt) rollbackjournal, the process doing the rollback will see that the journalcontains zero pages and will thus make no changes to the database. Priorto a commit, the rollback journal is flushed to disk to ensure thatall content has been synched to disk and there is no "garbage" leftin the file, and only then is the page count in the header changed fromzero to true number of pages in the rollback journal. The rollback journalheader is always kept in a separate sector from any page data so thatit can be overwritten and flushed without risking damage to a datapage if a power outage occurs. Notice that the rollback journalis flushed to disk twice: once to write the page content and a secondtime to write the page count in the header.</p><p>The previous paragraph describes what happens when thesynchronous pragma setting is "full".</p><blockquote>PRAGMA synchronous=FULL;</blockquote><p>The default synchronous setting is full so the above is what usuallyhappens. However, if the synchronous setting is lowered to "normal",SQLite only flushes the rollback journal once, after the page count hasbeen written.This carries a risk of corruption because it might happen that the modified (non-zero) page count reaches the disk surface before allof the data does. The data will have been written first, but SQLiteassumes that the underlying filesystem can reorder write requests andthat the page count can be burned into oxide first even though itswrite request occurred last. So as a second line of defense, SQLitealso uses a 32-bit checksum on every page of data in the rollbackjournal. This checksum is evaluated for each page during rollbackwhile rolling back a journal as described in <a href="#section_4_4">section 4.4</a>. If an incorrect checksumis seen, the rollback is abandoned. Note that the checksum doesnot guarantee that the page data is correct since there is a smallbut finite probability that the checksum might be right even if the data iscorrupt. But the checksum does at least make such an error unlikely.</p><p>Note that the checksums in the rollback journal are not necessaryif the synchronous setting is FULL. We only depend on the checksumswhen synchronous is lowered to NORMAL. Nevertheless, the checksumsnever hurt and so they are included in the rollback journal regardlessof the synchronous setting.</p><h3>6.3 Cache Spill Prior To Commit</h3><p>The commit process shown in <a href="#section_3_0">section 3.0</a>assumes that all database changes fit in memory until it is time tocommit. This is the common case. But sometimes a larger change willoverflow the user-space cache prior to transaction commit. In thosecases, the cache must spill to the database before the transactionis complete.</p><p>At the beginning of a cache spill, the status of the databaseconnection is as shown in <a href="#section_3_6">step 3.6</a>.Original page content has been saved in the rollback journal andmodifications of the pages exist in user memory. To spill the cache,SQLite executes steps <a href="#section_3_7">3.7</a> through<a href="#section_3_9">3.9</a>. In other words, the rollback journalis flushed to disk, an exclusive lock is acquired, and changes arewritten into the database. But the remaining steps are deferreduntil the transaction really commits. A new journal header isappended to the end of the rollback journal (in its own sector)and the exclusive database lock is retained, but otherwise processingreturns to <a href="#section_3_6">step 3.6</a>. When the transactioncommits, or if another cache spill occurs, steps<a href="#section_3_7">3.7</a> and <a href="#section_3_9">3.9</a> arerepeated. (Step <a href="#section_3_8">3.8</a> is omitted on secondand subsequent passes since an exclusive database lock is already helddue to the first pass.)</p><p>A cache spill causes the lock on the database file toescalate from reserved to exclusive. This reduces concurrency.A cache spill also causes extra disk flush or fsync operations tooccur and these operations are slow, hence a cache spill canseriously reduce performance.For these reasons a cache spill is avoided whenever possible.</p><h2>7.0 Optimizations</h2><p>Profiling indicates that for most systems and in most circumstancesSQLite spends most of its time doing disk I/O. It follows then thatanything we can do to reduce the amount of disk I/O will likely have alarge positive impact on the performance of SQLite. This sectiondescribes some of the techniques used by SQLite to try to reduce theamount of disk I/O to a minimum while still preserving atomic commit.</p><h3>7.1 Cache Retained Between Transactions</h3><p><a href="#section_3_12">Step 3.12</a> of the commit process showsthat once the shared lock has been released, all user-space cacheimages of database content must be discarded. This is done becausewithout a shared lock, other processes are free to modify the databasefile content and so any user-space image of that content might becomeobsolete. Consequently, each new transaction would begin by rereadingdata which had previously been read. This is not as bad as it soundsat first since the data being read is still likely in the operatingsystems file cache. So the "read" is really just a copy of datafrom kernel space into user space. But even so, it still takes time.</p><p>Beginning with SQLite version 3.3.14 a mechanism has been addedto try to reduce the needless rereading of data. In newer versionsof SQLite, the data in the user-space pager cache is retained whenthe lock on the database file is released. Later, after theshared lock is acquired at the beginning of the next transaction,SQLite checks to see if any other process has modified the databasefile. If the database has been changed in any way since the lockwas last released, the user-space cache is erased at that point.But commonly the database file is unchanged and the user-space cachecan be retained, and some unnecessary read operations can be avoided.</p><p>In order to determine whether or not the database file has changed,SQLite uses a counter in the database header (in bytes 24 through 27)which is incremented during every change operation. SQLite saves a copyof this counter prior to releasing its database lock. Then afteracquiring the next database lock it compares the saved counter valueagainst the current counter value and erases the cache if the valuesare different, or reuses the cache if they are the same.</p><a name="section_7_2"></a><h3>7.2 Exclusive Access Mode</h3><p>SQLite version 3.3.14 adds the concept of "Exclusive Access Mode".In exclusive access mode, SQLite retains the exclusivedatabase lock at the conclusion of each transaction. This preventsother processes for accessing the database, but in many deploymentsonly a single process is using a database so this is not aserious problem. The advantage of exclusive access mode is thatdisk I/O can be reduced in three ways:</p><ol><li><p>It is not necessary to increment the change counter in thedatabase header for transactions after the first transaction. Thiswill often save a write of page one to both the rollbackjournal and the main database file.</p></li><li><p>No other processes can change the database so there is nevera need to check the change counter and clear the user-space cacheat the beginning of a transaction.</p></li><li><p>Each transaction can be committed by overwriting the rollbackjournal header with zeros rather than deleting the journal file.This avoids having to modify the directory entry for the journal fileand it avoids having to deallocate disk sectors associated with the journal. Furthermore, the next transaction will overwrite existingjournal file content rather than append new content and on most systemsoverwriting is much faster than appending.</p></li></ol><p>The third optimization, zeroing the journal file header rather thandeleting the rollback journal file,does not depend on holding an exclusive lock at all times.This optimization can be set independently of exclusive lock modeusing the <a href="pragma.html#pragma_journal_mode">journal_mode</a>pragma as described in <a href="#section_7_6">section 7.6</a> below.</p><h3>7.3 Do Not Journal Freelist Pages</h3><p>When information is deleted from an SQLite database, the pages usedto hold the deleted information are added to a "freelist". Subsequentinserts will draw pages off of this freelist rather than expanding thedatabase file.</p><p>Some freelist pages contain critical data; specifically the locationsof other freelist pages. But most freelist pages contain nothing useful.These latter freelist pages are called "leaf" pages. We are free tomodify the content of a leaf freelist page in the database withoutchanging the meaning of the database in any way.</p><p>Because the content of leaf freelist pages is unimportant, SQLiteavoids storing leaf freelist page content in the rollback journalin <a href="#section_3_5">step 3.5</a> of the commit process.If a leaf freelist page is changed and that change does not get rolled backduring a transaction recovery, the database is not harmed by the omission.Similarly, the content of a new freelist page is never written backinto the database at <a href="#section_3_9">step 3.9</a> norread from the database at <a href="#section_3_3">step 3.3</a>.These optimizations can greatly reduce the amount of I/O that occurswhen making changes to a database file that contains free space.</p><h3>7.4 Single Page Updates And Atomic Sector Writes</h3><p>Beginning in SQLite version 3.5.0, the new Virtual File System (VFS)interface contains a method named xDeviceCharacteristics which reportson special properties that the underlying mass storage devicemight have. Among the special properties thatxDeviceCharacteristics might report is the ability of to do anatomic sector write.</p><p>Recall that by default SQLite assumes that sector writes arelinear but not atomic. A linear write starts at one end of thesector and changes information byte by byte until it gets to theother end of the sector. If a power loss occurs in the middle ofa linear write then part of the sector might be modified while theother end is unchanged. In an atomic sector write, either the entiresector is overwritten or else nothing in the sector is changed.</p><p>We believe that most modern disk drives implement atomic sectorwrites. When power is lost, the drive uses energy stored in capacitorsand/or the angular momentum of the disk platter to provide power to complete any operation in progress. Nevertheless, there are so manylayers in between the write system call and the on-board disk driveelectronics that we take the safe approach in both Unix and w32 VFSimplementations and assume that sector writes are not atomic. On theother hand, devicemanufactures with more control over their filesystems might wantto consider enabling the atomic write property of xDeviceCharacteristicsif their hardware really does do atomic writes.</p><p>When sector writes are atomic and the page size of a database isthe same as a sector size, and when there is a database change thatonly touches a single database page, then SQLite skips the whole
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -