📄 atomiccommit.html
字号:
<p>SQLite allows a single <a href="c3ref/sqlite3.html">database connection</a> to talk totwo or more database files simultaneously through the use ofthe <a href="lang_attach.html">ATTACH DATABASE</a> command.When multiple database files are modified within a singletransaction, all files are updated atomically. In other words, either all of the database files are updated orelse none of them are.Achieving an atomic commit across multiple database files ismore complex that doing so for a single file. This sectiondescribes how SQLite works that bit of magic.</p><h3>5.1 Separate Rollback Journals For Each Database</h3><img src="images/ac/multi-0.gif" align="right" hspace="15"><p>When multiple database files are involved in a transaction,each database has its own rollback journal and each databaseis locked separately. The diagram at the right shows a scenariowhere three different database files have been modified withinone transaction. The situation at this step is analogous to the single-file transaction scenario at <a href="#section_3_6">step 3.6</a>. Each database file hasa reserved lock. For each database, the original content of pages that are being changed have been written into the rollback journalfor that database, but the content of the journals have not yetbeen flushed to disk. No changes have been made to the databasefile itself yet, though presumably there are changes being heldin user memory.</p><p>For brevity, the diagrams in this section are simplified fromthose that came before. Blue color still signifies original contentand pink still signifies new content. But the individual pagesin the rollback journal and the database file are not shown andwe are not making the distinction between information in theoperating system cache and information that is on disk. All ofthese factors still apply in a multi-file commit scenario. Theyjust take up a lot of space in the diagrams and they do not addany new information, so they are omitted here.</p><br clear="both"><h3>5.2 The Master Journal File</h3><img src="images/ac/multi-1.gif" align="right" hspace="15"><p>The next step in a multi-file commit is the creation of a"master journal" file. The name of the master journal file isthe same name as the original database filename (the databasethat was opened using the <a href="c3ref/open.html">sqlite3_open()</a> interface,not one of the <a href="lang_attach.html">ATTACHed</a> auxiliarydatabases) with the text "<b>-mj</b><i>HHHHHHHH</i>" appended where<i>HHHHHHHH</i> is a random 32-bit hexadecimal number. Therandom <i>HHHHHHHH</i> suffix changes for every new master journal.</p><p><i>(Nota bene: The formula for computing the master journal filenamegiven in the previous paragraph corresponds to the implementation asof SQLite version 3.5.0. But this formula is not part of the SQLitespecification and is subject to change in future releases.)</i></p><p>Unlike the rollback journals, the master journal does not containany original database page content. Instead, the master journal containsthe full pathnames for rollback journals for every database that isparticipating in the transaction.</p><p>After the master journal is constructed, its content is flushedto disk before any further actions are taken. On unix, the directorythat contains the master journal is also synced in order to make surethe master journal file will appear in the directory following apower failure.</p><br clear="both"><h3>5.3 Updating Rollback Journal Headers</h3><img src="images/ac/multi-2.gif" align="right" hspace="15"><p>The next step is to record the full pathname of the master journal filein the header of every rollback journal. Space to hold the masterjournal filename was reserved at the beginning of each rollback journalas the rollback journals were created.</p><p>The content of each rollback journal is flushed to disk both beforeand after the master journal filename is written into the rollbackjournal header. It is important to do both of these flushes. Fortunately,the second flush is usually inexpensive since typically only a singlepage of the journal file (the first page) has changed.</p><p>This step is analogous to <a href="#section_3_7">step 3.7</a> in the single-file commitscenario described above.</p><br clear="both"><h3>5.4 Updating The Database Files</h3><img src="images/ac/multi-3.gif" align="right" hspace="15"><p>Once all rollback journal files have been flushed to disk, itis safe to begin updating database files. We have to obtain anexclusive lock on all database files before writing the changes.After all the changes are written, it is important to flush thechanges to disk so that they will be preserved in the event ofa power failure or operating system crash.</p><p>This step corresponds to steps<a href="#section_3_8">3.8</a>,<a href="#section_3_9">3.9</a>, and<a href="#section_3_10">3.10</a> in the single-file commitscenario described previously.</p><br clear="both"><a name="section_5_5"></a><h3>5.5 Delete The Master Journal File</h3><img src="images/ac/multi-4.gif" align="right" hspace="15"><p>The next step is to delete the master journal file.This is the point where the multi-file transaction commits.This step corresponds to <a href="#section_3_11">step 3.11</a> in the single-filecommit scenario where the rollback journal is deleted.</p><p>If a power failure or operating system crash occurs at thispoint, the transaction will not rollback when the system rebootseven though there are rollback journals present. Thedifference is the master journal pathname in the header of therollback journal. Upon restart, SQLite only considers a journalto be hot and will only playback the journal if there is nomaster journal filename in the header (which is the case fora single-file commit) or if the master journal file stillexists on disk.</p><br clear="both"><h3>5.6 Clean Up The Rollback Journals</h3><img src="images/ac/multi-5.gif" align="right" hspace="15"><p>The final step in a multi-file commit is to delete theindividual rollback journals and drop the exclusive locks onthe database files so that other processes can see the changes.This corresponds to <a href="#section_3_12">step 3.12</a> in the single-filecommit sequence.</p><p>The transaction has already committed at this point so timingis not critical in the deletion of the rollback journals.The current implementation deletes a single rollback journalthen unlocks the corresponding database file before proceedingto the next rollback journal. But in the future we might changethis so that all rollback journals are deleted before any databasefiles are unlocked. As long as the rollback journal is deleted beforeits corresponding database file is unlocked it does not matter in whatorder the rollback journals are deleted or the database files areunlocked.</p><h2>6.0 Additional Details Of The Commit Process</h2><p><a href="#section_3_0">Section 3.0</a> above provides an overview ofhow atomic commit works in SQLite. But it glosses over a number ofimportant details. The following subsections will attempt to fillin the gaps.</p><h3>6.1 Always Journal Complete Sectors</h3><p>When the original content of a database page is written intothe rollback journal (as shown in <a href="#section_3_5">section 3.5</a>),SQLite always writes a complete sectors worth of data, even if thepage size of the database is smaller than the sector size. Historically, the sector size in SQLite has been hard coded to 512bytes and since the minimum page size is also 512 bytes, this has neverbeen an issue. But beginning with SQLite version 3.3.14, it is possiblefor SQLite to use mass storage devices with a sector size larger than 512bytes. So, beginning with version 3.3.14, whenever any page within asector is written into the journal file, all pages in that same sectorare stored with it.</p><p>It is important to store all pages of a sector in the rollbackjournal in order to prevent database corruption following a powerloss while writing the sector. Suppose that pages 1, 2, 3, and 4 areall stored in sector 1 and that page 2 is modified. In order to writethe changes to page 2, the underlying hardware must also rewrite thecontent of pages 1, 3, and 4 since the hardware must write the completesector. If this write operation is interrupted by a power outage,one or more of the pages 1, 3, or 4 might be left with incorrect data.Hence, to avoid lasting corruption to the database, the original contentof all of those pages must be contained in the rollback journal.</p><h3>6.2 Dealing With Garbage Written Into Journal Files</h3><p>When data is appended to the end of the rollback journal,SQLite normally makes the pessimistic assumption that the fileis first extended with invalid "garbage" data and that afterwardsthe correct data replaces the garbage. In other words, SQLite assumesthat the file size is increased first and then afterwards the contentis written into the file. If a power failure occurs after the filesize has been increased but before the file content has been written,the rollback journal can be left containing garbage data. If afterpower is restored, another SQLite process sees the rollback journalcontaining the garbage data and tries to roll it back into the originaldatabase file, it might copy some of the garbage into the database fileand thus corrupt the database file.</p><p>SQLite uses two defenses against this problem. In the first place,SQLite records the number of pages in the rollback journal in the headerof the rollback journal. This number is initially zero. So during anattempt to rollback an incomplete (and possibly corrupt) rollbackjournal, the process doing the rollback will see that the journalcontains zero pages will thus make no changes to the database. Priorto a commit, the rollback journal is flushed to disk to ensure thatall content has been synched to disk and there is no "garbage" leftin the file, and only then is the page count in the header increasedtrue number of pages in the rollback journal. The rollback journalheader is always kept in a separate sector from any page data so thatit can be overwritten and flushed without risking damage to a datapage if a power outage occurs. Notice that the rollback journalis flushed to disk twice: once to write the page content and a secondtime to write the page count in the header.</p><p>The previous paragraph describes what happens when thesynchronous pragma setting is "full".</p><blockquote>PRAGMA synchronous=FULL;</blockquote><p>The default synchronous setting is full so the above is what usuallyhappens. However, if the synchronous setting is lowered to "normal",SQLite only flushes the rollback journal once, after the page count hasbeen written.This carries a risk of corruption because it might happen that the modified (non-zero) page count reaches the disk surface before allof the data does. The data will have been written first, but SQLiteassumes that the underlying filesystem can reorder write requests andthat the page count can be burned into oxide first even though itswrite request occurred last. So as a second line of defense, SQLitealso uses a 32-bit checksum on every page of data in the rollbackjournal. This checksum is evaluated for each page during rollbackwhile rolling back a journal as described in <a href="#section_4_4">section 4.4</a>. If an incorrect checksumis seen, the rollback is abandoned. Note that the checksum doesnot guarantee that the page data is correct since there is a smallbut finite probability that the checksum might be right even if the data iscorrupt. But the checksum does at least make such an error unlikely.</p><p>Note that the checksums in the rollback journal are not necessaryif the synchronous setting is FULL. We only depend on the checksumswhen synchronous is lowered to NORMAL. Nevertheless, the checksumsnever hurt and so they are included in the rollback journal regardlessof the synchronous setting.</p><h3>6.3 Cache Spill Prior To Commit</h3><p>The commit process shown in <a href="#section_3_0">section 3.0</a>assumes that all database changes fit in memory until it is time tocommit. This is the common case. But sometimes a larger change willoverflow the user-space cache prior to transaction commit. In thosecases, the cache must spill to the database before the transactionis complete.</p><p>At the beginning of a cache spill, the status of the databaseconnection is as shown in <a href="#section_3_6">step 3.6</a>.Original page content has been saved in the rollback journal andmodifications of the pages exist in user memory. To spill the cache,SQLite executes steps <a href="#section_3_7">3.7</a> through<a href="#section_3_9">3.9</a>. In other words, the rollback journalis flushed to disk, an exclusive lock is acquired, and changes arewritten into the database. But the remaining steps are deferreduntil the transaction really commits. A new journal header isappended to the end of the rollback journal (in its own sector)and the exclusive database lock is retained, but otherwise processingreturns to <a href="#section_3_6">step 3.6</a>. When the transactioncommits, or if another cache spill occurs, steps<a href="#section_3_7">3.7</a> and <a href="#section_3_9">3.9</a> arerepeated. (Step <a href="#section_3_8">3.8</a> is omitted on secondand subsequent passes since an exclusive database lock is already helddue to the first pass.)</p><p>A cache spill causes the lock on the database file toescalate from reserved to exclusive. This reduces concurrency.A cache spill also causes extra disk flush or fsync operations tooccur and these operations are slow, hence a cache spill canseriously reduce performance.For these reasons a cache spill is avoided whenever possible.</p><h2>7.0 Optimizations</h2><p>Profiling indicates that for most systems and in most circumstancesSQLite spends most of its time doing disk I/O. It follows then thatanything we can do to reduce the amount of disk I/O will likely have alarge positive impact on the performance of SQLite. This sectiondescribes some of the techniques used by SQLite to try to reduce theamount of disk I/O to a minimum while still preserving atomic commit.</p><h3>7.1 Cache Retained Between Transactions</h3><p><a href="#section_3_12">Step 3.12</a> of the commit process showsthat once the shared lock has been released, all user-space cacheimages of database content must be discarded. This is done becausewithout a shared lock, other processes are free to modify the databasefile content and so any user-space image of that content might becomeobsolete. Consequently, each new transaction would begin by rereadingdata which had previously been read. This is not as bad as it soundsat first since the data being read is still likely in the operatingsystems file cache. So the "read" is really just a copy of datafrom kernel space into user space. But even so, it still takes time.</p><p>Beginning with SQLite version 3.3.14 a mechanism has been addedto try to reduce the needless rereading of data. In newer versionsof SQLite, the data in the user-space pager cache is retained whenthe lock on the database file is released. Later, after theshared lock is acquired at the beginning of the next transaction,SQLite checks to see if any other process has modified the databasefile. If the database has been changed in any way since the lockwas last released, the user-space cache is erased at that point.But commonly the database file is unchanged and the user-space cachecan be retained, and some unnecessary read operations can be avoided.</p><p>In order to determine whether or not the database file has changed,SQLite uses a counter in the database header (in bytes 24 through 27)which is incremented during every change operation. SQLite saves a copy
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -