📄 atomiccommit.html
字号:
journaling and syncing process and simply writes the modified pagedirectly into the database file. The change counter in the firstpage of the database file is modified separately since no harm isdone if power is lost before the change counter can be updated.</p><h3>7.5 Filesystems With Safe Append Semantics</h3><p>Another optimization introduced in SQLite version 3.5.0 makesuse of "safe append" behavior of the underlying disk.Recall that SQLite assumes that when data is appended to a file(specifically to the rollback journal) that the size of the fileis increased first and that the content is written second. Soif power is lost after the file size is increased but before thecontent is written, the file is left containing invalid "garbage"data. The xDeviceCharacteristics method of the VFS might, however,indicate that the filesystem implements "safe append" semantics.This means that the content is written before the file size isincreased so that it is impossible for garbage to be introducedinto the rollback journal by a power loss or system crash.</p><p>When safe append semantics are indicated for a filesystem,SQLite always stores the special value of -1 for the page countin the header of the rollback journal. The -1 page count valuetells any process attempting to rollback the journal that thenumber of pages in the journal should be computed from the journalsize. This -1 value is never changed. So that when a commitoccurs, we save a single flush operation and a sector write ofthe first page of the journal file. Furthermore, when a cachespill occurs we no longer need to append a new journal headerto the end of the journal; we can simply continue appendingnew pages to the end of the existing journal.</p><a name="section_7_6"></a><h3>7.6 Persistent Rollback Journals</h3><p>Deleting a file is an expensive operation on many systems.So as an optimization, SQLite can be configured to avoid thedelete operation of <a href="#section_3_11">section 3.11</a>.Instead of deleting the journal file in order to commit a transaction,the file is either truncated to zero bytes in length or itsheader is overwritten with zeros. Truncating the file to zerolength saves having to make modifications to the directory containingthe file since the file is not removed from the directory. Overwriting the header has the additional savings of not havingto update the length of the file (in the "inode" on many systems)and not having to deal with newly freed disk sectors. Furthermore,at the next transaction the journal will be created by overwritingexisting content rather than appending new content onto the endof a file, and overwriting is often much faster than appending.</p><p>SQLite can be configured to commit transactions by overwritingthe journal header with zeros instead of deleting the journal fileby setting the "PERSIST" journaling mode using the <a href="pragma.html#pragma_journal_mode">journal_mode</a> PRAGMA.For example:</p><blockquote><pre>PRAGMA journal_mode=PERSIST;</per></blockquote><p>The use of persistent journal mode provide a noticable performanceimprovement on many systems. Of course, the drawback is that the journal files remain on the disk, using disk space and clutteringdirectories, long after the transaction commits. The only safe wayto delete a persistent journal file is to commit a transactionwith journaling mode set to DELETE:</p><blockquote><pre>PRAGMA journal_mode=DELETE;BEGIN EXCLUSIVE;COMMIT;</per></blockquote><p>Beware of deleting persistent journal files by any other meanssince the journal file might be hot, in which case deleting it willcorrupt the corresponding database file.</p><h2>8.0 Testing Atomic Commit Behavior</h2><p>The developers of SQLite are confident that it is robustin the face of power failures and system crashes because theautomatic test procedures do extensive checks onthe ability of SQLite to recover from simulated power loss.We call these the "crash tests".</p><p>Crash tests in SQLite use a modified VFS that can simulatethe kinds of filesystem damage that occur during a powerloss or operating system crash. The crash-test VFS can simulateincomplete sector writes, pages filled with garbage data becausea write has not completed, and out of order writes, all occurringat varying points during a test scenario. Crash tests executetransactions over and over, varying the time at which a simulatedpower loss occurs and the properties of the damage inflicted.Each test then reopens the database after the simulated crash andverifies that the transaction either occurred completelyor not at all and that the database is in a completelyconsistent state.</p><p>The crash tests in SQLite have discovered a number of verysubtle bugs (now fixed) in the recovery mechanism. Some of these bugs were very obscure and unlikely to have been foundusing only code inspection and analysis techniques. From thisexperience, the developers of SQLite feel confident that any otherdatabase system that does not use a similar crash test systemlikely contains undetected bugs that will lead to databasecorruption following a system crash or power failure.</p><h2>9.0 Things That Can Go Wrong</h2><p>The atomic commit mechanism in SQLite has proven to be robust,but it can be circumvented by a sufficiently creativeadversary or a sufficiently broken operating system implementation.This section describes a few of the ways in which an SQLite databasemight be corrupted by a power failure or system crash.</p><h3>9.1 Broken Locking Implementations</h3><p>SQLite uses filesystem locks to make sure that only oneprocess and database connection is trying to modify the databaseat a time. The filesystem locking mechanism is implementedin the VFS layer and is different for every operating system.SQLite depends on this implementation being correct. If somethinggoes wrong and two or more processes are able to write the samedatabase file at the same time, severe damage can result.</p><p>We have received reports of implementations of bothwindows network filesystems and NFS in which locking wassubtly broken. We can not verify these reports, but aslocking is difficult to get right on a network filesystemwe have no reason to doubt them. You are advised to avoid using SQLite on a network filesystem in the first place,since performance will be slow. But if you must use a network filesystem to store SQLite database files, considerusing a secondary locking mechanism to prevent simultaneouswrites to the same database even if the native filesystemlocking mechanism malfunctions.</p><p>The versions of SQLite that come preinstalled on AppleMac OS X computers contain a version of SQLite that has beenextended to use alternative locking strategies that work onall network filesystems that Apple supports. These extensionsused by Apple work great as long as all processes are accessingthe database file in the same way. Unfortunately, the lockingmechanisms do not exclude one another, so if one process isaccessing a file using (for example) AFP locking and anotherprocess (perhaps on a different machine) is using dot-file locks,the two processes might collide because AFP locks do not excludedot-file locks or vice versa.</p><h3>9.2 Incomplete Disk Flushes</h3><p>SQLite uses the fsync() system call on unix and the FlushFileBuffers()system call on w32 in order to sync the file system buffers onto diskoxide as shown in <a href="#section_3_7">step 3.7</a> and<a href="#section_3_10">step 3.10</a>. Unfortunately, we have receivedreports that neither of these interfaces works as advertised on manysystems. We hear that FlushFileBuffers() can be completely disabledusing registry settings on some windows versions. Some historicalversions of Linux contain versions of fsync() which are no-ops onsome filesystems, we are told. Even on systems where FlushFileBuffers() and fsync() are said to be working, oftenthe IDE disk control lies and says that data has reached oxidewhile it is still held only in the volatile control cache.</p><p>On the Mac, you can set this pragma:</p><blockquote>PRAGMA fullfsync=ON;</blockquote><p>Setting fullfsync on a Mac will guarantee that data really doesget pushed out to the disk platter on a flush. But the implementationof fullfsync involves resetting the disk controller. And so not onlyis it profoundly slow, it also slows down other unrelated disk I/O.So its use is not recommended.</p><h3>9.3 Partial File Deletions</h3><p>SQLite assumes that file deletion is an atomic operation from thepoint of view of a user process. If power fails in the middle ofa file deletion, then after power is restored SQLite expects to seeeither the entire file with all of its original data intact, or itexpects not to find the file at all. Transactions may not be atomicon systems that do not work this way.</p><h3>9.4 Garbage Written Into Files</h3><p>SQLite database files are ordinary disk files that can beopened and written by ordinary user processes. A rogue processcan open an SQLite database and fill it with corrupt data. Corrupt data might also be introduced into an SQLite databaseby bugs in the operating system or disk controller; especiallybugs triggered by a power failure. There is nothing SQLite cando to defend against these kinds of problems.</p><h3>9.5 Deleting Or Renaming A Hot Journal</h3><p>If a crash or power loss does occur and a hot journal is left onthe disk, it is essential that the original database file and the hotjournal remain on disk with their original names until the databasefile is opened by another SQLite process and rolled back. During recovery at <a href="section_4_2">step 4.2</a> SQLite locatesthe hot journal by looking for a file in the same directory as thedatabase being opened and whose name is derived from the name of thefile being opened. If either the original database file or thehot journal have been moved or renamed, then the hot journal willnot be seen and the database will not be rolled back.</p><p>We suspect that a common failure mode for SQLite recovery happenslike this: A power failure occurs. After power is restored, a well-meaninguser or system administrator begins looking around on the disk fordamage. They see their database file named "important.data". This fileis perhaps familiar to them. But after the crash, there is also ahot journal named "important.data-journal". The user then deletesthe hot journal, thinking that they are helping to cleanup the system.We know of no way to prevent this other than user education.</p><p>If there are multiple (hard or symbolic) links to a database file,the journal will be created using the name of the link through whichthe file was opened. If a crash occurs and the database is opened againusing a different link, the hot journal will not be located and norollback will occur.</p><p>Sometimes a power failure will cause a filesystem to be corruptedsuch that recently changed filenames are forgotten and the file ismoved into a "/lost+found" directory. When that happens, the hotjournal will not be found and recovery will not occur.SQLite tries to prevent thisby opening and syncing the directory containing the rollback journalat the same time it syncs the journal file itself. However, themovement of files into /lost+found can be caused by unrelated processescreating unrelated files in the same directory as the main database file.And since this is out from under the control of SQLite, there is nothingthat SQLite can do to prevent it. If you are running on a system thatis vulnerable to this kind of filesystem namespace corruption (mostmodern journalling filesystems are immune, we believe) then you mightwant to consider putting each SQLite database file in its own privatesubdirectory.</p><h2>10.0 Future Directions And Conclusion</h2><p>Every now and then someone discovers a new failure mode forthe atomic commit mechanism in SQLite and the developers have toput in a patch. This is happening less and less and thefailure modes are becoming more and more obscure. But it wouldstill be foolish to suppose that the atomic commit logic ofSQLite is entirely bug-free. The developers are committed to fixingthese bugs as quickly as they might be found.</p><p>The developers are also on the lookout for new ways tooptimize the commit mechanism. The current VFS implementationsfor Linux, MacOSX, and w32 make pessimistic assumptions aboutthe behavior of those systems. After consultation with expertson how these systems work, we might be able to relax some of theassumptions on these systems and allow them to run faster. Inparticular, we suspect that most modern filesystems exhibit thesafe append property and that many of them might support atomicsector writes. But until this is known for certain, SQLite willtake the conservative approach and assume the worst.</p><hr><small><i>This page last modified 2008/04/26 11:22:07 UTC</i></small></div></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -