📄 atomiccommit.html
字号:
of this counter prior to releasing its database lock. Then afteracquiring the next database lock it compares the saved counter valueagainst the current counter value and erases the cache if the valuesare different, or reuses the cache if they are the same.</p><h3>7.2 Exclusive Access Mode</h3><p>SQLite version 3.3.14 adds the concept of "Exclusive Access Mode".In exclusive access mode, SQLite retains the exclusivedatabase lock at the conclusion of each transaction. This preventsother processes for accessing the database, but in many deploymentsonly a single process is using a database so often this is not aserious problem. The advantage of exclusive access mode is thatdisk I/O can be reduced in three ways:</p><ol><li><p>It is not necessary to increment the change counter in thedatabase header for transactions after the first transaction. Thiswill often save a write of page one to both the rollbackjournal and the main database file.</p></li><li><p>No other processes can change the database so there is nevera need to check the change counter and clear the user-space cacheat the beginning of a transaction.</p></li><li><p>The rollback journal file can be truncated rather than deletedat the end of each transaction. On many operating systems, truncatinga file is much faster than deleting it.</p></li></ol><p>The third optimization, truncating rather than deleting the rollbackjournal file, does not depend on holding an exclusive lock at all times.We could, in theory, do that optimization at all times, not just whenexclusive access mode is enabled, and we may well choose to do so in some future release of SQLite. But for now (version 3.5.0) therollback journal truncation optimization is only enabled in conjunctionwith exclusive access mode.</p><h3>7.3 Do Not Journal Freelist Pages</h3><p>When information is deleted from an SQLite database, the pages usedto old the deleted information are added to a "freelist". Subsequentinserts will draw pages off of this freelist rather than expanding thedatabase file.</p><p>Some freelist pages contain critical data; specifically the locationsof other freelist pages. But most freelist pages contain nothing useful.These latter freelist pages are called "leaf" pages. We are free tomodify the content of a leaf freelist page in the database withoutchanging the meaning of the database in any way.</p><p>Because the content of leaf freelist pages is unimportant, SQLiteavoids storing leaf freelist page content in the rollback journalin <a href="#section_3_5">step 3.5</a> of the commit process.If a leaf freelist page is changed and that change does not get rolled backduring a transaction recovery, the database is not harmed by the omission.Similarly, the content of a new freelist page is never written backinto the database at <a href="#section_3_9">step 3.9</a> norread from the database at <a href="#section_3_3">step 3.3</a>.These optimizations can greatly reduce the amount of I/O that occurswhen making changes to a database file that contains free space.</p><h3>7.4 Single Page Updates And Atomic Sector Writes</h3><p>Beginning in SQLite version 3.5.0, the new Virtual File System (VFS)interface contains a method named xDeviceCharacteristics which reportson special properties that the underlying mass storage devicemight have. Among the special properties thatxDeviceCharacteristics might report is the ability of to do anatomic sector write.</p><p>Recall that by default SQLite assumes that sector writes arelinear but not atomic. A linear write starts at one end of thesector and changes information byte by byte until it gets to theother end of the sector. If a power loss occurs in the middle ofa linear write then part of the sector might be modified while theother end is unchanged. In an atomic sector write, either the entiresector is overwritten or else nothing in the sector is changed.</p><p>We believe that most modern disk drives implement atomic sectorwrites. When power is lost, the drive uses energy stored in capacitorsand/or the angular momentum of the disk platter to provide power to complete any operation in progress. Nevertheless, there are so manylayers in between the write system call and the on-board disk driveelectronics that we take the safe approach in both Unix and w32 VFSimplementations and assume that sector writes are not atomic. On theother hand, devicemanufactures with more control over their filesystems might wantto consider enabling the atomic write property of xDeviceCharacteristicsif their hardware really does do atomic writes.</p><p>When sector writes are atomic and the page size of a database isthe same as a sector size, and when there is a database change thatonly touches a single database page, then SQLite skips the wholejournaling and syncing process and simply writes the modified pagedirectly into the database file. The change counter in the firstpage of the database file is modified separately since no harm isdone if power is lost before the change counter can be updated.</p><h3>7.5 Filesystems With Safe Append Semantics</h3><p>Another optimization introduced in SQLite version 3.5.0 makesuse of "safe append" behavior of the underlying disk.Recall that SQLite assumes that when data is appended to a file(specifically to the rollback journal) that the size of the fileis increased first and that the content is written second. Soif power is lost after the file size is increased but before thecontent is written, the file is left containing invalid "garbage"data. The xDeviceCharacteristics method of the VFS might, however,indicate that the filesystem implements "safe append" semantics.This means that the content is written before the file size isincreased so that it is impossible for garbage to be introducedinto the rollback journal by a power loss or system crash.</p><p>When safe append semantics are indicated for a filesystem,SQLite always stores the special value of -1 for the page countin the header of the rollback journal. The -1 page count valuetells any process attempting to rollback the journal that thenumber of pages in the journal should be computed from the journalsize. This -1 value is never changed. So that when a commitoccurs, we save a single flush operation and a sector write ofthe first page of the journal file. Furthermore, when a cachespill occurs we no longer need to append a new journal headerto the end of the journal; we can simply continue appendingnew pages to the end of the existing journal.</p><h2>8.0 Testing Atomic Commit Behavior</h2><p>The developers of SQLite are confident that it is robustin the face of power failures and system crashes because theautomatic test procedures do extensive checks onthe ability of SQLite to recover from simulated power loss.We call these the "crash tests".</p><p>Crash tests in SQLite use a modified VFS that can simulatethe kinds of filesystem damage that occur during a powerloss or operating system crash. The crash-test VFS can simulateincomplete sector writes, pages filled with garbage data becausea write has not completed, and out of order writes, all occurringat varying points during a test scenario. Crash tests executetransactions over and over, varying the time at which a simulatedpower loss occurs and the properties of the damage inflicted.Each test then reopens the database after the simulated crash andverifies that the transaction either occurred completelyor not at all and that the database is in a completelyconsistent state.</p><p>The crash tests in SQLite have discovered a number of verysubtle bugs (now fixed) in the recovery mechanism. Some of these bugs were very obscure and unlikely to have been foundusing only code inspection and analysis techniques. From thisexperience, the developers of SQLite feel confident that any otherdatabase system that does not use a similar crash test systemlikely contains undetected bugs that will lead to databasecorruption following a system crash or power failure.</p><h2>9.0 Things That Can Go Wrong</h2><p>The atomic commit mechanism in SQLite has proven to be robust,but it can be circumvented by a sufficiently creativeadversary or a sufficiently broken operating system implementation.This section describes a few of the ways in which an SQLite databasemight be corrupted by a power failure or system crash.</p><h3>9.1 Broken Locking Implementations</h3><p>SQLite uses filesystem locks to make sure that only oneprocess and database connection is trying to modify the databaseat a time. The filesystem locking mechanism is implementedin the VFS layer and is different for every operating system.SQLite depends on this implementation being correct. If somethinggoes wrong and two or more processes are able to write the samedatabase file at the same time, severe damage can result.</p><p>We have received reports of implementations of bothwindows network filesystems and NFS in which locking wassubtly broken. We can not verify these reports, but aslocking is difficult to get right on a network filesystemwe have no reason to doubt them. You are advised to avoid using SQLite on a network filesystem in the first place,since performance will be slow. But if you must use a network filesystem to store SQLite database files, considerusing a secondary locking mechanism to prevent simultaneouswrites to the same database even if the native filesystemlocking mechanism malfunctions.</p><p>The versions of SQLite that come preinstalled on AppleMac OS X computers contain a version of SQLite that has beenextended to use alternative locking strategies that work onall network filesystems that Apple supports. These extensionsused by Apple work great as long as all processes are accessingthe database file in the same way. Unfortunately, the lockingmechanisms do not exclude one another, so if one process isaccessing a file using (for example) AFP locking and anotherprocess (perhaps on a different machine) is using dot-file locks,the two processes might collide because AFP locks do not excludedot-file locks or vice versa.</p><h3>9.2 Incomplete Disk Flushes</h3><p>SQLite uses the fsync() system call on unix and the FlushFileBuffers()system call on w32 in order to sync the file system buffers onto diskoxide as shown in <a href="#section_3_7">step 3.7</a> and<a href="#section_3_10">step 3.10</a>. Unfortunately, we have receivedreports that neither of these interfaces works as advertised on manysystems. We hear that FlushFileBuffers() can be completely disabledusing registry settings on some windows versions. Some historicalversions of Linux contain versions of fsync() which are no-ops onsome filesystems, we are told. Even on systems where FlushFileBuffers() and fsync() are said to be working, oftenthe IDE disk control lies and says that data has reached oxidewhile it is still held only in the volatile control cache.</p><p>On the Mac, you can set this pragma:</p><blockquote>PRAGMA fullfsync=ON;</blockquote><p>Setting fullfsync on a Mac will guarantee that data really doesget pushed out to the disk platter on a flush. But the implementationof fullfsync involves resetting the disk controller. And so not onlyis it profoundly slow, it also slows down other unrelated disk I/O.So its use is not recommended.</p><h3>9.3 Partial File Deletions</h3><p>SQLite assumes that file deletion is an atomic operation from thepoint of view of a user process. If power fails in the middle ofa file deletion, then after power is restored SQLite expects to seeeither the entire file with all of its original data intact, or itexpects not to find the file at all. Transactions may not be atomicon systems that do not work this way.</p><h3>9.4 Garbage Written Into Files</h3><p>SQLite database files are ordinary disk files that can beopened and written by ordinary user processes. A rogue processcan open an SQLite database and fill it with corrupt data. Corrupt data might also be introduced into an SQLite databaseby bugs in the operating system or disk controller; especiallybugs triggered by a power failure. There is nothing SQLite cando to defend against these kinds of problems.</p><h3>9.5 Deleting Or Renaming A Hot Journal</h3><p>If a crash or power loss does occur and a hot journal is left onthe disk, it is essential that the original database file and the hotjournal remain on disk with their original names until the databasefile is opened by another SQLite process and rolled back. During recovery at <a href="section_4_2">step 4.2</a> SQLite locatesthe hot journal by looking for a file in the same directory as thedatabase being opened and whose name is derived from the name of thefile being opened. If either the original database file or thehot journal have been moved or renamed, then the hot journal willnot be seen and the database will not be rolled back.</p><p>We suspect that a common failure mode for SQLite recovery happenslike this: A power failure occurs. After power is restored, a well-meaninguser or system administrator begins looking around on the disk fordamage. They see their database file named "important.data". This fileis perhaps familiar to them. But after the crash, there is also ahot journal named "important.data-journal". The user then deletesthe hot journal, thinking that they are helping to cleanup the system.We know of no way to prevent this other than user education.</p><p>If there are multiple (hard or symbolic) links to a database file,the journal will be created using the name of the link through whichthe file was opened. If a crash occurs and the database is opened againusing a different link, the hot journal will not be located and norollback will occur.</p><p>Sometimes a power failure will cause a filesystem to be corruptedsuch that recently changed filenames are forgotten and the file ismoved into a "/lost+found" directory. When that happens, the hotjournal will not be found and recovery will not occur.SQLite tries to prevent thisby opening and syncing the directory containing the rollback journalat the same time it syncs the journal file itself. However, themovement of files into /lost+found can be caused by unrelated processescreating unrelated files in the same directory as the main database file.And since this is out from under the control of SQLite, there is nothingthat SQLite can do to prevent it. If you are running on a system thatis vulnerable to this kind of filesystem namespace corruption (mostmodern journalling filesystems are immune, we believe) then you mightwant to consider putting each SQLite database file in its own privatesubdirectory.</p><h2>10.0 Future Directions And Conclusion</h2><p>Every now and then someone discovers a new failure mode forthe atomic commit mechanism in SQLite and the developers have toput in a patch. This is happening less and less and thefailure modes are becoming more and more obscure. But it wouldstill be foolish to suppose that the atomic commit logic ofSQLite is entirely bug-free. The developers are committed to fixingthese bugs as quickly as they might be found.</p><p>The developers are also on the lookout for new ways tooptimize the commit mechanism. The current VFS implementationsfor Linux, MacOSX, and w32 make pessimistic assumptions aboutthe behavior of those systems. After consultation with expertson how these systems work, we might be able to relax some of theassumptions on these systems and allow them to run faster. Inparticular, we suspect that most modern filesystems exhibit thesafe append property and that many of them might support atomicsector writes. But until this is known for certain, SQLite willtake the conservative approach and assume the worst.</p><hr><small<i>This page last modified 2007/11/28 16:23:23 UTC</i></small></div></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -