📄 atomiccommit.html
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html><head><title>Atomic Commit In SQLite</title><style type="text/css">body { margin: auto; font-family: "Verdana" "sans-serif"; padding: 8px 1%;}a { color: #45735f }a:visited { color: #734559 }.logo { position:absolute; margin:3px; }.tagline { float:right; text-align:right; font-style:italic; width:240px; margin:12px; margin-top:58px;}.toolbar { font-variant: small-caps; text-align: center; line-height: 1.6em; margin: 0; padding:1px 8px;}.toolbar a { color: white; text-decoration: none; padding: 6px 12px; }.toolbar a:visited { color: white; }.toolbar a:hover { color: #80a796; background: white; }.content { margin: 5%; }.content dt { font-weight:bold; }.content dd { margin-bottom: 25px; margin-left:20%; }.content ul { padding:0px; padding-left: 15px; margin:0px; }/* rounded corners */.se { background: url(images/se.png) 100% 100% no-repeat #80a796}.sw { background: url(images/sw.png) 0% 100% no-repeat }.ne { background: url(images/ne.png) 100% 0% no-repeat }.nw { background: url(images/nw.png) 0% 0% no-repeat }</style><meta http-equiv="content-type" content="text/html; charset=UTF-8"> </head><body><div><!-- container div to satisfy validator --><a href="index.html"><img class="logo" src="images/SQLite.gif" alt="SQLite Logo" border="0"></a><div><!-- IE hack to prevent disappearing logo--></div><div class="tagline">Small. Fast. Reliable.<br>Choose any three.</div><table width=100% style="clear:both"><tr><td> <div class="se"><div class="sw"><div class="ne"><div class="nw"> <div class="toolbar"> <a href="about.html">About</a> <a href="sitemap.html">Sitemap</a> <a href="docs.html">Documentation</a> <a href="download.html">Download</a> <a href="copyright.html">License</a> <a href="news.html">News</a> <a href="http://www.sqlite.org/cvstrac/index">Developers</a> <a href="support.html">Support</a> </div></div></div></div></div></td></tr></table> <h1 align="center">Atomic Commit In SQLite</h1><h2>1.0 Introduction</h2><p>An important feature of transactional databases like SQLiteis "atomic commit". Atomic commit means that either all database changes within a single transaction occur or none of them occur. With atomic commit, itis as if many different writes to different sections of the databasefile occur instantaneously and simultaneously.Real hardware serializes writes to mass storage, and writinga single sector takes a finite amount of time.So it is impossible to truly write many different sectors of a database file simultaneously and/or instantaneously.But the atomic commit logic withinSQLite makes it appear as if the changes for a transactionare all written instantaneously and simultaneously.</p><p>SQLite has the important property that transactions appearto be atomic even if the transaction is interrupted by anoperating system crash or power failure.</p><p>This article describes the techniques used by SQLite to create theillusion of atomic commit.</p><h2>2.0 Hardware Assumptions</h2><p>Throughout this article, we will call the mass storage device "disk"even though the mass storage device might really be flash memory.</p><p>We assume that disk is written in chunks which we call a "sector".It is not possible to modify any part of the disk smaller than a sector.To change a part of the disk smaller than a sector, you have to read inthe full sector that contains the part you want to change, make thechange, then write back out the complete sector.</p><p>On a traditional spinning disk, a sector is the minimum unit of transferin both directions, both reading and writing. On flash memory, however,the minimum size of a read is typically much smaller than a minimum write.SQLite is only concerned with the minimum write amount and so for thepurposes of this article, when we say "sector" we mean the minimum amountof data that can be written to mass storage in a single go.</p><p>Prior to SQLite version 3.3.14, a sector size of 512 bytes wasassumed in all cases. There was a compile-time option to changethis but the code had never been tested with a larger value. The512 byte sector assumption seemed reasonable since until very recentlyall disk drives used a 512 byte sector internally. However, therehas recently been a push to increase the sector size of disks to4096 bytes. Also the sector sizefor flash memory is usually larger than 512 bytes. For these reasons,versions of SQLite beginning with 3.3.14 have a method in the OSinterface layer that interrogates the underlying filesystem to findthe true sector size. As currently implemented (version 3.5.0) thismethod still returns a hard-coded value of 512 bytes, since thereis no standard way of discovering the true sector size on eitherwin32 or unix. But the method is available for embedded devicemanufactures to tweak according to their own needs. And we haveleft open the possibility of filling in a more meaningful implementationon unix and win32 in the future.</p><p>SQLite does <u>not</u> assume that a sector write is atomic.However, it does assume that a sector write is linear. By "linear"we mean that SQLite assumes that when writing a sector, the hardware beginsat one end of the data and writes byte by byte until it gets tothe other end. The write might go from beginning to end or fromend to beginning. If a power failure occurs in the middle of asector write it might be that part of the sector was modifiedand another part was left unchanged. The key assumption by SQLiteis that if any part of the sector gets changed, then either thefirst or the last bytes will be changed. So the hardware willnever start writing a sector in the middle and work towards theends. We do not know if this assumption is always true but itseems reasonable.</p><p>The previous paragraph states that SQLite does not assume thatsector writes are atomic. This is true by default. But as ofSQLite version 3.5.0, there is a new interface called theVirtual File System (VFS) interface. The VFS is the only meansby which SQLite communicates to the underlying filesystem. Thecode comes with default VFS implementations for unix and windowsand there is a mechanism for creating new custom VFS implementationsat runtime. In this new VFS interface there is a method calledxDeviceCharacteristics. This method interrogates the underlyingfilesystem to discover various properties and behaviors that thefilesystem may or may not exhibit. The xDeviceCharacteristicsmethod might indicate that sector writes are atomic, and if it doesso indicate, SQLite will try to take advantage of that fact. Butthe default xDeviceCharacteristics method for both unix and windowsdoes not indicate atomic sector writes and so these optimizationsare normally omitted.</p><p>SQLite assumes that the operating system will buffer writes andthat a write request will return before data has actually been storedin the mass storage device.SQLite further assumes that write operations will be reordered bythe operating system.For this reason, SQLite does a "flush" or "fsync" operation at keypoints. SQLite assumes that the flush or fsync will not return untilall pending write operations for the file that is being flushed havecompleted. We are told that the flush and fsync primitivesare broken on some versions of windows and Linux. This is unfortunate.It opens SQLite up to the possibility of database corruption followinga power loss in the middle of a commit. However, there is nothing that SQLite can do to test for or remedy the situation. SQLiteassumes that the operating system that it is running on works asadvertised. If that is not quite the case, well then hopefully youwill not lose power too often.</p><p>SQLite assumes that when a file grows in length that the newfile space originally contains garbage and then later is filled inwith the data actually written. In other words, SQLite assumes thatthe file size is updated before the file content. This is a pessimistic assumption and SQLite has to do some extra work to makesure that it does not cause database corruption if power is lostbetween the time when the file size is increased and when thenew content is written. The xDeviceCharacteristics method ofthe VFS might indicate that the filesystem will always write thedata before updating the file size. (This is the SQLITE_IOCAP_SAFE_APPEND property for those readers who are lookingat the code.) When the xDeviceCharacteristics method indicatesthat files content is written before the file size is increased,SQLite can forego some of its pedantic database protection stepsand thereby decrease the amount of disk I/O needed to perform acommit. The current implementation, however, makes no such assumptionsfor the default VFSes for windows and unix.</p><p>SQLite assumes that a file deletion is atomic from thepoint of view of a user process. By this we mean that if SQLiterequests that a file be deleted and the power is lost during thedelete operation, once power is restored either the file willexist completely with all if its original content unaltered, orelse the file will not be seen in the filesystem at all. Ifafter power is restored the file is only partially deleted,if some of its data has been altered or erased,or the file has been truncated but not completely removed, thendatabase corruption will likely result.</p><p>SQLite assumes that the detection and/or correction of bit errors caused by cosmic rays, thermal noise, quantumfluctuations, device driver bugs, or other mechanisms, is the responsibility of the underlying hardware and operating system. SQLite does not add any redundancy to the database file forthe purpose of detecting corruption or I/O errors.SQLite assumes that the data it reads is exactly the same data that it previously wrote.</p><a name="section_3_0"></a><h2>3.0 Single File Commit</h2><p>We begin with an overview of the steps SQLite takes in order toperform an atomic commit of a transaction against a single databasefile. The details of file formats used to guard against damage frompower failures and techniques for performing an atomic commit acrossmultiple databases are discussed in later sections.</p><h3>3.1 Initial State</h3><img src="images/ac/commit-0.gif" align="right" hspace="15"><p>The state of the computer when a database connection isfirst opened is shown conceptually by the diagram at theright.The area of the diagram on the extreme right (labeled "Disk") representsinformation stored on the mass storage device. Each rectangle isa sector. The blue color represents that the sectors containoriginal data.The middle area is the operating systems disk cache. At theonset of our example, the cache is cold and this is representedby leaving the rectangles of the disk cache empty.The left area of the diagram shows the content of memory forthe process that is using SQLite. The database connection hasjust been opened and no information has been read yet, so theuser space is empty.</p><br clear="both"><h3>3.2 Acquiring A Read Lock</h3><img src="images/ac/commit-1.gif" align="right" hspace="15"><p>Before SQLite can write to a database, it must first readthe database to see what is there already. Even if it is justappending new data, SQLite still has to read in the databaseschema from the <b>sqlite_master</b> table so that it can knowhow to parse the INSERT statements and discover where in thedatabase file the new information should be stored.</p><p>The first step toward reading from the database fileis obtaining a shared lock on the database file. A "shared"
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -