📄 reclimit.so

📁 berkeley db 4.6.21的源码。berkeley db是一个简单的数据库管理系统
💻 SO
字号:
m4_comment([$Id: reclimit.so,v 11.32 2005/06/16 17:13:55 bostic Exp $])m4_ref_title(m4_tam Applications,    m4_db recoverability,    m4_db @recoverability,    transapp/filesys, transapp/tune)m4_p([dnlm4_db recovery is based on write-ahead logging.  This means thatwhen a change is made to a database page, a description of the change iswritten into a log file.  This description in the log file is guaranteedto be written to stable storage before the database pages that werechanged are written to stable storage.  This is the fundamental featureof the logging system that makes durability and rollback work.])m4_p([dnlIf the application or system crashes, the log is reviewed duringrecovery.  Any database changes described in the log that were part ofcommitted transactions and that were never written to the actualdatabase itself are written to the database as part of recovery.  Anydatabase changes described in the log that were never committed and thatwere written to the actual database itself are backed-out of thedatabase as part of recovery.  This design allows the database to bewritten lazily, and only blocks from the log file have to be forced todisk as part of transaction commit.])m4_p([dnlThere are two interfaces that are a concern when considering m4_dbrecoverability:])m4_nlistbeginm4_nlist([The interface between m4_db and the operating system/filesystem.])m4_nlistns([The interface between the operating system/filesystem and theunderlying stable storage hardware.])m4_nlistendm4_p([dnlm4_db uses the operating system interfaces and its underlying filesystemwhen writing its files.  This means that m4_db can fail if the underlyingfilesystem fails in some unrecoverable way.  Otherwise, the interfacerequirements here are simple: The system call that m4_db uses to flushdata to disk (normally fsync or fdatasync), must guarantee that all theinformation necessary for a file's recoverability has been written tostable storage before it returns to m4_db, and that no possibleapplication or system crash can cause that file to be unrecoverable.])m4_p([dnlIn addition, m4_db implicitly uses the interface between the operatingsystem and the underlying hardware.  The interface requirements here arenot as simple.])m4_p([dnlFirst, it is necessary to consider the underlying page size of the m4_dbdatabases.  The m4_db library performs all database writes using thepage size specified by the application, and m4_db assumes pages arewritten atomically.  This means that if the operating system performsfilesystem I/O in blocks of different sizes than the database page size,it may increase the possibility for database corruption.  For example,assume that m4_db is writing 32KB pages for a database, and theoperating system does filesystem I/O in 16KB blocks.  If the operatingsystem writes the first 16KB of the database page successfully, butcrashes before being able to write the second 16KB of the database, thedatabase has been corrupted and this corruption may or may not bedetected during recovery.  For this reason, it may be important toselect database page sizes that will be written as single blocktransfers by the underlying operating system.  If you do not select apage size that the underlying operating system will write as a singleblock, you may want to configure the database to use checksums (see them4_ref(DB_CHKSUM) flag for more information).  By configuring checksums,you guarantee this kind of corruption will be detected at the expenseof the CPU required to generate the checksums.  When such an error isdetected, the only course of recovery is to perform catastrophicrecovery to restore the database.])m4_p([dnlSecond, if you are copying database files (either as part of doing ahot backup or creation of a hot failover area), there is an additionalquestion related to the page size of the m4_db databases.  You must copydatabases atomically, in units of the database page size.  In otherwords, the reads made by the copy program must not be interleaved withwrites by other threads of control, and the copy program must read thedatabases in multiples of the underlying database page size.  Generally,this is not a problem, as operating systems already make this guaranteeand system utilities normally read in power-of-2 sized chunks, whichare larger than the largest possible m4_db database page size.])m4_p([dnlOne problem we have seen in this area was in some releases of Solariswhere the cp utility was implemented using the mmap system call ratherthan the read system call.  Because the Solaris' mmap system call didnot make the same guarantee of read atomicity as the read system call,using the cp utility could create corrupted copies of the databases.Another problem we have seen is implementations of the tar utility doing10KB block reads by default, and even when an output block size wasspecified to that utility, not reading from the underlying databases inmultiples of the block size.  Using the dd utility instead of the cp ortar utilities (and specifying an appropriate block size), fixes theseproblems.  If you plan to use a system utility to copy database files,you may want to use a system call trace utility (for example, ktrace ortruss) to check for an I/O size smaller than or not a multiple of thedatabase page size and system calls other than read.])m4_p([dnlThird, it is necessary to consider the behavior of the system'sunderlying stable storage hardware.  For example, consider a SCSIcontroller that has been configured to cache data and return to theoperating system that the data has been written to stable storage, when,in fact, it has only been written into the controller RAM cache.  Ifpower is lost before the controller is able to flush its cache to disk,and the controller cache is not stable (that is, the writes will not beflushed to disk when power returns), the writes will be lost.  If thewrites include database blocks, there is no loss because recovery willcorrectly update the database.  If the writes include log file blocks,it is possible that transactions that were already committed may notappear in the recovered database, although the recovered database willbe coherent after a crash.])m4_p([dnlIf the underlying hardware can fail in any way so that only part of theblock was written, the failure conditions are the same as thosedescribed previously for an operating system failure that writes onlypart of a logical database block.  In such cases, configuring thedatabase for checksums will ensure the corruption is detected.])m4_p([dnlFor these reasons, it may be important to select hardware that does notdo partial writes and does not cache data writes (or does not returnthat the data has been written to stable storage until it has eitherbeen written to stable storage or the actual writing of all of the datais guaranteed, barring catastrophic hardware failure -- that is, yourdisk drive exploding).])m4_p([dnlIf the disk drive on which you are storing your databases explodes, youcan perform normal m4_db catastrophic recovery, because it requires onlya snapshot of your databases plus the log files you have archived sincethose snapshots were taken.  In this case, you should lose no databasechanges at all.])m4_p([dnlIf the disk drive on which you are storing your log files explodes, youcan also perform catastrophic recovery, but you will lose any databasechanges made as part of  transactions committed since your last archivalof the log files.   Alternatively, if your database environment anddatabases are still available after you lose the log file disk, youshould be able to dump your databases.  However, you may see aninconsistent snapshot of your data after doing the dump, becausechanges that were part of transactions that were not yet committedmay appear in the database dump.  Depending on the value of the data,a reasonable alternative may be to perform both the database dump andthe catastrophic recovery and then compare the databases created bythe two methods.])m4_p([dnlRegardless, for these reasons, storing your databases and log files ondifferent disks should be considered a safety measure as well as aperformance enhancement.])m4_p([dnlFinally, you should be aware that m4_db does not protect against allcases of stable storage hardware failure, nor does it protect againstsimple hardware misbehavior (for example, a disk controller writingincorrect data to the disk).  However, configuring the database forchecksums will ensure that any such corruption is detected.])m4_page_footer
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -