📄 wal.sgml
字号:
<!-- $Header: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v 1.25 2003/09/20 20:12:05 tgl Exp $ --><chapter id="wal"> <title>Write-Ahead Logging (<acronym>WAL</acronym>)</title> <indexterm zone="wal"> <primary>WAL</primary> </indexterm> <indexterm> <primary>transaction log</primary> <see>WAL</see> </indexterm> <para> <firstterm>Write-Ahead Logging</firstterm> (<acronym>WAL</acronym>) is a standard approach to transaction logging. Its detailed description may be found in most (if not all) books about transaction processing. Briefly, <acronym>WAL</acronym>'s central concept is that changes to data files (where tables and indexes reside) must be written only after those changes have been logged, that is, when log records have been flushed to permanent storage. If we follow this procedure, we do not need to flush data pages to disk on every transaction commit, because we know that in the event of a crash we will be able to recover the database using the log: any changes that have not been applied to the data pages will first be redone from the log records (this is roll-forward recovery, also known as REDO) and then changes made by uncommitted transactions will be removed from the data pages (roll-backward recovery, UNDO). </para> <sect1 id="wal-benefits-now"> <title>Benefits of <acronym>WAL</acronym></title> <indexterm zone="wal-benefits-now"> <primary>fsync</primary> </indexterm> <para> The first obvious benefit of using <acronym>WAL</acronym> is a significantly reduced number of disk writes, since only the log file needs to be flushed to disk at the time of transaction commit; in multiuser environments, commits of many transactions may be accomplished with a single <function>fsync()</function> of the log file. Furthermore, the log file is written sequentially, and so the cost of syncing the log is much less than the cost of flushing the data pages. </para> <para> The next benefit is consistency of the data pages. The truth is that, before <acronym>WAL</acronym>, <productname>PostgreSQL</productname> was never able to guarantee consistency in the case of a crash. Before <acronym>WAL</acronym>, any crash during writing could result in: <orderedlist> <listitem> <simpara>index rows pointing to nonexistent table rows</simpara> </listitem> <listitem> <simpara>index rows lost in split operations</simpara> </listitem> <listitem> <simpara>totally corrupted table or index page content, because of partially written data pages</simpara> </listitem> </orderedlist> Problems with indexes (problems 1 and 2) could possibly have been fixed by additional <function>fsync()</function> calls, but it is not obvious how to handle the last case without <acronym>WAL</acronym>; <acronym>WAL</acronym> saves the entire data page content in the log if that is required to ensure page consistency for after-crash recovery. </para> </sect1> <sect1 id="wal-benefits-later"> <title>Future Benefits</title> <para> The UNDO operation is not implemented. This means that changes made by aborted transactions will still occupy disk space and that a permanent <filename>pg_clog</filename> file to hold the status of transactions is still needed, since transaction identifiers cannot be reused. Once UNDO is implemented, <filename>pg_clog</filename> will no longer be required to be permanent; it will be possible to remove <filename>pg_clog</filename> at shutdown. (However, the urgency of this concern has decreased greatly with the adoption of a segmented storage method for <filename>pg_clog</filename>: it is no longer necessary to keep old <filename>pg_clog</filename> entries around forever.) </para> <para> With UNDO, it will also be possible to implement <firstterm>savepoints</firstterm><indexterm><primary>savepoint</></> to allow partial rollback of invalid transaction operations (parser errors caused by mistyping commands, insertion of duplicate primary/unique keys and so on) with the ability to continue or commit valid operations made by the transaction before the error. At present, any error will invalidate the whole transaction and require a transaction abort. </para> <para> <acronym>WAL</acronym> offers the opportunity for a new method for database on-line backup and restore (<acronym>BAR</acronym>). To use this method, one would have to make periodic saves of data files to another disk, a tape or another host and also archive the <acronym>WAL</acronym> log files. The database file copy and the archived log files could be used to restore just as if one were restoring after a crash. Each time a new database file copy was made the old log files could be removed. Implementing this facility will require the logging of data file and index creation and deletion; it will also require development of a method for copying the data files (operating system copy commands are not suitable). </para> <para> A difficulty standing in the way of realizing these benefits is that they require saving <acronym>WAL</acronym> entries for considerable periods of time (e.g., as long as the longest possible transaction if transaction UNDO is wanted). The present <acronym>WAL</acronym> format is extremely bulky since it includes many disk page snapshots. This is not a serious concern at present, since the entries only need to be kept for one or two checkpoint intervals; but to achieve these future benefits some sort of compressed <acronym>WAL</acronym> format will be needed. </para> </sect1> <sect1 id="wal-configuration"> <title><acronym>WAL</acronym> Configuration</title> <para> There are several <acronym>WAL</acronym>-related configuration parameters that affect database performance. This section explains their use. Consult <xref linkend="runtime-config"> for details about setting configuration parameters. </para> <para> <firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</></> are points in the sequence of transactions at which it is guaranteed that the data files have been updated with all information logged before the checkpoint. At checkpoint time, all dirty data pages are flushed to disk and a special checkpoint record is written to the log file. As result, in the event of a crash, the recoverer knows from what record in the log (known as the redo record) it should start the REDO operation, since any changes made to data files before that record are already on disk. After a checkpoint has been made, any log segments written before the redo records are no longer needed and can be recycled or removed. (When <acronym>WAL</acronym>-based <acronym>BAR</acronym> is implemented, the log segments would be archived before being recycled or removed.) </para> <para> The server spawns a special process every so often to create the next checkpoint. A checkpoint is created every <varname>checkpoint_segments</varname> log segments, or every <varname>checkpoint_timeout</varname> seconds, whichever comes first. The default settings are 3 segments and 300 seconds respectively. It is also possible to force a checkpoint by using the SQL command <command>CHECKPOINT</command>. </para> <para> Reducing <varname>checkpoint_segments</varname> and/or <varname>checkpoint_timeout</varname> causes checkpoints to be done more often. This allows faster after-crash recovery (since less work will need to be redone). However, one must balance this against the increased cost of flushing dirty data pages more often. In addition, to ensure data page consistency, the first modification of a data page after each checkpoint results in logging the entire page content. Thus a smaller checkpoint interval increases the volume of output to the log, partially negating the goal of using a smaller
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -