📄 storage.sgml
字号:
<!--$PostgreSQL: pgsql/doc/src/sgml/storage.sgml,v 1.8 2005/11/04 23:14:02 petere Exp $--><chapter id="storage"><title>Database Physical Storage</title><para>This chapter provides an overview of the physical storage format used by<productname>PostgreSQL</productname> databases.</para><sect1 id="storage-file-layout"><title>Database File Layout</title><para>This section describes the storage format at the level of files anddirectories.</para><para>All the data needed for a database cluster is stored within the cluster's datadirectory, commonly referred to as <varname>PGDATA</> (after the name of theenvironment variable that can be used to define it). A common location for<varname>PGDATA</> is <filename>/var/lib/pgsql/data</>. Multiple clusters,managed by different postmasters, can exist on the same machine.</para><para>The <varname>PGDATA</> directory contains several subdirectories and controlfiles, as shown in <xref linkend="pgdata-contents-table">. In addition tothese required items, the cluster configuration files<filename>postgresql.conf</filename>, <filename>pg_hba.conf</filename>, and<filename>pg_ident.conf</filename> are traditionally stored in<varname>PGDATA</> (although in <productname>PostgreSQL</productname> 8.0 andlater, it is possible to keep them elsewhere). </para><table tocentry="1" id="pgdata-contents-table"><title>Contents of <varname>PGDATA</></title><tgroup cols="2"><thead><row><entry>Item</entry><entry>Description</entry></row></thead><tbody><row> <entry><filename>PG_VERSION</></entry> <entry>A file containing the major version number of <productname>PostgreSQL</productname></entry></row><row> <entry><filename>base</></entry> <entry>Subdirectory containing per-database subdirectories</entry></row><row> <entry><filename>global</></entry> <entry>Subdirectory containing cluster-wide tables, such as <structname>pg_database</></entry></row><row> <entry><filename>pg_clog</></entry> <entry>Subdirectory containing transaction commit status data</entry></row><row> <entry><filename>pg_multixact</></entry> <entry>Subdirectory containing multitransaction status data (used for shared row locks)</entry> </row><row> <entry><filename>pg_subtrans</></entry> <entry>Subdirectory containing subtransaction status data</entry></row><row> <entry><filename>pg_tblspc</></entry> <entry>Subdirectory containing symbolic links to tablespaces</entry></row><row> <entry><filename>pg_twophase</></entry> <entry>Subdirectory containing state files for prepared transactions</entry></row><row> <entry><filename>pg_xlog</></entry> <entry>Subdirectory containing WAL (Write Ahead Log) files</entry></row><row> <entry><filename>postmaster.opts</></entry> <entry>A file recording the command-line options the postmaster waslast started with</entry></row><row> <entry><filename>postmaster.pid</></entry> <entry>A lock file recording the current postmaster PID and shared memorysegment ID (not present after postmaster shutdown)</entry></row></tbody></tgroup></table><para>For each database in the cluster there is a subdirectory within<varname>PGDATA</><filename>/base</>, named after the database's OID in<structname>pg_database</>. This subdirectory is the default locationfor the database's files; in particular, its system catalogs are storedthere.</para><para>Each table and index is stored in a separate file, named after the tableor index's <firstterm>filenode</> number, which can be found in<structname>pg_class</>.<structfield>relfilenode</>.</para><caution><para>Note that while a table's filenode often matches its OID, this is<emphasis>not</> necessarily the case; some operations, like<command>TRUNCATE</>, <command>REINDEX</>, <command>CLUSTER</> and some formsof <command>ALTER TABLE</>, can change the filenode while preserving the OID.Avoid assuming that filenode and table OID are the same.</para></caution><para>When a table or index exceeds 1Gb, it is divided into gigabyte-sized<firstterm>segments</>. The first segment's file name is the same as thefilenode; subsequent segments are named filenode.1, filenode.2, etc.This arrangement avoids problems on platforms that have file size limitations.The contents of tables and indexes are discussed further in<xref linkend="storage-page-layout">.</para><para>A table that has columns with potentially large entries will have anassociated <firstterm>TOAST</> table, which is used for out-of-line storage offield values that are too large to keep in the table rows proper.<structname>pg_class</>.<structfield>reltoastrelid</> links from a table toits <acronym>TOAST</> table, if any.See <xref linkend="storage-toast"> for more information.</para><para>Tablespaces make the scenario more complicated. Each user-defined tablespacehas a symbolic link inside the <varname>PGDATA</><filename>/pg_tblspc</>directory, which points to the physical tablespace directory (as specified inits <command>CREATE TABLESPACE</> command). The symbolic link is named afterthe tablespace's OID. Inside the physical tablespace directory there isa subdirectory for each database that has elements in the tablespace, namedafter the database's OID. Tables within that directory follow the filenodenaming scheme. The <literal>pg_default</> tablespace is not accessed through<filename>pg_tblspc</>, but corresponds to<varname>PGDATA</><filename>/base</>. Similarly, the <literal>pg_global</>tablespace is not accessed through <filename>pg_tblspc</>, but corresponds to<varname>PGDATA</><filename>/global</>.</para></sect1><sect1 id="storage-toast"><title>TOAST</title> <indexterm> <primary>TOAST</primary> </indexterm> <indexterm><primary>sliced bread</><see>TOAST</></indexterm><para>This section provides an overview of <acronym>TOAST</> (TheOversized-Attribute Storage Technique).</para><para>Since <productname>PostgreSQL</productname> uses a fixed page size (commonly8Kb), and does not allow tuples to span multiple pages, it's not possible tostore very large field values directly. Before <productname>PostgreSQL</> 7.1there was a hard limit of just under one page on the total amount of data thatcould be put into a table row. In release 7.1 and later, this limit isovercome by allowing large field values to be compressed and/or broken up intomultiple physical rows. This happens transparently to the user, with onlysmall impact on most of the backend code. The technique is affectionatelyknown as <acronym>TOAST</> (or <quote>the best thing since sliced bread</>).</para><para>Only certain data types support <acronym>TOAST</> — there is no need toimpose the overhead on data types that cannot produce large field values.To support <acronym>TOAST</>, a data type must have a variable-length(<firstterm>varlena</>) representation, in which the first 32-bit word of anystored value contains the total length of the value in bytes (includingitself). <acronym>TOAST</> does not constrain the rest of the representation.All the C-level functions supporting a <acronym>TOAST</>-able data type mustbe careful to handle <acronym>TOAST</>ed input values. (This is normally doneby invoking <function>PG_DETOAST_DATUM</> before doing anything with an inputvalue, but in some cases more efficient approaches are possible.)</para><para><acronym>TOAST</> usurps the high-order two bits of the varlena length word,thereby limiting the logical size of any value of a <acronym>TOAST</>-abledata type to 1Gb (2<superscript>30</> - 1 bytes). When both bits are zero,the value is an ordinary un-<acronym>TOAST</>ed value of the data type. Oneof these bits, if set, indicates that the value has been compressed and mustbe decompressed before use. The other bit, if set, indicates that the valuehas been stored out-of-line. In this case the remainder of the value isactually just a pointer, and the correct data has to be found elsewhere. Whenboth bits are set, the out-of-line data has been compressed too. In each casethe length in the low-order bits of the varlena word indicates the actual sizeof the datum, not the size of the logical value that would be extracted bydecompression or fetching of the out-of-line data.</para><para>If any of the columns of a table are <acronym>TOAST</>-able, the table willhave an associated <acronym>TOAST</> table, whose OID is stored in the table's<structname>pg_class</>.<structfield>reltoastrelid</> entry. Out-of-line<acronym>TOAST</>ed values are kept in the <acronym>TOAST</> table, asdescribed in more detail below.</para><para>The compression technique used is a fairly simple and very fast memberof the LZ family of compression techniques. See<filename>src/backend/utils/adt/pg_lzcompress.c</> for the details.</para><para>Out-of-line values are divided (after compression if used) into chunks of atmost <literal>TOAST_MAX_CHUNK_SIZE</> bytes (this value is a little less than<literal>BLCKSZ/4</>, or about 2000 bytes by default). Each chunk is storedas a separate row in the <acronym>TOAST</> table for the owning table. Every<acronym>TOAST</> table has the columns <structfield>chunk_id</> (an OIDidentifying the particular <acronym>TOAST</>ed value),<structfield>chunk_seq</> (a sequence number for the chunk within its value),and <structfield>chunk_data</> (the actual data of the chunk). A unique indexon <structfield>chunk_id</> and <structfield>chunk_seq</> provides fastretrieval of the values. A pointer datum representing an out-of-line<acronym>TOAST</>ed value therefore needs to store the OID of the<acronym>TOAST</> table in which to look and the OID of the specific value(its <structfield>chunk_id</>). For convenience, pointer datums also store thelogical datum size (original uncompressed data length) and actual stored size(different if compression was applied). Allowing for the varlena header word,the total size of a <acronym>TOAST</> pointer datum is therefore 20 bytesregardless of the actual size of the represented value.</para><para>The <acronym>TOAST</> code is triggered onlywhen a row value to be stored in a table is wider than <literal>BLCKSZ/4</>bytes (normally 2Kb). The <acronym>TOAST</> code will compress and/or movefield values out-of-line until the row value is shorter than<literal>BLCKSZ/4</> bytes or no more gains can be had. During an UPDATEoperation, values of unchanged fields are normally preserved as-is; so anUPDATE of a row with out-of-line values incurs no <acronym>TOAST</> costs ifnone of the out-of-line values change.</para><para>The <acronym>TOAST</> code recognizes four different strategies for storing<acronym>TOAST</>-able columns: <itemizedlist> <listitem> <para> <literal>PLAIN</literal> prevents either compression or out-of-line storage. This is the only possible strategy for columns of non-<acronym>TOAST</>-able data types. </para> </listitem> <listitem> <para> <literal>EXTENDED</literal> allows both compression and out-of-line storage. This is the default for most <acronym>TOAST</>-able data types. Compression will be attempted first, then out-of-line storage if the row is still too big. </para> </listitem> <listitem> <para> <literal>EXTERNAL</literal> allows out-of-line storage but not compression. Use of <literal>EXTERNAL</literal> will make substring operations on wide <type>text</type> and <type>bytea</type> columns faster (at the penalty of increased storage space) because these operations are optimized to fetch only the required parts of the out-of-line value when it is not compressed. </para> </listitem> <listitem> <para> <literal>MAIN</literal> allows compression but not out-of-line storage. (Actually, out-of-line storage will still be performed for such columns, but only as a last resort when there is no other way to make the row small enough.) </para> </listitem> </itemizedlist>Each <acronym>TOAST</>-able data type specifies a default strategy for columnsof that data type, but the strategy for a given table column can be alteredwith <command>ALTER TABLE SET STORAGE</>.</para><para>This scheme has a number of advantages compared to a more straightforwardapproach such as allowing row values to span pages. Assuming that queries areusually qualified by comparisons against relatively small key values, most ofthe work of the executor will be done using the main row entry. The big valuesof <acronym>TOAST</>ed attributes will only be pulled out (if selected at all)at the time the result set is sent to the client. Thus, the main table is muchsmaller and more of its rows fit in the shared buffer cache than would be thecase without any out-of-line storage. Sort sets shrink also, and sorts willmore often be done entirely in memory. A little test showed that a tablecontaining typical HTML pages and their URLs was stored in about half of theraw data size including the <acronym>TOAST</> table, and that the main tablecontained only about 10% of the entire data (the URLs and some small HTMLpages). There was no run time difference compared to an un-<acronym>TOAST</>edcomparison table, in which all the HTML pages were cut down to 7Kb to fit.</para></sect1><sect1 id="storage-page-layout"><title>Database Page Layout</title><para>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -