📄 btree.c
字号:
/*** 2004 April 6**** The author disclaims copyright to this source code. In place of** a legal notice, here is a blessing:**** May you do good and not evil.** May you find forgiveness for yourself and forgive others.** May you share freely, never taking more than you give.***************************************************************************** $Id: btree.c,v 1.261 2005/05/24 20:19:58 drh Exp $**** This file implements a external (disk-based) database using BTrees.** For a detailed discussion of BTrees, refer to**** Donald E. Knuth, THE ART OF COMPUTER PROGRAMMING, Volume 3:** "Sorting And Searching", pages 473-480. Addison-Wesley** Publishing Company, Reading, Massachusetts.**** The basic idea is that each page of the file contains N database** entries and N+1 pointers to subpages.**** ----------------------------------------------------------------** | Ptr(0) | Key(0) | Ptr(1) | Key(1) | ... | Key(N) | Ptr(N+1) |** ----------------------------------------------------------------**** All of the keys on the page that Ptr(0) points to have values less** than Key(0). All of the keys on page Ptr(1) and its subpages have** values greater than Key(0) and less than Key(1). All of the keys** on Ptr(N+1) and its subpages have values greater than Key(N). And** so forth.**** Finding a particular key requires reading O(log(M)) pages from the ** disk where M is the number of entries in the tree.**** In this implementation, a single file can hold one or more separate ** BTrees. Each BTree is identified by the index of its root page. The** key and data for any entry are combined to form the "payload". A** fixed amount of payload can be carried directly on the database** page. If the payload is larger than the preset amount then surplus** bytes are stored on overflow pages. The payload for an entry** and the preceding pointer are combined to form a "Cell". Each ** page has a small header which contains the Ptr(N+1) pointer and other** information such as the size of key and data.**** FORMAT DETAILS**** The file is divided into pages. The first page is called page 1,** the second is page 2, and so forth. A page number of zero indicates** "no such page". The page size can be anything between 512 and 65536.** Each page can be either a btree page, a freelist page or an overflow** page.**** The first page is always a btree page. The first 100 bytes of the first** page contain a special header (the "file header") that describes the file.** The format of the file header is as follows:**** OFFSET SIZE DESCRIPTION** 0 16 Header string: "SQLite format 3\000"** 16 2 Page size in bytes. ** 18 1 File format write version** 19 1 File format read version** 20 1 Bytes of unused space at the end of each page** 21 1 Max embedded payload fraction** 22 1 Min embedded payload fraction** 23 1 Min leaf payload fraction** 24 4 File change counter** 28 4 Reserved for future use** 32 4 First freelist page** 36 4 Number of freelist pages in the file** 40 60 15 4-byte meta values passed to higher layers**** All of the integer values are big-endian (most significant byte first).**** The file change counter is incremented when the database is changed more** than once within the same second. This counter, together with the** modification time of the file, allows other processes to know** when the file has changed and thus when they need to flush their** cache.**** The max embedded payload fraction is the amount of the total usable** space in a page that can be consumed by a single cell for standard** B-tree (non-LEAFDATA) tables. A value of 255 means 100%. The default** is to limit the maximum cell size so that at least 4 cells will fit** on one page. Thus the default max embedded payload fraction is 64.**** If the payload for a cell is larger than the max payload, then extra** payload is spilled to overflow pages. Once an overflow page is allocated,** as many bytes as possible are moved into the overflow pages without letting** the cell size drop below the min embedded payload fraction.**** The min leaf payload fraction is like the min embedded payload fraction** except that it applies to leaf nodes in a LEAFDATA tree. The maximum** payload fraction for a LEAFDATA tree is always 100% (or 255) and it** not specified in the header.**** Each btree pages is divided into three sections: The header, the** cell pointer array, and the cell area area. Page 1 also has a 100-byte** file header that occurs before the page header.**** |----------------|** | file header | 100 bytes. Page 1 only.** |----------------|** | page header | 8 bytes for leaves. 12 bytes for interior nodes** |----------------|** | cell pointer | | 2 bytes per cell. Sorted order.** | array | | Grows downward** | | v** |----------------|** | unallocated |** | space |** |----------------| ^ Grows upwards** | cell content | | Arbitrary order interspersed with freeblocks.** | area | | and free space fragments.** |----------------|**** The page headers looks like this:**** OFFSET SIZE DESCRIPTION** 0 1 Flags. 1: intkey, 2: zerodata, 4: leafdata, 8: leaf** 1 2 byte offset to the first freeblock** 3 2 number of cells on this page** 5 2 first byte of the cell content area** 7 1 number of fragmented free bytes** 8 4 Right child (the Ptr(N+1) value). Omitted on leaves.**** The flags define the format of this btree page. The leaf flag means that** this page has no children. The zerodata flag means that this page carries** only keys and no data. The intkey flag means that the key is a integer** which is stored in the key size entry of the cell header rather than in** the payload area.**** The cell pointer array begins on the first byte after the page header.** The cell pointer array contains zero or more 2-byte numbers which are** offsets from the beginning of the page to the cell content in the cell** content area. The cell pointers occur in sorted order. The system strives** to keep free space after the last cell pointer so that new cells can** be easily added without having to defragment the page.**** Cell content is stored at the very end of the page and grows toward the** beginning of the page.**** Unused space within the cell content area is collected into a linked list of** freeblocks. Each freeblock is at least 4 bytes in size. The byte offset** to the first freeblock is given in the header. Freeblocks occur in** increasing order. Because a freeblock must be at least 4 bytes in size,** any group of 3 or fewer unused bytes in the cell content area cannot** exist on the freeblock chain. A group of 3 or fewer free bytes is called** a fragment. The total number of bytes in all fragments is recorded.** in the page header at offset 7.**** SIZE DESCRIPTION** 2 Byte offset of the next freeblock** 2 Bytes in this freeblock**** Cells are of variable length. Cells are stored in the cell content area at** the end of the page. Pointers to the cells are in the cell pointer array** that immediately follows the page header. Cells is not necessarily** contiguous or in order, but cell pointers are contiguous and in order.**** Cell content makes use of variable length integers. A variable** length integer is 1 to 9 bytes where the lower 7 bits of each ** byte are used. The integer consists of all bytes that have bit 8 set and** the first byte with bit 8 clear. The most significant byte of the integer** appears first. A variable-length integer may not be more than 9 bytes long.** As a special case, all 8 bytes of the 9th byte are used as data. This** allows a 64-bit integer to be encoded in 9 bytes.**** 0x00 becomes 0x00000000** 0x7f becomes 0x0000007f** 0x81 0x00 becomes 0x00000080** 0x82 0x00 becomes 0x00000100** 0x80 0x7f becomes 0x0000007f** 0x8a 0x91 0xd1 0xac 0x78 becomes 0x12345678** 0x81 0x81 0x81 0x81 0x01 becomes 0x10204081**** Variable length integers are used for rowids and to hold the number of** bytes of key and data in a btree cell.**** The content of a cell looks like this:**** SIZE DESCRIPTION** 4 Page number of the left child. Omitted if leaf flag is set.** var Number of bytes of data. Omitted if the zerodata flag is set.** var Number of bytes of key. Or the key itself if intkey flag is set.** * Payload** 4 First page of the overflow chain. Omitted if no overflow**** Overflow pages form a linked list. Each page except the last is completely** filled with data (pagesize - 4 bytes). The last page can have as little** as 1 byte of data.**** SIZE DESCRIPTION** 4 Page number of next overflow page** * Data**** Freelist pages come in two subtypes: trunk pages and leaf pages. The** file header points to first in a linked list of trunk page. Each trunk** page points to multiple leaf pages. The content of a leaf page is** unspecified. A trunk page looks like this:**** SIZE DESCRIPTION** 4 Page number of next trunk page** 4 Number of leaf pointers on this page** * zero or more pages numbers of leaves*/#include "sqliteInt.h"#include "pager.h"#include "btree.h"#include "os.h"#include <assert.h>/* Round up a number to the next larger multiple of 8. This is used** to force 8-byte alignment on 64-bit architectures.*/#define ROUND8(x) ((x+7)&~7)/* The following value is the maximum cell size assuming a maximum page** size give above.*/#define MX_CELL_SIZE(pBt) (pBt->pageSize-8)/* The maximum number of cells on a single page of the database. This** assumes a minimum cell size of 3 bytes. Such small cells will be** exceedingly rare, but they are possible.*/#define MX_CELL(pBt) ((pBt->pageSize-8)/3)/* Forward declarations */typedef struct MemPage MemPage;/*** This is a magic string that appears at the beginning of every** SQLite database in order to identify the file as a real database.** 123456789 123456 */static const char zMagicHeader[] = "SQLite format 3";/*** Page type flags. An ORed combination of these flags appear as the** first byte of every BTree page.*/#define PTF_INTKEY 0x01#define PTF_ZERODATA 0x02#define PTF_LEAFDATA 0x04#define PTF_LEAF 0x08/*** As each page of the file is loaded into memory, an instance of the following** structure is appended and initialized to zero. This structure stores** information about the page that is decoded from the raw file page.**** The pParent field points back to the parent page. This allows us to** walk up the BTree from any leaf to the root. Care must be taken to** unref() the parent page pointer when this page is no longer referenced.** The pageDestructor() routine handles that chore.*/struct MemPage { u8 isInit; /* True if previously initialized. MUST BE FIRST! */ u8 idxShift; /* True if Cell indices have changed */ u8 nOverflow; /* Number of overflow cell bodies in aCell[] */ u8 intKey; /* True if intkey flag is set */ u8 leaf; /* True if leaf flag is set */ u8 zeroData; /* True if table stores keys only */ u8 leafData; /* True if tables stores data on leaves only */ u8 hasData; /* True if this page stores data */ u8 hdrOffset; /* 100 for page 1. 0 otherwise */ u8 childPtrSize; /* 0 if leaf==1. 4 if leaf==0 */ u16 maxLocal; /* Copy of Btree.maxLocal or Btree.maxLeaf */ u16 minLocal; /* Copy of Btree.minLocal or Btree.minLeaf */ u16 cellOffset; /* Index in aData of first cell pointer */ u16 idxParent; /* Index in parent of this node */ u16 nFree; /* Number of free bytes on the page */ u16 nCell; /* Number of cells on this page, local and ovfl */ struct _OvflCell { /* Cells that will not fit on aData[] */ u8 *pCell; /* Pointers to the body of the overflow cell */ u16 idx; /* Insert this cell before idx-th non-overflow cell */ } aOvfl[5]; struct Btree *pBt; /* Pointer back to BTree structure */ u8 *aData; /* Pointer back to the start of the page */ Pgno pgno; /* Page number for this page */ MemPage *pParent; /* The parent of this page. NULL for root */};/*** The in-memory image of a disk page has the auxiliary information appended** to the end. EXTRA_SIZE is the number of bytes of space needed to hold** that extra information.*/#define EXTRA_SIZE sizeof(MemPage)/*** Everything we need to know about an open database*/struct Btree { Pager *pPager; /* The page cache */ BtCursor *pCursor; /* A list of all open cursors */ MemPage *pPage1; /* First page of the database */ u8 inTrans; /* True if a transaction is in progress */ u8 inStmt; /* True if we are in a statement subtransaction */ u8 readOnly; /* True if the underlying file is readonly */ u8 maxEmbedFrac; /* Maximum payload as % of total page size */ u8 minEmbedFrac; /* Minimum payload as % of total page size */ u8 minLeafFrac; /* Minimum leaf payload as % of total page size */ u8 pageSizeFixed; /* True if the page size can no longer be changed */#ifndef SQLITE_OMIT_AUTOVACUUM u8 autoVacuum; /* True if database supports auto-vacuum */#endif u16 pageSize; /* Total number of bytes on a page */ u16 usableSize; /* Number of usable bytes on each page */ int maxLocal; /* Maximum local payload in non-LEAFDATA tables */ int minLocal; /* Minimum local payload in non-LEAFDATA tables */ int maxLeaf; /* Maximum local payload in a LEAFDATA table */ int minLeaf; /* Minimum local payload in a LEAFDATA table */ BusyHandler *pBusyHandler; /* Callback for when there is lock contention */};typedef Btree Bt;/*** Btree.inTrans may take one of the following values.*/#define TRANS_NONE 0#define TRANS_READ 1#define TRANS_WRITE 2/*** An instance of the following structure is used to hold information** about a cell. The parseCellPtr() function fills in this structure** based on information extract from the raw disk page.*/typedef struct CellInfo CellInfo;struct CellInfo { u8 *pCell; /* Pointer to the start of cell content */ i64 nKey; /* The key for INTKEY tables, or number of bytes in key */ u32 nData; /* Number of bytes of data */ u16 nHeader; /* Size of the cell content header in bytes */ u16 nLocal; /* Amount of payload held locally */ u16 iOverflow; /* Offset to overflow page number. Zero if no overflow */ u16 nSize; /* Size of the cell content on the main b-tree page */};/*** A cursor is a pointer to a particular entry in the BTree.** The entry is identified by its MemPage and the index in** MemPage.aCell[] of the entry.*/struct BtCursor { Btree *pBt; /* The Btree to which this cursor belongs */ BtCursor *pNext, *pPrev; /* Forms a linked list of all cursors */ int (*xCompare)(void*,int,const void*,int,const void*); /* Key comp func */ void *pArg; /* First arg to xCompare() */ Pgno pgnoRoot; /* The root page of this tree */ MemPage *pPage; /* Page that contains the entry */ int idx; /* Index of the entry in pPage->aCell[] */ CellInfo info; /* A parse of the cell we are pointing at */ u8 wrFlag; /* True if writable */ u8 isValid; /* TRUE if points to a valid entry */};/*** The TRACE macro will print high-level status information about the** btree operation when the global variable sqlite3_btree_trace is** enabled.*/#if SQLITE_TEST# define TRACE(X) if( sqlite3_btree_trace )\ { sqlite3DebugPrintf X; fflush(stdout); }#else# define TRACE(X)#endifint sqlite3_btree_trace=0; /* True to enable tracing *//*** Forward declaration*/static int checkReadLocks(Btree*,Pgno,BtCursor*);/*** Read or write a two- and four-byte big-endian integer values.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -