⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 readme

📁 PostgreSQL 8.1.4的源码 适用于Linux下的开源数据库系统
💻
📖 第 1 页 / 共 2 页
字号:
$PostgreSQL: pgsql/src/backend/access/hash/README,v 1.4 2003/11/29 19:51:40 pgsql Exp $This directory contains an implementation of hash indexing for Postgres.A hash index consists of two or more "buckets", into which tuples areplaced whenever their hash key maps to the bucket number.  Thekey-to-bucket-number mapping is chosen so that the index can beincrementally expanded.  When a new bucket is to be added to the index,exactly one existing bucket will need to be "split", with some of itstuples being transferred to the new bucket according to the updatedkey-to-bucket-number mapping.  This is essentially the same hash tablemanagement technique embodied in src/backend/utils/hash/dynahash.c forin-memory hash tables.Each bucket in the hash index comprises one or more index pages.  Thebucket's first page is permanently assigned to it when the bucket iscreated.  Additional pages, called "overflow pages", are added if thebucket receives too many tuples to fit in the primary bucket page.The pages of a bucket are chained together in a doubly-linked listusing fields in the index page special space.There is currently no provision to shrink a hash index, other than byrebuilding it with REINDEX.  Overflow pages can be recycled for reusein other buckets, but we never give them back to the operating system.There is no provision for reducing the number of buckets, either.Page addressing---------------There are four kinds of pages in a hash index: the meta page (page zero),which contains statically allocated control information; primary bucketpages; overflow pages; and bitmap pages, which keep track of overflowpages that have been freed and are available for re-use.  For addressingpurposes, bitmap pages are regarded as a subset of the overflow pages.Primary bucket pages and overflow pages are allocated independently (sinceany given index might need more or fewer overflow pages relative to itsnumber of buckets).  The hash code uses an interesting set of addressingrules to support a variable number of overflow pages while not having tomove primary bucket pages around after they are created.Primary bucket pages (henceforth just "bucket pages") are allocated inpower-of-2 groups, called "split points" in the code.  Buckets 0 and 1are created when the index is initialized.  At the first split, buckets 2and 3 are allocated; when bucket 4 is needed, buckets 4-7 are allocated;when bucket 8 is needed, buckets 8-15 are allocated; etc.  All the bucketpages of a power-of-2 group appear consecutively in the index.  Thisaddressing scheme allows the physical location of a bucket page to becomputed from the bucket number relatively easily, using only a smallamount of control information.  We take the log2() of the bucket numberto determine which split point S the bucket belongs to, and then simplyadd "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in themetapage) to compute the physical address.  hashm_spares[S] can beinterpreted as the total number of overflow pages that have been allocatedbefore the bucket pages of splitpoint S.  hashm_spares[0] is always 0,so that buckets 0 and 1 (which belong to splitpoint 0) always appear atblock numbers 1 and 2, just after the meta page.  We always havehashm_spares[N] <= hashm_spares[N+1], since the latter count includes theformer.  The difference between the two represents the number of overflowpages appearing between the bucket page groups of splitpoints N and N+1.When S splitpoints exist altogether, the array entries hashm_spares[0]through hashm_spares[S] are valid; hashm_spares[S] records the currenttotal number of overflow pages.  New overflow pages are created as neededat the end of the index, and recorded by incrementing hashm_spares[S].When it is time to create a new splitpoint's worth of bucket pages, wecopy hashm_spares[S] into hashm_spares[S+1] and increment S (which isstored in the hashm_ovflpoint field of the meta page).  This has theeffect of reserving the correct number of bucket pages at the end of theindex, and preparing to allocate additional overflow pages after thosebucket pages.  hashm_spares[] entries before S cannot change anymore,since that would require moving already-created bucket pages.Since overflow pages may be recycled if enough tuples are deleted fromtheir bucket, we need a way to keep track of currently-free overflowpages.  The state of each overflow page (0 = available, 1 = not available)is recorded in "bitmap" pages dedicated to this purpose.  The entries inthe bitmap are indexed by "bit number", a zero-based count in which everyoverflow page has a unique entry.  We can convert between an overflowpage's physical block number and its bit number using the information inhashm_spares[] (see hashovfl.c for details).  The bit number sequenceincludes the bitmap pages, which is the reason for saying that bitmappages are a subset of the overflow pages.  It turns out in fact that eachbitmap page's first bit represents itself --- this is not an essentialproperty, but falls out of the fact that we only allocate another bitmappage when we really need one.  Bit number zero always corresponds to blocknumber 3, which is the first bitmap page and is allocated during indexcreation.Lock definitions----------------We use both lmgr locks ("heavyweight" locks) and buffer context locks(LWLocks) to control access to a hash index.  lmgr locks are needed forlong-term locking since there is a (small) risk of deadlock, which we mustbe able to detect.  Buffer context locks are used for short-term accesscontrol to individual pages of the index.We define the following lmgr locks for a hash index:LockPage(rel, 0) represents the right to modify the hash-code-to-bucketmapping.  A process attempting to enlarge the hash table by splitting abucket must exclusive-lock this lock before modifying the metapage datarepresenting the mapping.  Processes intending to access a particularbucket must share-lock this lock until they have acquired lock on thecorrect target bucket.LockPage(rel, page), where page is the page number of a hash bucket page,represents the right to split or compact an individual bucket.  A processsplitting a bucket must exclusive-lock both old and new halves of thebucket until it is done.  A process doing VACUUM must exclusive-lock thebucket it is currently purging tuples from.  Processes doing scans orinsertions must share-lock the bucket they are scanning or inserting into.(It is okay to allow concurrent scans and insertions.)The lmgr lock IDs corresponding to overflow pages are currently unused.These are available for possible future refinements.Note that these lock definitions are conceptually distinct from any sortof lock on the pages whose numbers they share.  A process must also obtainread or write buffer lock on the metapage or bucket page before accessingsaid page.Processes performing hash index scans must hold share lock on the bucketthey are scanning throughout the scan.  This seems to be essential, sincethere is no reasonable way for a scan to cope with its bucket being splitunderneath it.  This creates a possibility of deadlock external to thehash index code, since a process holding one of these locks could blockwaiting for an unrelated lock held by another process.  If that processthen does something that requires exclusive lock on the bucket, we havedeadlock.  Therefore the bucket locks must be lmgr locks so that deadlockcan be detected and recovered from.  This also forces the page-zero lockto be an lmgr lock, because as we'll see below it is held while attemptingto acquire a bucket lock, and so it could also participate in a deadlock.Processes must obtain read (share) buffer context lock on any hash indexpage while reading it, and write (exclusive) lock while modifying it.To prevent deadlock we enforce these coding rules: no buffer lock may beheld long term (across index AM calls), nor may any buffer lock be heldwhile waiting for an lmgr lock, nor may more than one buffer lockbe held at a time by any one process.  (The third restriction is probablystronger than necessary, but it makes the proof of no deadlock obvious.)Pseudocode algorithms---------------------The operations we need to support are: readers scanning the index forentries of a particular hash code (which by definition are all in the samebucket); insertion of a new tuple into the correct bucket; enlarging thehash table by splitting an existing bucket; and garbage collection(deletion of dead tuples and compaction of buckets).  Bucket splitting isdone at conclusion of any insertion that leaves the hash table more fullthan the target load factor, but it is convenient to consider it as anindependent operation.  Note that we do not have a bucket-merge operation--- the number of buckets never shrinks.  Insertion, splitting, andgarbage collection may all need access to freelist management, which keepstrack of available overflow pages.The reader algorithm is:	share-lock page 0 (to prevent active split)	read/sharelock meta page	compute bucket number for target hash key	release meta page	share-lock bucket page (to prevent split/compact of this bucket)	release page 0 share-lock-- then, per read request:	read/sharelock current page of bucket		step to next page if necessary (no chaining of locks)	get tuple	release current page-- at scan shutdown:	release bucket share-lockBy holding the page-zero lock until lock on the target bucket is obtained,the reader ensures that the target bucket calculation is valid (otherwisethe bucket might be split before the reader arrives at it, and the targetentries might go into the new bucket).  Holding the bucket sharelock forthe remainder of the scan prevents the reader's current-tuple pointer frombeing invalidated by other processes.  Notice though that the reader neednot prevent other buckets from being split or compacted.The insertion algorithm is rather similar:	share-lock page 0 (to prevent active split)	read/sharelock meta page	compute bucket number for target hash key	release meta page	share-lock bucket page (to prevent split/compact of this bucket)	release page 0 share-lock-- (so far same as reader)	read/exclusive-lock current page of bucket	if full, release, read/exclusive-lock next page; repeat as needed	>> see below if no space in any page of bucket	insert tuple	write/release current page	release bucket share-lock	read/exclusive-lock meta page	increment tuple count, decide if split needed	write/release meta page	done if no split needed, else enter Split algorithm belowIt is okay for an insertion to take place in a bucket that is beingactively scanned, because it does not change the position of any existingitem in the bucket, so scan states are not invalidated.  We only need theshort-term buffer locks to ensure that readers do not see apartially-updated page.

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -