📄 fts2.c
字号:
/*** 2006 Oct 10**** The author disclaims copyright to this source code. In place of** a legal notice, here is a blessing:**** May you do good and not evil.** May you find forgiveness for yourself and forgive others.** May you share freely, never taking more than you give.************************************************************************************ This is an SQLite module implementing full-text search.*//*** The code in this file is only compiled if:**** * The FTS2 module is being built as an extension** (in which case SQLITE_CORE is not defined), or**** * The FTS2 module is being built into the core of** SQLite (in which case SQLITE_ENABLE_FTS2 is defined).*//* TODO(shess) Consider exporting this comment to an HTML file or the** wiki.*//* The full-text index is stored in a series of b+tree (-like)** structures called segments which map terms to doclists. The** structures are like b+trees in layout, but are constructed from the** bottom up in optimal fashion and are not updatable. Since trees** are built from the bottom up, things will be described from the** bottom up.******** Varints ****** The basic unit of encoding is a variable-length integer called a** varint. We encode variable-length integers in little-endian order** using seven bits * per byte as follows:**** KEY:** A = 0xxxxxxx 7 bits of data and one flag bit** B = 1xxxxxxx 7 bits of data and one flag bit**** 7 bits - A** 14 bits - BA** 21 bits - BBA** and so on.**** This is identical to how sqlite encodes varints (see util.c).******** Document lists ****** A doclist (document list) holds a docid-sorted list of hits for a** given term. Doclists hold docids, and can optionally associate** token positions and offsets with docids.**** A DL_POSITIONS_OFFSETS doclist is stored like this:**** array {** varint docid;** array { (position list for column 0)** varint position; (delta from previous position plus POS_BASE)** varint startOffset; (delta from previous startOffset)** varint endOffset; (delta from startOffset)** }** array {** varint POS_COLUMN; (marks start of position list for new column)** varint column; (index of new column)** array {** varint position; (delta from previous position plus POS_BASE)** varint startOffset;(delta from previous startOffset)** varint endOffset; (delta from startOffset)** }** }** varint POS_END; (marks end of positions for this document.** }**** Here, array { X } means zero or more occurrences of X, adjacent in** memory. A "position" is an index of a token in the token stream** generated by the tokenizer, while an "offset" is a byte offset,** both based at 0. Note that POS_END and POS_COLUMN occur in the** same logical place as the position element, and act as sentinals** ending a position list array.**** A DL_POSITIONS doclist omits the startOffset and endOffset** information. A DL_DOCIDS doclist omits both the position and** offset information, becoming an array of varint-encoded docids.**** On-disk data is stored as type DL_DEFAULT, so we don't serialize** the type. Due to how deletion is implemented in the segmentation** system, on-disk doclists MUST store at least positions.******** Segment leaf nodes ****** Segment leaf nodes store terms and doclists, ordered by term. Leaf** nodes are written using LeafWriter, and read using LeafReader (to** iterate through a single leaf node's data) and LeavesReader (to** iterate through a segment's entire leaf layer). Leaf nodes have** the format:**** varint iHeight; (height from leaf level, always 0)** varint nTerm; (length of first term)** char pTerm[nTerm]; (content of first term)** varint nDoclist; (length of term's associated doclist)** char pDoclist[nDoclist]; (content of doclist)** array {** (further terms are delta-encoded)** varint nPrefix; (length of prefix shared with previous term)** varint nSuffix; (length of unshared suffix)** char pTermSuffix[nSuffix];(unshared suffix of next term)** varint nDoclist; (length of term's associated doclist)** char pDoclist[nDoclist]; (content of doclist)** }**** Here, array { X } means zero or more occurrences of X, adjacent in** memory.**** Leaf nodes are broken into blocks which are stored contiguously in** the %_segments table in sorted order. This means that when the end** of a node is reached, the next term is in the node with the next** greater node id.**** New data is spilled to a new leaf node when the current node** exceeds LEAF_MAX bytes (default 2048). New data which itself is** larger than STANDALONE_MIN (default 1024) is placed in a standalone** node (a leaf node with a single term and doclist). The goal of** these settings is to pack together groups of small doclists while** making it efficient to directly access large doclists. The** assumption is that large doclists represent terms which are more** likely to be query targets.**** TODO(shess) It may be useful for blocking decisions to be more** dynamic. For instance, it may make more sense to have a 2.5k leaf** node rather than splitting into 2k and .5k nodes. My intuition is** that this might extend through 2x or 4x the pagesize.******** Segment interior nodes ****** Segment interior nodes store blockids for subtree nodes and terms** to describe what data is stored by the each subtree. Interior** nodes are written using InteriorWriter, and read using** InteriorReader. InteriorWriters are created as needed when** SegmentWriter creates new leaf nodes, or when an interior node** itself grows too big and must be split. The format of interior** nodes:**** varint iHeight; (height from leaf level, always >0)** varint iBlockid; (block id of node's leftmost subtree)** optional {** varint nTerm; (length of first term)** char pTerm[nTerm]; (content of first term)** array {** (further terms are delta-encoded)** varint nPrefix; (length of shared prefix with previous term)** varint nSuffix; (length of unshared suffix)** char pTermSuffix[nSuffix]; (unshared suffix of next term)** }** }**** Here, optional { X } means an optional element, while array { X }** means zero or more occurrences of X, adjacent in memory.**** An interior node encodes n terms separating n+1 subtrees. The** subtree blocks are contiguous, so only the first subtree's blockid** is encoded. The subtree at iBlockid will contain all terms less** than the first term encoded (or all terms if no term is encoded).** Otherwise, for terms greater than or equal to pTerm[i] but less** than pTerm[i+1], the subtree for that term will be rooted at** iBlockid+i. Interior nodes only store enough term data to** distinguish adjacent children (if the rightmost term of the left** child is "something", and the leftmost term of the right child is** "wicked", only "w" is stored).**** New data is spilled to a new interior node at the same height when** the current node exceeds INTERIOR_MAX bytes (default 2048).** INTERIOR_MIN_TERMS (default 7) keeps large terms from monopolizing** interior nodes and making the tree too skinny. The interior nodes** at a given height are naturally tracked by interior nodes at** height+1, and so on.******** Segment directory ****** The segment directory in table %_segdir stores meta-information for** merging and deleting segments, and also the root node of the** segment's tree.**** The root node is the top node of the segment's tree after encoding** the entire segment, restricted to ROOT_MAX bytes (default 1024).** This could be either a leaf node or an interior node. If the top** node requires more than ROOT_MAX bytes, it is flushed to %_segments** and a new root interior node is generated (which should always fit** within ROOT_MAX because it only needs space for 2 varints, the** height and the blockid of the previous root).**** The meta-information in the segment directory is:** level - segment level (see below)** idx - index within level** - (level,idx uniquely identify a segment)** start_block - first leaf node** leaves_end_block - last leaf node** end_block - last block (including interior nodes)** root - contents of root node**** If the root node is a leaf node, then start_block,** leaves_end_block, and end_block are all 0.******** Segment merging ****** To amortize update costs, segments are groups into levels and** merged in matches. Each increase in level represents exponentially** more documents.**** New documents (actually, document updates) are tokenized and** written individually (using LeafWriter) to a level 0 segment, with** incrementing idx. When idx reaches MERGE_COUNT (default 16), all** level 0 segments are merged into a single level 1 segment. Level 1** is populated like level 0, and eventually MERGE_COUNT level 1** segments are merged to a single level 2 segment (representing** MERGE_COUNT^2 updates), and so on.**** A segment merge traverses all segments at a given level in** parallel, performing a straightforward sorted merge. Since segment** leaf nodes are written in to the %_segments table in order, this** merge traverses the underlying sqlite disk structures efficiently.** After the merge, all segment blocks from the merged level are** deleted.**** MERGE_COUNT controls how often we merge segments. 16 seems to be** somewhat of a sweet spot for insertion performance. 32 and 64 show** very similar performance numbers to 16 on insertion, though they're** a tiny bit slower (perhaps due to more overhead in merge-time** sorting). 8 is about 20% slower than 16, 4 about 50% slower than** 16, 2 about 66% slower than 16.**** At query time, high MERGE_COUNT increases the number of segments** which need to be scanned and merged. For instance, with 100k docs** inserted:**** MERGE_COUNT segments** 16 25** 8 12** 4 10** 2 6**** This appears to have only a moderate impact on queries for very** frequent terms (which are somewhat dominated by segment merge** costs), and infrequent and non-existent terms still seem to be fast** even with many segments.**** TODO(shess) That said, it would be nice to have a better query-side** argument for MERGE_COUNT of 16. Also, it's possible/likely that** optimizations to things like doclist merging will swing the sweet** spot around.********** Handling of deletions and updates ****** Since we're using a segmented structure, with no docid-oriented** index into the term index, we clearly cannot simply update the term** index when a document is deleted or updated. For deletions, we** write an empty doclist (varint(docid) varint(POS_END)), for updates** we simply write the new doclist. Segment merges overwrite older** data for a particular docid with newer data, so deletes or updates** will eventually overtake the earlier data and knock it out. The** query logic likewise merges doclists so that newer data knocks out** older data.**** TODO(shess) Provide a VACUUM type operation to clear out all** deletions and duplications. This would basically be a forced merge** into a single segment.*/#if !defined(SQLITE_CORE) || defined(SQLITE_ENABLE_FTS2)#if defined(SQLITE_ENABLE_FTS2) && !defined(SQLITE_CORE)# define SQLITE_CORE 1#endif#include <assert.h>#if !defined(__APPLE__)#include <malloc.h>#endif#include <stdlib.h>#include <stdio.h>#include <string.h>#include <ctype.h>#include "fts2.h"#include "fts2_hash.h"#include "fts2_tokenizer.h"#include "sqlite3.h"#include "sqlite3ext.h"SQLITE_EXTENSION_INIT1/* TODO(shess) MAN, this thing needs some refactoring. At minimum, it** would be nice to order the file better, perhaps something along the** lines of:**** - utility functions** - table setup functions** - table update functions** - table query functions**** Put the query functions last because they're likely to reference** typedefs or functions from the table update section.*/#if 0# define TRACE(A) printf A; fflush(stdout)#else# define TRACE(A)#endif/* It is not safe to call isspace(), tolower(), or isalnum() on** hi-bit-set characters. This is the same solution used in the** tokenizer.*//* TODO(shess) The snippet-generation code should be using the** tokenizer-generated tokens rather than doing its own local** tokenization.*//* TODO(shess) Is __isascii() a portable version of (c&0x80)==0? */static int safe_isspace(char c){ return (c&0x80)==0 ? isspace(c) : 0;}static int safe_tolower(char c){ return (c&0x80)==0 ? tolower(c) : c;}static int safe_isalnum(char c){ return (c&0x80)==0 ? isalnum(c) : 0;}typedef enum DocListType { DL_DOCIDS, /* docids only */ DL_POSITIONS, /* docids + positions */ DL_POSITIONS_OFFSETS /* docids + positions + offsets */} DocListType;/*** By default, only positions and not offsets are stored in the doclists.** To change this so that offsets are stored too, compile with**** -DDL_DEFAULT=DL_POSITIONS_OFFSETS**** If DL_DEFAULT is set to DL_DOCIDS, your table can only be inserted** into (no deletes or updates).*/#ifndef DL_DEFAULT# define DL_DEFAULT DL_POSITIONS#endifenum { POS_END = 0, /* end of this position list */ POS_COLUMN, /* followed by new column number */ POS_BASE};/* MERGE_COUNT controls how often we merge segments (see comment at** top of file).*/#define MERGE_COUNT 16/* utility functions *//* CLEAR() and SCRAMBLE() abstract memset() on a pointer to a single** record to prevent errors of the form:**** my_function(SomeType *b){** memset(b, '\0', sizeof(b)); // sizeof(b)!=sizeof(*b)** }*//* TODO(shess) Obvious candidates for a header file. */#define CLEAR(b) memset(b, '\0', sizeof(*(b)))#ifndef NDEBUG# define SCRAMBLE(b) memset(b, 0x55, sizeof(*(b)))#else# define SCRAMBLE(b)#endif/* We may need up to VARINT_MAX bytes to store an encoded 64-bit integer. */#define VARINT_MAX 10/* Write a 64-bit variable-length integer to memory starting at p[0]. * The length of data written will be between 1 and VARINT_MAX bytes. * The number of bytes written is returned. */static int putVarint(char *p, sqlite_int64 v){ unsigned char *q = (unsigned char *) p; sqlite_uint64 vu = v; do{ *q++ = (unsigned char) ((vu & 0x7f) | 0x80); vu >>= 7;
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -