📄 fts2.c

📁 sqlite-3.4.1,嵌入式数据库.是一个功能强大的开源数据库,给学习和研发以及小型公司的发展带来了全所未有的好处.
💻 C
📖 第 1 页 / 共 5 页
字号:
12 3 4 5 下一页
/*** 2006 Oct 10**** The author disclaims copyright to this source code.  In place of** a legal notice, here is a blessing:****    May you do good and not evil.**    May you find forgiveness for yourself and forgive others.**    May you share freely, never taking more than you give.************************************************************************************ This is an SQLite module implementing full-text search.*//*** The code in this file is only compiled if:****     * The FTS2 module is being built as an extension**       (in which case SQLITE_CORE is not defined), or****     * The FTS2 module is being built into the core of**       SQLite (in which case SQLITE_ENABLE_FTS2 is defined).*//* TODO(shess) Consider exporting this comment to an HTML file or the** wiki.*//* The full-text index is stored in a series of b+tree (-like)** structures called segments which map terms to doclists.  The** structures are like b+trees in layout, but are constructed from the** bottom up in optimal fashion and are not updatable.  Since trees** are built from the bottom up, things will be described from the** bottom up.******** Varints ****** The basic unit of encoding is a variable-length integer called a** varint.  We encode variable-length integers in little-endian order** using seven bits * per byte as follows:**** KEY:**         A = 0xxxxxxx    7 bits of data and one flag bit**         B = 1xxxxxxx    7 bits of data and one flag bit****  7 bits - A** 14 bits - BA** 21 bits - BBA** and so on.**** This is identical to how sqlite encodes varints (see util.c).******** Document lists ****** A doclist (document list) holds a docid-sorted list of hits for a** given term.  Doclists hold docids, and can optionally associate** token positions and offsets with docids.**** A DL_POSITIONS_OFFSETS doclist is stored like this:**** array {**   varint docid;**   array {                (position list for column 0)**     varint position;     (delta from previous position plus POS_BASE)**     varint startOffset;  (delta from previous startOffset)**     varint endOffset;    (delta from startOffset)**   }**   array {**     varint POS_COLUMN;   (marks start of position list for new column)**     varint column;       (index of new column)**     array {**       varint position;   (delta from previous position plus POS_BASE)**       varint startOffset;(delta from previous startOffset)**       varint endOffset;  (delta from startOffset)**     }**   }**   varint POS_END;        (marks end of positions for this document.** }**** Here, array { X } means zero or more occurrences of X, adjacent in** memory.  A "position" is an index of a token in the token stream** generated by the tokenizer, while an "offset" is a byte offset,** both based at 0.  Note that POS_END and POS_COLUMN occur in the** same logical place as the position element, and act as sentinals** ending a position list array.**** A DL_POSITIONS doclist omits the startOffset and endOffset** information.  A DL_DOCIDS doclist omits both the position and** offset information, becoming an array of varint-encoded docids.**** On-disk data is stored as type DL_DEFAULT, so we don't serialize** the type.  Due to how deletion is implemented in the segmentation** system, on-disk doclists MUST store at least positions.******** Segment leaf nodes ****** Segment leaf nodes store terms and doclists, ordered by term.  Leaf** nodes are written using LeafWriter, and read using LeafReader (to** iterate through a single leaf node's data) and LeavesReader (to** iterate through a segment's entire leaf layer).  Leaf nodes have** the format:**** varint iHeight;             (height from leaf level, always 0)** varint nTerm;               (length of first term)** char pTerm[nTerm];          (content of first term)** varint nDoclist;            (length of term's associated doclist)** char pDoclist[nDoclist];    (content of doclist)** array {**                             (further terms are delta-encoded)**   varint nPrefix;           (length of prefix shared with previous term)**   varint nSuffix;           (length of unshared suffix)**   char pTermSuffix[nSuffix];(unshared suffix of next term)**   varint nDoclist;          (length of term's associated doclist)**   char pDoclist[nDoclist];  (content of doclist)** }**** Here, array { X } means zero or more occurrences of X, adjacent in** memory.**** Leaf nodes are broken into blocks which are stored contiguously in** the %_segments table in sorted order.  This means that when the end** of a node is reached, the next term is in the node with the next** greater node id.**** New data is spilled to a new leaf node when the current node** exceeds LEAF_MAX bytes (default 2048).  New data which itself is** larger than STANDALONE_MIN (default 1024) is placed in a standalone** node (a leaf node with a single term and doclist).  The goal of** these settings is to pack together groups of small doclists while** making it efficient to directly access large doclists.  The** assumption is that large doclists represent terms which are more** likely to be query targets.**** TODO(shess) It may be useful for blocking decisions to be more** dynamic.  For instance, it may make more sense to have a 2.5k leaf** node rather than splitting into 2k and .5k nodes.  My intuition is** that this might extend through 2x or 4x the pagesize.******** Segment interior nodes ****** Segment interior nodes store blockids for subtree nodes and terms** to describe what data is stored by the each subtree.  Interior** nodes are written using InteriorWriter, and read using** InteriorReader.  InteriorWriters are created as needed when** SegmentWriter creates new leaf nodes, or when an interior node** itself grows too big and must be split.  The format of interior** nodes:**** varint iHeight;           (height from leaf level, always >0)** varint iBlockid;          (block id of node's leftmost subtree)** optional {**   varint nTerm;           (length of first term)**   char pTerm[nTerm];      (content of first term)**   array {**                                (further terms are delta-encoded)**     varint nPrefix;            (length of shared prefix with previous term)**     varint nSuffix;            (length of unshared suffix)**     char pTermSuffix[nSuffix]; (unshared suffix of next term)**   }** }**** Here, optional { X } means an optional element, while array { X }** means zero or more occurrences of X, adjacent in memory.**** An interior node encodes n terms separating n+1 subtrees.  The** subtree blocks are contiguous, so only the first subtree's blockid** is encoded.  The subtree at iBlockid will contain all terms less** than the first term encoded (or all terms if no term is encoded).** Otherwise, for terms greater than or equal to pTerm[i] but less** than pTerm[i+1], the subtree for that term will be rooted at** iBlockid+i.  Interior nodes only store enough term data to** distinguish adjacent children (if the rightmost term of the left** child is "something", and the leftmost term of the right child is** "wicked", only "w" is stored).**** New data is spilled to a new interior node at the same height when** the current node exceeds INTERIOR_MAX bytes (default 2048).** INTERIOR_MIN_TERMS (default 7) keeps large terms from monopolizing** interior nodes and making the tree too skinny.  The interior nodes** at a given height are naturally tracked by interior nodes at** height+1, and so on.******** Segment directory ****** The segment directory in table %_segdir stores meta-information for** merging and deleting segments, and also the root node of the** segment's tree.**** The root node is the top node of the segment's tree after encoding** the entire segment, restricted to ROOT_MAX bytes (default 1024).** This could be either a leaf node or an interior node.  If the top** node requires more than ROOT_MAX bytes, it is flushed to %_segments** and a new root interior node is generated (which should always fit** within ROOT_MAX because it only needs space for 2 varints, the** height and the blockid of the previous root).**** The meta-information in the segment directory is:**   level               - segment level (see below)**   idx                 - index within level**                       - (level,idx uniquely identify a segment)**   start_block         - first leaf node**   leaves_end_block    - last leaf node**   end_block           - last block (including interior nodes)**   root                - contents of root node**** If the root node is a leaf node, then start_block,** leaves_end_block, and end_block are all 0.******** Segment merging ****** To amortize update costs, segments are groups into levels and** merged in matches.  Each increase in level represents exponentially** more documents.**** New documents (actually, document updates) are tokenized and** written individually (using LeafWriter) to a level 0 segment, with** incrementing idx.  When idx reaches MERGE_COUNT (default 16), all** level 0 segments are merged into a single level 1 segment.  Level 1** is populated like level 0, and eventually MERGE_COUNT level 1** segments are merged to a single level 2 segment (representing** MERGE_COUNT^2 updates), and so on.**** A segment merge traverses all segments at a given level in** parallel, performing a straightforward sorted merge.  Since segment** leaf nodes are written in to the %_segments table in order, this** merge traverses the underlying sqlite disk structures efficiently.** After the merge, all segment blocks from the merged level are** deleted.**** MERGE_COUNT controls how often we merge segments.  16 seems to be** somewhat of a sweet spot for insertion performance.  32 and 64 show** very similar performance numbers to 16 on insertion, though they're** a tiny bit slower (perhaps due to more overhead in merge-time** sorting).  8 is about 20% slower than 16, 4 about 50% slower than** 16, 2 about 66% slower than 16.**** At query time, high MERGE_COUNT increases the number of segments** which need to be scanned and merged.  For instance, with 100k docs** inserted:****    MERGE_COUNT   segments**       16           25**        8           12**        4           10**        2            6**** This appears to have only a moderate impact on queries for very** frequent terms (which are somewhat dominated by segment merge** costs), and infrequent and non-existent terms still seem to be fast** even with many segments.**** TODO(shess) That said, it would be nice to have a better query-side** argument for MERGE_COUNT of 16.  Also, it's possible/likely that** optimizations to things like doclist merging will swing the sweet** spot around.********** Handling of deletions and updates ****** Since we're using a segmented structure, with no docid-oriented** index into the term index, we clearly cannot simply update the term** index when a document is deleted or updated.  For deletions, we** write an empty doclist (varint(docid) varint(POS_END)), for updates** we simply write the new doclist.  Segment merges overwrite older** data for a particular docid with newer data, so deletes or updates** will eventually overtake the earlier data and knock it out.  The** query logic likewise merges doclists so that newer data knocks out** older data.**** TODO(shess) Provide a VACUUM type operation to clear out all** deletions and duplications.  This would basically be a forced merge** into a single segment.*/#if !defined(SQLITE_CORE) || defined(SQLITE_ENABLE_FTS2)#if defined(SQLITE_ENABLE_FTS2) && !defined(SQLITE_CORE)# define SQLITE_CORE 1#endif#include <assert.h>#if !defined(__APPLE__)#include <malloc.h>#endif#include <stdlib.h>#include <stdio.h>#include <string.h>#include <ctype.h>#include "fts2.h"#include "fts2_hash.h"#include "fts2_tokenizer.h"#include "sqlite3.h"#include "sqlite3ext.h"SQLITE_EXTENSION_INIT1/* TODO(shess) MAN, this thing needs some refactoring.  At minimum, it** would be nice to order the file better, perhaps something along the** lines of:****  - utility functions**  - table setup functions**  - table update functions**  - table query functions**** Put the query functions last because they're likely to reference** typedefs or functions from the table update section.*/#if 0# define TRACE(A)  printf A; fflush(stdout)#else# define TRACE(A)#endif/* It is not safe to call isspace(), tolower(), or isalnum() on** hi-bit-set characters.  This is the same solution used in the** tokenizer.*//* TODO(shess) The snippet-generation code should be using the** tokenizer-generated tokens rather than doing its own local** tokenization.*//* TODO(shess) Is __isascii() a portable version of (c&0x80)==0? */static int safe_isspace(char c){  return (c&0x80)==0 ? isspace(c) : 0;}static int safe_tolower(char c){  return (c&0x80)==0 ? tolower(c) : c;}static int safe_isalnum(char c){  return (c&0x80)==0 ? isalnum(c) : 0;}typedef enum DocListType {  DL_DOCIDS,              /* docids only */  DL_POSITIONS,           /* docids + positions */  DL_POSITIONS_OFFSETS    /* docids + positions + offsets */} DocListType;/*** By default, only positions and not offsets are stored in the doclists.** To change this so that offsets are stored too, compile with****          -DDL_DEFAULT=DL_POSITIONS_OFFSETS**** If DL_DEFAULT is set to DL_DOCIDS, your table can only be inserted** into (no deletes or updates).*/#ifndef DL_DEFAULT# define DL_DEFAULT DL_POSITIONS#endifenum {  POS_END = 0,        /* end of this position list */  POS_COLUMN,         /* followed by new column number */  POS_BASE};/* MERGE_COUNT controls how often we merge segments (see comment at** top of file).*/#define MERGE_COUNT 16/* utility functions *//* CLEAR() and SCRAMBLE() abstract memset() on a pointer to a single** record to prevent errors of the form:**** my_function(SomeType *b){**   memset(b, '\0', sizeof(b));  // sizeof(b)!=sizeof(*b)** }*//* TODO(shess) Obvious candidates for a header file. */#define CLEAR(b) memset(b, '\0', sizeof(*(b)))#ifndef NDEBUG#  define SCRAMBLE(b) memset(b, 0x55, sizeof(*(b)))#else#  define SCRAMBLE(b)#endif/* We may need up to VARINT_MAX bytes to store an encoded 64-bit integer. */#define VARINT_MAX 10/* Write a 64-bit variable-length integer to memory starting at p[0]. * The length of data written will be between 1 and VARINT_MAX bytes. * The number of bytes written is returned. */static int putVarint(char *p, sqlite_int64 v){  unsigned char *q = (unsigned char *) p;  sqlite_uint64 vu = v;  do{    *q++ = (unsigned char) ((vu & 0x7f) | 0x80);    vu >>= 7;
12 3 4 5 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -