📄 regularexp.c
字号:
static const char CVSID[] = "$Id: regularExp.c,v 1.22 2003/05/07 10:51:52 edg Exp $";/*------------------------------------------------------------------------* * `CompileRE', `ExecRE', and `substituteRE' -- regular expression parsing * * This is a HIGHLY ALTERED VERSION of Henry Spencer's `regcomp' and * `regexec' code adapted for NEdit. * * .-------------------------------------------------------------------. * | ORIGINAL COPYRIGHT NOTICE: | * | | * | Copyright (c) 1986 by University of Toronto. | * | Written by Henry Spencer. Not derived from licensed software. | * | | * | Permission is granted to anyone to use this software for any | * | purpose on any computer system, and to redistribute it freely, | * | subject to the following restrictions: | * | | * | 1. The author is not responsible for the consequences of use of | * | this software, no matter how awful, even if they arise | * | from defects in it. | * | | * | 2. The origin of this software must not be misrepresented, either | * | by explicit claim or by omission. | * | | * | 3. Altered versions must be plainly marked as such, and must not | * | be misrepresented as being the original software. | * `-------------------------------------------------------------------' * * BEWARE that some of this code is subtly aware of the way operator * precedence is structured in regular expressions. Serious changes in * regular-expression syntax might require a total rethink. * -- Henry Spencer * (Yes, it did!) -- Christopher Conrad, Dec. 1999 * * January, 1994, Mark Edel * Consolidated files, changed names of external functions to avoid * potential conflicts with native regcomp and regexec functions, changed * error reporting to NEdit form, added multi-line and reverse searching, * and added \n \t \u \U \l \L. * * June, 1996, Mark Edel * Bug in NEXT macro, didn't work for expressions which compiled to over * 256 bytes. * * December, 1999, Christopher Conrad * Reformatted code for readability, improved error output, added octal and * hexadecimal escapes, added back-references (\1-\9), added positive look * ahead: (?=...), added negative lookahead: (?!...), added non-capturing * parentheses: (?:...), added case insensitive constructs (?i...) and * (?I...), added newline matching constructs (?n...) and (?N...), added * regex comments: (?#...), added shortcut escapes: \d\D\l\L\s\S\w\W\y\Y. * Added "not a word boundary" anchor \B. * * July, 2002, Eddy De Greef * Added look behind, both positive (?<=...) and negative (?<!...) for * bounded-length patterns. */#ifdef HAVE_CONFIG_H#include "../config.h"#endif#include "regularExp.h"#include <stdio.h>#include <stdlib.h>#include <string.h>#include <ctype.h>#include <limits.h>#ifdef HAVE_DEBUG_H#include "../debug.h"#endif/* The first byte of the regexp internal `program' is a magic number to help gaurd against corrupted data; the compiled regex code really begins in the second byte. */#define MAGIC 0234/* The "internal use only" fields in `regexp.h' are present to pass info from * `CompileRE' to `ExecRE' which permits the execute phase to run lots faster on * simple cases. They are: * * match_start Character that must begin a match; '\0' if none obvious. * anchor Is the match anchored (at beginning-of-line only)? * * `match_start' and `anchor' permit very fast decisions on suitable starting * points for a match, considerably reducing the work done by ExecRE. *//* STRUCTURE FOR A REGULAR EXPRESSION (regex) `PROGRAM'. * * This is essentially a linear encoding of a nondeterministic finite-state * machine or NFA (aka syntax charts or `railroad normal form' in parsing * technology). Each node is an opcode plus a NEXT pointer, possibly * followed by operands. NEXT pointers of all nodes except BRANCH implement * concatenation; a NEXT pointer with a BRANCH on both ends of it is * connecting two alternatives. (Here we have one of the subtle syntax * dependencies: an individual BRANCH (as opposed to a collection of them) is * never concatenated with anything because of operator precedence.) The * operand of some types of nodes is a literal string; for others, it is a node * leading into a sub-FSM. In particular, the operand of a BRANCH node is the * first node of the branch. (NB this is _NOT_ a tree structure: the tail of * the branch connects to the thing following the set of BRANCHes.) * * The opcodes are: *//* DEFINITION VALUE MEANING */#define END 1 /* End of program. *//* Zero width positional assertions. */#define BOL 2 /* Match position at beginning of line. */#define EOL 3 /* Match position at end of line. */#define BOWORD 4 /* Match "" representing word delimiter or BOL */#define EOWORD 5 /* Match "" representing word delimiter or EOL */#define NOT_BOUNDARY 6 /* Not word boundary (\B, opposite of < and >) *//* Op codes with null terminated string operands. */#define EXACTLY 7 /* Match this string. */#define SIMILAR 8 /* Match this case insensitive string */#define ANY_OF 9 /* Match any character in the set. */#define ANY_BUT 10 /* Match any character not in the set. *//* Op codes to match any character. */#define ANY 11 /* Match any one character (implements '.') */#define EVERY 12 /* Same as ANY but matches newline. *//* Shortcut escapes, \d, \D, \l, \L, \s, \S, \w, \W, \y, \Y. */#define DIGIT 13 /* Match any digit, i.e. [0123456789] */#define NOT_DIGIT 14 /* Match any non-digit, i.e. [^0123456789] */#define LETTER 15 /* Match any letter character [a-zA-Z] */#define NOT_LETTER 16 /* Match any non-letter character [^a-zA-Z] */#define SPACE 17 /* Match any whitespace character EXCEPT \n */#define SPACE_NL 18 /* Match any whitespace character INCLUDING \n */#define NOT_SPACE 19 /* Match any non-whitespace character */#define NOT_SPACE_NL 20 /* Same as NOT_SPACE but matches newline. */#define WORD_CHAR 21 /* Match any word character [a-zA-Z0-9_] */#define NOT_WORD_CHAR 22 /* Match any non-word character [^a-zA-Z0-9_] */#define IS_DELIM 23 /* Match any character that's a word delimiter */#define NOT_DELIM 24 /* Match any character NOT a word delimiter *//* Quantifier nodes. (Only applied to SIMPLE nodes. Quantifiers applied to non SIMPLE nodes or larger atoms are implemented using complex constructs.)*/#define STAR 25 /* Match this (simple) thing 0 or more times. */#define LAZY_STAR 26 /* Minimal matching STAR */#define QUESTION 27 /* Match this (simple) thing 0 or 1 times. */#define LAZY_QUESTION 28 /* Minimal matching QUESTION */#define PLUS 29 /* Match this (simple) thing 1 or more times. */#define LAZY_PLUS 30 /* Minimal matching PLUS */#define BRACE 31 /* Match this (simple) thing m to n times. */#define LAZY_BRACE 32 /* Minimal matching BRACE *//* Nodes used to build complex constructs. */#define NOTHING 33 /* Match empty string (always matches) */#define BRANCH 34 /* Match this alternative, or the next... */#define BACK 35 /* Always matches, NEXT ptr points backward. */#define INIT_COUNT 36 /* Initialize {m,n} counter to zero */#define INC_COUNT 37 /* Increment {m,n} counter by one */#define TEST_COUNT 38 /* Test {m,n} counter against operand *//* Back Reference nodes. */#define BACK_REF 39 /* Match latest matched parenthesized text */#define BACK_REF_CI 40 /* Case insensitive version of BACK_REF */#define X_REGEX_BR 41 /* Cross-Regex Back-Ref for syntax highlighting */#define X_REGEX_BR_CI 42 /* Case insensitive version of X_REGEX_BR_CI *//* Various nodes used to implement parenthetical constructs. */#define POS_AHEAD_OPEN 43 /* Begin positive look ahead */#define NEG_AHEAD_OPEN 44 /* Begin negative look ahead */#define LOOK_AHEAD_CLOSE 45 /* End positive or negative look ahead */#define POS_BEHIND_OPEN 46 /* Begin positive look behind */#define NEG_BEHIND_OPEN 47 /* Begin negative look behind */#define LOOK_BEHIND_CLOSE 48 /* Close look behind */#define OPEN 49 /* Open for capturing parentheses. */ /* OPEN+1 is number 1, etc. */#define CLOSE (OPEN + NSUBEXP) /* Close for capturing parentheses. */#define LAST_PAREN (CLOSE + NSUBEXP)#if (LAST_PAREN > UCHAR_MAX)#error "Too many parentheses for storage in an unsigned char (LAST_PAREN too big.)"#endif/* The next_ptr () function can consume up to 30% of the time during matching because it is called an immense number of times (an average of 25 next_ptr() calls per match() call was witnessed for Perl syntax highlighting). Therefore it is well worth removing some of the function call overhead by selectively inlining the next_ptr() calls. Moreover, the inlined code can be simplified for matching because one of the tests, only necessary during compilation, can be left out. The net result of using this inlined version at two critical places is a 25% speedup (again, witnesses on Perl syntax highlighting). */ #define NEXT_PTR(in_ptr, out_ptr)\ next_ptr_offset = GET_OFFSET (in_ptr);\ if (next_ptr_offset == 0)\ out_ptr = NULL;\ else {\ if (GET_OP_CODE (in_ptr) == BACK)\ out_ptr = in_ptr - next_ptr_offset;\ else \ out_ptr = in_ptr + next_ptr_offset;\ } /* OPCODE NOTES: ------------ All nodes consist of an 8 bit op code followed by 2 bytes that make up a 16 bit NEXT pointer. Some nodes have a null terminated character string operand following the NEXT pointer. Other nodes may have an 8 bit index operand. The TEST_COUNT node has an index operand followed by a 16 bit test value. The BRACE and LAZY_BRACE nodes have two 16 bit values for min and max but no index value. SIMILAR Operand(s): null terminated string Implements a case insensitive match of a string. Mostly intended for use in syntax highlighting patterns for keywords of languages like FORTRAN and Ada that are case insensitive. The regex text in this node is converted to lower case during regex compile. DIGIT, NOT_DIGIT, LETTER, NOT_LETTER, SPACE, NOT_SPACE, WORD_CHAR, NOT_WORD_CHAR Operand(s): None Implements shortcut escapes \d, \D, \l, \L, \s, \S, \w, \W. The locale aware ANSI functions isdigit(), isalpha(), isalnun(), and isspace() are used to implement these in the hopes of increasing portability. NOT_BOUNDARY Operand(s): None Implements \B as a zero width assertion that the current character is NOT on a word boundary. Word boundaries are defined to be the position between two characters where one of those characters is one of the dynamically defined word delimiters, and the other character is not. IS_DELIM Operand(s): None Implements \y as any character that is one of the dynamically specified word delimiters. NOT_DELIM Operand(s): None Implements \Y as any character that is NOT one of the dynamically specified word delimiters. STAR, PLUS, QUESTION, and complex '*', '+', and '?' Operand(s): None (Note: NEXT pointer is usually zero. The code that processes this node skips over it.) Complex (parenthesized) versions implemented as circular BRANCH structures using BACK. SIMPLE versions (one character per match) are implemented separately for speed and to minimize recursion. BRACE, LAZY_BRACE Operand(s): minimum value (2 bytes), maximum value (2 bytes) Implements the {m,n} construct for atoms that are SIMPLE. BRANCH Operand(s): None The set of branches constituting a single choice are hooked together with their NEXT pointers, since precedence prevents anything being concatenated to any individual branch. The NEXT pointer of the last BRANCH in a choice points to the thing following the whole choice. This is also where the final NEXT pointer of each individual branch points; each branch starts with the operand node of a BRANCH node. BACK Operand(s): None Normal NEXT pointers all implicitly point forward. Back implicitly points backward. BACK exists to make loop structures possible. INIT_COUNT Operand(s): index (1 byte) Initializes the count array element referenced by the index operand. This node is used to build general (i.e. parenthesized) {m,n} constructs. INC_COUNT Operand(s): index (1 byte) Increments the count array element referenced by the index operand. This node is used to build general (i.e. parenthesized) {m,n} constructs. TEST_COUNT Operand(s): index (1 byte), test value (2 bytes) Tests the current value of the count array element specified by the index operand against the test value. If the current value is less than the test value, control passes to the node after that TEST_COUNT node. Otherwise control passes to the node referenced by the NEXT pointer for the TEST_COUNT node. This node is used to build general (i.e. parenthesized) {m,n} constructs. BACK_REF, BACK_REF_CI Operand(s): index (1 byte, value 1-9) Implements back references. This node will attempt to match whatever text was most recently captured by the index'th set of parentheses. BACK_REF_CI is case insensitive version. X_REGEX_BR, X_REGEX_BR_CI (NOT IMPLEMENTED YET) Operand(s): index (1 byte, value 1-9)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -