regularexp.c

来自「nedit 是一款linux下的开发源码的功能强大的编辑器」· C语言代码 · 共 1,624 行 · 第 1/5 页
1,624 行
static const char CVSID[] = "$Id: regularExp.c,v 1.22 2003/05/07 10:51:52 edg Exp $";/*------------------------------------------------------------------------* * `CompileRE', `ExecRE', and `substituteRE' -- regular expression parsing * * This is a HIGHLY ALTERED VERSION of Henry Spencer's `regcomp' and * `regexec' code adapted for NEdit. * * .-------------------------------------------------------------------. * | ORIGINAL COPYRIGHT NOTICE:                                        | * |                                                                   | * | Copyright (c) 1986 by University of Toronto.                      | * | Written by Henry Spencer.  Not derived from licensed software.    | * |                                                                   | * | Permission is granted to anyone to use this software for any      | * | purpose on any computer system, and to redistribute it freely,    | * | subject to the following restrictions:                            | * |                                                                   | * | 1. The author is not responsible for the consequences of use of   | * |      this software, no matter how awful, even if they arise       | * |      from defects in it.                                          | * |                                                                   | * | 2. The origin of this software must not be misrepresented, either | * |      by explicit claim or by omission.                            | * |                                                                   | * | 3. Altered versions must be plainly marked as such, and must not  | * |      be misrepresented as being the original software.            | * `-------------------------------------------------------------------' * * BEWARE that some of this code is subtly aware of the way operator * precedence is structured in regular expressions.  Serious changes in * regular-expression syntax might require a total rethink. *   -- Henry Spencer * (Yes, it did!) -- Christopher Conrad, Dec. 1999 * * January, 1994, Mark Edel *    Consolidated files, changed names of external functions to avoid *    potential conflicts with native regcomp and regexec functions, changed *    error reporting to NEdit form, added multi-line and reverse searching, *    and added \n \t \u \U \l \L. * * June, 1996, Mark Edel *    Bug in NEXT macro, didn't work for expressions which compiled to over *    256 bytes. * * December, 1999, Christopher Conrad *    Reformatted code for readability, improved error output, added octal and *    hexadecimal escapes, added back-references (\1-\9), added positive look *    ahead: (?=...), added negative lookahead: (?!...),  added non-capturing *    parentheses: (?:...), added case insensitive constructs (?i...) and *    (?I...), added newline matching constructs (?n...) and (?N...), added *    regex comments: (?#...), added shortcut escapes: \d\D\l\L\s\S\w\W\y\Y. *    Added "not a word boundary" anchor \B. * * July, 2002, Eddy De Greef *    Added look behind, both positive (?<=...) and negative (?<!...) for *    bounded-length patterns. */#ifdef HAVE_CONFIG_H#include "../config.h"#endif#include "regularExp.h"#include <stdio.h>#include <stdlib.h>#include <string.h>#include <ctype.h>#include <limits.h>#ifdef HAVE_DEBUG_H#include "../debug.h"#endif/* The first byte of the regexp internal `program' is a magic number to help   gaurd against corrupted data; the compiled regex code really begins in the   second byte. */#define MAGIC   0234/* The "internal use only" fields in `regexp.h' are present to pass info from * `CompileRE' to `ExecRE' which permits the execute phase to run lots faster on * simple cases.  They are: * *   match_start     Character that must begin a match; '\0' if none obvious. *   anchor          Is the match anchored (at beginning-of-line only)? * * `match_start' and `anchor' permit very fast decisions on suitable starting * points for a match, considerably reducing the work done by ExecRE. *//* STRUCTURE FOR A REGULAR EXPRESSION (regex) `PROGRAM'. * * This is essentially a linear encoding of a nondeterministic finite-state * machine or NFA (aka syntax charts or `railroad normal form' in parsing * technology).  Each node is an opcode plus a NEXT pointer, possibly * followed by operands.  NEXT pointers of all nodes except BRANCH implement * concatenation; a NEXT pointer with a BRANCH on both ends of it is * connecting two alternatives.  (Here we have one of the subtle syntax * dependencies:  an individual BRANCH (as opposed to a collection of them) is * never concatenated with anything because of operator precedence.)  The * operand of some types of nodes is a literal string; for others, it is a node * leading into a sub-FSM.  In particular, the operand of a BRANCH node is the * first node of the branch. (NB this is _NOT_ a tree structure:  the tail of * the branch connects to the thing following the set of BRANCHes.) * * The opcodes are: *//* DEFINITION            VALUE  MEANING */#define END                1  /* End of program. *//* Zero width positional assertions. */#define BOL                2  /* Match position at beginning of line. */#define EOL                3  /* Match position at end of line. */#define BOWORD             4  /* Match "" representing word delimiter or BOL */#define EOWORD             5  /* Match "" representing word delimiter or EOL */#define NOT_BOUNDARY       6  /* Not word boundary (\B, opposite of < and >) *//* Op codes with null terminated string operands. */#define EXACTLY            7  /* Match this string. */#define SIMILAR            8  /* Match this case insensitive string */#define ANY_OF             9  /* Match any character in the set. */#define ANY_BUT           10  /* Match any character not in the set. *//* Op codes to match any character. */#define ANY               11  /* Match any one character (implements '.') */#define EVERY             12  /* Same as ANY but matches newline. *//* Shortcut escapes, \d, \D, \l, \L, \s, \S, \w, \W, \y, \Y. */#define DIGIT             13  /* Match any digit, i.e. [0123456789] */#define NOT_DIGIT         14  /* Match any non-digit, i.e. [^0123456789] */#define LETTER            15  /* Match any letter character [a-zA-Z] */#define NOT_LETTER        16  /* Match any non-letter character [^a-zA-Z] */#define SPACE             17  /* Match any whitespace character EXCEPT \n */#define SPACE_NL          18  /* Match any whitespace character INCLUDING \n */#define NOT_SPACE         19  /* Match any non-whitespace character */#define NOT_SPACE_NL      20  /* Same as NOT_SPACE but matches newline. */#define WORD_CHAR         21  /* Match any word character [a-zA-Z0-9_] */#define NOT_WORD_CHAR     22  /* Match any non-word character [^a-zA-Z0-9_] */#define IS_DELIM          23  /* Match any character that's a word delimiter */#define NOT_DELIM         24  /* Match any character NOT a word delimiter *//* Quantifier nodes. (Only applied to SIMPLE nodes.  Quantifiers applied to non   SIMPLE nodes or larger atoms are implemented using complex constructs.)*/#define STAR              25  /* Match this (simple) thing 0 or more times. */#define LAZY_STAR         26  /* Minimal matching STAR */#define QUESTION          27  /* Match this (simple) thing 0 or 1 times. */#define LAZY_QUESTION     28  /* Minimal matching QUESTION */#define PLUS              29  /* Match this (simple) thing 1 or more times. */#define LAZY_PLUS         30  /* Minimal matching PLUS */#define BRACE             31  /* Match this (simple) thing m to n times. */#define LAZY_BRACE        32  /* Minimal matching BRACE *//* Nodes used to build complex constructs. */#define NOTHING           33  /* Match empty string (always matches) */#define BRANCH            34  /* Match this alternative, or the next... */#define BACK              35  /* Always matches, NEXT ptr points backward. */#define INIT_COUNT        36  /* Initialize {m,n} counter to zero */#define INC_COUNT         37  /* Increment {m,n} counter by one */#define TEST_COUNT        38  /* Test {m,n} counter against operand *//* Back Reference nodes. */#define BACK_REF          39  /* Match latest matched parenthesized text */#define BACK_REF_CI       40  /* Case insensitive version of BACK_REF */#define X_REGEX_BR        41  /* Cross-Regex Back-Ref for syntax highlighting */#define X_REGEX_BR_CI     42  /* Case insensitive version of X_REGEX_BR_CI *//* Various nodes used to implement parenthetical constructs. */#define POS_AHEAD_OPEN    43  /* Begin positive look ahead */#define NEG_AHEAD_OPEN    44  /* Begin negative look ahead */#define LOOK_AHEAD_CLOSE  45  /* End positive or negative look ahead */#define POS_BEHIND_OPEN   46  /* Begin positive look behind */#define NEG_BEHIND_OPEN   47  /* Begin negative look behind */#define LOOK_BEHIND_CLOSE 48  /* Close look behind */#define OPEN              49  /* Open for capturing parentheses. */                              /*  OPEN+1 is number 1, etc. */#define CLOSE       (OPEN + NSUBEXP)  /* Close for capturing parentheses. */#define LAST_PAREN  (CLOSE + NSUBEXP)#if (LAST_PAREN > UCHAR_MAX)#error "Too many parentheses for storage in an unsigned char (LAST_PAREN too big.)"#endif/* The next_ptr () function can consume up to 30% of the time during matching   because it is called an immense number of times (an average of 25    next_ptr() calls per match() call was witnessed for Perl syntax    highlighting). Therefore it is well worth removing some of the function   call overhead by selectively inlining the next_ptr() calls. Moreover,   the inlined code can be simplified for matching because one of the tests,    only necessary during compilation, can be left out.   The net result of using this inlined version at two critical places is    a 25% speedup (again, witnesses on Perl syntax highlighting). */   #define NEXT_PTR(in_ptr, out_ptr)\   next_ptr_offset = GET_OFFSET (in_ptr);\   if (next_ptr_offset == 0)\       out_ptr = NULL;\   else {\       if (GET_OP_CODE (in_ptr) == BACK)\           out_ptr = in_ptr - next_ptr_offset;\       else \           out_ptr = in_ptr + next_ptr_offset;\   }   /* OPCODE NOTES:   ------------   All nodes consist of an 8 bit op code followed by 2 bytes that make up a 16   bit NEXT pointer.  Some nodes have a null terminated character string operand   following the NEXT pointer.  Other nodes may have an 8 bit index operand.   The TEST_COUNT node has an index operand followed by a 16 bit test value.   The BRACE and LAZY_BRACE nodes have two 16 bit values for min and max but no   index value.   SIMILAR      Operand(s): null terminated string      Implements a case insensitive match of a string.  Mostly intended for use      in syntax highlighting patterns for keywords of languages like FORTRAN      and Ada that are case insensitive.  The regex text in this node is      converted to lower case during regex compile.   DIGIT, NOT_DIGIT, LETTER, NOT_LETTER, SPACE, NOT_SPACE, WORD_CHAR,   NOT_WORD_CHAR      Operand(s): None      Implements shortcut escapes \d, \D, \l, \L, \s, \S, \w, \W.  The locale      aware ANSI functions isdigit(), isalpha(), isalnun(), and isspace() are      used to implement these in the hopes of increasing portability.   NOT_BOUNDARY      Operand(s): None      Implements \B as a zero width assertion that the current character is      NOT on a word boundary.  Word boundaries are defined to be the position      between two characters where one of those characters is one of the      dynamically defined word delimiters, and the other character is not.   IS_DELIM      Operand(s): None      Implements \y as any character that is one of the dynamically      specified word delimiters.   NOT_DELIM      Operand(s): None      Implements \Y as any character that is NOT one of the dynamically      specified word delimiters.   STAR, PLUS, QUESTION, and complex '*', '+', and '?'      Operand(s): None (Note: NEXT pointer is usually zero.  The code that                        processes this node skips over it.)      Complex (parenthesized) versions implemented as circular BRANCH      structures using BACK.  SIMPLE versions (one character per match) are      implemented separately for speed and to minimize recursion.   BRACE, LAZY_BRACE      Operand(s): minimum value (2 bytes), maximum value (2 bytes)      Implements the {m,n} construct for atoms that are SIMPLE.   BRANCH      Operand(s): None      The set of branches constituting a single choice are hooked together      with their NEXT pointers, since precedence prevents anything being      concatenated to any individual branch.  The NEXT pointer of the last      BRANCH in a choice points to the thing following the whole choice.  This      is also where the final NEXT pointer of each individual branch points;      each branch starts with the operand node of a BRANCH node.   BACK      Operand(s): None      Normal NEXT pointers all implicitly point forward.  Back implicitly      points backward.  BACK exists to make loop structures possible.   INIT_COUNT      Operand(s): index (1 byte)      Initializes the count array element referenced by the index operand.      This node is used to build general (i.e. parenthesized) {m,n} constructs.   INC_COUNT      Operand(s): index (1 byte)      Increments the count array element referenced by the index operand.      This node is used to build general (i.e. parenthesized) {m,n} constructs.   TEST_COUNT      Operand(s): index (1 byte), test value (2 bytes)      Tests the current value of the count array element specified by the      index operand against the test value.  If the current value is less than      the test value, control passes to the node after that TEST_COUNT node.      Otherwise control passes to the node referenced by the NEXT pointer for      the TEST_COUNT node.  This node is used to build general (i.e.      parenthesized) {m,n} constructs.   BACK_REF, BACK_REF_CI      Operand(s): index (1 byte, value 1-9)      Implements back references.  This node will attempt to match whatever text      was most recently captured by the index'th set of parentheses.      BACK_REF_CI is case insensitive version.   X_REGEX_BR, X_REGEX_BR_CI      (NOT IMPLEMENTED YET)      Operand(s): index (1 byte, value 1-9)
regularexp.c - 源码说明

本页面展示了「nedit 是一款linux下的开发源码的功能强大的编辑器」中的 regularexp.c 源码文件，采用 C语言编程语言编写，共 1,624 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与nedit相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?