📄 regx.c

📁 一个开源著名的TDE编辑器源代码
💻 C
📖 第 1 页 / 共 3 页
字号:
12 3 下一页
/* * These functions compile a regular expression into an NFA and recognize * a pattern. * * Regular expressions are often used to describe a pattern that might * match several different or similar words.  An example of a regular * expression subset is the wild card characters '*' and '?' used to list * files in a directory. * * Might as well read the books and papers written by those who first wrote * the various [a-z]?greps.  Ken Thompson was the author of grep.  In one * weekend, Al Aho wrote egrep and fgrep.  Incidentally, when Dr. Aho was * the guest speaker at the Distinguished Lecture Series at Georgia Tech, * he personally autographed my copy of his book  _Compilers:  Principles, * Techniques, and Tools_, aka the "Dragon Book." * * See: * *   Ken Thompson, "Regular Expression Search Algorithm."  _Communications *      of the ACM_, 11 (No. 6): 419-422, 1968. * *   Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, _Compilers: Principles, *      Techniques, and Tools_, Addison-Wesley, Reading, Mass., 1986, *      Chapter 3, "Lexical Analysis", pp 83-158.  ISBN 0-201-10088-6. * *   Robert Sedgewick, _Algorithms in C_, Addison-Wesley, Reading, Mass., *      1990, Chapter 20, "Pattern Matching", pp. 293-303, and Chapter 21, *      "Parsing", pp. 305-317.  ISBN 0-201-51425-7. * *   Andrew Hume, "A Tale of Two Greps."  _Software--Practice and Experience_, *      18 (No. 11): 1063-1072, 1988. * * *                         Regular Expressions in TDE * * The core of the regular expression search algorithm in TDE is based on * the nondeterministic finite-state machine in Dr. Segdewick's book. * Additional parsing, operator precedence, and nfa construction and * simulation info is from Dr. Aho's book.  The nfa node naming convention * and additional nfa construction guidelines are from Ken Thompson's paper. * I'm much too stupid to compile a regular expression into assembly language * as suggested in Ken Thompson's paper, but I did get the idea to do the * next best thing and added C library functions as NNODE terminal types. * * The local global NFA is builded with two arrays and a deque.  The * first array, next1, in the NFA is used to hold the path for lambda * transitions.  The second array, next2, is used to hold alternate paths. * Next1 can hold all types of nodes, but it gets priority for lambda * transitions.  The deque is a circular array where future states are queued * in the head and present states are pushed in the tail. * * jmh: To be consistent with other regular expression implementations, the * longest match should be returned, not the shortest which the above does. * Fortunately, returning the longest match is a trivial change, simply * continue scanning if the deque is not empty.  However, other implementations * have "greedy" operators - it is the operator itself that matches as much as * possible, rather than the entire expression.  As far as the overall match is * concerned, this is just semantics, but it makes quite a difference when * capturing the substrings used by replacement.  Unfortunately, the deque * method prevents the reliable capture of substrings, nor does it allow the * operators to determine their own greediness.  So instead of using a deque to * store states, a recursive function is used to test one state; if it fails, * the scan continues with the other state.  Greediness is controlled by simply * choosing which state is done first. * * Searching for regular expressions in TDE is not very fast.  The worst * case search, .*x, where x is any word or letter, is probably the slowest * of any implementation with a regular expression search function.  However, * TDE can handle a large regular expression subset. * * In version 3.1, I added BOW and EOW (beginning of word and end of word.) * * In version 3.2, the macro letters were moved to letters.h per Byrial Jensen. * We also use the bj_isxxx( ) functions to test whether a letter falls into * a predefined macro class. * * jmh 050729: added BOB and EOB (beginning & end of bracket/backreference). *              This is used by the replacement to substitute bracketed *              expressions and for backreferences. * jmh 050803: added BLANK, separate from WHITESPACE. * jmh 050811: added BOS and EOS (beginning & end of string). * jmh 050817: added a closing lookahead assertion. * jmh 050908: replaced the deque with a recursive function.  This was *              necessary to allow for separate greedy and non-greedy operators *              and to correctly obtain backreferences. * * New editor name:  TDE, the Thomson-Davis Editor. * Author:           Frank Davis * Date:             June 5, 1991, version 1.0 * Date:             July 29, 1991, version 1.1 * Date:             October 5, 1991, version 1.2 * Date:             January 20, 1992, version 1.3 * Date:             February 17, 1992, version 1.4 * Date:             April 1, 1992, version 1.5 * Date:             June 5, 1992, version 2.0 * Date:             October 31, 1992, version 2.1 * Date:             April 1, 1993, version 2.2 * Date:             June 5, 1993, version 3.0 * Date:             August 29, 1993, version 3.1 * Date:             November 13, 1993, version 3.2 * Date:             June 5, 1994, version 4.0 * Date:             December 5, 1998, version 5.0 (jmh) * * This code is released into the public domain, Frank Davis. * You may use and distribute it freely. */#include "tdestr.h"#include "common.h"#include "tdefunc.h"#include "define.h"#ifndef min #define min(a,b)       (((a) < (b)) ? (a) : (b))#endif/* * types of nodes in the NFA. * * let's use the node naming convention in Ken Thompson's reg ex paper. */#define CNODE           0#define NNODE           1/* * types of terminals in NNODEs. * * starting with simple ASCII terminals, it's easy to build in other types of *  terminals, e.g. wild chars, BOL, EOL, and character classes.  actually, *  we can easily build ANSI C ctype library function NNODEs. */#define STRAIGHT_ASCII  0#define IGNORE_ASCII    1#define WILD            2#define CLASS           3#define NOTCLASS        4#define BLANK           5#define WHITESPACE      6#define ALPHA           7#define UPPER           8#define LOWER           9#define ALPHANUM        10#define DECIMAL         11#define HEX             12#define BOL             13#define EOL             14#define BOW             15#define EOW             16#define BOS             17#define EOS             18#define BOB             19#define EOB             20#define BACKREF         21/* * types of terminals in CNODEs. */#define START           0#define ACCEPT          1#define OR_NODE         2#define JUXTA           3#define CLOSURE         4#define ZERO_OR_ONE     5#define ASSERT          6#define ASSERTNOT       7static int  lookahead;static int  regx_rc;static int  parser_state;static char class_bits[32];static int  bit[8] = { 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80 };static int  c1;static int  c2;static int  ttype;static NFA_TYPE  *nfa_pointer;       int  nfa_status;static int  search_len;static int  search_col;static text_ptr search_string;extern int  found_len;                  /* defined in findrep.c */extern int  block_replace_offset;extern long last_replace_line;extern int  add_search_line( line_list_ptr, long, int );static int  subs_found;       int  subs_pos[10];       int  subs_len[10];/* * Name:    find_regx * Purpose: To set up and find a regular expression. * Date:    June 5, 1993 * Passed:  window:  pointer to current window * Notes:   Keep the regular expression until changed. */int  find_regx( TDE_WIN *window ){int  direction;int  new_string;register TDE_WIN *win;  /* put window pointer in a register */int  rc;char answer[MAX_COLS+2];   switch (g_status.command) {      case RegXForward :         direction  = FORWARD;         new_string = TRUE;         break;      case RegXBackward :         direction  = BACKWARD;         new_string = TRUE;         break;      case RepeatRegXForward :         direction  = FORWARD;         new_string =  regx.search_defined != OK ? TRUE : FALSE;         break;      case RepeatRegXBackward :         direction  = BACKWARD;         new_string =  regx.search_defined != OK ? TRUE : FALSE;         break;      default :         assert( FALSE );         return( ERROR );   }   win = window;   if (un_copy_line( win->ll, win, TRUE, TRUE ) == ERROR)      return( ERROR );   /*    * get search text, using previous as default    */   rc = OK;   if (new_string == TRUE) {      *answer = '\0';      if (regx.search_defined == OK) {         assert( strlen( (char *)regx.pattern ) < (size_t)g_display.ncols );         strcpy( answer, (char *)regx.pattern );      }      /*       * jmh 991006: if there was an error processing the pattern ask again.       */      do {         /*          * string to find:          */         if (get_name( reg1, win->bottom_line, answer, &h_find ) <= 0)            return( ERROR );         strcpy( (char *)regx.pattern, answer );         rc = regx.search_defined = build_nfa( );      } while (regx.search_defined == ERROR);   }   g_status.search = SEARCH_REGX;   if (direction == BACKWARD)      g_status.search |= SEARCH_BACK;   rc = perform_search( win );   return( rc );}/* * Name:    regx_search_forward * Purpose: search forward for pattern using nfa * Date:    June 5, 1993 * Passed:  ll:     pointer to current node in linked list *          rline:  pointer to real line counter *          col:    column in ll->line to begin search * Returns: position in file if found or NULL if not found * Notes:   run the nfa machine on each character in each line * * jmh 010704: fixed bug when on TOF. * jmh 031030: optimise the BOL search (don't check each character). */line_list_ptr regx_search_forward( line_list_ptr ll, long *rline, int *col ){file_infos *file;int bc, ec;int anchor;   if (ll->next == NULL)      return( NULL );   if (g_status.command == DefineGrep  ||  g_status.command == RepeatGrep) {      nfa_pointer = &sas_nfa;   } else {      nfa_pointer = &nfa;   }   nfa_status = OK;   search_string = ll->line;   search_len = ll->len;   search_col = *col;   file = g_status.current_file;   if ((g_status.search & SEARCH_BLOCK)  &&       (file->block_type == BOX  ||        (file->block_type == STREAM  &&         (*rline == file->block_br ||          (*rline == file->block_er && file->block_ec != -1))))) {      bc = file->block_bc;      ec = (file->block_ec == -1) ? search_len : file->block_ec;      if (last_replace_line == *rline) {         ec += block_replace_offset;         block_replace_offset = 0;         last_replace_line = 0;      }      if (file->inflate_tabs) {         bc = entab_adjust_rcol( search_string, search_len, bc,                                 file->ptab_size );         if (file->block_ec != -1)            ec = entab_adjust_rcol( search_string, search_len, ec,                                    file->ptab_size );      }      if (bc > search_len)         bc = search_len;      if (ec >= search_len)         ec = search_len - 1;      if (file->block_type == STREAM) {         if (*rline == file->block_br) {            if (search_col < bc)               search_col = bc;         }         if (*rline == file->block_er) {            if (search_col > ec + 1)               return( NULL );            search_len = ec + 1;            if (*rline != file->block_br)               bc = 0;         }      } else {         if (search_col < bc)            search_col = bc;         search_len = ec + 1;      }      search_string += bc;      search_len    -= bc;      search_col    -= bc;   } else      bc = 0;   anchor = nfa_pointer->next1[0];   anchor = (nfa_pointer->node_type[anchor] == NNODE &&             nfa_pointer->term_type[anchor] == BOL);   while (!g_status.control_break) {      if (anchor) {         if (nfa_match( ) != ERROR) {            if (g_status.search & SEARCH_RESULTS) {               if (!add_search_line( ll, *rline, 0 ))                  return( NULL );            } else {               *col = bc;               return( ll );            }         }      } else {         for (; search_col <= search_len; search_col++) {            if (nfa_match( ) != ERROR) {               if (g_status.search & SEARCH_RESULTS) {                  if (!add_search_line( ll, *rline, search_col ))                     return( NULL );                  break;               } else {                  *col = search_col + bc;                  return( ll );               }            }         }      }      ll = ll->next;      if (ll->len == EOF)         return( NULL );      ++*rline;      search_string = ll->line;      search_len = ll->len;      search_col = 0;      if ((g_status.search & SEARCH_BLOCK)  &&          (file->block_type == BOX  ||           (file->block_type == STREAM  &&            *rline == file->block_er && file->block_ec != -1))) {         bc = file->block_bc;         ec = file->block_ec;         if (file->inflate_tabs) {            bc = entab_adjust_rcol( search_string, search_len, bc,                                    file->ptab_size );            ec = entab_adjust_rcol( search_string, search_len, ec,                                    file->ptab_size );         }         if (bc > search_len)            bc = search_len;         if (ec >= search_len)            ec = search_len - 1;         if (file->block_type == STREAM)            search_len = ec + 1;         else {            search_string += bc;            search_len     = ec - bc + 1;         }      }   }   return( NULL );}/* * Name:    regx_search_backward * Purpose: search backward for pattern using regular expression * Date:    June 5, 1993 * Passed:  ll:     pointer to current node in linked list *          rline:  pointer to real line counter *          col:    column in ll->line to begin search * Returns: position in file if found or NULL if not found * Notes:   run the nfa machine on each character in each line */line_list_ptr regx_search_backward( line_list_ptr ll, long *rline, int *col ){file_infos *file;int  bc, ec;int  anchor;   if (ll == NULL || ll->len == EOF)      return( NULL );   nfa_pointer = &nfa;   search_string = ll->line;   search_len = ll->len;   search_col = *col;   file = g_status.current_file;   if ((g_status.search & SEARCH_BLOCK)  &&       (file->block_type == BOX  ||        (file->block_type == STREAM  &&         (*rline == file->block_br ||          (*rline == file->block_er && file->block_ec != -1))))) {      bc = file->block_bc;      ec = (file->block_ec == -1) ? search_len : file->block_ec;      if (file->inflate_tabs) {         bc = entab_adjust_rcol( search_string, search_len, bc,                                 file->ptab_size );         if (file->block_ec != -1)            ec = entab_adjust_rcol( search_string, search_len, ec,                                    file->ptab_size );      }      if (bc > search_len)         bc = search_len;      if (ec >= search_len)         ec = search_len - 1;      if (file->block_type == STREAM) {         if (*rline == file->block_er) {            search_len = ec + 1;            if (*rline != file->block_br)               bc = 0;            if (search_col > search_len)               search_col = search_len;         }
12 3 下一页
💿 文件大小 343 K
👤 上传用户 qqpp2q
📂 所属分类编辑器/阅读器
🏷️ 相关标签

#TDE #开源 #编辑器 #源代码
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -