📄 regx.c
字号:
/* * These functions compile a regular expression into an NFA and recognize * a pattern. * * Regular expressions are often used to describe a pattern that might * match several different or similar words. An example of a regular * expression subset is the wild card characters '*' and '?' used to list * files in a directory. * * Might as well read the books and papers written by those who first wrote * the various [a-z]?greps. Ken Thompson was the author of grep. In one * weekend, Al Aho wrote egrep and fgrep. Incidentally, when Dr. Aho was * the guest speaker at the Distinguished Lecture Series at Georgia Tech, * he personally autographed my copy of his book _Compilers: Principles, * Techniques, and Tools_, aka the "Dragon Book." * * See: * * Ken Thompson, "Regular Expression Search Algorithm." _Communications * of the ACM_, 11 (No. 6): 419-422, 1968. * * Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, _Compilers: Principles, * Techniques, and Tools_, Addison-Wesley, Reading, Mass., 1986, * Chapter 3, "Lexical Analysis", pp 83-158. ISBN 0-201-10088-6. * * Robert Sedgewick, _Algorithms in C_, Addison-Wesley, Reading, Mass., * 1990, Chapter 20, "Pattern Matching", pp. 293-303, and Chapter 21, * "Parsing", pp. 305-317. ISBN 0-201-51425-7. * * Andrew Hume, "A Tale of Two Greps." _Software--Practice and Experience_, * 18 (No. 11): 1063-1072, 1988. * * * Regular Expressions in TDE * * The core of the regular expression search algorithm in TDE is based on * the nondeterministic finite-state machine in Dr. Segdewick's book. * Additional parsing, operator precedence, and nfa construction and * simulation info is from Dr. Aho's book. The nfa node naming convention * and additional nfa construction guidelines are from Ken Thompson's paper. * I'm much too stupid to compile a regular expression into assembly language * as suggested in Ken Thompson's paper, but I did get the idea to do the * next best thing and added C library functions as NNODE terminal types. * * The local global NFA is builded with two arrays and a deque. The * first array, next1, in the NFA is used to hold the path for lambda * transitions. The second array, next2, is used to hold alternate paths. * Next1 can hold all types of nodes, but it gets priority for lambda * transitions. The deque is a circular array where future states are queued * in the head and present states are pushed in the tail. * * jmh: To be consistent with other regular expression implementations, the * longest match should be returned, not the shortest which the above does. * Fortunately, returning the longest match is a trivial change, simply * continue scanning if the deque is not empty. However, other implementations * have "greedy" operators - it is the operator itself that matches as much as * possible, rather than the entire expression. As far as the overall match is * concerned, this is just semantics, but it makes quite a difference when * capturing the substrings used by replacement. Unfortunately, the deque * method prevents the reliable capture of substrings, nor does it allow the * operators to determine their own greediness. So instead of using a deque to * store states, a recursive function is used to test one state; if it fails, * the scan continues with the other state. Greediness is controlled by simply * choosing which state is done first. * * Searching for regular expressions in TDE is not very fast. The worst * case search, .*x, where x is any word or letter, is probably the slowest * of any implementation with a regular expression search function. However, * TDE can handle a large regular expression subset. * * In version 3.1, I added BOW and EOW (beginning of word and end of word.) * * In version 3.2, the macro letters were moved to letters.h per Byrial Jensen. * We also use the bj_isxxx( ) functions to test whether a letter falls into * a predefined macro class. * * jmh 050729: added BOB and EOB (beginning & end of bracket/backreference). * This is used by the replacement to substitute bracketed * expressions and for backreferences. * jmh 050803: added BLANK, separate from WHITESPACE. * jmh 050811: added BOS and EOS (beginning & end of string). * jmh 050817: added a closing lookahead assertion. * jmh 050908: replaced the deque with a recursive function. This was * necessary to allow for separate greedy and non-greedy operators * and to correctly obtain backreferences. * * New editor name: TDE, the Thomson-Davis Editor. * Author: Frank Davis * Date: June 5, 1991, version 1.0 * Date: July 29, 1991, version 1.1 * Date: October 5, 1991, version 1.2 * Date: January 20, 1992, version 1.3 * Date: February 17, 1992, version 1.4 * Date: April 1, 1992, version 1.5 * Date: June 5, 1992, version 2.0 * Date: October 31, 1992, version 2.1 * Date: April 1, 1993, version 2.2 * Date: June 5, 1993, version 3.0 * Date: August 29, 1993, version 3.1 * Date: November 13, 1993, version 3.2 * Date: June 5, 1994, version 4.0 * Date: December 5, 1998, version 5.0 (jmh) * * This code is released into the public domain, Frank Davis. * You may use and distribute it freely. */#include "tdestr.h"#include "common.h"#include "tdefunc.h"#include "define.h"#ifndef min #define min(a,b) (((a) < (b)) ? (a) : (b))#endif/* * types of nodes in the NFA. * * let's use the node naming convention in Ken Thompson's reg ex paper. */#define CNODE 0#define NNODE 1/* * types of terminals in NNODEs. * * starting with simple ASCII terminals, it's easy to build in other types of * terminals, e.g. wild chars, BOL, EOL, and character classes. actually, * we can easily build ANSI C ctype library function NNODEs. */#define STRAIGHT_ASCII 0#define IGNORE_ASCII 1#define WILD 2#define CLASS 3#define NOTCLASS 4#define BLANK 5#define WHITESPACE 6#define ALPHA 7#define UPPER 8#define LOWER 9#define ALPHANUM 10#define DECIMAL 11#define HEX 12#define BOL 13#define EOL 14#define BOW 15#define EOW 16#define BOS 17#define EOS 18#define BOB 19#define EOB 20#define BACKREF 21/* * types of terminals in CNODEs. */#define START 0#define ACCEPT 1#define OR_NODE 2#define JUXTA 3#define CLOSURE 4#define ZERO_OR_ONE 5#define ASSERT 6#define ASSERTNOT 7static int lookahead;static int regx_rc;static int parser_state;static char class_bits[32];static int bit[8] = { 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80 };static int c1;static int c2;static int ttype;static NFA_TYPE *nfa_pointer; int nfa_status;static int search_len;static int search_col;static text_ptr search_string;extern int found_len; /* defined in findrep.c */extern int block_replace_offset;extern long last_replace_line;extern int add_search_line( line_list_ptr, long, int );static int subs_found; int subs_pos[10]; int subs_len[10];/* * Name: find_regx * Purpose: To set up and find a regular expression. * Date: June 5, 1993 * Passed: window: pointer to current window * Notes: Keep the regular expression until changed. */int find_regx( TDE_WIN *window ){int direction;int new_string;register TDE_WIN *win; /* put window pointer in a register */int rc;char answer[MAX_COLS+2]; switch (g_status.command) { case RegXForward : direction = FORWARD; new_string = TRUE; break; case RegXBackward : direction = BACKWARD; new_string = TRUE; break; case RepeatRegXForward : direction = FORWARD; new_string = regx.search_defined != OK ? TRUE : FALSE; break; case RepeatRegXBackward : direction = BACKWARD; new_string = regx.search_defined != OK ? TRUE : FALSE; break; default : assert( FALSE ); return( ERROR ); } win = window; if (un_copy_line( win->ll, win, TRUE, TRUE ) == ERROR) return( ERROR ); /* * get search text, using previous as default */ rc = OK; if (new_string == TRUE) { *answer = '\0'; if (regx.search_defined == OK) { assert( strlen( (char *)regx.pattern ) < (size_t)g_display.ncols ); strcpy( answer, (char *)regx.pattern ); } /* * jmh 991006: if there was an error processing the pattern ask again. */ do { /* * string to find: */ if (get_name( reg1, win->bottom_line, answer, &h_find ) <= 0) return( ERROR ); strcpy( (char *)regx.pattern, answer ); rc = regx.search_defined = build_nfa( ); } while (regx.search_defined == ERROR); } g_status.search = SEARCH_REGX; if (direction == BACKWARD) g_status.search |= SEARCH_BACK; rc = perform_search( win ); return( rc );}/* * Name: regx_search_forward * Purpose: search forward for pattern using nfa * Date: June 5, 1993 * Passed: ll: pointer to current node in linked list * rline: pointer to real line counter * col: column in ll->line to begin search * Returns: position in file if found or NULL if not found * Notes: run the nfa machine on each character in each line * * jmh 010704: fixed bug when on TOF. * jmh 031030: optimise the BOL search (don't check each character). */line_list_ptr regx_search_forward( line_list_ptr ll, long *rline, int *col ){file_infos *file;int bc, ec;int anchor; if (ll->next == NULL) return( NULL ); if (g_status.command == DefineGrep || g_status.command == RepeatGrep) { nfa_pointer = &sas_nfa; } else { nfa_pointer = &nfa; } nfa_status = OK; search_string = ll->line; search_len = ll->len; search_col = *col; file = g_status.current_file; if ((g_status.search & SEARCH_BLOCK) && (file->block_type == BOX || (file->block_type == STREAM && (*rline == file->block_br || (*rline == file->block_er && file->block_ec != -1))))) { bc = file->block_bc; ec = (file->block_ec == -1) ? search_len : file->block_ec; if (last_replace_line == *rline) { ec += block_replace_offset; block_replace_offset = 0; last_replace_line = 0; } if (file->inflate_tabs) { bc = entab_adjust_rcol( search_string, search_len, bc, file->ptab_size ); if (file->block_ec != -1) ec = entab_adjust_rcol( search_string, search_len, ec, file->ptab_size ); } if (bc > search_len) bc = search_len; if (ec >= search_len) ec = search_len - 1; if (file->block_type == STREAM) { if (*rline == file->block_br) { if (search_col < bc) search_col = bc; } if (*rline == file->block_er) { if (search_col > ec + 1) return( NULL ); search_len = ec + 1; if (*rline != file->block_br) bc = 0; } } else { if (search_col < bc) search_col = bc; search_len = ec + 1; } search_string += bc; search_len -= bc; search_col -= bc; } else bc = 0; anchor = nfa_pointer->next1[0]; anchor = (nfa_pointer->node_type[anchor] == NNODE && nfa_pointer->term_type[anchor] == BOL); while (!g_status.control_break) { if (anchor) { if (nfa_match( ) != ERROR) { if (g_status.search & SEARCH_RESULTS) { if (!add_search_line( ll, *rline, 0 )) return( NULL ); } else { *col = bc; return( ll ); } } } else { for (; search_col <= search_len; search_col++) { if (nfa_match( ) != ERROR) { if (g_status.search & SEARCH_RESULTS) { if (!add_search_line( ll, *rline, search_col )) return( NULL ); break; } else { *col = search_col + bc; return( ll ); } } } } ll = ll->next; if (ll->len == EOF) return( NULL ); ++*rline; search_string = ll->line; search_len = ll->len; search_col = 0; if ((g_status.search & SEARCH_BLOCK) && (file->block_type == BOX || (file->block_type == STREAM && *rline == file->block_er && file->block_ec != -1))) { bc = file->block_bc; ec = file->block_ec; if (file->inflate_tabs) { bc = entab_adjust_rcol( search_string, search_len, bc, file->ptab_size ); ec = entab_adjust_rcol( search_string, search_len, ec, file->ptab_size ); } if (bc > search_len) bc = search_len; if (ec >= search_len) ec = search_len - 1; if (file->block_type == STREAM) search_len = ec + 1; else { search_string += bc; search_len = ec - bc + 1; } } } return( NULL );}/* * Name: regx_search_backward * Purpose: search backward for pattern using regular expression * Date: June 5, 1993 * Passed: ll: pointer to current node in linked list * rline: pointer to real line counter * col: column in ll->line to begin search * Returns: position in file if found or NULL if not found * Notes: run the nfa machine on each character in each line */line_list_ptr regx_search_backward( line_list_ptr ll, long *rline, int *col ){file_infos *file;int bc, ec;int anchor; if (ll == NULL || ll->len == EOF) return( NULL ); nfa_pointer = &nfa; search_string = ll->line; search_len = ll->len; search_col = *col; file = g_status.current_file; if ((g_status.search & SEARCH_BLOCK) && (file->block_type == BOX || (file->block_type == STREAM && (*rline == file->block_br || (*rline == file->block_er && file->block_ec != -1))))) { bc = file->block_bc; ec = (file->block_ec == -1) ? search_len : file->block_ec; if (file->inflate_tabs) { bc = entab_adjust_rcol( search_string, search_len, bc, file->ptab_size ); if (file->block_ec != -1) ec = entab_adjust_rcol( search_string, search_len, ec, file->ptab_size ); } if (bc > search_len) bc = search_len; if (ec >= search_len) ec = search_len - 1; if (file->block_type == STREAM) { if (*rline == file->block_er) { search_len = ec + 1; if (*rline != file->block_br) bc = 0; if (search_col > search_len) search_col = search_len; }
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -