📄 regexp.cpp
字号:
// Win32 porting notes.
#if defined( _MBCS )
#pragma message( __FILEINFO__ "This code is broken under _MBCS, " \
"see the comments at the top of this file." )
#endif //_MBCS
//
//
// In case this isn't obvious from the later comments this is an ALTERED
// version of the software. If you like my changes then cool, but nearly
// all of the functionality here is derived from Henry Spencer's original
// work.
//
// This code should work correctly under both _SBCS and _UNICODE, I did
// start working on making it work with _MBCS but gave up after a while
// since I don't need this particular port and it's not going to be as
// straight forward as the other two.
//
// The problem stems from the compiled program being stored as TCHARS,
// the individual items need to be wide enough to hold whatever character
// is thrown at them, but currently they are accessed as an array of
// whatever size integral type is appropriate. _MBCS would cause this
// to be char, but at times it would need to be larger. This would
// require making the program be an array of short with the appropriate
// conversions used everywhere. Certainly it's doable, but it's a pain.
// What's worse is that the current code will compile and run under _MBCS,
// only breaking when it gets wide characters thrown against it.
//
// I've marked at least one bit of code with #pragma messages, I may not
// get all of them, but they should be a start
//
// Guy Gascoigne - Piggford (ggp@bigfoot.com) Friday, February 27, 1998
// regcomp and regexec -- regsub and regerror are elsewhere
// @(#)regexp.c 1.3 of 18 April 87
//
// Copyright (c) 1986 by University of Toronto.
// Written by Henry Spencer. Not derived from licensed software.
//
// Permission is granted to anyone to use this software for any
// purpose on any computer system, and to redistribute it freely,
// subject to the following restrictions:
//
// 1. The author is not responsible for the consequences of use of
// this software, no matter how awful, even if they arise
// from defects in it.
//
// 2. The origin of this software must not be misrepresented, either
// by explicit claim or by omission.
//
// 3. Altered versions must be plainly marked as such, and must not
// be misrepresented as being the original software.
// *** THIS IS AN ALTERED VERSION. It was altered by John Gilmore,
// *** hoptoad!gnu, on 27 Dec 1986, to add \< and \> for word-matching
// *** as in BSD grep and ex.
// *** THIS IS AN ALTERED VERSION. It was altered by John Gilmore,
// *** hoptoad!gnu, on 28 Dec 1986, to optimize characters quoted with \.
// *** THIS IS AN ALTERED VERSION. It was altered by James A. Woods,
// *** ames!jaw, on 19 June 1987, to quash a regcomp() redundancy.
// *** THIS IS AN ALTERED VERSION. It was altered by Geoffrey Noer,
// *** THIS IS AN ALTERED VERSION. It was altered by Guy Gascoigne - Piggford
// *** guy@wyrdrune.com, on 15 March 1998, porting it to C++ and converting
// *** it to be the engine for the Regexp class
//
// Beware that some of this code is subtly aware of the way operator
// precedence is structured in regular expressions. Serious changes in
// regular-expression syntax might require a total rethink.
#include "stdafx.h"
#include "regexp.h"
// The first byte of the regexp internal "program" is actually this magic
// number; the start node begins in the second byte.
const TCHAR MAGIC = ((TCHAR)'\234');
#define new DEBUG_NEW
#pragma warning( disable : 4711 ) // automatic inline selected
// The "internal use only" fields in regexp.h are present to pass info from
// compile to execute that permits the execute phase to run lots faster on
// simple cases. They are:
//
// regstart char that must begin a match; '\0' if none obvious
// reganch is the match anchored (at beginning-of-line only)?
// regmust string (pointer into program) that match must include, or NULL
// regmlen length of regmust string
//
// Regstart and reganch permit very fast decisions on suitable starting
// points for a match, cutting down the work a lot. Regmust permits fast
// rejection of lines that cannot possibly match. The regmust tests are
// costly enough that regcomp() supplies a regmust only if the
// r.e. contains something potentially expensive (at present, the only
// such thing detected is * or + at the start of the r.e., which can
// involve a lot of backup). Regmlen is supplied because the test in
// regexec() needs it and regcomp() is computing it anyway.
// Structure for regexp "program". This is essentially a linear encoding
// of a nondeterministic finite-state machine (aka syntax charts or
// "railroad normal form" in parsing technology). Each node is an opcode
// plus a "next" pointer, possibly plus an operand. "Next" pointers of
// all nodes except BRANCH implement concatenation; a "next" pointer with
// a BRANCH on both ends of it is connecting two alternatives. (Here we
// have one of the subtle syntax dependencies: an individual BRANCH (as
// opposed to a collection of them) is never concatenated with anything
// because of operator precedence.) The operand of some types of node is
// a literal string; for others, it is a node leading into a sub-FSM. In
// particular, the operand of a BRANCH node is the first node of the
// branch. (NB this is *not* a tree structure: the tail of the branch
// connects to the thing following the set of BRANCHes.) The opcodes
// are:
enum {
// definition number opnd? meaning
END = 0, // no End of program.
BOL = 1, // no Match beginning of line.
EOL = 2, // no Match end of line.
ANY = 3, // no Match any character.
ANYOF = 4, // str Match any of these.
ANYBUT = 5, // str Match any but one of these.
BRANCH = 6, // node Match this, or the next..\&.
BACK = 7, // no "next" ptr points backward.
EXACTLY = 8, // str Match this string.
NOTHING = 9, // no Match empty string.
STAR = 10, // node Match this 0 or more times.
PLUS = 11, // node Match this 1 or more times.
WORDA = 12, // no Match "" at wordchar, where prev is nonword
WORDZ = 13, // no Match "" at nonwordchar, where prev is word
OPEN = 20, // no Sub-RE starts here.
// OPEN+1 is number 1, etc.
CLOSE = 30 // no Analogous to OPEN.
};
// Opcode notes:
//
// BRANCH The set of branches constituting a single choice are hooked
// together with their "next" pointers, since precedence prevents
// anything being concatenated to any individual branch. The
// "next" pointer of the last BRANCH in a choice points to the
// thing following the whole choice. This is also where the
// final "next" pointer of each individual branch points; each
// branch starts with the operand node of a BRANCH node.
//
// BACK Normal "next" pointers all implicitly point forward; BACK
// exists to make loop structures possible.
//
// STAR,PLUS '?', and complex '*' and '+', are implemented as circular
// BRANCH structures using BACK. Simple cases (one character
// per match) are implemented with STAR and PLUS for speed
// and to minimize recursive plunges.
//
// OPEN,CLOSE ...are numbered at compile time.
// A node is one char of opcode followed by two chars of "next" pointer.
// "Next" pointers are stored as two 8-bit pieces, high order first. The
// value is a positive offset from the opcode of the node containing it.
// An operand, if any, simply follows the node. (Note that much of the
// code generation knows about this implicit relationship.)
//
// Using two bytes for the "next" pointer is vast overkill for most things,
// but allows patterns to get big without disasters.
enum
{
REGERR_SENTINEL_VALUE = 0,
REGERR_NULLARG = 1, REGERR_CORRUPTED, REGERR_CORRUPTION, REGERR_CORRUPTED_POINTERS,
REGERR_BAD_REGREPEAT, REGERR_CORRUPTED_OPCODE, REGERR_NULL_TO_REGSUB,
REGERR_DAMAGED_REGEXP_REGSUB, REGERR_DAMAGED_MATCH_STRING, REGERR_NULL_TO_REGCOMP,
REGERR_TO_BIG, REGERR_TO_MANY_PAREN, REGERR_UNTERMINATED_PAREN, REGERR_UNMATCHED_PAREN,
REGERR_INTERNAL_ERROR_JUNK, REGERR_OP_COULD_BE_EMPTY, REGERR_NESTED_OP, REGERR_INVALID_RANGE,
REGERR_UNMATCHED_BRACE, REGERR_INTERNAL_UNEXPECTED_CHAR, REGERR_OP_FOLLOWS_NOTHING,
REGERR_TRAILING_ESC, REGERR_INTERNAL_STRSCSPN, REGERR_NO_REGEXP
};
struct regErr
{
int m_id;
LPCTSTR m_err;
} errors[] = {
{ REGERR_NULLARG, _T( "NULL argument to regexec" ) },
{ REGERR_CORRUPTED, _T( "corrupted regexp" ) },
{ REGERR_CORRUPTION, _T( "regexp corruption" ) },
{ REGERR_CORRUPTED_POINTERS, _T( "corrupted pointers" ) },
{ REGERR_BAD_REGREPEAT, _T( "internal error: bad call of regrepeat" ) },
{ REGERR_CORRUPTED_OPCODE, _T( "corrupted opcode" ) },
{ REGERR_NULL_TO_REGSUB, _T( "NULL parm to regsub" ) },
{ REGERR_DAMAGED_REGEXP_REGSUB, _T( "damaged regexp fed to regsub" ) },
{ REGERR_DAMAGED_MATCH_STRING, _T( "damaged match string" ) },
{ REGERR_NULL_TO_REGCOMP, _T( "NULL argument to regcomp" ) },
{ REGERR_TO_BIG, _T( "regexp too big" ) },
{ REGERR_TO_MANY_PAREN, _T( "too many ()" ) },
{ REGERR_UNTERMINATED_PAREN, _T( "unterminated ()" ) },
{ REGERR_UNMATCHED_PAREN, _T( "unmatched ()" ) },
{ REGERR_INTERNAL_ERROR_JUNK, _T( "internal error: junk on end" ) },
{ REGERR_OP_COULD_BE_EMPTY, _T( "*+ operand could be empty" ) },
{ REGERR_NESTED_OP, _T( "nested *?+" ) },
{ REGERR_INVALID_RANGE, _T( "invalid [] range" ) },
{ REGERR_UNMATCHED_BRACE, _T( "unmatched []" ) },
{ REGERR_INTERNAL_UNEXPECTED_CHAR, _T( "internal error: \\0|) unexpected" ) },
{ REGERR_OP_FOLLOWS_NOTHING, _T( "?+* follows nothing" ) },
{ REGERR_TRAILING_ESC, _T( "trailing \\" ) },
{ REGERR_INTERNAL_STRSCSPN, _T( "internal error: strcspn 0" ) },
{ REGERR_NO_REGEXP, _T( "NULL regexp" ) },
{ REGERR_SENTINEL_VALUE, _T( "Unknown error") } // must be last value
};
// Flags to be passed up and down.
enum {
HASWIDTH = 01, // Known never to match null string.
SIMPLE = 02, // Simple enough to be STAR/PLUS operand.
SPSTART = 04, // Starts with * or +.
WORST = 0 // Worst case.
};
///////////////////////////////////////////////////////////////////////////////
class CRegErrorHandler
{
friend Regexp;
mutable CString m_szError;
static LPCTSTR FindErr( int id );
protected:
void ClearErrorString() const;
void regerror( LPCTSTR s ) const;
void regerror( int id ) const;
public:
CRegErrorHandler() { }
CRegErrorHandler( const CRegErrorHandler & reh ) : m_szError( reh.m_szError ) {}
const CString & GetErrorString() const;
};
void CRegErrorHandler::regerror( LPCTSTR s ) const
{
TRACE1( "regerror: %s\n", s );
m_szError = s;
}
void CRegErrorHandler::regerror( int id ) const
{
regerror( FindErr( id ) );
}
const CString & CRegErrorHandler::GetErrorString() const
{
return m_szError;
}
void CRegErrorHandler::ClearErrorString() const
{
m_szError.Empty();
}
LPCTSTR CRegErrorHandler::FindErr( int id )
{
struct regErr * perr = NULL;
for ( perr = errors; perr->m_id != REGERR_SENTINEL_VALUE; perr++ )
if ( perr->m_id == id )
return perr->m_err;
return perr->m_err; // since we've fallen off the array, perr->m_id == 0
}
///////////////////////////////////////////////////////////////////////////////
// All of the functions required to directly access the 'program'
class CRegProgramAccessor : public CRegErrorHandler
{
public:
static inline TCHAR OP( LPTSTR p )
{
return (*(p));
}
static inline LPTSTR OPERAND( LPTSTR p )
{
return p + 3;
}
static inline LPTSTR regnext( LPTSTR p )
{
const short &offset = *((short*)(p+1));
if (offset == 0)
return(NULL);
return((OP(p) == BACK) ? p-offset : p+offset);
}
#ifdef _RE_DEBUG
LPTSTR CRegProgramAccessor::regprop( LPTSTR op );
#endif
};
///////////////////////////////////////////////////////////////////////////////
// The internal interface to the regexp, wrapping the compilation as well as the
// execution of the regexp (matching)
class regexp : public CRegProgramAccessor
{
friend class CRegExecutor;
friend class Regexp;
int m_programSize;
LPTSTR startp[Regexp::NSUBEXP];
LPTSTR endp[Regexp::NSUBEXP];
TCHAR regstart; // Internal use only.
TCHAR reganch; // Internal use only.
LPTSTR regmust; // Internal use only.
int regmlen; // Internal use only.
LPTSTR program;
bool status;
int count; // used by Regexp to manage the reference counting of regexps
int numSubs;
public:
regexp( LPCTSTR exp, BOOL iCase );
regexp( const regexp & r );
~regexp();
void ignoreCase( const TCHAR * in, TCHAR * out );
bool regcomp( LPCTSTR exp );
bool regexec( LPCTSTR string );
bool Status() const { return status; }
CString GetReplaceString( const TCHAR* sReplaceExp ) const;
regexp * getCopy();
#ifdef _RE_DEBUG
void regdump();
#endif
#ifdef _DEBUG
CString m_originalPattern;
CString m_modifiedPattern;
#endif
};
///////////////////////////////////////////////////////////////////////////////
// Compile / Validate the regular expression - ADT
class CRegCompilerBase : public CRegProgramAccessor
{
public:
CRegCompilerBase( LPCTSTR parse );
LPTSTR reg(int paren, int *flagp);
protected:
LPTSTR regparse; // Input-scan pointer.
int regnpar; // () count.
LPTSTR regbranch(int *flagp);
LPTSTR regpiece(int *flagp);
LPTSTR regatom(int *flagp);
inline bool ISREPN( TCHAR c ) { return ((c) == '*' || (c) == '+' || (c) == '?'); }
virtual void regc(int c) = 0;
virtual LPTSTR regnode(int op) = 0;
virtual void reginsert(TCHAR op, LPTSTR opnd) = 0;
virtual void regtail(LPTSTR p, LPTSTR val) = 0;
virtual void regoptail(LPTSTR p, LPTSTR val) = 0;
};
///////////////////////////////////////////////////////////////////////////////
// First pass over the expression, testing for validity and returning the
// program size
class CRegValidator : public CRegCompilerBase
{
public:
CRegValidator( LPCTSTR parse );
inline long Size() const { return regsize; }
private:
long regsize; // Code size.
TCHAR regdummy[3]; // NOTHING, 0 next ptr
protected:
LPTSTR regnode(int) { regsize += 3; return regdummy; }
void regc(int) { regsize++; }
void reginsert(TCHAR, LPTSTR) { regsize += 3; }
void regtail(LPTSTR, LPTSTR) { return; }
void regoptail(LPTSTR, LPTSTR) { return; }
};
///////////////////////////////////////////////////////////////////////////////
// Second pass, actually generating the program
class CRegCompiler : public CRegCompilerBase
{
public:
CRegCompiler( LPCTSTR parse, LPTSTR prog );
private:
LPTSTR regcode;
protected:
// regc - emit (if appropriate) a byte of code
void regc(int b)
{
*regcode++ = (TCHAR)b;
}
LPTSTR regnode(int op);
void reginsert(TCHAR op, LPTSTR opnd);
void regtail(LPTSTR p, LPTSTR val);
void regoptail(LPTSTR p, LPTSTR val);
};
// regnode - emit a node
LPTSTR CRegCompiler::regnode(int op)
{
LPTSTR const ret = regcode;
LPTSTR ptr = ret;
*ptr++ = (TCHAR)op;
*ptr++ = '\0'; // Null next pointer.
*ptr++ = '\0';
regcode = ptr;
return(ret);
}
// reginsert - insert an operator in front of already-emitted operand
//
// Means relocating the operand.
void CRegCompiler::reginsert(TCHAR op, LPTSTR opnd)
{
LPTSTR place;
(void) memmove(opnd+3, opnd, (size_t)((regcode - opnd)*sizeof(TCHAR)));
regcode += 3;
place = opnd; // Op node, where operand used to be.
*place++ = op;
*place++ = '\0';
*place++ = '\0';
}
// regtail - set the next-pointer at the end of a node chain
void CRegCompiler::regtail(LPTSTR p, LPTSTR val)
{
LPTSTR scan;
LPTSTR temp;
// Find last node.
for (scan = p; (temp = regnext(scan)) != NULL; scan = temp)
continue;
*((short *)(scan+1)) = (short)((OP(scan) == BACK) ? scan - val : val - scan);
}
// regoptail - regtail on operand of first argument; nop if operandless
void CRegCompiler::regoptail(LPTSTR p, LPTSTR val)
{
// "Operandless" and "op != BRANCH" are synonymous in practice.
if (OP(p) == BRANCH)
regtail(OPERAND(p), val);
}
///////////////////////////////////////////////////////////////////////////////
CRegCompilerBase::CRegCompilerBase( LPCTSTR parse )
: regparse( (LPTSTR)parse ),
regnpar(1)
{
}
CRegValidator::CRegValidator( LPCTSTR parse )
: CRegCompilerBase( parse ),
regsize(0)
{
regc(MAGIC);
regdummy[0] = NOTHING;
regdummy[1] = regdummy[2] = 0;
}
CRegCompiler::CRegCompiler( LPCTSTR parse, LPTSTR prog )
: CRegCompilerBase( parse ),
regcode(prog)
{
regc(MAGIC);
}
///////////////////////////////////////////////////////////////////////////////
regexp::regexp( LPCTSTR exp, BOOL iCase )
: regstart(0),
reganch(0),
regmust(0),
regmlen(0),
program(0),
m_programSize(0)
{
#if _DEBUG
m_originalPattern = exp; // keep a version of the pattern for debugging
#endif
if ( iCase )
{
TCHAR * out = new TCHAR[(_tcslen( exp ) * 4) + 1];
ignoreCase( exp, out );
#if _DEBUG
m_modifiedPattern = out; // and the modified version if there is one
#endif
status = regcomp( out );
delete [] out;
}
else
status = regcomp( exp );
count = numSubs = 0;
}
regexp::regexp( const regexp & orig )
: regstart(orig.regstart),
reganch(orig.reganch),
regmlen(orig.regmlen),
m_programSize(orig.m_programSize),
numSubs(orig.numSubs),
regmust(0)
{
#if _DEBUG
m_originalPattern = orig.m_originalPattern;
m_modifiedPattern = orig.m_modifiedPattern;
#endif
status = orig.status;
count = 0;
program = new TCHAR[m_programSize];
memcpy( program, orig.program, m_programSize * sizeof( TCHAR ) );
if ( orig.regmust )
regmust = program + ( orig.regmust - orig.program );
for ( int i = Regexp::NSUBEXP; i > 0; i--)
{
startp[i] = orig.startp[i];
endp[i] = orig.endp[i];
}
}
regexp::~regexp()
{
delete [] program;
}
// regcomp - compile a regular expression into internal code
//
// We can't allocate space until we know how big the compiled form will
// be, but we can't compile it (and thus know how big it is) until we've
// got a place to put the code. So we cheat: we compile it twice, once
// with code generation turned off and size counting turned on, and once
// "for real". This also means that we don't allocate space until we are
// sure that the thing really will compile successfully, and we never
// have to move the code and thus invalidate pointers into it. (Note
// that it has to be in one piece because free() must be able to free it
// all.)
//
// Beware that the optimization-preparation code in here knows about some
// of the structure of the compiled regexp.
bool regexp::regcomp(LPCTSTR exp)
{
LPTSTR scan;
int flags;
if (exp == NULL)
{
regerror( REGERR_NULL_TO_REGCOMP );
return NULL;
}
// First pass: determine size, legality.
CRegValidator tester( exp );
if (tester.reg(0, &flags) == NULL)
return false;
// Small enough for pointer-storage convention?
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -