⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 regexp.cpp

📁 CodezBank is just a small application that stores source code snippets organized in a hierarhichal m
💻 CPP
📖 第 1 页 / 共 3 页
字号:
// Win32 porting notes.

#if defined( _MBCS )
#pragma message( __FILEINFO__ "This code is broken under _MBCS, " \
				 "see the comments at the top of this file." )
#endif //_MBCS
//
//
// In case this isn't obvious from the later comments this is an ALTERED
// version of the software. If you like my changes then cool, but nearly
// all of the functionality here is derived from Henry Spencer's original
// work.
//
// This code should work correctly under both _SBCS and _UNICODE, I did
// start working on making it work with _MBCS but gave up after a while
// since I don't need this particular port and it's not going to be as
// straight forward as the other two.
//
// The problem stems from the compiled program being stored as TCHARS,
// the individual items need to be wide enough to hold whatever character
// is thrown at them, but currently they are accessed as an array of
// whatever size integral type is appropriate.  _MBCS would cause this
// to be char, but at times it would need to be larger.  This would
// require making the program be an array of short with the appropriate
// conversions used everywhere.  Certainly it's doable, but it's a pain.
// What's worse is that the current code will compile and run under _MBCS,
// only breaking when it gets wide characters thrown against it.
//
// I've marked at least one bit of code with #pragma messages, I may not
// get all of them, but they should be a start
//
// Guy Gascoigne - Piggford (ggp@bigfoot.com) Friday, February 27, 1998


// regcomp and regexec -- regsub and regerror are elsewhere
// @(#)regexp.c	1.3 of 18 April 87
//
//	Copyright (c) 1986 by University of Toronto.
//	Written by Henry Spencer.  Not derived from licensed software.
//
//	Permission is granted to anyone to use this software for any
//	purpose on any computer system, and to redistribute it freely,
//	subject to the following restrictions:
//
//	1. The author is not responsible for the consequences of use of
//		this software, no matter how awful, even if they arise
//		from defects in it.
//
//	2. The origin of this software must not be misrepresented, either
//		by explicit claim or by omission.
//
//	3. Altered versions must be plainly marked as such, and must not
//		be misrepresented as being the original software.
// *** THIS IS AN ALTERED VERSION.  It was altered by John Gilmore,
// *** hoptoad!gnu, on 27 Dec 1986, to add \< and \> for word-matching
// *** as in BSD grep and ex.
// *** THIS IS AN ALTERED VERSION.  It was altered by John Gilmore,
// *** hoptoad!gnu, on 28 Dec 1986, to optimize characters quoted with \.
// *** THIS IS AN ALTERED VERSION.  It was altered by James A. Woods,
// *** ames!jaw, on 19 June 1987, to quash a regcomp() redundancy.
// *** THIS IS AN ALTERED VERSION.  It was altered by Geoffrey Noer,
// *** THIS IS AN ALTERED VERSION.  It was altered by Guy Gascoigne - Piggford
// *** guy@wyrdrune.com, on 15 March 1998, porting it to C++ and converting
// *** it to be the engine for the Regexp class
//
// Beware that some of this code is subtly aware of the way operator
// precedence is structured in regular expressions.  Serious changes in
// regular-expression syntax might require a total rethink.

#include "stdafx.h"
#include "regexp.h"

// The first byte of the regexp internal "program" is actually this magic
// number; the start node begins in the second byte.

const TCHAR	MAGIC = ((TCHAR)'\234');

#define new DEBUG_NEW

#pragma warning( disable : 4711 )	// automatic inline selected

// The "internal use only" fields in regexp.h are present to pass info from
// compile to execute that permits the execute phase to run lots faster on
// simple cases.  They are:
//
// regstart	char that must begin a match; '\0' if none obvious
// reganch	is the match anchored (at beginning-of-line only)?
// regmust	string (pointer into program) that match must include, or NULL
// regmlen	length of regmust string
//
// Regstart and reganch permit very fast decisions on suitable starting
// points for a match, cutting down the work a lot.  Regmust permits fast
// rejection of lines that cannot possibly match.  The regmust tests are
// costly enough that regcomp() supplies a regmust only if the
// r.e. contains something potentially expensive (at present, the only
// such thing detected is * or + at the start of the r.e., which can
// involve a lot of backup).  Regmlen is supplied because the test in
// regexec() needs it and regcomp() is computing it anyway.

// Structure for regexp "program".  This is essentially a linear encoding
// of a nondeterministic finite-state machine (aka syntax charts or
// "railroad normal form" in parsing technology).  Each node is an opcode
// plus a "next" pointer, possibly plus an operand.  "Next" pointers of
// all nodes except BRANCH implement concatenation; a "next" pointer with
// a BRANCH on both ends of it is connecting two alternatives.  (Here we
// have one of the subtle syntax dependencies: an individual BRANCH (as
// opposed to a collection of them) is never concatenated with anything
// because of operator precedence.)  The operand of some types of node is
// a literal string; for others, it is a node leading into a sub-FSM.  In
// particular, the operand of a BRANCH node is the first node of the
// branch.  (NB this is *not* a tree structure: the tail of the branch
// connects to the thing following the set of BRANCHes.)  The opcodes
// are:

enum {
//  definition	number		opnd?	meaning
	END		=	0,		//	no		End of program.
	BOL		=	1,		//	no		Match beginning of line.
	EOL		=	2,		//	no		Match end of line.
	ANY		=	3,		//	no		Match any character.
	ANYOF	=	4,		//	str		Match any of these.
	ANYBUT	=	5,		//	str		Match any but one of these.
	BRANCH	=	6,		//	node	Match this, or the next..\&.
	BACK	=	7,		//	no		"next" ptr points backward.
	EXACTLY	=	8,		//	str		Match this string.
	NOTHING	=	9,		//	no		Match empty string.
	STAR	=	10,		//	node	Match this 0 or more times.
	PLUS	=	11,		//	node	Match this 1 or more times.
	WORDA	=	12,		//	no		Match "" at wordchar, where prev is nonword
	WORDZ	=	13,		//	no		Match "" at nonwordchar, where prev is word
	OPEN	=	20,		//	no		Sub-RE starts here.
						//			OPEN+1 is number 1, etc.
	CLOSE	=	30		//	no		Analogous to OPEN.
};

// Opcode notes:
//
// BRANCH	The set of branches constituting a single choice are hooked
//		together with their "next" pointers, since precedence prevents
//		anything being concatenated to any individual branch.  The
//		"next" pointer of the last BRANCH in a choice points to the
//		thing following the whole choice.  This is also where the
//		final "next" pointer of each individual branch points; each
//		branch starts with the operand node of a BRANCH node.
//
// BACK		Normal "next" pointers all implicitly point forward; BACK
//		exists to make loop structures possible.
//
// STAR,PLUS	'?', and complex '*' and '+', are implemented as circular
//		BRANCH structures using BACK.  Simple cases (one character
//		per match) are implemented with STAR and PLUS for speed
//		and to minimize recursive plunges.
//
// OPEN,CLOSE	...are numbered at compile time.

// A node is one char of opcode followed by two chars of "next" pointer.
// "Next" pointers are stored as two 8-bit pieces, high order first.  The
// value is a positive offset from the opcode of the node containing it.
// An operand, if any, simply follows the node.  (Note that much of the
// code generation knows about this implicit relationship.)
//
// Using two bytes for the "next" pointer is vast overkill for most things,
// but allows patterns to get big without disasters.


enum
{
	REGERR_SENTINEL_VALUE = 0,
	REGERR_NULLARG = 1, REGERR_CORRUPTED, REGERR_CORRUPTION, REGERR_CORRUPTED_POINTERS,
	REGERR_BAD_REGREPEAT, REGERR_CORRUPTED_OPCODE, REGERR_NULL_TO_REGSUB,
	REGERR_DAMAGED_REGEXP_REGSUB, REGERR_DAMAGED_MATCH_STRING, REGERR_NULL_TO_REGCOMP,
	REGERR_TO_BIG, REGERR_TO_MANY_PAREN, REGERR_UNTERMINATED_PAREN, REGERR_UNMATCHED_PAREN,
	REGERR_INTERNAL_ERROR_JUNK, REGERR_OP_COULD_BE_EMPTY, REGERR_NESTED_OP, REGERR_INVALID_RANGE,
	REGERR_UNMATCHED_BRACE, REGERR_INTERNAL_UNEXPECTED_CHAR, REGERR_OP_FOLLOWS_NOTHING,
	REGERR_TRAILING_ESC, REGERR_INTERNAL_STRSCSPN, REGERR_NO_REGEXP
};

struct regErr
{
	int m_id;
	LPCTSTR m_err;
} errors[] = {
	{ REGERR_NULLARG,					_T( "NULL argument to regexec" ) },
	{ REGERR_CORRUPTED,					_T( "corrupted regexp" ) },
	{ REGERR_CORRUPTION,				_T( "regexp corruption" ) },
	{ REGERR_CORRUPTED_POINTERS,		_T( "corrupted pointers" ) },
	{ REGERR_BAD_REGREPEAT,				_T( "internal error: bad call of regrepeat" ) },
	{ REGERR_CORRUPTED_OPCODE,			_T( "corrupted opcode" ) },
	{ REGERR_NULL_TO_REGSUB,			_T( "NULL parm to regsub" ) },
	{ REGERR_DAMAGED_REGEXP_REGSUB,		_T( "damaged regexp fed to regsub" ) },
	{ REGERR_DAMAGED_MATCH_STRING,		_T( "damaged match string" ) },
	{ REGERR_NULL_TO_REGCOMP,			_T( "NULL argument to regcomp" ) },
	{ REGERR_TO_BIG,					_T( "regexp too big" ) },
	{ REGERR_TO_MANY_PAREN,				_T( "too many ()" ) },
	{ REGERR_UNTERMINATED_PAREN,		_T( "unterminated ()" ) },
	{ REGERR_UNMATCHED_PAREN,			_T( "unmatched ()" ) },
	{ REGERR_INTERNAL_ERROR_JUNK,		_T( "internal error: junk on end" ) },
	{ REGERR_OP_COULD_BE_EMPTY,			_T( "*+ operand could be empty" ) },
	{ REGERR_NESTED_OP,					_T( "nested *?+" ) },
	{ REGERR_INVALID_RANGE,				_T( "invalid [] range" ) },
	{ REGERR_UNMATCHED_BRACE,			_T( "unmatched []" ) },
	{ REGERR_INTERNAL_UNEXPECTED_CHAR,	_T( "internal error: \\0|) unexpected" ) },
	{ REGERR_OP_FOLLOWS_NOTHING,		_T( "?+* follows nothing" ) },
	{ REGERR_TRAILING_ESC,				_T( "trailing \\" ) },
	{ REGERR_INTERNAL_STRSCSPN,			_T( "internal error: strcspn 0" ) },
	{ REGERR_NO_REGEXP,					_T( "NULL regexp" ) },
	{ REGERR_SENTINEL_VALUE,			_T( "Unknown error") }							// must be last value
};

// Flags to be passed up and down.

enum {
	HASWIDTH	=	01,	// Known never to match null string.
	SIMPLE		=	02,	// Simple enough to be STAR/PLUS operand.
	SPSTART		=	04,	// Starts with * or +.
	WORST		=	0	// Worst case.
};

///////////////////////////////////////////////////////////////////////////////

class CRegErrorHandler
{
	friend Regexp;
	mutable CString m_szError;
	static LPCTSTR FindErr( int id );
protected:
	void ClearErrorString() const;
	void regerror( LPCTSTR s ) const;
	void regerror( int id ) const;
public:
	CRegErrorHandler() { }
	CRegErrorHandler( const CRegErrorHandler & reh ) : m_szError( reh.m_szError ) {}

	const CString & GetErrorString() const;
};

void CRegErrorHandler::regerror( LPCTSTR s ) const
{
	TRACE1( "regerror: %s\n", s );
	m_szError = s;
}

void CRegErrorHandler::regerror( int id ) const
{
	regerror( FindErr( id ) );
}

const CString & CRegErrorHandler::GetErrorString() const
{
	return m_szError;
}

void CRegErrorHandler::ClearErrorString() const
{
	m_szError.Empty();
}

LPCTSTR CRegErrorHandler::FindErr( int id )
{
   struct regErr * perr = NULL;
	for ( perr = errors; perr->m_id != REGERR_SENTINEL_VALUE; perr++ )
		if ( perr->m_id == id )
			return perr->m_err;

	return perr->m_err;		// since we've fallen off the array, perr->m_id == 0
}

///////////////////////////////////////////////////////////////////////////////

// All of the functions required to directly access the 'program'
class CRegProgramAccessor : public CRegErrorHandler
{
public:
	static inline TCHAR OP( LPTSTR p )
	{
		return (*(p));
	}
	static inline LPTSTR OPERAND( LPTSTR p )
	{
		return p + 3;
	}
	static inline LPTSTR regnext( LPTSTR p )
	{
		const short &offset = *((short*)(p+1));

		if (offset == 0)
			return(NULL);
		
		return((OP(p) == BACK) ? p-offset : p+offset);
	}
#ifdef _RE_DEBUG
	LPTSTR CRegProgramAccessor::regprop( LPTSTR op );
#endif
};

///////////////////////////////////////////////////////////////////////////////

// The internal interface to the regexp, wrapping the compilation as well as the
// execution of the regexp (matching)

class regexp : public CRegProgramAccessor
{
	friend class CRegExecutor;
	friend class Regexp;
	
	int m_programSize;
	LPTSTR startp[Regexp::NSUBEXP];
	LPTSTR endp[Regexp::NSUBEXP];
	TCHAR regstart;		// Internal use only.
	TCHAR reganch;		// Internal use only.
	LPTSTR regmust;		// Internal use only.
	int regmlen;		// Internal use only.
	LPTSTR program;
	
	bool status;
	int count;			// used by Regexp to manage the reference counting of regexps
	int numSubs;
public:
	
	regexp( LPCTSTR exp, BOOL iCase );
	regexp( const regexp & r );
	~regexp();
	
	void ignoreCase( const TCHAR * in, TCHAR * out );
	
	bool regcomp( LPCTSTR exp );
	bool regexec( LPCTSTR string );
	bool Status() const { return status; }

	CString GetReplaceString( const TCHAR* sReplaceExp ) const;

	regexp * getCopy();

#ifdef _RE_DEBUG
	void  regdump();
#endif
	
#ifdef _DEBUG
	CString m_originalPattern;
	CString m_modifiedPattern;
#endif
};

///////////////////////////////////////////////////////////////////////////////
// Compile / Validate the regular expression - ADT

class CRegCompilerBase : public CRegProgramAccessor
{
public:
	CRegCompilerBase( LPCTSTR parse );

	LPTSTR reg(int paren, int *flagp);
protected:
	LPTSTR regparse;	// Input-scan pointer. 
	int regnpar;		// () count. 

	LPTSTR regbranch(int *flagp);
	LPTSTR regpiece(int *flagp);
	LPTSTR regatom(int *flagp);
	inline bool ISREPN( TCHAR c )	{ return ((c) == '*' || (c) == '+' || (c) == '?'); }

	virtual void regc(int c) = 0;
	virtual LPTSTR regnode(int op) = 0;
	virtual void reginsert(TCHAR op, LPTSTR opnd) = 0;
	virtual void regtail(LPTSTR p, LPTSTR val) = 0;
	virtual void regoptail(LPTSTR p, LPTSTR val) = 0;
};

///////////////////////////////////////////////////////////////////////////////
// First pass over the expression, testing for validity and returning the 
// program size

class CRegValidator : public CRegCompilerBase
{
public:
	CRegValidator( LPCTSTR parse );

	inline long Size() const { return regsize; }
private:
	long regsize;		// Code size. 
	TCHAR regdummy[3];	// NOTHING, 0 next ptr 
protected:
	LPTSTR regnode(int) { regsize += 3; return regdummy; }
	void regc(int) { regsize++; }
	void reginsert(TCHAR, LPTSTR) { regsize += 3; }
	void regtail(LPTSTR, LPTSTR) { return; }
	void regoptail(LPTSTR, LPTSTR) { return; }
};

///////////////////////////////////////////////////////////////////////////////
// Second pass, actually generating the program

class CRegCompiler : public CRegCompilerBase
{
public:
	CRegCompiler( LPCTSTR parse, LPTSTR prog );
private:
	LPTSTR regcode;
protected:
	// regc - emit (if appropriate) a byte of code
	void regc(int b)
	{
		*regcode++ = (TCHAR)b;
	}
	LPTSTR regnode(int op);
	void reginsert(TCHAR op, LPTSTR opnd);
	void regtail(LPTSTR p, LPTSTR val);
	void regoptail(LPTSTR p, LPTSTR val);
};

// regnode - emit a node
LPTSTR CRegCompiler::regnode(int op)
{
	LPTSTR const ret = regcode;

	LPTSTR ptr = ret;
	*ptr++ = (TCHAR)op;
	*ptr++ = '\0';		// Null next pointer. 
	*ptr++ = '\0';
	regcode = ptr;

	return(ret);
}

// reginsert - insert an operator in front of already-emitted operand
//
// Means relocating the operand.
void CRegCompiler::reginsert(TCHAR op, LPTSTR opnd)
{
	LPTSTR place;

	(void) memmove(opnd+3, opnd, (size_t)((regcode - opnd)*sizeof(TCHAR)));
	regcode += 3;

	place = opnd;		// Op node, where operand used to be.
	*place++ = op;
	*place++ = '\0';
	*place++ = '\0';
}

// regtail - set the next-pointer at the end of a node chain
void CRegCompiler::regtail(LPTSTR p, LPTSTR val)
{
	LPTSTR scan;
	LPTSTR temp;

	// Find last node. 
	for (scan = p; (temp = regnext(scan)) != NULL; scan = temp)
		continue;

	*((short *)(scan+1)) = (short)((OP(scan) == BACK) ? scan - val : val - scan);
}

// regoptail - regtail on operand of first argument; nop if operandless
void CRegCompiler::regoptail(LPTSTR p, LPTSTR val)
{
	// "Operandless" and "op != BRANCH" are synonymous in practice. 
	if (OP(p) == BRANCH)
		regtail(OPERAND(p), val);
}

///////////////////////////////////////////////////////////////////////////////

CRegCompilerBase::CRegCompilerBase( LPCTSTR parse )
	: regparse( (LPTSTR)parse ),
	regnpar(1)
{
}

CRegValidator::CRegValidator( LPCTSTR parse )
	: CRegCompilerBase( parse ),
	regsize(0)
{
	regc(MAGIC);
	regdummy[0] = NOTHING;
	regdummy[1] = regdummy[2] = 0;
}

CRegCompiler::CRegCompiler( LPCTSTR parse, LPTSTR prog )
	: CRegCompilerBase( parse ),
	regcode(prog)
{
	regc(MAGIC);
}

///////////////////////////////////////////////////////////////////////////////

regexp::regexp( LPCTSTR exp, BOOL iCase )
	: regstart(0),
	reganch(0),
	regmust(0),
	regmlen(0),
	program(0),
	m_programSize(0)
{
#if _DEBUG
	m_originalPattern = exp;		// keep a version of the pattern for debugging
#endif

	if ( iCase )
	{
		TCHAR * out = new TCHAR[(_tcslen( exp ) * 4) + 1];
		ignoreCase( exp, out );

#if _DEBUG
		m_modifiedPattern = out;	// and the modified version if there is one
#endif
		status = regcomp( out );
		delete [] out;
	}
	else
		status = regcomp( exp );

	count = numSubs = 0;
}

regexp::regexp( const regexp & orig )
	: regstart(orig.regstart),
	reganch(orig.reganch),
	regmlen(orig.regmlen),
	m_programSize(orig.m_programSize),
	numSubs(orig.numSubs),
	regmust(0)
{
#if _DEBUG
	m_originalPattern = orig.m_originalPattern;
	m_modifiedPattern = orig.m_modifiedPattern;
#endif
	status = orig.status;
	count = 0;
	program = new TCHAR[m_programSize];
	memcpy( program, orig.program, m_programSize * sizeof( TCHAR ) );
	if ( orig.regmust )
		regmust = program + ( orig.regmust - orig.program );

	for ( int i = Regexp::NSUBEXP; i > 0; i--)
	{
		startp[i] = orig.startp[i];
		endp[i] = orig.endp[i];
	}
}

regexp::~regexp()
{
	delete [] program;
}


// regcomp - compile a regular expression into internal code
//
// We can't allocate space until we know how big the compiled form will
// be, but we can't compile it (and thus know how big it is) until we've
// got a place to put the code.  So we cheat: we compile it twice, once
// with code generation turned off and size counting turned on, and once
// "for real".  This also means that we don't allocate space until we are
// sure that the thing really will compile successfully, and we never
// have to move the code and thus invalidate pointers into it.  (Note
// that it has to be in one piece because free() must be able to free it
// all.)
//
// Beware that the optimization-preparation code in here knows about some
// of the structure of the compiled regexp.

bool regexp::regcomp(LPCTSTR exp)
{
	LPTSTR scan;
	int flags;

	if (exp == NULL)
	{
		regerror( REGERR_NULL_TO_REGCOMP );
		return NULL;
	}

	// First pass: determine size, legality.
	CRegValidator tester( exp );

	if (tester.reg(0, &flags) == NULL)
		return false;

	// Small enough for pointer-storage convention? 

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -