📄 wcpattern.h
字号:
<td><tt>[a-z&&[aeiou]]</tt></td></tr>
<tr><th>5 </th>
<td>Union</td>
<td><tt>[a-e][i-u]<tt></td></tr>
</table></blockquote>
<p> Note that a different set of metacharacters are in effect inside
a character class than outside a character class. For instance, the
regular expression <tt>.</tt> loses its special meaning inside a
character class, while the expression <tt>-</tt> becomes a range
forming metacharacter.
<a name="lt">
<a name="cg">
<h4> Groups and capturing </h4>
<p> Capturing groups are numbered by counting their opening parentheses from
left to right. In the expression <tt>((A)(B(C)))</tt>, for example, there
are four such groups: </p>
<blockquote><table cellpadding=1 cellspacing=0 summary="Capturing group numberings">
<tr><th>1 </th>
<td><tt>((A)(B(C)))</tt></td></tr>
<tr><th>2 </th>
<td><tt>(A)</tt></td></tr>
<tr><th>3 </th>
<td><tt>(B(C))</tt></td></tr>
<tr><th>4 </th>
<td><tt>(C)</tt></td></tr>
</table></blockquote>
<p> Group zero always stands for the entire expression.
<p> Capturing groups are so named because, during a match, each subsequence
of the input sequence that matches such a group is saved. The captured
subsequence may be used later in the expression, via a back reference, and
may also be retrieved from the matcher once the match operation is complete.
<p> The captured input associated with a group is always the subsequence
that the group most recently matched. If a group is evaluated a second time
because of quantification then its previously-captured value, if any, will
be retained if the second evaluation fails. Matching the string
<tt>L"aba"</tt> against the expression <tt>(a(b)?)+</tt>, for example, leaves
group two set to <tt>L"b"</tt>. All captured input is discarded at the
beginning of each match.
<p> Groups beginning with <tt>(?</tt> are pure, <i>non-capturing</i> groups
that do not capture text and do not count towards the group total.
<h4> WC support </h4>
<p> Coming Soon.
<h4> Comparison to Perl 5 </h4>
<p>The <code>WCPattern</code> engine performs traditional NFA-based matching
with ordered alternation as occurs in Perl 5.
<p> Perl constructs not supported by this class: </p>
<ul>
<li><p> The conditional constructs <tt>(?{</tt><i>X</i><tt>})</tt> and
<tt>(?(</tt><i>condition</i><tt>)</tt><i>X</i><tt>|</tt><i>Y</i><tt>)</tt>,
</p></li>
<li><p> The embedded code constructs <tt>(?{</tt><i>code</i><tt>})</tt>
and <tt>(??{</tt><i>code</i><tt>})</tt>,</p></li>
<li><p> The embedded comment syntax <tt>(?#comment)</tt>, and </p></li>
<li><p> The preprocessing operations <tt>\l</tt> <tt>\u</tt>,
<tt>\L</tt>, and <tt>\U</tt>. </p></li>
<li><p> Embedded flags</p></li>
</ul>
<p> Constructs supported by this class but not by Perl: </p>
<ul>
<li><p> Possessive quantifiers, which greedily match as much as they can
and do not back off, even when doing so would allow the overall match to
succeed. </p></li>
<li><p> Character-class union and intersection as described
above.</p></li>
</ul>
<p> Notable differences from Perl: </p>
<ul>
<li><p> In Perl, <tt>\1</tt> through <tt>\9</tt> are always interpreted
as back references; a backslash-escaped number greater than <tt>9</tt> is
treated as a back reference if at least that many subexpressions exist,
otherwise it is interpreted, if possible, as an octal escape. In this
class octal escapes must always begin with a zero. In this class,
<tt>\1</tt> through <tt>\9</tt> are always interpreted as back
references, and a larger number is accepted as a back reference if at
least that many subexpressions exist at that point in the regular
expression, otherwise the parser will drop digits until the number is
smaller or equal to the existing number of groups or it is one digit.
</p></li>
<li><p> Perl uses the <tt>g</tt> flag to request a match that resumes
where the last match left off. This functionality is provided implicitly
by the <CODE>WCMatcher</CODE> class: Repeated invocations of the
<code>find</code> method will resume where the last match left off,
unless the matcher is reset. </p></li>
<li><p> Perl is forgiving about malformed matching constructs, as in the
expression <tt>*a</tt>, as well as dangling brackets, as in the
expression <tt>abc]</tt>, and treats them as literals. This
class also strict and will not compile a pattern when dangling characters
are encountered.</p></li>
</ul>
<p> For a more precise description of the behavior of regular expression
constructs, please see <a href="http://www.oreilly.com/catalog/regex2/">
<i>Mastering Regular Expressions, 2nd Edition</i>, Jeffrey E. F. Friedl,
O'Reilly and Associates, 2002.</a>
</p>
<P>
<i>End Text Extracted And Modified From java.util.regex.WCPattern documentation</i>
<hr>
@author Jeffery Stuart
@since March 2003, Stable Since November 2004
@version 1.05.01
@memo A class used to represent "PERL 5"-ish regular expressions
*/
class WCPattern
{
friend class WCMatcher;
friend class NFAUNode;
friend class NFAQuantifierUNode;
private:
/**
This constructor should not be called directly. Those wishing to use the
WCPattern class should instead use the {@link compile compile} method.
@param rhs The pattern to compile
@memo Creates a new pattern from the regular expression in <code>rhs</code>.
*/
WCPattern(const std::wstring & rhs);
protected:
/**
This currently is not used, so don't try to do anything with it.
@memo Holds all the compiled patterns for quick access.
*/
static std::map<std::wstring, WCPattern *> compiledWCPatterns;
/**
Holds all of the registered patterns as strings. Due to certain problems
with compilation of patterns, especially with capturing groups, this seemed
to be the best way to do it.
*/
static std::map<std::wstring, std::pair<std::wstring, unsigned long> > registeredWCPatterns;
protected:
/**
Holds all the NFA nodes used. This makes deletion of a pattern, as well as
clean-up from an unsuccessful compile much easier and faster.
*/
std::map<NFAUNode*, bool> nodes;
/**
Used when methods like split are called. The matcher class uses a lot of
dynamic memeory, so having an instance increases speedup of certain
operations.
*/
WCMatcher * matcher;
/**
The front node of the NFA.
*/
NFAUNode * head;
/**
The actual regular expression we rerpesent
*/
std::wstring pattern;
/**
Flag used during compilation. Once the pattern is successfully compiled,
<code>error</code> is no longer used.
*/
bool error;
/**
Used during compilation to keep track of the current index into
<code>{@link pattern pattern}<code>. Once the pattern is successfully
compiled, <code>error</code> is no longer used.
*/
int curInd;
/**
The number of capture groups this contains.
*/
int groupCount;
/**
The number of non-capture groups this contains.
*/
int nonCapGroupCount;
/**
The flags specified when this was compiled.
*/
unsigned long flags;
protected:
/**
Raises an error during compilation. Compilation will cease at that point
and compile will return <code>NULL</code>.
*/
void raiseError();
/**
Convenience function for registering a node in <code>nodes</code>.
@param node The node to register
@return The registered node
*/
NFAUNode * registerNode(NFAUNode * node);
/**
Calculates the union of two strings. This function will first sort the
strings and then use a simple selection algorithm to find the union.
@param s1 The first "class" to union
@param s2 The second "class" to union
@return A new string containing all unique characters. Each character
must have appeared in one or both of <code>s1</code> and
<code>s2</code>.
*/
std::wstring classUnion (std::wstring s1, std::wstring s2) const;
/**
Calculates the intersection of two strings. This function will first sort
the strings and then use a simple selection algorithm to find the
intersection.
@param s1 The first "class" to intersect
@param s2 The second "class" to intersect
@return A new string containing all unique characters. Each character
must have appeared both <code>s1</code> and <code>s2</code>.
*/
std::wstring classIntersect (std::wstring s1, std::wstring s2) const;
/**
Calculates the negation of a string. The negation is the set of all
characters between <code>\x00</code> and <code>\xFF</code> not
contained in <code>s1</code>.
@param s1 The "class" to be negated.
@param s2 The second "class" to intersect
@return A new string containing all unique characters. Each character
must have appeared both <code>s1</code> and <code>s2</code>.
*/
std::wstring classNegate (std::wstring s1) const;
/**
Creates a new "class" representing the range from <code>low</code> thru
<code>hi</code>. This function will wrap if <code>low</code> >
<code>hi</code>. This is a feature, not a buf. Sometimes it is useful
to be able to say [\x70-\x10] instead of [\x70-\x7F\x00-\x10].
@param low The beginning character
@param hi The ending character
@return A new string containing all the characters from low thru hi.
*/
std::wstring classCreateRange(wchar_t low, wchar_t hi) const;
/**
Extracts a decimal number from the substring of member-variable
<code>{@link pattern pattern}<code> starting at <code>start</code> and
ending at <code>end</code>.
@param start The starting index in <code>{@link pattern pattern}<code>
@param end The last index in <code>{@link pattern pattern}<code>
@return The decimal number in <code>{@link pattern pattern}<code>
*/
int getInt(int start, int end);
/**
Parses a <code>{n,m}</code> string out of the member-variable
<code>{@link pattern pattern}<code> stores the result in <code>sNum</code>
and <code>eNum</code>.
@param sNum Output parameter. The minimum number of matches required
by the curly quantifier are stored here.
@param eNum Output parameter. The maximum number of matches allowed
by the curly quantifier are stored here.
@return Success/Failure. Fails when the curly does not have the proper
syntax
*/
bool quantifyCurly(int & sNum, int & eNum);
/**
Tries to quantify the currently parsed group. If the group being parsed
is indeed quantified in the member-variable
<code>{@link pattern pattern}<code>, then the NFA is modified accordingly.
@param start The starting node of the current group being parsed
@param stop The ending node of the current group being parsed
@param gn The group number of the current group being parsed
@return The node representing the starting node of the group. If the
group becomes quantified, then this node is not necessarily
a GroupHead node.
*/
NFAUNode * quantifyGroup(NFAUNode * start, NFAUNode * stop, const int gn);
/**
Tries to quantify the last parsed expression. If the character was indeed
quantified, then the NFA is modified accordingly.
@param newNode The recently created expression node
@return The node representing the last parsed expression. If the
expression was quantified, <code>return value != newNode</code>
*/
NFAUNode * quantify(NFAUNode * newNode);
/**
Parses the current class being examined in
<code>{@link pattern pattern}</code>.
@return A string of unique characters contained in the current class being
parsed
*/
std::wstring parseClass();
/**
Parses the current POSIX class being examined in
<code>{@link pattern pattern}</code>.
@return A string of unique characters representing the POSIX class being
parsed
*/
std::wstring parsePosix();
/**
Returns a string containing the octal character being parsed
@return The string contained the octal value being parsed
*/
std::wstring parseOctal();
/**
Returns a string containing the hex character being parsed
@return The string contained the hex value being parsed
*/
std::wstring parseHex();
/**
Returns a new node representing the back reference being parsed
@return The new node representing the back reference being parsed
*/
NFAUNode * parseBackref();
/**
Parses the escape sequence currently being examined. Determines if the
escape sequence is a class, a single character, or the beginning of a
quotation sequence.
@param inv Output parameter. Whether or not to invert the returned class
@param quo Output parameter. Whether or not this sequence starts a
quotation.
@return The characters represented by the class
*/
std::wstring parseEscape(bool & inv, bool & quo);
/**
Parses a supposed registered pattern currently under compilation. If the
sequence of characters does point to a registered pattern, then the
registered pattern is appended to <code>*end<code>. The registered pattern
is parsed with the current compilation flags.
@param end The ending node of the thus-far compiled pattern
@return The new end node of the current pattern
*/
NFAUNode * parseRegisteredWCPattern(NFAUNode ** end);
/**
Parses a lookbehind expression. Appends the necessary nodes
<code>*end</code>.
@param pos Positive or negative look behind
@param end The ending node of the current pattern
@return The new end node of the current pattern
*/
NFAUNode * parseBehind(const bool pos, NFAUNode ** end);
/**
Parses the current expression and tacks on nodes until a \E is found.
@return The end of the current pattern
*/
NFAUNode * parseQuote();
/**
Parses <code>{@link pattern pattern}</code>. This function is called
recursively when an or (<code>|</code>) or a group is encountered.
@param inParen Are we currently parsing inside a group
@param inOr Are we currently parsing one side of an or (<code>|</code>)
@param end The end of the current expression
@return The starting node of the NFA constructed from this parse
*/
NFAUNode * parse(const bool inParen = 0, const bool inOr = 0, NFAUNode ** end = NULL);
public:
/// We should match regardless of case
const static unsigned long CASE_INSENSITIVE;
/// We are implicitly quoted
const static unsigned long LITERAL;
/// @memo We should treat a <code><b>.</b></code> as [\x00-\x7F]
const static unsigned long DOT_MATCHES_ALL;
/** <code>^</code> and <code>$</code> should anchor to the beginning and
ending of lines, not all input
*/
const static unsigned long MULTILINE_MATCHING;
/** When enabled, only instances of <code>\n</codes> are recognized as
line terminators
*/
const static unsigned long UNIX_LINE_MODE;
/// The absolute minimum number of matches a quantifier can match (0)
const static int MIN_QMATCH;
/// The absolute maximum number of matches a quantifier can match (0x7FFFFFFF)
const static int MAX_QMATCH;
public:
/**
Call this function to compile a regular expression into a
<code>WCPattern</code> object. Special values can be assigned to
<code>mode</code> when certain non-standard behaviors are expected from
the <code>WCPattern</code> object.
@param pattern The regular expression to compile
@param mode A bitwise or of flags signalling what special behaviors are
wanted from this <code>WCPattern</code> object
@return If successful, <code>compile</code> returns a <code>WCPattern</code>
pointer. Upon failure, <code>compile</code> returns
<code>NULL</code>
*/
static WCPattern * compile (const std::wstring & pattern,
const unsigned long mode = 0);
/**
Dont use this function. This function will compile a pattern, and cache
the result. This will eventually be used as an optimization when people
just want to call static methods using the same pattern over and over
instead of first compiling the pattern and then using the compiled
instance for matching.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -