📄 pattern.html

📁 C++正则表达式解析
💻 HTML
📖 第 1 页 / 共 4 页
字号:
<td colspan="2">
&nbsp;
</td>
</tr>
<tr>
<td colspan="2">
<b>Quoting</b>
</td>
</tr>
<tr>
<td>
<code>\Q</code>
</td>
<td>
Nothing, but treat every character (including \s) literally until a
matching <code>\E</code>
</td>
</tr>
<tr>
<td>
<code>\E</code>
</td>
<td>
Nothing, but ends its matching <code>\Q</code>
</td>
</tr>
<tr>
<td>
&nbsp;
</td>
</tr>
<tr>
<td colspan="2">
<b>Special Constructs</b>
</td>
</tr>
<tr>
<td>
<code>(?:<i>x</i>)</code>
</td>
<td>
<code><i>x</i></code>, but not as a capturing group
</td>
</tr>
<tr>
<td>
<code>(?=<i>x</i>)</code>
</td>
<td>
<code><i>x</i></code>, via positive lookahead. This means that the
expression will match only if it is trailed by <code><i>x</i></code>.
It will not "eat" any of the characters matched by
<code><i>x</i></code>.
</td>
</tr>
<tr>
<td>
<code>(?!<i>x</i>)</code>
</td>
<td>
<code><i>x</i></code>, via negative lookahead. This means that the
expression will match only if it is not trailed by
<code><i>x</i></code>. It will not "eat" any of the characters
matched by <code><i>x</i></code>.
</td>
</tr>
<tr>
<td>
<code>(?<=<i>x</i>)</code>
</td>
<td>
<code><i>x</i></code>, via positive lookbehind. <code><i>x</i></code>
cannot contain any quantifiers.
</td>
</tr>
<tr>
<td>
<code>(?<!<i>x</i>)</code>
</td>
<td>
<code><i>x</i></code>, via negative lookbehind. <code><i>x</i></code>
cannot contain any quantifiers.
</td>
</tr>
<tr>
<td>
<code>(?><i>x</i>)</code>
</td>
<td>
<code><i>x</i>{1}+</code>
</td>
</tr>
<tr>
<td colspan="2">
&nbsp;
</td>
</tr>
<tr>
<td colspan="2">
<b>Registered Expression Matching</b>
</td>
</tr>
<tr>
<td>
<code>{<i>x</i>}</code>
</td>
<td>
The registered pattern <code><i>x</i></code>
</td>
</tr>
</table>

<P><hr>

<P><i>Begin Text Extracted And Modified From java.util.regex.Pattern documentation</i>

<P><h4> Backslashes, escapes, and quoting </h4>

<P><p> The backslash character (<tt>'\'</tt>) serves to introduce escaped
constructs, as defined in the table above, as well as to quote characters
that otherwise would be interpreted as unescaped constructs.  Thus the
expression <tt>\\</tt> matches a single backslash and <tt>\{</tt> matches a
left brace.

<P><p> It is an error to use a backslash prior to any alphabetic character that
does not denote an escaped construct; these are reserved for future
extensions to the regular-expression language.  A backslash may be used
prior to a non-alphabetic character regardless of whether that character is
part of an unescaped construct.

<P><p>It is necessary to double backslashes in string literals that represent
regular expressions to protect them from interpretation by a compiler.  The
string literal <tt>"&#92;b"</tt>, for example, matches a single backspace
character when interpreted as a regular expression, while
<tt>"&#92;&#92;b"</tt> matches a word boundary.  The string litera
<tt>"&#92;(hello&#92;)"</tt> is illegal and leads to a compile-time error;
in order to match the string <tt>(hello)</tt> the string literal
<tt>"&#92;&#92;(hello&#92;&#92;)"</tt> must be used.

<P><h4> Character Classes </h4>

<P><p> Character classes may appear within other character classes, and
may be composed by the union operator (implicit) and the intersection
operator (<tt>&amp;&amp;</tt>).
The union operator denotes a class that contains every character that is
in at least one of its operand classes.  The intersection operator
denotes a class that contains every character that is in both of its
operand classes.

<P><p> The precedence of character-class operators is as follows, from
highest to lowest:

<P><blockquote><table border="0" cellpadding="1" cellspacing="0"
summary="Precedence of character class operators.">

<P><tr><th>1&nbsp;&nbsp;&nbsp;&nbsp;</th>
<td>Literal escape&nbsp;&nbsp;&nbsp;&nbsp;</td>
<td><tt>\x</tt></td></tr>
<tr><th>2&nbsp;&nbsp;&nbsp;&nbsp;</th>
<td>Range</td>
<td><tt>a-z</tt></td></tr>
<tr><th>3&nbsp;&nbsp;&nbsp;&nbsp;</th>
<td>Grouping</td>
<td><tt>[...]</tt></td></tr>
<tr><th>4&nbsp;&nbsp;&nbsp;&nbsp;</th>
<td>Intersection</td>
<td><tt>[a-z&&[aeiou]]</tt></td></tr>
<tr><th>5&nbsp;&nbsp;&nbsp;&nbsp;</th>
<td>Union</td>
<td><tt>[a-e][i-u]<tt></td></tr>
</table></blockquote>

<P><p> Note that a different set of metacharacters are in effect inside
a character class than outside a character class. For instance, the
regular expression <tt>.</tt> loses its special meaning inside a
character class, while the expression <tt>-</tt> becomes a range
forming metacharacter.

<P><a name="lt">

<P><a name="cg">
<h4> Groups and capturing </h4>

<P><p> Capturing groups are numbered by counting their opening parentheses from
left to right.  In the expression <tt>((A)(B(C)))</tt>, for example, there
are four such groups: </p>

<P><blockquote><table cellpadding=1 cellspacing=0 summary="Capturing group numberings">

<P><tr><th>1&nbsp;&nbsp;&nbsp;&nbsp;</th>
<td><tt>((A)(B(C)))</tt></td></tr>
<tr><th>2&nbsp;&nbsp;&nbsp;&nbsp;</th>
<td><tt>(A)</tt></td></tr>
<tr><th>3&nbsp;&nbsp;&nbsp;&nbsp;</th>
<td><tt>(B(C))</tt></td></tr>

<P><tr><th>4&nbsp;&nbsp;&nbsp;&nbsp;</th>
<td><tt>(C)</tt></td></tr>
</table></blockquote>

<P><p> Group zero always stands for the entire expression.

<P><p> Capturing groups are so named because, during a match, each subsequence
of the input sequence that matches such a group is saved.  The captured
subsequence may be used later in the expression, via a back reference, and
may also be retrieved from the matcher once the match operation is complete.

<P><p> The captured input associated with a group is always the subsequence
that the group most recently matched.  If a group is evaluated a second time
because of quantification then its previously-captured value, if any, will
be retained if the second evaluation fails.  Matching the string
<tt>"aba"</tt> against the expression <tt>(a(b)?)+</tt>, for example, leaves
group two set to <tt>"b"</tt>.  All captured input is discarded at the
beginning of each match.

<P><p> Groups beginning with <tt>(?</tt> are pure, <i>non-capturing</i> groups
that do not capture text and do not count towards the group total.

<P>
<h4> Unicode support </h4>

<P><p> Coming Soon.

<P><h4> Comparison to Perl 5 </h4>

<P><p>The <code>Pattern</code> engine performs traditional NFA-based matching
with ordered alternation as occurs in Perl 5.

<P><p> Perl constructs not supported by this class: </p>

<P><ul>

<P><li><p> The conditional constructs <tt>(?{</tt><i>X</i><tt>})</tt> and
<tt>(?(</tt><i>condition</i><tt>)</tt><i>X</i><tt>|</tt><i>Y</i><tt>)</tt>,
</p></li>

<P><li><p> The embedded code constructs <tt>(?{</tt><i>code</i><tt>})</tt>
and <tt>(??{</tt><i>code</i><tt>})</tt>,</p></li>

<P><li><p> The embedded comment syntax <tt>(?#comment)</tt>, and </p></li>

<P><li><p> The preprocessing operations <tt>\l</tt> <tt>&#92;u</tt>,
<tt>\L</tt>, and <tt>\U</tt>.  </p></li>

<P><li><p> Embedded flags</p></li>

<P></ul>

<P><p> Constructs supported by this class but not by Perl: </p>

<P><ul>

<P><li><p> Possessive quantifiers, which greedily match as much as they can
and do not back off, even when doing so would allow the overall match to
succeed.  </p></li>

<P><li><p> Character-class union and intersection as described
above.</p></li>

<P></ul>

<P><p> Notable differences from Perl: </p>

<P><ul>

<P><li><p> In Perl, <tt>\1</tt> through <tt>\9</tt> are always interpreted
as back references; a backslash-escaped number greater than <tt>9</tt> is
treated as a back reference if at least that many subexpressions exist,
otherwise it is interpreted, if possible, as an octal escape.  In this
class octal escapes must always begin with a zero. In this class,
<tt>\1</tt> through <tt>\9</tt> are always interpreted as back
references, and a larger number is accepted as a back reference if at
least that many subexpressions exist at that point in the regular
expression, otherwise the parser will drop digits until the number is
smaller or equal to the existing number of groups or it is one digit.
</p></li>

<P><li><p> Perl uses the <tt>g</tt> flag to request a match that resumes
where the last match left off.  This functionality is provided implicitly
by the <CODE>Matcher</CODE> class: Repeated invocations of the
<code>find</code> method will resume where the last match left off,
unless the matcher is reset.  </p></li>

<P><li><p> Perl is forgiving about malformed matching constructs, as in the
expression <tt>*a</tt>, as well as dangling brackets, as in the
expression <tt>abc]</tt>, and treats them as literals.  This
class also strict and will not compile a pattern when dangling characters
are encountered.</p></li>

<P></ul>

<P>
<p> For a more precise description of the behavior of regular expression
constructs, please see <a href="http://www.oreilly.com/catalog/regex2/">
<i>Mastering Regular Expressions, 2nd Edition</i>, Jeffrey E. F. Friedl,
O'Reilly and Associates, 2002.</a>
</p>
<P>

<P><i>End Text Extracted And Modified From java.util.regex.Pattern documentation</i>

<P><hr>

<P></BLOCKQUOTE>
<DL>

<A NAME="compiledPatterns"></A>
<A NAME="DOC.3.2"></A>
<DT><IMG ALT="o" BORDER=0 SRC=icon2.gif><TT><B>static   std::map&lt;std::string, <!1><A HREF="Pattern.html">Pattern</A> *&gt;  compiledPatterns</B></TT>
<DD>
This currently is not used, so don't try to do anything with it.

<DL><DT><DD></DL><P>
<A NAME="registeredPatterns"></A>
<A NAME="DOC.3.3"></A>
<DT><IMG ALT="o" BORDER=0 SRC=icon2.gif><TT><B>static   std::map&lt;std::string, std::pair&lt;std::string, unsigned long&gt; &gt;  registeredPatterns</B></TT>
<DD>
Holds all of the registered patterns as strings. Due to certain problems
with compilation of patterns, especially with capturing groups, this seemed
to be the best way to do it.
<DL><DT><DD></DL><P>
<A NAME="nodes"></A>
<A NAME="DOC.3.4"></A>
<DT><IMG ALT="o" BORDER=0 SRC=icon2.gif><TT><B>std::map&lt;NFANode*, bool&gt;  nodes</B></TT>
<DD>
Holds all the NFA nodes used. This makes deletion of a pattern, as well as
clean-up from an unsuccessful compile much easier and faster.
<DL><DT><DD></DL><P>
<A NAME="matcher"></A>
<A NAME="DOC.3.5"></A>
<DT><IMG ALT="o" BORDER=0 SRC=icon2.gif><TT><B><!1><A HREF="Matcher.html">Matcher</A>* matcher</B></TT>
<DD>
Used when methods like split are called. The matcher class uses a lot of
dynamic memeory, so having an instance increases speedup of certain
operations.
<DL><DT><DD></DL><P>
<A NAME="head"></A>
<A NAME="DOC.3.6"></A>
<DT><IMG ALT="o" BORDER=0 SRC=icon2.gif><TT><B>NFANode* head</B></TT>
<DD>
The front node of the NFA
<DL><DT><DD></DL><P>
<A NAME="pattern"></A>
<A NAME="DOC.3.7"></A>
<DT><IMG ALT="o" BORDER=0 SRC=icon2.gif><TT><B>std::string pattern</B></TT>
<DD>
The actual regular expression we rerpesent
<DL><DT><DD></DL><P>
<A NAME="error"></A>
<A NAME="DOC.3.8"></A>
<DT><IMG ALT="o" BORDER=0 SRC=icon2.gif><TT><B>bool error</B></TT>
<DD>
Flag used during compilation. Once the pattern is successfully compiled,
<code>error</code> is no longer used.
<DL><DT><DD></DL><P>
<A NAME="curInd"></A>
<A NAME="DOC.3.9"></A>
<DT><IMG ALT="o" BORDER=0 SRC=icon2.gif><TT><B>int curInd</B></TT>
<DD>
Used during compilation to keep track of the current index into
<code><!1><A HREF="Pattern.html#DOC.3.7">pattern</A><code>.  Once the pattern is successfully
compiled, <code>error</code> is no longer used.
<DL><DT><DD></DL><P>
<A NAME="groupCount"></A>
<A NAME="DOC.3.10"></A>
<DT><IMG ALT="o" BORDER=0 SRC=icon2.gif><TT><B>int groupCount</B></TT>
<DD>
The number of capture groups this contains
<DL><DT><DD></DL><P>
<A NAME="nonCapGroupCount"></A>
<A NAME="DOC.3.11"></A>
<DT><IMG ALT="o" BORDER=0 SRC=icon2.gif><TT><B>int nonCapGroupCount</B></TT>
<DD>
The number of non-capture groups this contains
<DL><DT><DD></DL><P>
<A NAME="flags"></A>
<A NAME="DOC.3.12"></A>
<DT><IMG ALT="o" BORDER=0 SRC=icon2.gif><TT><B>unsigned long flags</B></TT>
<DD>
The flags specified when this was compiled
<DL><DT><DD></DL><P>
<A NAME="raiseError"></A>
<A NAME="DOC.3.13"></A>
<DT><IMG ALT="o" BORDER=0 SRC=icon2.gif><TT><B>void raiseError()</B></TT>
<DD>
Raises an error during compilation. Compilation will cease at that point
and compile will return <code>NULL</code>.
<DL><DT><DD></DL><P>
<A NAME="registerNode"></A>
<A NAME="DOC.3.14"></A>
<DT><IMG ALT="o" BORDER=0 SRC=icon2.gif><TT><B>NFANode* registerNode(NFANode*  node)</B></TT>
<DD>
Convenience function for registering a node in <code>nodes</code>.

<DL><DT><DT><B>Parameters:</B><DD><B>node</B> -  The node to register
<BR><DT><B>Returns:</B><DD>  The registered node<BR><DD></DL><P>
<A NAME="classUnion"></A>
<A NAME="DOC.3.15"></A>
<DT><IMG ALT="o" BORDER=0 SRC=icon2.gif><TT><B>std::string classUnion(std::string s1, std::string s2) const </B></TT>
<DD>
Calculates the union of two strings. This function will first sort the
strings and then use a simple selection algorithm to find the union.

<DL><DT><DT><B>Parameters:</B><DD><B>s1</B> -  The first "class" to union
<BR><B>s2</B> -  The second "class" to union
<BR><DT><B>Returns:</B><DD>  A new string containing all unique characters. Each character
must have appeared in one or both of <code>s1</code> and
<code>s2</code>.<BR><DD></DL><P>
<A NAME="classIntersect"></A>
<A NAME="DOC.3.16"></A>
<DT><IMG ALT="o" BORDER=0 SRC=icon2.gif><TT><B>std::string classIntersect(std::string s1, std::string s2) const </B></TT>
<DD>
Calculates the intersection of two strings. This function will first sort
the strings and then use a simple selection algorithm to find the
intersection.

<DL><DT><DT><B>Parameters:</B><DD><B>s1</B> -  The first "class" to intersect
<BR><B>s2</B> -  The second "class" to intersect
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -