📄 regexp.shtml.htm

📁 mfc资料集合5
💻 HTM
📖 第 1 页 / 共 2 页
字号:
上一页 12
{
	ASSERT( re.SubStart(0) == 0 );
	ASSERT( re.SubLength(0) == 26 );

	ASSERT( re.SubStart(1) == 0 );
	ASSERT( re.SubLength(1) == 19 );

	ASSERT( re.SubStart(2) == 20 );
	ASSERT( re.SubLength(2) == 5 );
}
</tt></pre>
</blockquote>

<p><strong><tt>CString Regexp::GetReplaceString( LPCTSTR source ) const; </tt></strong></p>

<blockquote>
  <p>After a successful Match you can retrieve a replacement string as an alternative to
  building up the various substrings by hand. </p>
  <p>Each character in the source string will be copied to the return value except for the
  following special characters:</p>
  <table border="0" width="690">
    <tr>
      <td width="88">&amp;&nbsp;&nbsp; </td>
      <td width="594">The complete matched string (sub-string 0).</td>
    </tr>
    <tr>
      <td width="88">\1&nbsp; </td>
      <td width="594">Sub-string 1</td>
    </tr>
    <tr>
      <td width="88">...</td>
      <td width="594">and so on until...</td>
    </tr>
    <tr>
      <td width="88">\9</td>
      <td width="594">Sub-string 9</td>
    </tr>
  </table>
  <p>So, taking the now ubiquitous example:</p>
  <p><tt>CString repl = re.GetReplacementString( &quot;\2 == \1&quot; );</tt></p>
</blockquote>

<blockquote>
  <p>Will give:</p>
</blockquote>

<blockquote>
  <p><tt>repl == &quot;Kelly == wyrdrune.com!kelly&quot;; </tt></p>
</blockquote>

<blockquote>
  <p>As an implementation note: the <tt>CRegExp</tt> version of a similarly named function
  returned a newly allocated pointer array. Whilst this is efficient, it puts the onus upon
  the user of the class to delete it (correctly, with <tt>delete []</tt> ) after it&#146;s
  done with. Considering how the reference counting is implemented in the MFC <tt>CString</tt>
  class, passing <tt>CStrings</tt> around on the stack isn&#146;t that expensive, the
  allocation only happens when the string data is initially allocated, with the ownership of
  the actual string data being handed from one <tt>CString</tt> instance to another as
  needed. Finally when the <tt>CString</tt> goes out of scope the data is deleted. This is
  efficient, and much more robust than having to keep track of which functions are
  allocators and which ones are not.</p>
</blockquote>

<p><strong><tt>CString Regexp::GetErrorString() const; </tt></strong></p>

<blockquote>
  <p>Return a description of the most recent error caused on this <tt>Regexp</tt>. Errors
  include, but are not limited to, various forms of compilation errors, usually syntax
  errors, and calling <tt>Match</tt> when the <tt>Regexp</tt> hasn&#146;t been initialized
  correctly (or at all). There are a fair number of these that should never occur if all of
  the <tt>Regexp</tt> use comes from your code, but where the user can type in regular
  expressions that you then have to compile, checking this can be very important.</p>
</blockquote>

<p><strong><tt>bool Regexp::CompiledOK() const; </tt></strong>

<dir>
  <dir>
    <p>Return the status of the last regular expression compilation.</p>
  </dir>
</dir>
<font FACE="Arial" SIZE="4"><b><a name="Regular Expression Syntax">

<p>Regular Expression Syntax</a></b></font> </p>

<p>A regular expression is zero or more branches, separated by '<strong><font
color="#FF0000">|</font></strong>'. It matches anything that matches one of the branches.</p>

<p>A branch is zero or more pieces, concatenated. It matches a match for the first,
followed by a match for the second, etc.</p>

<p>A piece is an atom possibly followed by '<strong><font color="#FF0000">*</font></strong>',
'<strong><font color="#FF0000">+</font></strong>', or '<strong><font color="#FF0000">?</font></strong>'.
An atom followed by '<strong><font color="#FF0000">*</font></strong>' matches a sequence
of 0 or more matches of the atom. An atom followed by '<strong><font color="#FF0000">+</font></strong>'
matches a sequence of 1 or more matches of the atom. An atom followed by '<strong><font
color="#FF0000">?</font></strong>' matches a match of the atom, or the null string.</p>

<p>An atom is a regular expression in parentheses (matching a match for the regular
expression), a range (see below), '<strong><font color="#FF0000">.</font></strong>'
(matching any single character), '<strong><font color="#FF0000">^</font></strong>'
(matching the null string at the beginning of the input string), '<strong><font
color="#FF0000">$</font></strong>' (matching the null string at the end of the input
string), a '<strong><font color="#FF0000">\</font></strong>' followed by a single
character (matching that character), or a single character with no other significance
(matching that character).</p>

<p>A range is a sequence of characters enclosed in '<strong><font color="#FF0000">[]</font></strong>'.
It normally matches any single character from the sequence. If the sequence begins with '<strong><font
color="#FF0000">^</font></strong>', it matches any single character not from the rest of
the sequence. If two characters in the sequence are separated by '<strong><font
color="#FF0000">-</font></strong>', this is shorthand for the full list of ASCII
characters between them (e.g. '<strong><font color="#FF0000">[0-9]</font></strong>'
matches any decimal digit). To include a literal '<strong><font color="#FF0000">]</font></strong>'
in the sequence, make it the first character (following a possible '<strong><font
color="#FF0000">^</font></strong>'). To include a literal '<strong><font color="#FF0000">-</font></strong>',
make it the first or last character.</p>
<font FACE="Arial" SIZE="4"><b><a name="Ambiguity">

<p>Ambiguity</a></b></font> </p>

<p>If a regular expression could match two different parts of the input string, it will
match the one which begins earliest. If both begin in the same place but match different
lengths, or match the same length in different ways, life gets messier, as follows. </p>

<p>In general, the possibilities in a list of branches are considered in left-to-right
order, the possibilities for '<strong><font color="#FF0000">*</font></strong>', '<strong><font
color="#FF0000">+</font></strong>', and '<strong><font color="#FF0000">?</font></strong>'
are considered longest-first, nested constructs are considered from the outermost in, and
concatenated constructs are considered leftmost-first. The match that will be chosen is
the one that uses the earliest possibility in the first choice that has to be made. If
there is more than one choice, the next will be made in the same manner (earliest
possibility) subject to the decision on the first choice. And so forth.</p>

<p>For example, '<strong><font color="#FF0000">(ab|a)b*c</font></strong>' could match '<strong><font
color="#FF0000">abc</font></strong>' in one of two ways. The first choice is between '<strong><font
color="#FF0000">ab</font></strong>' and '<strong><font color="#FF0000">a</font></strong>';
since '<strong><font color="#FF0000">ab</font></strong>' is earlier, and does lead to a
successful overall match, it is chosen. Since the '<strong><font color="#FF0000">b</font></strong>'
is already spoken for, the '<strong><font color="#FF0000">b*</font></strong>' must match
its last possibility--the empty string--since it must respect the earlier choice.</p>

<p>In the particular case where the regular expression does not use `<strong><font
color="#FF0000">|</font></strong>' and does not apply `<strong><font color="#FF0000">*</font></strong>',
`<strong><font color="#FF0000">+</font></strong>', or `<strong><font color="#FF0000">?</font></strong>'
to parenthesized subexpressions, the net effect is that the longest possible match will be
chosen. So `<strong><font color="#FF0000">ab*</font></strong>', presented with `<strong><font
color="#FF0000">xabbbby</font></strong>', will match `<strong><font color="#FF0000">abbbb</font></strong>'.
Note that if `<strong><font color="#FF0000">ab*</font></strong>' is tried against `<strong><font
color="#FF0000">xabyabbbz</font></strong>', it will match `<strong><font color="#FF0000">ab</font></strong>'
just after `<strong><font color="#FF0000">x</font></strong>', due to the begins-earliest
rule. (In effect, the decision on where to start the match is the first choice to be made,
hence subsequent choices must respect it even if this leads them to less-preferred
alternatives.)</p>
<font FACE="Arial" SIZE="4"><b><a name="The Source">

<p>The Source</a></b></font> </p>

<p>The accompanying archive contains the regexp library, as well as two separate test
programs.</p>

<p>The first (originally enough called Test1) is a C++ port of the original test program
that came with the C code. I&#146;ve updated it to use the C++ constructs that the new
library exposes. It acts as a useful sanity check and regression test when I&#146;ve been
modifying the source.</p>

<p>The second test is much simpler and uses the libraries substring extraction function to
chop fields out of an email header, this is less of a test program and more of a simple
sample.</p>

<p><a href="regexp_source.zip" tppabs="http://www.codeguru.com/string/regexp_source.zip">Download Source.</a></p>
<font FACE="Arial" SIZE="3"><b><a name="A Note about Character Size">

<p>A Note about Character Size</a></b></font> </p>

<p>This code (and the samples) work and have been tested pretty thoroughly under Single
Byte Character Sets (SBCS) and UNICODE. It will NOT work under Multi Byte Character Sets
(MBCS), though it will compile which is very misleading. The problem (for anyone
interested in fixing it) is that the internal representation of the &#145;program&#146;
requires a fixed size character, it manipulates this using <tt>memcpy()</tt> and <tt>memmove()</tt>
without any knowledge of whether a particular element in it&#146;s array is some internal
code or a character. Making this use variable width characters would be a real pain since
much more of the code would have to decode the program itself in order to determine
whether a specific point in the program was looking at a operator or part of a character.
Certainly this is doable, but it is more work than I want right now. The code works under
UNICODE and that&#146;s good enough for me. BTW even if the code is compiled with <tt>_MBCS</tt>
it will only fail when it&#146;s actually presented with multi-byte text, it&#146;ll work
just fine with 8-bit ASCII.</p>

<p>&nbsp;</p>
</body>

<P>Posted on: April 10, 98.
<P>
<HR>
<TABLE BORDER=0 WIDTH="100%" >
<TR>
<TD WIDTH="33%"><FONT SIZE=-1><A HREF="../index.htm" tppabs="http://www.codeguru.com/">Goto HomePage</A></FONT></TD>
<TD WIDTH="33%"> <CENTER><FONT SIZE=-2>&copy; 1997 Zafir Anjum</FONT>&nbsp;</CENTER></TD>
<TD WIDTH="34%"><DIV ALIGN=right><FONT SIZE=-1>Contact me: <A HREF="mailto:zafir@home.com">zafir@home.com</A>&nbsp;</FONT></DIV></TD>
</TR>
</TABLE>
<CENTER><FONT SIZE=-2>1243</FONT></CENTER>
</BODY>
</HTML>
上一页 12
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -