📄 ch08_01.htm

📁 by Randal L. Schwartz and Tom Phoenix ISBN 0-596-00132-0 Third Edition, published July 2001. (See
💻 HTM
字号:
<html><head><title>More About Regular Expressions (Learning Perl, 3rd Edition)</title><link rel="stylesheet" type="text/css" href="../style/style1.css" />

<meta name="DC.Creator" content="Randal L. Schwartz and Tom Phoenix" /><meta name="DC.Format" content="text/xml" scheme="MIME" /><meta name="DC.Language" content="en-US" /><meta name="DC.Publisher" content="O'Reilly &amp; Associates, Inc." /><meta name="DC.Source" scheme="ISBN" content="0596001320L" /><meta name="DC.Subject.Keyword" content="stuff" /><meta name="DC.Title" content="Learning Perl, 3rd Edition" /><meta name="DC.Type" content="Text.Monograph" />

</head><body bgcolor="#ffffff">

<img alt="Book Home" border="0" src="gifs/smbanner.gif" usemap="#banner-map" /><map name="banner-map"><area shape="rect" coords="1,-2,616,66" href="index.htm" alt="Learning Perl, 3rd Edition" /><area shape="rect" coords="629,-11,726,25" href="jobjects/fsearch.htm" alt="Search this book" /></map>

<div class="navbar"><table width="684" border="0"><tr><td align="left" valign="top" width="228"><a href="ch07_04.htm"><img alt="Previous" border="0" src="../gifs/txtpreva.gif" /></a></td><td align="center" valign="top" width="228"><a href="index.htm"></a></td><td align="right" valign="top" width="228"><a href="ch08_02.htm"><img alt="Next" border="0" src="../gifs/txtnexta.gif" /></a></td></tr></table></div>




<h1 class="chapter">Chapter 8. More About Regular Expressions</h1>
<div class="htmltoc"><h4 class="tochead">Contents:</h4>
  <p> <a href="#lperl3-CHP-8-SECT-1">Character Classes</a><br />
<a href="ch08_02.htm">General Quantifiers</a><br />
<a href="ch08_03.htm">Anchors</a><br />
<a href="ch08_04.htm">Memory Parentheses</a><br />
<a href="ch08_05.htm">Precedence</a><br />
<a href="ch08_06.htm">Exercises</a><br /></p></div>

<p>In the previous chapter, we saw the beginnings of what regular
expressions can do. Here we'll see some of their other common
features.
</p>

<div class="sect1"><a name="lperl3-CHP-8-SECT-1" /></a>
<h2 class="sect1">8.1. Character Classes</h2>

<p><a name="INDEX-524" /></a>A <em class="firstterm">character
class</em>, a list of possible characters inside
<a name="INDEX-525" /></a>
<a name="INDEX-526" /></a>square brackets
(<tt class="literal">[]</tt>), matches any single character from within the
class. It matches just one single character, but that one character
may be any of the ones listed.
</p>

<p>For example, the character class <tt class="literal">[abcwxyz]</tt> may
match any one of those seven characters. For convenience, you may
specify a range of characters with a <a name="INDEX-527" /></a> <a name="INDEX-528" /></a>hyphen (<tt class="literal">-</tt>), so
that class may also be written as <tt class="literal">[a-cw-z]</tt>. That
didn't save much typing, but it's more usual to make a
character class like <tt class="literal">[a-zA-Z]</tt>, to match any one
letter out of that set of 52.<a href="#FOOTNOTE-173">[173]</a> You may use the same character
shortcuts as in any double-quotish string to define a character, so
the class <tt class="literal">[\000-\177]</tt> matches any seven-bit ASCII
character.<a href="#FOOTNOTE-174">[174]</a>
</p><blockquote class="footnote"> <a name="FOOTNOTE-173" /></a><p>[173]Notice that those 52
don't include letters like &Aring; and &Eacute; and &Icirc; and
&Oslash; and &Uuml;. But when Unicode processing is available, that
particular character range is noticed and enhanced to automatically
do the right thing.</p> </blockquote><blockquote class="footnote"> <a name="FOOTNOTE-174" /></a><p>[174]At least, if you use ASCII and not
EBCDIC.</p> </blockquote>

<p>Of course, a character class will be just part of a full pattern; it
will never stand on its own in Perl. For example, you might see code
that says something like this:
</p>

<blockquote><pre class="code">$_ = "The HAL-9000 requires authorization to continue.";
if (/HAL-[0-9]+/) {
  print "The string mentions some model of HAL computer.\n";
}</pre></blockquote>

<p>Sometimes, it's easier to specify the characters left out,
rather than the ones within the character class. A caret
("<tt class="literal">^</tt>") at the start of the character
class negates it. That is, <tt class="literal">[^def]</tt> will match any
single character <em class="emphasis">except</em> one of those three. And
<tt class="literal">[^n\-z]</tt> matches any character except for
<tt class="literal">n</tt>, hyphen, or <tt class="literal">z</tt>. (Note that the
hyphen is backslashed, because it's special inside a character
class. But the first hyphen in <tt class="literal">/HAL-[0-9]+/</tt>
doesn't need a backslash, because hyphens aren't special
<em class="emphasis">outside</em> a character class.)
</p>

<a name="lperl3-CHP-8-SECT-1.1" /></a><div class="sect2">
<h3 class="sect2">8.1.1. Character Class Shortcuts</h3>

<p>Some character classes appear so frequently that they have
<a name="INDEX-529" /></a>shortcuts. For example, the character
class for any digit, <tt class="literal">[0-9]</tt>, may be abbreviated as
<tt class="literal">\d</tt><a name="INDEX-530" /></a> <a name="INDEX-531" /></a>. Thus, the pattern from the
example about HAL could be written <tt class="literal">/HAL-\d+/</tt>
instead.
</p>

<p>The shortcut <tt class="literal">\w</tt><a name="INDEX-532" /></a> <a name="INDEX-533" /></a> is a so-called "word"
character: <tt class="literal">[A-Za-z0-9_]</tt>. If your
"words" are made up of ordinary letters, digits, and
underscores, you'll be happy with this. Most of the rest of us
have words made up of ordinary letters, hyphens, and
apostrophes,<a href="#FOOTNOTE-175">[175]</a> and we'd like to change
this. As of this writing, the Perl developers are working on it, but
it's not available yet.<a href="#FOOTNOTE-176">[176]</a> So use
this one only when you want ordinary letters, digits, and
underscores.
</p><blockquote class="footnote"> <a name="FOOTNOTE-175" /></a><p>[175]At least, in usual English we do. In
other languages, you may have different components of words. And when
looking at ASCII-encoded English text, we have the problem that the
single quote and the apostrophe are the same character, so it's
not possible in isolation to tell whether <tt class="literal">cats'</tt> is
a word with an apostrophe or a word at the end of a quotation. This
is probably one reason that computers haven't been able to take
over the world yet.</p> </blockquote><blockquote class="footnote"> <a name="FOOTNOTE-176" /></a><p>[176]Except to a limited
(but nevertheless useful) extent in connection with locales; see the
<em class="emphasis">perllocale</em> manpage.</p> </blockquote>

<p>Of course, <tt class="literal">\w</tt> doesn't match a
"word"; it merely matches a single "word"
character. To match an entire word, though, the plus modifier is
handy. A pattern like <tt class="literal">/fred \w+ barney/</tt> will match
<tt class="literal">fred</tt> and a space, then a "word", then
a space and <tt class="literal">barney</tt>. That is, it'll match if
there's one word<a href="#FOOTNOTE-177">[177]</a> between <tt class="literal">fred</tt> and
<tt class="literal">barney</tt>, set off by single spaces.
</p><blockquote class="footnote"> <a name="FOOTNOTE-177" /></a><p>[177]We're going to stop
saying "word" in quotes so much; you know by now that
these letter-digit-underscore words are the ones we mean.</p>
</blockquote>

<p>As you may have noticed in that previous example, it might be handy
to be able to match spaces more flexibly. The
<tt class="literal">\s</tt><a name="INDEX-534" /></a>
<a name="INDEX-535" /></a>
<a name="INDEX-536" /></a>
<a name="INDEX-537" /></a> shortcut is good for
whitespace; it's the same as <tt class="literal">[\f\t\n\r ]</tt>.
That is, it's the same as a class containing the five
whitespace characters form-feed, tab, newline, carriage return, and
the space character itself. These are the characters that merely move
the printing position around; they don't use any ink. Still,
like the other shortcuts we've just seen, <tt class="literal">\s</tt>
matches just a single character from the class, so it's usual
to use either <tt class="literal">\s*</tt> for any amount of whitespace
(including none at all), or <tt class="literal">\s+</tt> for one or more
whitespace characters. (In fact, it's rare to see
<tt class="literal">\s</tt> without one of those quantifiers.) Since all of
those whitespace characters look about the same to us humans, we can
treat them all in the same way with this shortcut.
</p>

</div>
<a name="lperl3-CHP-8-SECT-1.2" /></a><div class="sect2">
<h3 class="sect2">8.1.2. Negating the Shortcuts</h3>

<p>Sometimes you may want the opposite of one of these three
<a name="INDEX-538" /></a>
<a name="INDEX-539" /></a>shortcuts. That is, you may want
<tt class="literal">[^\d]</tt>, <tt class="literal">[^\w]</tt>, or
<tt class="literal">[^\s]</tt>, meaning a nondigit character, a nonword
character, or a nonwhitespace character. That's easy enough to
accomplish by using their uppercase counterparts:
<tt class="literal">\D</tt><a name="INDEX-540" /></a>,
<tt class="literal">\W</tt><a name="INDEX-541" /></a>, or
<tt class="literal">\S</tt><a name="INDEX-542" /></a>. These match any character that their
counterpart would <em class="emphasis">not</em> match.
</p>

<p>Any of these shortcuts will work either in place of a character class
(standing on their own in a pattern), or inside the square brackets
of a larger character class. That means that you could now use
<tt class="literal">/[\dA-Fa-f]+/</tt> to match hexadecimal (base 16)
numbers, which use letters <tt class="literal">ABCDEF</tt> (or the same
letters in lowercase) as additional digits.
</p>

<p>Another compound character class is <tt class="literal">[\d\D]</tt>, which
means any digit, or any non-digit. That is to say, any character at
all! This is a common way to match any character, even a newline. (As
opposed to <tt class="literal">.</tt>, which matches any character
<em class="emphasis">except</em> a newline.) And then there's the
totally useless <tt class="literal">[^\d\D]</tt>, which matches anything
that's not either a digit or a non-digit.
Right -- nothing!<a name="INDEX-543" /></a> 
</p>

</div>
</div>












<hr width="684" align="left" />
<div class="navbar"><table width="684" border="0"><tr><td align="left" valign="top" width="228"><a href="ch07_04.htm"><img alt="Previous" border="0" src="../gifs/txtpreva.gif" /></a></td><td align="center" valign="top" width="228"><a href="index.htm"><img alt="Home" border="0" src="../gifs/txthome.gif" /></a></td><td align="right" valign="top" width="228"><a href="ch08_02.htm"><img alt="Next" border="0" src="../gifs/txtnexta.gif" /></a></td></tr><tr><td align="left" valign="top" width="228">7.4. Exercises</td><td align="center" valign="top" width="228"><a href="index/index.htm"><img alt="Book Index" border="0" src="../gifs/index.gif" /></a></td><td align="right" valign="top" width="228">8.2. General Quantifiers</td></tr></table></div>
<hr width="684" align="left" />

<img alt="Library Navigation Links" border="0" src="../gifs/navbar.gif" usemap="#library-map" />
<p><p><font size="-1"><a href="copyrght.htm">Copyright &copy; 2002</a> O'Reilly &amp; Associates. All rights reserved.</font></p>

<map name="library-map"><area shape="rect" coords="1,0,85,94" href="../index.htm"><area shape="rect" coords="86,1,178,103" href="../lwp/index.htm"><area shape="rect" coords="180,0,265,103" href="../lperl/index.htm"><area shape="rect" coords="267,0,353,105" href="../perlnut/index.htm"><area shape="rect" coords="354,1,446,115" href="../prog/index.htm"><area shape="rect" coords="448,0,526,132" href="../tk/index.htm"><area shape="rect" coords="528,1,615,119" href="../cookbook/index.htm"><area shape="rect" coords="617,0,690,135" href="../pxml/index.htm"></map>

</body></html>
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -