📄 ch01_07.htm
字号:
<html><head><title>Regular Expressions (Programming Perl)</title><!-- STYLESHEET --><link rel="stylesheet" type="text/css" href="../style/style1.css"><!-- METADATA --><!--Dublin Core Metadata--><meta name="DC.Creator" content=""><meta name="DC.Date" content=""><meta name="DC.Format" content="text/xml" scheme="MIME"><meta name="DC.Generator" content="XSLT stylesheet, xt by James Clark"><meta name="DC.Identifier" content=""><meta name="DC.Language" content="en-US"><meta name="DC.Publisher" content="O'Reilly & Associates, Inc."><meta name="DC.Source" content="" scheme="ISBN"><meta name="DC.Subject.Keyword" content=""><meta name="DC.Title" content="Regular Expressions"><meta name="DC.Type" content="Text.Monograph"></head><body><!-- START OF BODY --><!-- TOP BANNER --><img src="gifs/smbanner.gif" usemap="#banner-map" border="0" alt="Book Home"><map name="banner-map"><AREA SHAPE="RECT" COORDS="0,0,466,71" HREF="index.htm" ALT="Programming Perl"><AREA SHAPE="RECT" COORDS="467,0,514,18" HREF="jobjects/fsearch.htm" ALT="Search this book"></map><!-- TOP NAV BAR --><div class="navbar"><table width="515" border="0"><tr><td align="left" valign="top" width="172"><a href="ch01_06.htm"><img src="../gifs/txtpreva.gif" alt="Previous" border="0"></a></td><td align="center" valign="top" width="171"><a href="ch01_01.htm">Chapter 1: An Overview of Perl</a></td><td align="right" valign="top" width="172"><a href="ch01_08.htm"><img src="../gifs/txtnexta.gif" alt="Next" border="0"></a></td></tr></table></div><hr width="515" align="left"><!-- SECTION BODY --><h2 class="sect1">1.7. Regular Expressions</h2><p><a name="INDEX-325"></a><a name="INDEX-326"></a><a name="INDEX-327"></a><a name="INDEX-328"></a><a name="INDEX-329"></a><a name="INDEX-330"></a><a name="INDEX-331"></a><a name="INDEX-332"></a><a name="INDEX-333"></a><a name="INDEX-334"></a><a name="INDEX-335"></a><em class="emphasis">Regular expressions</em> (a.k.a. regexes, regexps, orREs) are used by many search programs such as<em class="emphasis">grep</em> and <em class="emphasis">findstr</em>,text-munging programs like <em class="emphasis">sed</em> and<em class="emphasis">awk</em>, and editors like <em class="emphasis">vi</em> and<em class="emphasis">emacs</em>. A regular expression is a way ofdescribing a set of strings without having to list all the strings inyour set.<a href="#FOOTNOTE-22">[22]</a></p><blockquote class="footnote"><a name="FOOTNOTE-22"></a><p>[22] A good source of information on regularexpression concepts is Jeffrey Friedl's book, <em class="emphasis">MasteringRegular Expressions</em> (O'Reilly &Associates).</p></blockquote><p><a name="INDEX-336"></a>Many other computer languages incorporate regular expressions (some ofthem even advertise "Perl5 regular expressions"!), but none ofthese languages integrates regular expressions into the language theway Perl does. Regular expressions are used several ways in Perl.First and foremost, they're used in conditionals to determine whethera string matches a particular pattern, because in a Boolean contextthey return true and false. So when you see something that looks like<tt class="literal">/foo/</tt> in a conditional, you know you'relooking at an ordinary <em class="emphasis">pattern-matching</em> operator:<blockquote><pre class="programlisting">if (/Windows 95/) { print "Time to upgrade?\n" }</pre></blockquote><a name="INDEX-337"></a><a name="INDEX-338"></a></p><p>Second, if you can locate patterns within a string, you can replace themwith something else. So when you see something that looks like<tt class="literal">s/foo/bar/</tt>, you know it's asking Perl to substitute "bar" for "foo",if possible. We call that the <em class="emphasis">substitution</em> operator. It alsohappens to return true or false depending on whether it succeeded,but usually it's evaluated for its side effect:<blockquote><pre class="programlisting">s/Windows/Linux/;</pre></blockquote><a name="INDEX-339"></a></p><p><a name="INDEX-340"></a>Finally, patterns can specify not only where something is, but alsowhere it <em class="emphasis">isn't</em>. So the <tt class="literal">split</tt>operator uses a regular expression to specify where the data isn't. That is, the regular expression defines the <em class="emphasis">separators</em>that delimit the fields of data. Our Average Example has a couple oftrivial examples of this. Lines 5 and 12 each split strings on the spacecharacter in order to return a list of words. But you can split on any separator you can specify with a regular expression:<blockquote><pre class="programlisting">($good, $bad, $ugly) = split(/,/, "vi,emacs,teco");</pre></blockquote>(There are various modifiers you can use in each of these situations todo exotic things like ignore case when matching alphabetic characters,but these are the sorts of gory details that we'll cover later whenwe get to the gory details.)</p><p><a name="INDEX-341"></a><a name="INDEX-342"></a>The simplest use of regular expressions is to match a literalexpression. In the case of the <tt class="literal">split</tt> above, wematched on a single comma character. But if you match on severalcharacters in a row, they all have to match sequentially. That is,the pattern looks for a substring, much as you'd expect. Let's say wewant to show all the lines of an HTML file that contain HTTP links (asopposed to FTP links). Let's imagine we're working with HTML for thefirst time, and we're being a little na&#239;ve. We know that theselinks will always have "<tt class="literal">http:</tt>" in them somewhere. We could loop through our file with this:<blockquote><pre class="programlisting">while ($line = <FILE>) { if ($line =~ /http:/) { print $line; }}</pre></blockquote><a name="INDEX-343"></a><a name="INDEX-344"></a><a name="INDEX-345"></a><a name="INDEX-346"></a><a name="INDEX-347"></a>Here, the <tt class="literal">=~</tt> (pattern-binding operator) is tellingPerl to look for a match of the regular expression"<tt class="literal">http:</tt>" in the variable<tt class="literal">$line</tt>. If it finds the expression, the operatorreturns a true value and the block (a <tt class="literal">print</tt>statement) is executed.<a href="#FOOTNOTE-23">[23]</a></p><blockquote class="footnote"><a name="FOOTNOTE-23"></a><p>[23] This is very similar to what theUnix command <tt class="literal">grep 'http:' file</tt> would do. On MS-DOSyou could use the <em class="emphasis">find</em> command, but it doesn'tknow how to do more complicated regular expressions. (However, themisnamed <em class="emphasis">findstr</em> program of Windows NT does knowabout regular expressions.)</p></blockquote><p><a name="INDEX-348"></a><a name="INDEX-349"></a>By the way, if you don't use the <tt class="literal">=~</tt> binding operator, Perl willsearch a default string instead of <tt class="literal">$line</tt>. It's like when you say,"Eek! Help me find my contact lens!" People automatically know to lookaround near you without your actually having to tell them that.Likewise, Perl knows that there is a default place to search for thingswhen you don't say where to search for them. This default string isactually a special scalar variable that goes by the odd name of <tt class="literal">$_</tt>.In fact, it's not the default just for pattern matching; many operatorsin Perl default to using the <tt class="literal">$_</tt> variable, so a veteran Perlprogrammer would likely write the last example as:<blockquote><pre class="programlisting">while (<FILE>) { print if /http:/;}</pre></blockquote>(Hmm, another one of those statement modifiers seems to have snuck inthere. Insidious little beasties.)</p><p>This stuff is pretty handy, but what if we wanted to find all of thelink types, not just the HTTP links? We could give a list of link types,like "<tt class="literal">http:</tt>", "<tt class="literal">ftp:</tt>", "<tt class="literal">mailto:</tt>", and so on. But that listcould get long, and what would we do when a new kind of link wasadded?<blockquote><pre class="programlisting">while (<FILE>) { print if /http:/; print if /ftp:/; print if /mailto:/; # What next?}</pre></blockquote><a name="INDEX-350"></a><a name="INDEX-351"></a><a name="INDEX-352"></a><a name="INDEX-353"></a></p><p>Since regular expressions are descriptive of a set of strings, we canjust describe what we are looking for: a number of alphabetic charactersfollowed by a colon. In regular expression talk (Regexese?), thatwould be <tt class="literal">/[a-zA-Z]+:/</tt>, where the brackets define a<em class="emphasis">character class</em>. The <tt class="literal">a-z</tt> and<tt class="literal">A-Z</tt> represent all alphabetic characters (thedash means the range of all characters between the starting and endingcharacter, inclusive). And the <tt class="literal">+</tt> is a special character that says"one or more of whatever was before me". It's what we call a<em class="emphasis">quantifier</em>, meaning a gizmo that says how many times something isallowed to repeat. (The slashes aren't really part of the regularexpression, but rather part of the pattern-match operator. The slashesare acting like quotes that just happen to contain a regularexpression.)</p><p><a name="INDEX-354"></a><a name="INDEX-355"></a><a name="INDEX-356"></a><a name="INDEX-357"></a><a name="INDEX-358"></a><a name="INDEX-359"></a><a name="INDEX-360"></a>Because certain classes like the alphabetics are so commonly used, Perldefines shortcuts for them:</p><a name="perl3-tab-over-re-meta"></a><table border="1"><tr><th>Name</th><th>ASCII Definition</th><th>Code</th></tr><tr><td>Whitespace</td><td><tt class="literal">[ \t\n\r\f]</tt></td><td><tt class="literal">\s</tt></td></tr><tr><td>Word character</td><td><tt class="literal">[a-zA-Z_0-9]</tt></td><td><tt class="literal">\w</tt></td></tr><tr><td>Digit</td><td><tt class="literal">[0-9]</tt></td><td><tt class="literal">\d</tt></td></tr></table><p>Note that these match <em class="emphasis">single</em> characters. A<tt class="literal">\w</tt> will match any single word character, not anentire word. (Remember that <tt class="literal">+</tt> quantifier? You cansay <tt class="literal">\w+</tt> to match a word.) Perl also provides thenegation of these classes by using the uppercased character, such as<tt class="literal">\D</tt> for a nondigit character.</p><p><a name="INDEX-361"></a><a name="INDEX-362"></a>We should note that <tt class="literal">\w</tt> is not always equivalent to<tt class="literal">[a-zA-Z_0-9]</tt> (and <tt class="literal">\d</tt> is notalways <tt class="literal">[0-9]</tt>). Some locales define additionalalphabetic characters outside the ASCII sequence, and<tt class="literal">\w</tt> respects them. Newer versions of Perl also knowabout Unicode letter and digit properties and treat Unicode characterswith those properties accordingly. (Perl also considers ideographs tobe <tt class="literal">\w</tt> characters.)</p><p><a name="INDEX-363"></a> There is one other very specialcharacter class, written with a "<tt class="literal">.</tt>", that willmatch any character whatsoever.<a href="#FOOTNOTE-24">[24]</a> Forexample, <tt class="literal">/a./</tt> will match any string containing an"<tt class="literal">a</tt>" that is not the last character in the string. Thus it will match "<tt class="literal">at</tt>" or "<tt class="literal">am</tt>" or even"<tt class="literal">a!</tt>", but not "<tt class="literal">a</tt>", since there's nothing after the "<tt class="literal">a</tt>" for the dot to match. Sinceit's searching for the pattern anywhere in the string, it'llmatch "<tt class="literal">oasis</tt>" and "<tt class="literal">camel</tt>", butnot "<tt class="literal">sheba</tt>". It matches"<tt class="literal">caravan</tt>" on the first "<tt class="literal">a</tt>". Itcould match on the second "<tt class="literal">a</tt>", but it stops afterit finds the first suitable match, searching from left to right.</p><blockquote class="footnote"><a name="FOOTNOTE-24"></a><p>[24] Except that it won'tnormally match a newline. When you think about it, a"<tt class="literal">.</tt>" doesn't normally match a newline in<em class="emphasis">grep</em>(1) either.</p></blockquote>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -