📄 ch05_09.htm
字号:
<html><head><title>Staying in Control (Programming Perl)</title><!-- STYLESHEET --><link rel="stylesheet" type="text/css" href="../style/style1.css"><!-- METADATA --><!--Dublin Core Metadata--><meta name="DC.Creator" content=""><meta name="DC.Date" content=""><meta name="DC.Format" content="text/xml" scheme="MIME"><meta name="DC.Generator" content="XSLT stylesheet, xt by James Clark"><meta name="DC.Identifier" content=""><meta name="DC.Language" content="en-US"><meta name="DC.Publisher" content="O'Reilly & Associates, Inc."><meta name="DC.Source" content="" scheme="ISBN"><meta name="DC.Subject.Keyword" content=""><meta name="DC.Title" content="Staying in Control"><meta name="DC.Type" content="Text.Monograph"></head><body><!-- START OF BODY --><!-- TOP BANNER --><img src="gifs/smbanner.gif" usemap="#banner-map" border="0" alt="Book Home"><map name="banner-map"><AREA SHAPE="RECT" COORDS="0,0,466,71" HREF="index.htm" ALT="Programming Perl"><AREA SHAPE="RECT" COORDS="467,0,514,18" HREF="jobjects/fsearch.htm" ALT="Search this book"></map><!-- TOP NAV BAR --><div class="navbar"><table width="515" border="0"><tr><td align="left" valign="top" width="172"><a href="ch05_08.htm"><img src="../gifs/txtpreva.gif" alt="Previous" border="0"></a></td><td align="center" valign="top" width="171"><a href="ch05_01.htm">Chapter 5: Pattern Matching</a></td><td align="right" valign="top" width="172"><a href="ch05_10.htm"><img src="../gifs/txtnexta.gif" alt="Next" border="0"></a></td></tr></table></div><hr width="515" align="left"><!-- SECTION BODY --><h2 class="sect1">5.9. Staying in Control</h2><p>As any good manager knows, you shouldn't micromanage your employees.Just tell them what you want, and let them figure out the best way ofdoing it. Similarly, it's often best to think of a regular expressionas a kind of specification: "Here's what I want; go find a string thatfits the bill."</p><p><a name="INDEX-1674"></a>On the other hand, the best managers also understand the job theiremployees are trying to do. The same is true of pattern matching inPerl. The more thoroughly you understand of how Perl goes about thetask of matching any particular pattern, the more wisely you'll be ableto make use of Perl's pattern matching capabilities.</p><p>One of the most important things to understand about Perl's pattern-matchingis when <em class="emphasis">not</em> to use it.</p><h3 class="sect2">5.9.1. Letting Perl Do the Work</h3><p><a name="INDEX-1675"></a>When people of a certain temperament first learn regular expressions,they're often tempted to see everything as a problem in patternmatching. And while that may even be true in the larger sense,pattern matching is about more than just evaluating regularexpressions. It's partly about looking for your car keys where youdropped them, not just under the streetlamp where you can see better.In real life, we all know that it's a lot more efficient to look inthe right places than the wrong ones.</p><p>Similarly, you should use Perl's control flow to decide which patternsto execute, and which ones to skip. A regular expression is prettysmart, but it's smart like a horse. It can get distracted if it seestoo much at once. So sometimes you have to put blinders onto it. Forexample, you'll recall our earlier example of alternation:<blockquote><pre class="programlisting">/Gandalf|Saruman|Radagast/</pre></blockquote><a name="INDEX-1676"></a><a name="INDEX-1677"></a><a name="INDEX-1678"></a>That works as advertised, but not as well as it might, because itsearches every position in the string for every name before it moveson to the next position. Astute readers of <em class="emphasis">The Lord of theRings</em> will recall that, of the three wizards named above,Gandalf is mentioned much more frequently than Saruman, and Saruman ismentioned much more frequently than Radagast. So it's generally moreefficient to use Perl's logical operators to do the alternation:<blockquote><pre class="programlisting">/Gandalf/ || /Saruman/ || /Radagast/</pre></blockquote>This is yet another way of defeating the "leftmost" policy of theEngine. It only searches for <tt class="literal">Saruman</tt> if<tt class="literal">Gandalf</tt> was nowhere to be seen. And it onlysearches for <tt class="literal">Radagast</tt> if <tt class="literal">Saruman</tt>is also absent.</p><p>Not only does this change the order in which things are searched, butit sometimes allows the regular expression optimizer to work better. It'sgenerally easier to optimize searching for a single string than forseveral strings simultaneously. Similarly, anchored searches can oftenbe optimized if they're not too complicated.</p><p><a name="INDEX-1679"></a><a name="INDEX-1680"></a>You don't have to limit your control of the control flow to the<tt class="literal">||</tt> operator. Often you can control things at thestatement level. You should always think about weeding out the commoncases first. Suppose you're writing a loop to process aconfigurationfile. Many configuration files are mostly comments. It's often bestto discard comments and blank lines early before doing any heavy-dutyprocessing, even if the heavy duty processing would throw out thecomments and blank lines in the course of things:<blockquote><pre class="programlisting">while (<CONF>) { next if /^#/; next if /^\s*(#|$)/; chomp; munchabunch($_);}</pre></blockquote>Even if you're not trying to be efficient, you often need to alternateordinary Perl expressions with regular expressions simply because youwant to take some action that is not possible (or very difficult) fromwithin the regular expression, such as printing things out. Here's auseful number classifier:<blockquote><pre class="programlisting">warn "has nondigits" if /\D/;warn "not a natural number" unless /^\d+$/; # rejects -3warn "not an integer" unless /^-?\d+$/; # rejects +3warn "not an integer" unless /^[+-]?\d+$/;warn "not a decimal number" unless /^-?\d+\.?\d*$/; # rejects .2warn "not a decimal number" unless /^-?(?:\d+(?:\.\d*)?|\.\d+)$/;warn "not a C float" unless /^([+-]?)(?=\d|\.\d)\d*(\.\d*)?([Ee]([+-]?\d+))?$/;</pre></blockquote>We could stretch this section out a lot longer, but really, that sortof thing is what this whole book is about. You'll see many moreexamples of the interplay of Perl code and pattern matching as we goalong. In particular, see the later section <a href="ch05_10.htm#ch05-sect-pp">Section 5.10.3, "Programmatic Patterns"</a>.(It's okay to read the intervening material first, of course.)</p><a name="ch05-sect-vi"></a><h3 class="sect2">5.9.2. Variable Interpolation</h3><p><a name="INDEX-1681"></a><a name="INDEX-1682"></a>Using Perl's control flow mechanisms to control regular expressionmatching has its limits. The main difficulty is that it's an "all ornothing" approach; either you run the pattern, or you don't.Sometimes you know the general outlines of the pattern you want, butyou'd like to have the capability of parameterizing it. Variableinterpolation provides that capability, much like parameterizing asubroutine lets you have more influence over its behavior than justdeciding whether to call it or not. (More about subroutines in thenext chapter).</p><p>One nice use of interpolation is to provide a little abstraction, alongwith a little readability. With regular expressions you may certainlywrite things concisely:<blockquote><pre class="programlisting">if ($num =~ /^[-+]?\d+\.?\d*$/) { ... }</pre></blockquote>But what you mean is more apparent when you write:<blockquote><pre class="programlisting">$sign = '[-+]?';$digits = '\d+';$decimal = '\.?';$more_digits = '\d*';$number = "$sign$digits$decimal$more_digits";...if ($num =~ /^$number$/o) { ... }</pre></blockquote><a name="INDEX-1683"></a>We'll cover this use of interpolation more under "Generated patterns"later in this chapter. We'll just point out that we used the <tt class="literal">/o</tt>modifier to suppress recompilation because we don't expect <tt class="literal">$number</tt> tochange its value over the course of the program.</p><p><a name="INDEX-1684"></a>Another cute trick is to turn your tests inside out and usethe variable string to pattern-match against a set of known strings:<blockquote><pre class="programlisting">chomp($answer = <STDIN>);if ("SEND" =~ /^\Q$answer/i) { print "Action is send\n" }elsif ("STOP" =~ /^\Q$answer/i) { print "Action is stop\n" }elsif ("ABORT" =~ /^\Q$answer/i) { print "Action is abort\n" }elsif ("LIST" =~ /^\Q$answer/i) { print "Action is list\n" }elsif ("EDIT" =~ /^\Q$answer/i) { print "Action is edit\n" }</pre></blockquote>This lets your user perform the "send" action by typing any of <tt class="literal">S</tt>,<tt class="literal">SE</tt>, <tt class="literal">SEN</tt>, or <tt class="literal">SEND</tt> (in any mixture of upper- and lowercase). To"stop", they'd have to type at least <tt class="literal">ST</tt> (or <tt class="literal">St</tt>, or <tt class="literal">sT</tt>, or <tt class="literal">st</tt>).</p><h3 class="sect3">5.9.2.1. When backslashes happen</h3><p><a name="INDEX-1685"></a>When you think of double-quote interpolation, you usually think of bothvariable and backslash interpolation. But as we mentioned earlier, forregular expressions there are two passes, and the interpolation passdefers most of the backslash interpretation to the regular expressionparser (which we discuss later). Ordinarily, you don'tnotice the difference, because Perl takes pains to hide thedifference. (One sequence that's obviously different is the <tt class="literal">\b</tt>metasymbol, which turns into a word boundary assertion--outside ofcharacter classes, anyway. Inside a character class where assertionsmake no sense, it reverts to being a backspace, as it is normally.)</p><p>It's actually fairly important that the regex parser handle thebackslashes. Suppose you're searching for tab characters in a patternwith a <tt class="literal">/x</tt> modifier:<blockquote><pre class="programlisting">($col1, $col2) = /(.*?) \t+ (.*?)/x;</pre></blockquote>If Perl didn't defer the interpretation of <tt class="literal">\t</tt> to the regex parser,the <tt class="literal">\t</tt> would have turned into whitespace, which the regex parserwould have ignorantly ignored because of the <tt class="literal">/x</tt>. But Perl isnot so ignoble, or tricky.</p><p>You can trick yourself though. Suppose you abstracted out the columnseparator, like this:<blockquote><pre class="programlisting">$colsep = "\t+"; # (double quotes)($col1, $col2) = /(.*?) $colsep (.*?)/x;</pre></blockquote>Now you've just blown it, because the <tt class="literal">\t</tt> turns intoa real tab before it gets to the regex parser, which will think yousaid <tt class="literal">/(.*?)+(.*?)/</tt> after it discards thewhitespace. Oops. To fix, avoid <tt class="literal">/x</tt>, or use singlequotes. Or better, use <tt class="literal">qr//</tt>. (See the nextsection.)</p><p><a name="INDEX-1686"></a><a name="INDEX-1687"></a><a name="INDEX-1688"></a>The only double-quote escapes that are processed as such are the sixtranslation escapes: <tt class="literal">\U</tt>, <tt class="literal">\u</tt>,<tt class="literal">\L</tt>, <tt class="literal">\l</tt>, <tt class="literal">\Q</tt>,and <tt class="literal">\E</tt>. If you ever look into the inner workingsof the Perl regular expression compiler, you'll find code for handlingescapes like <tt class="literal">\t</tt> for tab, <tt class="literal">\n</tt> fornewline, and so on. But you won't find code for those six translationescapes. (We only listed them in<a href="ch05_03.htm#perl3-tab-regex-meta-alpha">Table 5-7</a>because people expect to find them there.) If you somehow manage tosneak any of them into the pattern without going through double-quotishevaluation, they won't be recognized.</p><p><a name="INDEX-1689"></a><a name="INDEX-1690"></a>How could they find their way in? Well, you can defeat interpolationby using single quotes as your pattern delimiter. In <tt class="literal">m'...'</tt>,<tt class="literal">qr'...'</tt>, and <tt class="literal">s'...'...'</tt>, the single quotes suppress variableinterpolation and the processing of translation escapes, just as theywould in a single-quoted string. Saying <tt class="literal">m'\ufrodo'</tt> won't find acapitalized version of poor frodo. However, since the "normal"backslash characters aren't really processed on that level anyway,<tt class="literal">m'\t\d'</tt> still matches a real tab followed by any digit.</p><p>Another way to defeat interpolation is through interpolation itself.If you say:<blockquote><pre class="programlisting">$var = '\U';/${var}frodo/;</pre></blockquote>poor frodo remains uncapitalized. Perl won't redo the interpolationpass for you just because you interpolated something that looks likeit might want to be reinterpolated. You can't expect that to work anymore than you'd expect this double interpolation to work:<blockquote><pre class="programlisting">$hobbit = 'Frodo';$var = '$hobbit'; # (single quotes)/$var/; # means m'$hobbit', not m'Frodo'.</pre></blockquote><a name="INDEX-1691"></a></p><p>Here's another example that shows how most backslashes are interpretedby the regex parser, not by variable interpolation. Imagine you have asimple little <em class="emphasis">grep</em>-style program written in Perl:<a href="#FOOTNOTE-10">[10]</a><blockquote><pre class="programlisting">#!/usr/bin/perl$pattern = shift;while (<>) { print if /$pattern/o;}</pre></blockquote>If you name that program <em class="emphasis">pgrep</em> and call it this way:<blockquote><pre class="programlisting">% <tt class="userinput"><b>pgrep '\t\d' *.c</b></tt></pre></blockquote>then you'll find that it prints out all lines of all your C sourcefiles in which a digit follows a tab. You didn't have to do anything
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -