📄 ch05_10.htm
字号:
<a name="ch05-sect-pp"></a><h3 class="sect2">5.10.3. Programmatic Patterns</h3><p><a name="INDEX-1751"></a><a name="INDEX-1752"></a><a name="INDEX-1753"></a><a name="INDEX-1754"></a>Most Perl programs tend to follow an imperative (also calledprocedural) programming style, like a series of discrete commands laidout in a readily observable order: "Preheat oven, mix, glaze, heat,cool, serve to aliens." Sometimes into this mix you toss a fewdollops of functional programming ("Use a little more glaze than youthink you need, even after taking this into account, recursively"),or sprinkle it with bits of object-oriented techniques ("but please holdthe anchovy objects"). Often it's a combination of all of these.</p><p><a name="INDEX-1755"></a>But the regular expression Engine takes a completely differentapproach to problem solving, more of a declarative approach. Youdescribe goals in the language of regular expressions, and theEngine implements whatever logic is needed to solve your goals.Logic programming languages (such as Prolog) don't always get asmuch exposure as the other three styles, but they're more commonthan you'd think. Perl couldn't even be built without <em class="emphasis">make</em>(1)or <em class="emphasis">yacc</em>(1), both of which could be considered, if not purelydeclarative languages, at least hybrids that blend imperative andlogic programming together.</p><p>You can do this sort of thing in Perl, too, by blending goaldeclarations and imperative code together more miscibly than we'vedone so far, drawing upon the strengths of both. You can programmaticallybuild up the string you'll eventually present to the regex Engine,in a sense creating a program that writes a new program on the fly.</p><p><a name="INDEX-1756"></a>You can also supply ordinary Perl expressions as the replacement partof <tt class="literal">s///</tt> via the <tt class="literal">/e</tt> modifier. This allows you to dynamicallygenerate the replacement string by executing a bit of code every timethe pattern matches.</p><p><a name="INDEX-1757"></a>Even more elaborately, you can interject bits of code wherever you'd likein a middle of a pattern using the <tt class="literal">(?{</tt><em class="replaceable">CODE</em><tt class="literal">})</tt> extension,and that code will be executed every time the Engine encountersthat code as it advances and recedes in its intricate backtrackingdance.</p><p>Finally, you can use <tt class="literal">s///ee</tt> or <tt class="literal">(??{</tt><em class="replaceable">CODE</em><tt class="literal">})</tt> to add anotherlevel of indirection: the <em class="emphasis">results</em> of executing those code snippetswill themselves be re-evaluated for further use, creating bits ofprogram and pattern on the fly, just in time.</p><h3 class="sect3">5.10.3.1. Generated patterns</h3><p><a name="INDEX-1758"></a><a name="INDEX-1759"></a>It has been said<a href="#FOOTNOTE-14">[14]</a> that programs that write programs are thehappiest programs in the world. In Jeffrey Friedl's book, <em class="citetitle">MasteringRegular Expressions</em>, the final tour de force demonstrates how to writea program that produces a regular expression to determine whether astring conforms to the RFC 822 standard; that is, whether it contains astandards-compliant, valid mail header. The pattern produced is severalthousand characters long, and about as easy to read as a crash dump inpure binary. But Perl's pattern matcher doesn't care about that; itjust compiles up the pattern without a hitch and, even moreinterestingly, executes the match very quickly--much more quickly, in fact,than many short patterns with complex backtracking requirements.</p><blockquote class="footnote"><a name="FOOTNOTE-14"></a><p>[14] By Andrew Hume, the famous Unixphilosopher.</p></blockquote><p>That's a very complicated example. Earlier we showed you a very simpleexample of the same technique when we built up a <tt class="literal">$number</tt> pattern outof its components (see the section <a href="ch05_09.htm#ch05-sect-vi">Section 5.9.2, "Variable Interpolation"</a>). But to show you thepower of this programmatic approach to producing a pattern, let's workout a problem of medium complexity.</p><p>Suppose you wanted to pull out all the words with a certainvowel-consonant sequence; for example, "audio" and "eerie" both followa VVCVV pattern. Although describing what counts as a consonant or avowel is easy, you wouldn't ever want to type that in more than once.Even for our simple VVCVV case, you'd need to type in a pattern thatlooked something like this:<blockquote><pre class="programlisting">^[aeiouy][aeiouy][cbdfghjklmnpqrstvwxzy][aeiouy][aeiouy]$</pre></blockquote>A more general-purpose program would accept a string like "<tt class="literal">VVCVV</tt>"and programmatically generate that pattern for you. For even moreflexibility, it could accept a word like "<tt class="literal">audio</tt>" as input and usethat as a template to infer "<tt class="literal">VVCVV</tt>", and from that, the long patternabove. It sounds complicated, but really isn't, because we'll letthe program generate the pattern for us. Here's a simple <em class="emphasis">cvmap</em>program that does all of that:<blockquote><pre class="programlisting">#!/usr/bin/perl$vowels = 'aeiouy';$cons = 'cbdfghjklmnpqrstvwxzy';%map = (C => $cons, V => $vowels); # init map for C and Vfor $class ($vowels, $cons) { # now for each type for (split //, $class) { # get each letter of that type $map{$_} .= $class; # and map the letter back to the type }}for $char (split //, shift) { # for each letter in template word $pat .= "[$map{$char}]"; # add appropriate character class}$re = qr/^${pat}$/i; # compile the patternprint "REGEX is $re\n"; # debugging output@ARGV = ('/usr/dict/words') # pick a default dictionary if -t && !@ARGV;while (<>) { # and now blaze through the input print if /$re/; # printing any line that matches}</pre></blockquote>The <tt class="literal">%map</tt> variable holds all the interesting bits. Its keys are eachletter of the alphabet, and the corresponding value is all the lettersof its type. We throw in C and V, too, so you can specify either"<tt class="literal">VVCVV</tt>" or "<tt class="literal">audio</tt>", and still get out "<tt class="literal">eerie</tt>". Each character in theargument supplied to the program is used to pull out the rightcharacter class to add to the pattern. Once the pattern is created andcompiled up with <tt class="literal">qr//</tt>, the match (even a very long one) will runquickly. Here's why you might get if you run this program on"fortuitously":<blockquote><pre class="programlisting">% <tt class="userinput"><b>cvmap fortuitously /usr/dict/wordses</b></tt>REGEX is (?i-xsm:^[cbdfghjklmnpqrstvwxzy][aeiouy][cbdfghjklmnpqrstvwxzy][cbdfghjklmnpqrstvwxzy][aeiouy][aeiouy][cbdfghjklmnpqrstvwxzy][aeiouy][aeiouy][cbdfghjklmnpqrstvwxzy][cbdfghjklmnpqrstvwxzy][aeiouycbdfghjklmnpqrstvwxzy]$)carriageablecircuitouslyfortuitouslylanguorouslymarriageablemilquetoastssesquiquartasesquiquintavillainously</pre></blockquote>Looking at that <tt class="literal">REGEX</tt>, you can see just how much villainoustyping you saved by programming languorously, albeit circuitously.</p><h3 class="sect3">5.10.3.2. Substitution evaluations</h3><p><a name="INDEX-1760"></a><a name="INDEX-1761"></a><a name="INDEX-1762"></a><a name="INDEX-1763"></a>When the <tt class="literal">/e</tt> modifier ("e" is for expression evaluation) is used onan <tt class="literal">s/</tt><em class="replaceable">PATTERN</em><tt class="literal">/</tt><em class="replaceable">CODE</em><tt class="literal">/e</tt> expression, the replacement portion isinterpreted as a Perl expression, not just as a double-quoted string.It's like an embedded <tt class="literal">do {</tt><em class="replaceable">CODE</em><tt class="literal">}</tt>. Even though it looks likea string, it's really just a code block that gets compiled up at thesame time as rest of your program, long before the substitutionactually happens.</p><p>You can use the <tt class="literal">/e</tt> modifier to build replacementstrings with fancier logic than double-quote interpolation allows.This shows the difference:<blockquote><pre class="programlisting">s/(\d+)/$1 * 2/; # Replaces "42" with "42 * 2"s/(\d+)/$1 * 2/e; # Replaces "42" with "84"</pre></blockquote>And this converts Celsius temperatures into Fahrenheit:<blockquote><pre class="programlisting">$_ = "Preheat oven to 233C.\n";s/\b(\d+\.?\d*)C\b/int($1 * 1.8 + 32) . "F"/e; # convert to 451F</pre></blockquote>Applications of this technique are limitless. Here's a filter thatmodifies its files in place (like an editor) by adding 100to every number that starts a line (and that is followed by acolon, which we only peek at, but don't actually match, or replace):<blockquote><pre class="programlisting">% <tt class="userinput"><b>perl -pi -e 's/^(\d+)(?=:)/100 + $1/e' filename</b></tt></pre></blockquote>Now and then, you want to do more than just use the string you matchedin another computation. Sometimes you want that string to<em class="emphasis">be</em> a computation, whose own evaluation you'll usefor the replacement value. Each additional <tt class="literal">/e</tt>modifier after the first wraps an <tt class="literal">eval</tt> around thecode to execute. The following two lines do the same thing, but thefirst one is easier to read:<blockquote><pre class="programlisting">s/<em class="replaceable">PATTERN</em>/<em class="replaceable">CODE</em>/ees/<em class="replaceable">PATTERN</em>/eval(<em class="replaceable">CODE</em>)/e</pre></blockquote>You could use this technique to replace mentions of simple scalar variableswith their values:<blockquote><pre class="programlisting">s/(\$\w+)/$1/eeg; # Interpolate most scalars' values</pre></blockquote>Because it's really an <tt class="literal">eval</tt>, the<tt class="literal">/ee</tt> even finds lexical variables. A slightly moreelaborate example calculates a replacement for simple arithmeticalexpressions on (nonnegative) integers:<blockquote><pre class="programlisting">$_ = "I have 4 + 19 dollars and 8/2 cents.\n";s{ ( \d+ \s* # find an integer [+*/-] # and an arithmetical operator \s* \d+ # and another integer )}{ $1 }eegx; # then expand $1 and run that codeprint; # "I have 23 dollars and 4 cents."</pre></blockquote>Like any other <tt class="literal">eval</tt><em class="replaceable">STRING</em>, compile-time errors (like syntaxproblems) and run-time exceptions (like dividing by zero) are trapped.If so, the <tt class="literal">$@</tt> (<tt class="literal">$EVAL_ERROR</tt>)variable says what went wrong.</p><a name="ch05-sect-mt"></a><h3 class="sect3">5.10.3.3. Match-time code evaluation</h3><p><a name="INDEX-1764"></a><a name="INDEX-1765"></a>In most programs that use regular expressions, the surroundingprogram's run-time control structure drives the logical execution flow.You write <tt class="literal">if</tt> or <tt class="literal">while</tt> loops, ormake function or method calls, that wind up calling a pattern-matchingoperation now and then. Even with <tt class="literal">s///e</tt>, it's thesubstitution operator that is in control, executing the replacementcode only after a successful match.</p><p>With <em class="emphasis">code subpatterns</em>, the normal relationshipbetween regular expression and program code is inverted. As theEngine is applying its Rules to your pattern at match time, it maycome across a regex extension of the form <tt class="literal">(?{</tt><em class="replaceable">CODE</em><tt class="literal">})</tt>. Whentriggered, this subpattern doesn't do any matching or any lookingabout. It's a zero-width assertion that always "succeeds", evaluatedonly for its side effects. Whenever the Engine needs to progress overthe code subpattern as it executes the pattern, it runs that code.<blockquote><pre class="programlisting">"glyph" =~ /.+ (?{ print "hi" }) ./x; # Prints "hi" twice.</pre></blockquote>As the Engine tries to match <tt class="literal">glyph</tt> against this pattern,it first lets the <tt class="literal">.+</tt> eat up all five letters. Then it prints "<tt class="literal">hi</tt>".When it finds that final dot, all five letters have been eaten, so itneeds to backtrack back to the <tt class="literal">.+</tt> and make it give up one of theletters. Then it moves forward through the pattern again, stoppingto print "<tt class="literal">hi</tt>" again before assigning <tt class="literal">h</tt> to the final dot and completingthe match successfully.</p><p>The braces around the <em class="replaceable">CODE</em> fragment areintended to remind you that it is a block of Perl code, and itcertainly behaves like a block in the lexical sense. That is, if youuse <tt class="literal">my</tt> to declare a lexically scoped variable init, it is private to the block. But if you use<tt class="literal">local</tt> to localize a dynamically scoped variable, itmay not do what you expect. A<tt class="literal">(?{</tt> <em class="replaceable">CODE</em><tt class="literal">})</tt> subpattern creates an implicit dynamic scopethat is valid throughout the rest of the pattern, until it eithersucceeds or backtracks through the code subpattern. One way to thinkof it is that the block doesn't actually return when it gets to theend. Instead, it makes an invisible recursive call to the Engine totry to match the rest of the pattern. Only when that recursive callis finished does it return from the block, delocalizing the localizedvariables.<a href="#FOOTNOTE-15">[15]</a></p><blockquote class="footnote"><a name="FOOTNOTE-15"></a><p>[15] People who are familiar with recursivedescent parsers may find this behavior confusing because suchcompilers return from a recursive function call whenever they figuresomething out. The Engine doesn't do that--when it figures somethingout, it goes <em class="emphasis">deeper</em> into recursion (even whenexiting a parenthetical group!). A recursive descent parser is at aminimum of recursion when it succeeds at the end, but the Engine is ata local <em class="emphasis">maximum</em> of recursion when it succeeds atthe end of the pattern. You might find it helpful to dangle thepattern from its left end and think of it as a skinny representationof a call graph tree. If you can get that picture into your head, thedynamic scoping of local variables will make more sense. (And if youcan't, you're no worse off than before.)</p></blockquote><p>In the next example, we initialize <tt class="literal">$i</tt> to <tt class="literal">0</tt> by including a codesubpattern at the beginning of the pattern. Then we match any number ofcharacters with <tt class="literal">.*</tt>--but we place another code subpattern in betweenthe <tt class="literal">.</tt> and the <tt class="literal">*</tt> so we can count how many times <tt class="literal">.</tt> matches.<blockquote><pre class="programlisting">$_ = 'lothlorien';m/ (?{ $i = 0 }) # Set $i to 0 (. (?{ $i++ }) )* # Update $i, even after backtracking lori # Forces a backtrack /x;</pre></blockquote>The Engine merrily goes along, setting <tt class="literal">$i</tt> to<tt class="literal">0</tt> and letting the <tt class="literal">.*</tt> gobble upall 10 characters in the string. When it encounters the literal<tt class="literal">lori</tt> in the pattern, it backtracks and gives upthose four characters from the <tt class="literal">.*</tt>. After thematch, <tt class="literal">$i</tt> will still be <tt class="literal">10</tt>.</p><p>If you wanted <tt class="literal">$i</tt> to reflect how many characters the <tt class="literal">.*</tt> actuallyended up with, you could make use of the dynamic scope within the pattern:<blockquote><pre class="programlisting">$_ = 'lothlorien';m/ (?{ $i = 0 }) (. (?{ local $i = $i + 1; }) )* # Update $i, backtracking-safe. lori (?{ $result = $i }) # Copy to non-localized location. /x;</pre></blockquote>Here, we use <tt class="literal">local</tt> to ensure that <tt class="literal">$i</tt> contains the number ofcharacters matched by <tt class="literal">.*</tt>, regardless of backtracking. <tt class="literal">$i</tt> willbe forgotten after the regular expression ends, so the code
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -