📄 ch05_10.htm
字号:
<html><head><title>Fancy Patterns (Programming Perl)</title><!-- STYLESHEET --><link rel="stylesheet" type="text/css" href="../style/style1.css"><!-- METADATA --><!--Dublin Core Metadata--><meta name="DC.Creator" content=""><meta name="DC.Date" content=""><meta name="DC.Format" content="text/xml" scheme="MIME"><meta name="DC.Generator" content="XSLT stylesheet, xt by James Clark"><meta name="DC.Identifier" content=""><meta name="DC.Language" content="en-US"><meta name="DC.Publisher" content="O'Reilly & Associates, Inc."><meta name="DC.Source" content="" scheme="ISBN"><meta name="DC.Subject.Keyword" content=""><meta name="DC.Title" content="Fancy Patterns"><meta name="DC.Type" content="Text.Monograph"></head><body><!-- START OF BODY --><!-- TOP BANNER --><img src="gifs/smbanner.gif" usemap="#banner-map" border="0" alt="Book Home"><map name="banner-map"><AREA SHAPE="RECT" COORDS="0,0,466,71" HREF="index.htm" ALT="Programming Perl"><AREA SHAPE="RECT" COORDS="467,0,514,18" HREF="jobjects/fsearch.htm" ALT="Search this book"></map><!-- TOP NAV BAR --><div class="navbar"><table width="515" border="0"><tr><td align="left" valign="top" width="172"><a href="ch05_09.htm"><img src="../gifs/txtpreva.gif" alt="Previous" border="0"></a></td><td align="center" valign="top" width="171"><a href="ch05_01.htm">Chapter 5: Pattern Matching</a></td><td align="right" valign="top" width="172"><a href="ch06_01.htm"><img src="../gifs/txtnexta.gif" alt="Next" border="0"></a></td></tr></table></div><hr width="515" align="left"><!-- SECTION BODY --><h2 class="sect1">5.10. Fancy Patterns</h2><a name="ch05-sect-la"></a><h3 class="sect2">5.10.1. Lookaround Assertions</h3><a name="INDEX-1733"></a><a name="INDEX-1734"></a><a name="INDEX-1735"></a><a name="INDEX-1736"></a><a name="INDEX-1737"></a><a name="INDEX-1738"></a><a name="INDEX-1739"></a><p><a name="INDEX-1740"></a><a name="INDEX-1741"></a><a name="INDEX-1742"></a>Sometimes you just need to sneak a peek. There are four regexextensions that help you do just that, and we call them <em class="emphasis">lookaround</em>assertions because they let you scout around in a hypothetical sort ofway, without committing to matching any characters. What theseassertions assert is that some pattern would (or would not) match if wewere to try it. The Engine works it all out for us by actually tryingto match the hypothetical pattern, and then pretending that it didn'tmatch (if it did).</p><p>When the Engine peeks ahead from its current position in the string, wecall it a <em class="emphasis">lookahead</em> assertion. If it peeks backward, we call it a<em class="emphasis">lookbehind</em> assertion. The lookahead patterns can be any regular expression,but the lookbehind patterns may only be fixed width, since they have to knowwhere to start the hypothetical match from.</p><p>While these four extensions are all zero-width assertions, and hence donot consume characters (at least, not officially), you can in factcapture substrings within them if you supply extra levels of capturingparentheses.</p><dl><dt><b><tt class="literal">(?=</tt><em class="replaceable">PATTERN</em><tt class="literal">)</tt><em class="emphasis">(positive lookahead)</em></b></dt><dd><p> When the Engine encounters<tt class="literal">(?=</tt><em class="replaceable">PATTERN</em><tt class="literal">)</tt>,it looks ahead in the string to ensure that<em class="replaceable">PATTERN</em> occurs. If you'll recall, in ourearlier duplicate word remover, we had to write a loop because thepattern ate too much each time through:<blockquote><pre class="programlisting">$_ = "Paris in THE THE THE THE spring.";# remove duplicate words (and triplicate (and quadruplicate...))1 while s/\b(\w+) \1\b/$1/gi;</pre></blockquote>Whenever you hear the phrase "ate too much", you should always think"lookahead assertion". (Well, almost always.) By peeking aheadinstead of gobbling up the second word, you can write a one-passduplicate word remover like this:<blockquote><pre class="programlisting">s/ \b(\w+) \s (?= \1\b ) //gxi;</pre></blockquote><a name="INDEX-1743"></a><a name="INDEX-1744"></a>Of course, this isn't quite right, since it will mess up valid phrases like"The clothes you DON DON't fit."</p></dd><dt><b><tt class="literal">(?!</tt><em class="replaceable">PATTERN</em><tt class="literal">)</tt><em class="emphasis">(negative lookahead)</em></b></dt><dd><p> When the Engine encounters<tt class="literal">(?!</tt><em class="replaceable">PATTERN</em><tt class="literal">)</tt>,it looks ahead in the string to ensure that<em class="replaceable">PATTERN</em> does <em class="emphasis">not</em>occur. To fix our previous example, we can add a negative lookaheadassertion after the positive assertion to weed out the case ofcontractions:<blockquote><pre class="programlisting">s/ \b(\w+) \s (?= \1\b (?! '\w))//xgi;</pre></blockquote><a name="INDEX-1745"></a><a name="INDEX-1746"></a>That final <tt class="literal">\w</tt> is necessary to avoid confusingcontractions with words at the ends of single-quoted strings. We cantake this one step further, since earlier in this chapter weintentionally used "that that particular", and we'd like our programto not "fix" that for us. So we can add an alternative to thenegative lookahead in order to pre-unfix that "that", (therebydemonstrating that any pair of parentheses can be used to clusteralternatives):<blockquote><pre class="programlisting">s/ \b(\w+) \s (?= \1\b (?! '\w | \s particular))//gix;</pre></blockquote>Now we know that that particular phrase is safe. Unfortunately, theGettysburg Address is still broken. So we add another exception:<blockquote><pre class="programlisting">s/ \b(\w+) \s (?= \1\b (?! '\w | \s particular | \s nation))//igx;</pre></blockquote>This is just starting to get out of hand. So let's do an Official Listof Exceptions, using a cute interpolation trick with the <tt class="literal">$"</tt>variable to separate the alternatives with the <tt class="literal">|</tt> character:<blockquote><pre class="programlisting">@thatthat = qw(particular nation);local $" = '|';s/ \b(\w+) \s (?= \1\b (?! '\w | \s (?: @thatthat )))//xig;</pre></blockquote></p></dd><dt><b><tt class="literal">(?<=</tt><em class="replaceable">PATTERN</em><tt class="literal">)</tt> <em class="emphasis">(positive lookbehind)</em></b></dt><dd><p>When the Engine encounters <tt class="literal">(?<=</tt><em class="replaceable">PATTERN</em><tt class="literal">)</tt>, it looksbackward in the string to ensure that <em class="replaceable">PATTERN</em> already occurred.</p><p>Our example still has a problem. Although it now lets Honest Abe say thingslike "that that nation", it also allows "Paris, in the the nation ofFrance". We can add a positive lookbehind assertion in front of ourexception list to make sure that we apply our <tt class="literal">@thatthat</tt> exceptionsonly to a real "that that".<blockquote><pre class="programlisting">s/ \b(\w+) \s (?= \1\b (?! '\w | (?<= that) \s (?: @thatthat )))//ixg;</pre></blockquote>Yes, it's getting terribly complicated, but that's why this section iscalled "Fancy Patterns", after all. If you need to complicate thepattern any more than we've done so far, judicious use of comments and<tt class="literal">qr//</tt> will help keep you sane. Or at least saner.</p></dd><dt><b><tt class="literal">(?<!</tt><em class="replaceable">PATTERN</em><tt class="literal">)</tt> <em class="emphasis">(negative lookbehind)</em></b></dt><dd><p>When the Engine encounters <tt class="literal">(?<!</tt><em class="replaceable">PATTERN</em><tt class="literal">)</tt>, it looksbackward in the string to ensure that <em class="replaceable">PATTERN</em> did not occur.</p><p>Let's go for a really simple example this time. How about the easyversion of that old spelling rule, "I before E except after C"? In Perl,you spell it:<blockquote><pre class="programlisting">s/(?<!c)ei/ie/g</pre></blockquote>You'll have to weigh for yourself whether you want to handle any ofthe exceptions. (For example, "weird" is spelled weird, especially whenyou spell it "wierd".)</p></dd></dl><a name="INDEX-1747"></a><h3 class="sect2">5.10.2. Nonbacktracking Subpatterns</h3><a name="INDEX-1748"></a><p><a name="INDEX-1749"></a><a name="INDEX-1750"></a>As described in "The Little Engine That /Could(n't)?/", the Engineoften backtracks as it proceeds through the pattern. You can blockthe Engine from backtracking back through a particular set of choicesby creating a <em class="emphasis">nonbacktracking subpattern</em>. Anonbacktracking subpattern looks like<tt class="literal">(?></tt><em class="replaceable">PATTERN</em><tt class="literal">)</tt>,and it works exactly like a simple<tt class="literal">(?:</tt><em class="replaceable">PATTERN</em><tt class="literal">)</tt>,except that once <em class="replaceable">PATTERN</em> has found a match,it suppresses backtracking on any of the quantifiers or alternativesinside the subpattern. (Hence, it is meaningless to use this on a<em class="replaceable">PATTERN</em> that doesn't contain quantifiers oralternatives.) The only way to get it to change its mind is tobacktrack to something before the subpattern and reenter thesubpattern from the left.</p><p>It's like going into a car dealership. After a certainamount of haggling over the price, you deliver an ultimatum: "Here's mybest offer; take it or leave it." If they don't take it, you don't goback to haggling again. Instead, you backtrack clear out the door.Maybe you go to another dealership, and start haggling again. You'reallowed to haggle again, but only because you reentered thenonbacktracking pattern again in a different context.</p><p>For devotees of Prolog or SNOBOL, you can think of this as a scopedcut or fence operator.</p><p>Consider how in <tt class="literal">"aaab" =~ /(?:a*)ab/</tt>, the<tt class="literal">a*</tt> first matches three <tt class="literal">a</tt>'s, butthen gives up one of them because the last <tt class="literal">a</tt> isneeded later. The subgroup sacrifices some of what it wants in orderfor the whole match to succeed. (Which is like letting the carsalesman talk you into giving him more of your money because you'reafraid to walk away from the deal.) In contrast, the subpattern in<tt class="literal">"aaab" =~ /(?>a*)ab/</tt> will never give upwhat it grabs, even though this behavior causes the whole match tofail. (As the song says, you have to know when to hold 'em, when tofold 'em, and when to walk away.)</p><p>Although<tt class="literal">(?></tt><em class="replaceable">PATTERN</em><tt class="literal">)</tt>is useful for changing the behavior of a pattern, it's mostly used forspeeding up the failure of certain matches that you know will failanyway (unless they succeed outright). The Engine can take aspectacularly long time to fail, particular with nested quantifiers.The following pattern will succeed almost instantly:<blockquote><pre class="programlisting">$_ = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab";/a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*[b]/;</pre></blockquote>But success is not the problem. Failure is. If you remove that final"<tt class="literal">b</tt>" from the string, the pattern will probably run for many, manyyears before failing. Many, many millennia. Actually, billions andbillions of years.<a href="#FOOTNOTE-13">[13]</a> You can see by inspection that the pattern can'tsucceed if there's no "<tt class="literal">b</tt>" on the end of the string, but the regexoptimizer is not smart enough (as of this writing) to figure out that<tt class="literal">/[b]/</tt> is equivalent to <tt class="literal">/b/</tt>. But if you give it a hint, you can getit to fail quickly while still letting it succeed where it can:<blockquote><pre class="programlisting">/(?>a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*)[b]/;</pre></blockquote>For a (hopefully) more realistic example, imagine a program that'ssupposed to read in a paragraph at a time and show just the lines thatare continued, where contination lines are specified with trailingbackslashes. Here's a sample from Perl's<em class="emphasis">Makefile</em> that uses this line-continuationconvention:<blockquote><pre class="programlisting"># Files to be built with variable substitution before miniperl# is available.sh = Makefile.SH cflags.SH config_h.SH makeaperl.SH makedepend.SH \ makedir.SH myconfig.SH writemain.SH</pre></blockquote>You could write your simple program this way:<blockquote><pre class="programlisting">#!/usr/bin/perl -00pwhile ( /( (.+) ( (?<=\\) \n .* )+ ) /gx) { print "GOT $.: $1\n\n";}</pre></blockquote>That works, but it's really quite slow. That's because the Enginebacktracks a character at a time from the end of the line, shrinkingwhat's in <tt class="literal">$1</tt>. This is pointless. And writing it withoutthe extraneous captures doesn't help much. Using:<blockquote><pre class="programlisting">(.+(?:(?<=\\)\n.*)+)</pre></blockquote>for a pattern is somewhat faster, but not much. This is where anonbacktracking subpattern helps a lot. The pattern:<blockquote><pre class="programlisting">((?>.+)(?:(?<=\\)\n.*)+)</pre></blockquote>does the same thing, but more than an order of magnitude fasterbecause it doesn't waste time backtracking in search of somethingthat isn't there.</p><blockquote class="footnote"><a name="FOOTNOTE-13"></a><p>[13] Actually, it's more on the order ofseptillions and septillions. We don't know exactly how long it wouldtake. We didn't care to wait around watching it not fail. In anyevent, your computer is likely to crash before the heat death of theuniverse, and this regular expression takes longer than either ofthose.</p></blockquote><p>You'll never get a success with <tt class="literal">(?>...)</tt>that you wouldn't get with <tt class="literal">(?:...)</tt> or even asimple <tt class="literal">(...)</tt>. But if you're going to fail,it's best to fail quickly and get on with your life.</p>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -