📄 ch05_09.htm

📁 编程珍珠,里面很多好用的代码,大家可以参考学习呵呵,
💻 HTM
📖 第 1 页 / 共 3 页
字号:
上一页 1 23
doesn't keep looking for a "better" match, even though the patternmight match in many different ways.</p><p>If it is unable to match the pattern at the first position in thestring, it admits temporary defeat and moves to the next position inthe string, between the first and second characters, and tries all thepossibilities again.  If it succeeds, it stops.  If it fails, itcontinues on down the string.  The pattern match as a whole doesn'tfail until it has tried to match the entire regular expression at everyposition in the string, including after the last character.</p><p>A string of <em class="emphasis">n</em> characters actually provides<em class="emphasis">n</em><tt class="literal">+ 1</tt> positions to match at.That's because the beginnings and the ends of matches are<em class="emphasis">between</em> the characters of the string.  This rulesometimes surprises people when they write a pattern like<tt class="literal">/x*/</tt> that can match zero or more"<tt class="literal">x</tt>" characters. If you try that pattern on a stringlike "<tt class="literal">fox</tt>", it won't find the"<tt class="literal">x</tt>".  Instead, it will immediately match the nullstring before the "<tt class="literal">f</tt>" and never look further.  Ifyou want it to match one or more <tt class="literal">x</tt> characters, youneed to use <tt class="literal">/x+/</tt> instead.  See the quantifiersunder Rule&nbsp;5.</p><p>A corollary to this rule is that any pattern matching the null stringis guaranteed to match at the leftmost position in the string (in theabsence of any zero-width assertions to the contrary).</p></dd><dt><b>Rule 2</b></dt><dd><p>When the Engine encounters a set of alternatives (separated by <tt class="literal">|</tt>symbols), either at the top level or at the current "cluster" level, ittries them left-to-right, stopping on the first successful match thatallows successful completion of the entire pattern.</p><p><a name="INDEX-1706"></a>A set of alternatives matches a string if any of the alternatives matchunder Rule&nbsp;3.  If none of the alternatives matches, it backtracks tothe Rule that invoked this Rule, which is usually Rule&nbsp;1, but could beRule&nbsp;4 or 6, if we're within a cluster.  That rule will then look for anew position at which to apply Rule&nbsp;2.</p><p>If there's only one alternative, then either it matches or it doesn't,and Rule&nbsp;2 still applies.  (There's no such thing as zero alternatives,because a null string always matches.)</p></dd><dt><b>Rule 3</b></dt><dd><p>Any particular alternative matches if every <em class="emphasis">item</em> listed in the alternativematches sequentially according to Rules&nbsp;4 and 5 (such that the entireregular expression can be satisfied).</p><p>An item consists of either an <em class="emphasis">assertion</em>, which is covered in Rule&nbsp;4,or a <em class="emphasis">quantified atom</em>, covered by Rule&nbsp;5.  Items that have choices onhow to match are given a "pecking order" from left to right.  If theitems cannot be matched in order, the Engine backtracks to the nextalternative under Rule&nbsp;2.</p><p><a name="INDEX-1707"></a>Items that must be matched sequentially aren't separated in the regularexpression by anything syntactic--they're merely juxtaposed in theorder they must match. When you ask to match <tt class="literal">/^foo/</tt>, you're actuallyasking for four items to be matched one after the other.  The first isa zero-width assertion, matched under Rule&nbsp;4, and the other three areordinary characters that must match themselves, one after the other,under Rule&nbsp;5.</p><p>The left-to-right pecking order means that in a pattern like:<blockquote><pre class="programlisting">/x*y*/</pre></blockquote><tt class="literal">x*</tt> gets to pick one way to match, and then <tt class="literal">y*</tt> tries all its ways. If that fails, then <tt class="literal">x*</tt> gets to pick its second choice, and make <tt class="literal">y*</tt>try all of its ways again.  And so on.  The items to the right "varyfaster", to borrow a phrase from multi-dimensional arrays.</p></dd><dt><b>Rule 4</b></dt><dd><p><a name="INDEX-1708"></a>If an assertion does not match at the current position, the Enginebacktracks to Rule&nbsp;3 and retries higher-pecking-order items withdifferent choices.</p><p><a name="INDEX-1709"></a><a name="INDEX-1710"></a>Some assertions are fancier than others.  Perl supports many regexextensions, some of which are zero-width assertions.  For example, thepositive lookahead <tt class="literal">(?=...)</tt> and the negative lookahead <tt class="literal">(?!...)</tt> don't actually match any characters, but merely assert that the regularexpression represented by <tt class="literal">...</tt> would (or would not) match at thispoint, were we to attempt it, hypothetically speaking.<a href="#FOOTNOTE-12">[12]</a><a name="INDEX-1711"></a><a name="INDEX-1712"></a><a name="INDEX-1713"></a></p><blockquote class="footnote"><a name="FOOTNOTE-12"></a><p>[12]In actualfact, the Engine <em class="emphasis">does</em> attempt it.  The Engine goes back to Rule&nbsp;2 totest the subpattern, and then wipes out any record of how much stringwas eaten, returning only the success or failure of the subpattern asthe value of the assertion.  (It does, however, remember any capturedsubstrings.)</p></blockquote></dd><dt><b>Rule 5</b></dt><dd><p>A quantified atom matches only if the atom itself matches some numberof times that is allowed by the quantifier.  (The atom itself is matchedaccording to Rule&nbsp;6.)</p><p>Different quantifiers require different numbers of matches, and mostof them allow a range of numbers of matches.  Multiple matches mustall match in a row; that is, they must be adjacent within the string.An unquantified atom is assumed to have a quantifier requiring exactlyone match (that is, <tt class="literal">/x/</tt> is the same as<tt class="literal">/x{1}/</tt>).  If no match can be found at the currentposition for any allowed quantity of the atom in question, the Enginebacktracks to Rule&nbsp;3 and retries higher-pecking-order items withdifferent choices.</p><p>The quantifiers are <tt class="literal">*</tt>, <tt class="literal">+</tt>, <tt class="literal">?</tt>, <tt class="literal">*?</tt>, <tt class="literal">+?</tt>, <tt class="literal">??</tt>, and the variousbrace forms.  If you use the <tt class="literal">{</tt><em class="replaceable">COUNT</em><tt class="literal">}</tt> form, then there is nochoice, and the atom must match exactly that number of times or not atall.  Otherwise, the atom can match over a range of quantities, andthe Engine keeps track of all the choices so that it can backtrack ifnecessary.  But then the question arises as to which of these choicesto try first.  One could start with the maximal number of matches andwork down, or the minimal number of matches and work up.</p><p><a name="INDEX-1714"></a><a name="INDEX-1715"></a><a name="INDEX-1716"></a>The traditional quantifiers (without a trailing question mark) specify<em class="emphasis">greedy</em> matching; that is, they attempt to match as many charactersas possible.  To find the greediest match, the Engine has to be alittle bit careful.  Bad guesses are potentially rather expensive, sothe Engine doesn't actually count down from the maximum value, whichafter all could be Very Large and cause millions of bad guesses.  Whatthe Engine actually does is a little bit smarter: it first counts <em class="emphasis">up</em>to find out how many matching atoms (in a row) are really there in thestring, and then it uses <em class="emphasis">that</em> actual maximum as its first choice.(It also remembers all the shorter choices in case the longest onedoesn't pan out.)  It then (at long last) tries to match the rest ofthe pattern, assuming the longest choice to be the best.  If thelongest choice fails to produce a match for the rest of the pattern, itbacktracks and tries the next longest.</p><p>If you say <tt class="literal">/.*foo/</tt>, for example, it will try to match the maximalnumber of "any" characters (represented by the dot) clear out to theend of the line before it ever tries looking for "<tt class="literal">foo</tt>"; and thenwhen the "<tt class="literal">foo</tt>" doesn't match there (and it can't, because there'snot enough room for it at the end of the string), the Engine will backoff one character at a time until it finds a "<tt class="literal">foo</tt>".  If there ismore than one "<tt class="literal">foo</tt>" in the line, it'll stop on the last one, sincethat will really be the <em class="emphasis">first</em> one it encounters as it backtracks. When the entire pattern succeeds using some particular length of <tt class="literal">.*</tt>,the Engine knows it can throw away all the other shorter choices for<tt class="literal">.*</tt> (the ones it would have used had the current "<tt class="literal">foo</tt>" not panned out).</p><p>By placing a question mark after any greedy quantifier, you turn it intoa frugal quantifier that chooses the smallest quantity for the first try.So if you say <tt class="literal">/.*?foo/</tt>, the <tt class="literal">.*?</tt> first tries to match 0characters, then 1 character, then 2, and so on until it can match the"<tt class="literal">foo</tt>".  Instead of backtracking backward, it backtracks forward, soto speak, and ends up finding the first "<tt class="literal">foo</tt>" on the line instead ofthe last.</p></dd><dt><b>Rule 6</b></dt><dd><p><a name="INDEX-1717"></a>Each atom matches according to the designated semantics of its type. If the atom doesn't match (or does match, but doesn't allow a match ofthe rest of the pattern), the Engine backtracks to Rule&nbsp;5 and tries thenext choice for the atom's quantity.</p><p>Atoms match according to the following types:</p><ul><li><p><a name="INDEX-1718"></a><a name="INDEX-1719"></a>A regular expression in parentheses, <tt class="literal">(...)</tt>, matches whatever theregular expression (represented by <tt class="literal">...</tt>) matches according toRule&nbsp;2.  Parentheses therefore serve as a clustering operator forquantification.  Bare parentheses also have the side effect of capturingthe matched substring for later use in a <em class="emphasis">backreference</em>.  This sideeffect can be suppressed by using <tt class="literal">(?:...)</tt> instead, which has onlythe clustering semantics--it doesn't store anything in <tt class="literal">$1</tt>, <tt class="literal">$2</tt>,and so on.  Other forms of parenthetical atoms (and assertions) arepossible--see the rest of this chapter.</p></li><li><p>A dot matches any character, except maybe newline.</p></li><li><p><a name="INDEX-1720"></a><a name="INDEX-1721"></a><a name="INDEX-1722"></a><a name="INDEX-1723"></a><a name="INDEX-1724"></a><a name="INDEX-1725"></a>A list of characters in square brackets (a <em class="emphasis">character class</em>)matches any one of the characters specified by the list.</p></li><li><p><a name="INDEX-1726"></a><a name="INDEX-1727"></a><a name="INDEX-1728"></a>A backslashed letter matches either a particular character or a characterfrom a set of characters, as listed in<a href="ch05_03.htm#perl3-tab-regex-meta-alpha">Table 5-7</a>.</p></li><li><p>Any other backslashed character matches that character.</p></li><li><p><a name="INDEX-1729"></a>Any character not mentioned above matches itself.</p></li></ul></dd></dl><p>That all sounds rather complicated, but the upshot of it is that, foreach set of choices given by a quantifier or alternation, the Enginehas a knob it can twiddle.  It will twiddle those knobs until theentire pattern matches.  The Rules just say in which order theEngine is allowed to twiddle those knobs.  Saying the Engine prefersthe leftmost match merely means it twiddles the start position knob theslowest.  And backtracking is just the process of untwiddling the knobyou just twiddled in order to try twiddling a knob higher in the peckingorder, that is, one that varies slower.</p><p>Here's a more concrete example, a program that detects whentwo consecutive words share a common ending and beginning:<blockquote><pre class="programlisting">$a = 'nobody';$b = 'bodysnatcher';if ("$a $b" =~ /^(\w+)(\w+) \2(\w+)$/) {    print "$2 overlaps in $1-$2-$3\n";}</pre></blockquote>This prints:<blockquote><pre class="programlisting">body overlaps in no-body-snatcher</pre></blockquote>You might think that <tt class="literal">$1</tt> would first grab up all of "<tt class="literal">nobody</tt>" due togreediness.  And in fact, it does--at first.  But once it's done so,there aren't any further characters to put in <tt class="literal">$2</tt>, which needs charactersput into it because of the <tt class="literal">+</tt> quantifier.  So the Engine backs up and <tt class="literal">$1</tt>begrudgingly gives up one character to <tt class="literal">$2</tt>.  This time the spacecharacter matches successfully, but then it sees <tt class="literal">\2</tt>, whichrepresents a measly "<tt class="literal">y</tt>".  The next character in the string is not a"<tt class="literal">y</tt>", but a "<tt class="literal">b</tt>".  This makes the Engine back up all the way andtry several more times, eventually forcing <tt class="literal">$1</tt> to surrender the body to<tt class="literal">$2</tt>.  Habeas corpus, as it were.</p><p>Actually, that won't quite work out if the overlap is itself theproduct of a doubling, as in the two words "<tt class="literal">rococo</tt>" and"<tt class="literal">cocoon</tt>".  The algorithm above would have decided that theoverlapping string, <tt class="literal">$2</tt>, must be just "<tt class="literal">co</tt>" rather than "<tt class="literal">coco</tt>".But we don't want a "<tt class="literal">rocococoon</tt>"; we want a "<tt class="literal">rococoon</tt>".  Here'sone of those places you can outsmart the Engine.  Adding a minimalmatching quantifier to the <tt class="literal">$1</tt> part gives the much better pattern<tt class="literal">/^(\w+?)(\w+) \2(\w+)$/</tt>, which does exactly what we want.</p><p>For a much more detailed discussion of the pros and cons of variouskinds of regular expression engines, see Jeffrey Friedl's book,<em class="emphasis">Mastering Regular Expressions</em>.  Perl's regular expression Engineworks very well for many of the everyday problems you want to solvewith Perl, and it even works okay for those not-so-everyday problems,if you give it a little respect and understanding.</p><a name="INDEX-1730"></a><a name="INDEX-1731"></a><a name="INDEX-1732"></a><!-- BOTTOM NAV BAR --><hr width="515" align="left"><div class="navbar"><table width="515" border="0"><tr><td align="left" valign="top" width="172"><a href="ch05_08.htm"><img src="../gifs/txtpreva.gif" alt="Previous" border="0"></a></td><td align="center" valign="top" width="171"><a href="index.htm"><img src="../gifs/txthome.gif" alt="Home" border="0"></a></td><td align="right" valign="top" width="172"><a href="ch05_10.htm"><img src="../gifs/txtnexta.gif" alt="Next" border="0"></a></td></tr><tr><td align="left" valign="top" width="172">5.8. Alternation</td><td align="center" valign="top" width="171"><a href="index/index.htm"><img src="../gifs/index.gif" alt="Book Index" border="0"></a></td><td align="right" valign="top" width="172">5.10. Fancy Patterns</td></tr></table></div><hr width="515" align="left"><!-- LIBRARY NAV BAR --><img src="../gifs/smnavbar.gif" usemap="#library-map" border="0" alt="Library Navigation Links"><p><font size="-1"><a href="copyrght.htm">Copyright &copy; 2001</a> O'Reilly &amp; Associates. All rights reserved.</font></p><map name="library-map"> <area shape="rect" coords="2,-1,79,99" href="../index.htm"><area shape="rect" coords="84,1,157,108" href="../perlnut/index.htm"><area shape="rect" coords="162,2,248,125" href="../prog/index.htm"><area shape="rect" coords="253,2,326,130" href="../advprog/index.htm"><area shape="rect" coords="332,1,407,112" href="../cookbook/index.htm"><area shape="rect" coords="414,2,523,103" href="../sysadmin/index.htm"></map><!-- END OF BODY --></body></html>
上一页 1 23
💿 文件大小 1969 K
👤 上传用户 ccuading
📂 所属分类电子书籍
🏷️ 相关标签

#编程 #代码 #家
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -