📄 ch05_06.htm
字号:
<html><head><title>Positions (Programming Perl)</title><!-- STYLESHEET --><link rel="stylesheet" type="text/css" href="../style/style1.css"><!-- METADATA --><!--Dublin Core Metadata--><meta name="DC.Creator" content=""><meta name="DC.Date" content=""><meta name="DC.Format" content="text/xml" scheme="MIME"><meta name="DC.Generator" content="XSLT stylesheet, xt by James Clark"><meta name="DC.Identifier" content=""><meta name="DC.Language" content="en-US"><meta name="DC.Publisher" content="O'Reilly & Associates, Inc."><meta name="DC.Source" content="" scheme="ISBN"><meta name="DC.Subject.Keyword" content=""><meta name="DC.Title" content="Positions"><meta name="DC.Type" content="Text.Monograph"></head><body><!-- START OF BODY --><!-- TOP BANNER --><img src="gifs/smbanner.gif" usemap="#banner-map" border="0" alt="Book Home"><map name="banner-map"><AREA SHAPE="RECT" COORDS="0,0,466,71" HREF="index.htm" ALT="Programming Perl"><AREA SHAPE="RECT" COORDS="467,0,514,18" HREF="jobjects/fsearch.htm" ALT="Search this book"></map><!-- TOP NAV BAR --><div class="navbar"><table width="515" border="0"><tr><td align="left" valign="top" width="172"><a href="ch05_05.htm"><img src="../gifs/txtpreva.gif" alt="Previous" border="0"></a></td><td align="center" valign="top" width="171"><a href="ch05_01.htm">Chapter 5: Pattern Matching</a></td><td align="right" valign="top" width="172"><a href="ch05_07.htm"><img src="../gifs/txtnexta.gif" alt="Next" border="0"></a></td></tr></table></div><hr width="515" align="left"><!-- SECTION BODY --><h2 class="sect1">5.6. Positions</h2><a name="INDEX-1602"></a><p><a name="INDEX-1603"></a><a name="INDEX-1604"></a><a name="INDEX-1605"></a><a name="INDEX-1606"></a>Some regex constructs represent <em class="emphasis">positions</em> in the string to bematched, which is a location just to the left or right of a realcharacter. These metasymbols are examples of <em class="emphasis">zero-width</em> assertionsbecause they do not correspond to actual characters in the string. Weoften just call them "assertions". (They're also known as "anchors"because they tie some part of the pattern to a particular position.)</p><p><a name="INDEX-1607"></a><a name="INDEX-1608"></a>You can always manipulate positions in a string without usingpatterns. The built-in <tt class="literal">substr</tt> function lets youextract and assign to substrings, measured from the beginning of thestring, the end of the string, or from a particular numeric offset.This might be all you need if you were working with fixed-lengthrecords, for instance. Patterns are only necessary when a numericoffset isn't sufficient. But most of the time, offsets aren'tsufficient--at least, not sufficiently convenient, compared topatterns.</p><h3 class="sect2">5.6.1. Beginnings: The \A and ^ Assertions</h3><a name="INDEX-1609"></a><a name="INDEX-1610"></a><p><a name="INDEX-1611"></a><a name="INDEX-1612"></a><a name="INDEX-1613"></a><a name="INDEX-1614"></a><a name="INDEX-1615"></a>The <tt class="literal">\A</tt> assertion matches only at the beginning ofthe string, no matter what. However, the <tt class="literal">^</tt>assertion is the traditional beginning-of-line assertion as well as abeginning-of-string assertion. Therefore, if the pattern uses the<tt class="literal">/m</tt> modifier<a href="#FOOTNOTE-8">[8]</a> and the string hasembedded newlines, <tt class="literal">^</tt> also matches anywhere insidethe string immediately following a newline character:<blockquote><pre class="programlisting">/\Abar/ # Matches "bar" and "barstool"/^bar/ # Matches "bar" and "barstool"/^bar/m # Matches "bar" and "barstool" and "sand\nbar"</pre></blockquote>Used in conjunction with <tt class="literal">/g</tt>, the <tt class="literal">/m</tt> modifier lets<tt class="literal">^</tt> match many times in the same string:<blockquote><pre class="programlisting">s/^\s+//gm; # Trim leading whitespace on each line$total++ while /^./mg; # Count nonblank lines</pre></blockquote></p><blockquote class="footnote"><a name="FOOTNOTE-8"></a><p>[8] Or you've set thedeprecated <tt class="literal">$*</tt> variable to <tt class="literal">1</tt> andyou're not overriding <tt class="literal">$*</tt> with the<tt class="literal">/s</tt> modifier.</p></blockquote><h3 class="sect2">5.6.2. Endings: The \z, \Z, and $ Assertions</h3><p><a name="INDEX-1616"></a><a name="INDEX-1617"></a><a name="INDEX-1618"></a><a name="INDEX-1619"></a><a name="INDEX-1620"></a><a name="INDEX-1621"></a></p><p>The <tt class="literal">\z</tt> metasymbol matches at the end of the string,no matter what's inside. <tt class="literal">\Z</tt> matches right beforethe newline at the end of the string if there is a newline, or at theend if there isn't. The <tt class="literal">$</tt> metacharacter usuallymeans the same as <tt class="literal">\Z</tt>. However, if the<tt class="literal">/m</tt> modifier was specified and the string hasembedded newlines, then <tt class="literal">$</tt> can also match anywhereinside the string right in front of a newline:<blockquote><pre class="programlisting">/bot\z/ # Matches "robot"/bot\Z/ # Matches "robot" and "abbot\n"/bot$/ # Matches "robot" and "abbot\n"/bot$/m # Matches "robot" and "abbot\n" and "robot\nrules"/^robot$/ # Matches "robot" and "robot\n"/^robot$/m # Matches "robot" and "robot\n" and "this\nrobot\n"/\Arobot\Z/ # Matches "robot" and "robot\n"/\Arobot\z/ # Matches only "robot" -- but why didn't you use eq?</pre></blockquote>As with <tt class="literal">^</tt>, the <tt class="literal">/m</tt> modifier lets <tt class="literal">$</tt> match many times in thesame string when used with <tt class="literal">/g</tt>. (These examples assume that you've read amultiline record into <tt class="literal">$_</tt>, perhaps by setting <tt class="literal">$/</tt> to <tt class="literal">""</tt> beforereading.)<blockquote><pre class="programlisting">s/\s*$//gm; # Trim trailing whitespace on each line in paragraphwhile (/^([^:]+):\s*(.*)/gm ) { # get mail header $headers{$1} = $2;}</pre></blockquote>In "Variable Interpolation" later in this chapter, we'll discuss how you caninterpolate variables into patterns: if <tt class="literal">$foo</tt> is "<tt class="literal">bc</tt>", then<tt class="literal">/a$foo/</tt> is equivalent to <tt class="literal">/abc/</tt>. Here, the <tt class="literal">$</tt> does not matchthe end of the string. For a <tt class="literal">$</tt> to match the end of the string, itmust be at the end of the pattern or immediately be followed by avertical bar or closing parenthesis.</p><h3 class="sect2">5.6.3. Boundaries: The \b and \B Assertions</h3><a name="INDEX-1622"></a><a name="INDEX-1623"></a><p><a name="INDEX-1624"></a><a name="INDEX-1625"></a><a name="INDEX-1626"></a>The <tt class="literal">\b</tt> assertion matches at any word boundary, defined as the positionbetween a <tt class="literal">\w</tt> character and a <tt class="literal">\W</tt> character, in either order. Ifthe order is <tt class="literal">\W\w</tt>, it's a beginning-of-word boundary, and if theorder is <tt class="literal">\w\W</tt>, it's an end-of-word boundary. (The ends of thestring count as <tt class="literal">\W</tt> characters here.) The <tt class="literal">\B</tt> assertion matchesany position that is <em class="emphasis">not</em> a word boundary, that is, the middle ofeither <tt class="literal">\w\w</tt> or <tt class="literal">\W\W</tt>.<blockquote><pre class="programlisting">/\bis\b/ # matches "what it is" and "that is it"/\Bis\B/ # matches "thistle" and "artist"/\bis\B/ # matches "istanbul" and "so--isn't that butter?"/\Bis\b/ # matches "confutatis" and "metropolis near you"</pre></blockquote>Because <tt class="literal">\W</tt> includes all punctuation characters (except theunderscore), there are <tt class="literal">\b</tt> boundaries in the middle of strings like"isn't", "booktech@oreilly.com", "M.I.T.", and "key/value".<a name="INDEX-1627"></a></p><p><a name="INDEX-1628"></a>Inside a character class (<tt class="literal">[\b]</tt>), a <tt class="literal">\b</tt> represents a backspacerather than a word boundary.</p><h3 class="sect2">5.6.4. Progressive Matching</h3><p><a name="INDEX-1629"></a>When used with the <tt class="literal">/g</tt> modifier, the <tt class="literal">pos</tt> function allows you toread or set the offset where the next progressive match will start:<blockquote><pre class="programlisting">$burglar = "Bilbo Baggins";while ($burglar =~ /b/gi) { printf "Found a B at %d\n", pos($burglar)-1;}</pre></blockquote>(We subtract one from the position because that was the length ofthe string we were looking for, and <tt class="literal">pos</tt> is always the positionjust past the match.)</p><p>The code above prints:<blockquote><pre class="programlisting">Found a B at 0Found a B at 3Found a B at 6</pre></blockquote><a name="INDEX-1630"></a><a name="INDEX-1631"></a>After a failure, the match position normally resets back to the start.If you also apply the <tt class="literal">/c</tt> (for "continue") modifier, then when the<tt class="literal">/g</tt> runs out, the failed match doesn't reset the position pointer.This lets you continue your search past that point without startingover at the very beginning.<blockquote><pre class="programlisting">$burglar = "Bilbo Baggins";while ($burglar =~ /b/gci) { # ADD /c printf "Found a B at %d\n", pos($burglar)-1;}while ($burglar =~ /i/gi) { printf "Found an I at %d\n", pos($burglar)-1;}</pre></blockquote>Besides the three <tt class="literal">B</tt>'s it found earlier, Perl now reports finding an<tt class="literal">i</tt> at position 10. Without the <tt class="literal">/c</tt>, the second loop's match wouldhave restarted from the beginning and found another <tt class="literal">i</tt> at position 6first.</p><h3 class="sect2">5.6.5. Where You Left Off: The \G Assertion</h3><a name="INDEX-1632"></a><a name="INDEX-1633"></a><a name="INDEX-1634"></a><a name="INDEX-1635"></a><a name="INDEX-1636"></a><p>Whenever you start thinking in terms of the <tt class="literal">pos</tt> function, it's temptingto start carving your string up with <tt class="literal">substr</tt>, but this is rarelythe right thing to do. More often, if you started with pattern matching,you should continue with pattern matching. However, if you're lookingfor a positional assertion, you're probably looking for <tt class="literal">\G</tt>.</p><p>The <tt class="literal">\G</tt> assertion represents within the pattern the same pointthat <tt class="literal">pos</tt> represents outside of it. When you're progressivelymatching a string with the <tt class="literal">/g</tt> modifier (or you've used the <tt class="literal">pos</tt>function to directly select the starting point), you can use <tt class="literal">\G</tt>to specify the position just after the previous match. That is,it matches the location immediately before whatever character wouldbe identified by <tt class="literal">pos</tt>. This allows you to remember where youleft off:<blockquote><pre class="programlisting">($recipe = <<'DISH') =~ s/^\s+//gm; Preheat oven to 451 deg. fahrenheit. Mix 1 ml. dilithium with 3 oz. NaCl and stir in 4 anchovies. Glaze with 1 g. mercury. Heat for 4 hours and let cool for 3 seconds. Serves 10 aliens.DISH$recipe =~ /\d+ /g;$recipe =~ /\G(\w+)/; # $1 is now "deg"$recipe =~ /\d+ /g;$recipe =~ /\G(\w+)/; # $1 is now "ml"$recipe =~ /\d+ /g;$recipe =~ /\G(\w+)/; # $1 is now "oz"</pre></blockquote><a name="INDEX-1637"></a>The <tt class="literal">\G</tt> metasymbol is often used in a loop, as wedemonstrate in our next example. We "pause" after every digitsequence, and at that position, we test whether there's anabbreviation. If so, we grab the next two words. Otherwise, we justgrab the next word:<blockquote><pre class="programlisting">pos($recipe) = 0; # Just to be safe, reset \G to 0while ( $recipe =~ /(\d+) /g ) { my $amount = $1; if ($recipe =~ / \G (\w{0,3}) \. \s+ (\w+) /x) { # abbrev. + word print "$amount $1 of $2\n"; } else { $recipe =~ / \G (\w+) /x; # just a word print "$amount $1\n"; }}</pre></blockquote>That produces:<blockquote><pre class="programlisting">451 deg of fahrenheit1 ml of dilithium3 oz of NaCl4 anchovies1 g of mercury4 hours3 seconds10 aliens</pre></blockquote></p><a name="INDEX-1638"></a><a name="INDEX-1639"></a><a name="INDEX-1640"></a><!-- BOTTOM NAV BAR --><hr width="515" align="left"><div class="navbar"><table width="515" border="0"><tr><td align="left" valign="top" width="172"><a href="ch05_05.htm"><img src="../gifs/txtpreva.gif" alt="Previous" border="0"></a></td><td align="center" valign="top" width="171"><a href="index.htm"><img src="../gifs/txthome.gif" alt="Home" border="0"></a></td><td align="right" valign="top" width="172"><a href="ch05_07.htm"><img src="../gifs/txtnexta.gif" alt="Next" border="0"></a></td></tr><tr><td align="left" valign="top" width="172">5.5. Quantifiers</td><td align="center" valign="top" width="171"><a href="index/index.htm"><img src="../gifs/index.gif" alt="Book Index" border="0"></a></td><td align="right" valign="top" width="172">5.7. Capturing and Clustering</td></tr></table></div><hr width="515" align="left"><!-- LIBRARY NAV BAR --><img src="../gifs/smnavbar.gif" usemap="#library-map" border="0" alt="Library Navigation Links"><p><font size="-1"><a href="copyrght.htm">Copyright © 2001</a> O'Reilly & Associates. All rights reserved.</font></p><map name="library-map"> <area shape="rect" coords="2,-1,79,99" href="../index.htm"><area shape="rect" coords="84,1,157,108" href="../perlnut/index.htm"><area shape="rect" coords="162,2,248,125" href="../prog/index.htm"><area shape="rect" coords="253,2,326,130" href="../advprog/index.htm"><area shape="rect" coords="332,1,407,112" href="../cookbook/index.htm"><area shape="rect" coords="414,2,523,103" href="../sysadmin/index.htm"></map><!-- END OF BODY --></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -