📄 ch05_07.htm

📁 编程珍珠,里面很多好用的代码,大家可以参考学习呵呵,
💻 HTM
📖 第 1 页 / 共 2 页
字号:
12 下一页
<html><head><title>Capturing and Clustering (Programming Perl)</title><!-- STYLESHEET --><link rel="stylesheet" type="text/css" href="../style/style1.css"><!-- METADATA --><!--Dublin Core Metadata--><meta name="DC.Creator" content=""><meta name="DC.Date" content=""><meta name="DC.Format" content="text/xml" scheme="MIME"><meta name="DC.Generator" content="XSLT stylesheet, xt by James Clark"><meta name="DC.Identifier" content=""><meta name="DC.Language" content="en-US"><meta name="DC.Publisher" content="O'Reilly &amp; Associates, Inc."><meta name="DC.Source" content="" scheme="ISBN"><meta name="DC.Subject.Keyword" content=""><meta name="DC.Title" content="Capturing and Clustering"><meta name="DC.Type" content="Text.Monograph"></head><body><!-- START OF BODY --><!-- TOP BANNER --><img src="gifs/smbanner.gif" usemap="#banner-map" border="0" alt="Book Home"><map name="banner-map"><AREA SHAPE="RECT" COORDS="0,0,466,71" HREF="index.htm" ALT="Programming Perl"><AREA SHAPE="RECT" COORDS="467,0,514,18" HREF="jobjects/fsearch.htm" ALT="Search this book"></map><!-- TOP NAV BAR --><div class="navbar"><table width="515" border="0"><tr><td align="left" valign="top" width="172"><a href="ch05_06.htm"><img src="../gifs/txtpreva.gif" alt="Previous" border="0"></a></td><td align="center" valign="top" width="171"><a href="ch05_01.htm">Chapter 5: Pattern Matching</a></td><td align="right" valign="top" width="172"><a href="ch05_08.htm"><img src="../gifs/txtnexta.gif" alt="Next" border="0"></a></td></tr></table></div><hr width="515" align="left"><!-- SECTION BODY --><h2 class="sect1">5.7. Capturing and Clustering</h2><p><a name="INDEX-1641"></a>Patterns allow you to group portions of your pattern together intosubpatterns and to remember the strings matched by those subpatterns.  Wecall the first behavior <em class="emphasis">clustering</em> and the second one <em class="emphasis">capturing</em>.</p><h3 class="sect2">5.7.1. Capturing</h3><a name="INDEX-1642"></a><a name="INDEX-1643"></a><a name="INDEX-1644"></a><p><a name="INDEX-1645"></a>To capture a substring for later use, put parentheses around thesubpattern that matches it.  The first pair of parentheses stores itssubstring in <tt class="literal">$1</tt>, the second pair in <tt class="literal">$2</tt>, and so on.  You may useas many parentheses as you like; Perl just keeps defining more numberedvariables for you to represent these captured strings.</p><p>Some examples:<blockquote><pre class="programlisting">/(\d)(\d)/  # Match two digits, capturing them into $1 and $2/(\d+)/     # Match one or more digits, capturing them all into $1/(\d)+/     # Match a digit one or more times, capturing the last into $1</pre></blockquote>Note the difference between the second and third patterns.  The secondform is usually what you want.  The third form does <em class="emphasis">not</em> create multiplevariables for multiple digits.  Parentheses are numbered when the patternis compiled, not when it is matched.</p><p><a name="INDEX-1646"></a>Captured strings are often called <em class="emphasis">backreferences</em>because they refer back to parts of the captured text.  There areactually two ways to get at these backreferences.  The numberedvariables you've seen are how you get at backreferences outside of apattern, but inside the pattern, that doesn't work.  You have to use<tt class="literal">\1</tt>, <tt class="literal">\2</tt>, etc.<a href="#FOOTNOTE-9">[9]</a> So to find doubled words like"<tt class="literal">the the</tt>" or "<tt class="literal">had had</tt>", youmight use this pattern:<blockquote><pre class="programlisting">/\b(\w+) \1\b/i</pre></blockquote>But most often, you'll be using the <tt class="literal">$1</tt> form, because you'll usuallyapply a pattern and then do something with the substrings.  Suppose youhave some text (a mail header) that looks like this:<blockquote><pre class="programlisting">From: gnat@perl.comTo: camelot@oreilly.comDate: Mon, 17 Jul 2000 09:00:00 -1000Subject: Eye of the needle</pre></blockquote>and you want to construct a hash that maps the text before each colonto the text afterward.  If you were looping through this text line byline (say, because you were reading it from a file) you could do thatas follows:<blockquote><pre class="programlisting">while (&lt;&gt;) {    /^(.*?): (.*)$/;    # Pre-colon text into $1, post-colon into $2    $fields{$1} = $2;}</pre></blockquote>Like <tt class="literal">$`</tt>, <tt class="literal">$&amp;</tt>, and<tt class="literal">$'</tt>, these numbered variables are dynamically scopedthrough the end of the enclosing block or <tt class="literal">eval</tt>string, or to the next successful pattern match, whichever comesfirst.  You can use them in the righthand side (the replacement part)of a substitute, too:<blockquote><pre class="programlisting">s/^(\S+) (\S+)/$2 $1/;  # Swap first two words</pre></blockquote><a name="INDEX-1647"></a>Groupings can nest, and when they do, the groupings are counted by thelocation of the left parenthesis.  So given the string "PrimulaBrandybuck", the pattern:</p><blockquote class="footnote"><a name="FOOTNOTE-9"></a><p>[9] Youcan't use <tt class="literal">$1</tt> for a backreference within the patternbecause that would already have been interpolated as an ordinaryvariable back when the regex was compiled.  So we use the traditional<tt class="literal">\1</tt> backreference notation inside patterns.  Fortwo- and three-digit backreference numbers, there is some ambiguitywith octal character notation, but that is neatly solved byconsidering how many captured patterns are available.  For instance,if Perl sees a <tt class="literal">\11</tt> metasymbol, it's equivalent to<tt class="literal">$11</tt> only if there are at least 11 substringscaptured earlier in the pattern. Otherwise, it's equivalent to<tt class="literal">\011</tt>, that is, a tabcharacter.</p></blockquote><blockquote><pre class="programlisting">/^((\w+) (\w+))$/</pre></blockquote><p>would capture "<tt class="literal">Primula Brandybuck</tt>" into <tt class="literal">$1</tt>, "<tt class="literal">Primula</tt>" into <tt class="literal">$2</tt>,and "<tt class="literal">Brandybuck</tt>" into <tt class="literal">$3</tt>.  This is depicted in <a href="ch05_07.htm#perl3-backrefs">Figure 5-1</a>.</p><a name="perl3-backrefs"></a><div class="figure"></div><h4 class="objtitle">Figure 5.1. Creating backreferences with parentheses</h4><p><a name="INDEX-1648"></a><a name="INDEX-1649"></a>Patterns with captures are often used in list context to populate a list ofvalues, since the pattern is smart enough to return the captured substringsas a list:<blockquote><pre class="programlisting">($first, $last)        =  /^(\w+) (\w+)$/;($full, $first, $last) =  /^((\w+) (\w+))$/;</pre></blockquote>With the <tt class="literal">/g</tt> modifier, a pattern can return multiple substrings from multiplematches, all in one list.  Suppose you had the mail header we saw earlierall in one string (in <tt class="literal">$_</tt>, say).  You could do the same thing as ourline-by-line loop, but with one statement:<blockquote><pre class="programlisting">%fields = /^(.*?): (.*)$/gm;</pre></blockquote>The pattern matches four times, and each time it matches, it finds twosubstrings.  The <tt class="literal">/gm</tt> match returns all of these as a flat list of eightstrings, which the list assignment to <tt class="literal">%fields</tt> will convenientlyinterpret as four key/value pairs, thus restoring harmony to the universe.</p><p><a name="INDEX-1650"></a><a name="INDEX-1651"></a><a name="INDEX-1652"></a><a name="INDEX-1653"></a>Several other special variables deal with text captured in patternmatches.  <tt class="literal">$&amp;</tt> contains the entire matched string, <tt class="literal">$`</tt> everythingto the left of the match, <tt class="literal">$'</tt> everything to the right.  <tt class="literal">$+</tt>contains the contents of the last backreference.<blockquote><pre class="programlisting">$_ = "Speak, &lt;EM&gt;friend&lt;/EM&gt;, and enter.";m[ (&lt;.*?&gt;) (.*?) (&lt;/.*?&gt;) ]x;     # A tag, then chars, then an end tagprint "prematch: $`\n";           # Speak,print "match: $&amp;\n";              # &lt;EM&gt;friend&lt;/EM&gt;
12 下一页
💿 文件大小 1969 K
👤 上传用户 ccuading
📂 所属分类电子书籍
🏷️ 相关标签

#编程 #代码 #家
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -