⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 ch05_09.htm

📁 编程珍珠,里面很多好用的代码,大家可以参考学习呵呵,
💻 HTM
📖 第 1 页 / 共 3 页
字号:
special to get Perl to realize that <tt class="literal">\t</tt> was a tab.If Perl's patterns <em class="emphasis">were</em> just double-quoteinterpolated, you would have; fortunately, they aren't.  They'rerecognized directly by the regex parser.</p><blockquote class="footnote"><a name="FOOTNOTE-10"></a><p>[10] If youdidn't know what a <em class="emphasis">grep</em> program was before, you will now.  No systemshould be without <em class="emphasis">grep</em>--we believe <em class="emphasis">grep</em> is the most useful smallprogram ever invented.  (It logically follows that we don't believePerl is a small program.)</p></blockquote><p><a name="INDEX-1692"></a>The real <em class="emphasis">grep</em> program has a <span class="option">-i</span> switch that turns offcase-sensitive matching.  You don't have to add such a switch to your<em class="emphasis">pgrep</em> program; it can already handle that without modification.  Youjust pass it a slightly fancier pattern, with an embedded <tt class="literal">/i</tt>modifier:<blockquote><pre class="programlisting">% <tt class="userinput"><b>pgrep '(?i)ring' LotR*.pod</b></tt></pre></blockquote>That now searches for any of "Ring", "ring", "RING", and so on.You don't see this feature too much in literal patterns, since you canalways just write <tt class="literal">/ring/i</tt>.  But for patterns passed in on thecommand line, in web search forms, or embedded in configuration files,it can be a lifesaver.  (Speaking of rings.)</p><h3 class="sect3">5.9.2.2. The qr// quote regex operator</h3><p><a name="INDEX-1693"></a>Variables that interpolate into patterns necessarily do so at runtime, not compile time.  This slows down execution because Perlhas to check whether you've changed the contents of the variable;if so, it would have to recompile the regular expression.As mentioned in "Pattern-Matching Operators", if you promise neverto change the pattern, you can use the <tt class="literal">/o</tt> option to interpolateand compile only once:<blockquote><pre class="programlisting">print if /$pattern/o;</pre></blockquote>Although that works fine in our <em class="emphasis">pgrep</em> program, in the generalcase, it doesn't.  Imagine you have a slew of patterns, and youwant to match each of them in a loop, perhaps like this:<blockquote><pre class="programlisting">foreach $item (@data) {    foreach $pat (@patterns) {        if ($item =~ /$pat/) { ... }    }}</pre></blockquote>You couldn't write <tt class="literal">/$pat/o</tt> because the meaning of <tt class="literal">$pat</tt>varies each time through the inner loop.</p><p>The solution to this is the <tt class="literal">qr/</tt><em class="replaceable">PATTERN</em><tt class="literal">/imosx</tt> operator.  Thisoperator quotes--and compiles--its <em class="replaceable">PATTERN</em> as a regular expression.<em class="replaceable">PATTERN</em> is interpolated the same way as in<tt class="literal">m/</tt><em class="replaceable">PATTERN</em><tt class="literal">/</tt>.  If<tt class="literal">'</tt> is used as the delimiter, no interpolation of variables (orthe six translation escapes) is done.  The operator returns a Perl value thatmay be used instead of the equivalent literal in a correspondingpattern match or substitute.  For example:<blockquote><pre class="programlisting">$regex = qr/my.STRING/is;s/$regex/something else/;</pre></blockquote>is equivalent to:<blockquote><pre class="programlisting">s/my.STRING/something else/is;</pre></blockquote>So for our nested loop problem above, preprocess your pattern firstusing a separate loop:<blockquote><pre class="programlisting">@regexes = ();foreach $pat (@patterns) {    push @regexes, qr/$pat/;}</pre></blockquote>Or all at once using Perl's <tt class="literal">map</tt> operator:<blockquote><pre class="programlisting">@regexes = map { qr/$_/ } @patterns;</pre></blockquote>And then change the loop to use those precompiled regexes:<blockquote><pre class="programlisting">foreach $item (@data) {    foreach $re (@regexes) {        if ($item =~ /$re/) { ... }    }}</pre></blockquote>Now when you run the match, Perl doesn't have to create a compiledregular expression on each <tt class="literal">if</tt> test, because it sees that italready has one.</p><p>The result of a <tt class="literal">qr//</tt> may even be interpolated into a largermatch, as though it were a simple string:<blockquote><pre class="programlisting">$regex = qr/$pattern/;$string =~ /foo${regex}bar/;   # interpolate into larger patterns</pre></blockquote>This time, Perl does recompile the pattern, but you could always chainseveral <tt class="literal">qr//</tt> operators together into one.</p><p>The reason this works is because the <tt class="literal">qr//</tt> operatorreturns a special kind of object that has a stringification overloadas described in <a href="ch13_01.htm">Chapter 13, "Overloading"</a>.  If you printout the return value, you'll see the equivalent string:<blockquote><pre class="programlisting">$re = qr/my.STRING/is;print $re;                  # prints (?si-xm:my.STRING)</pre></blockquote>The <tt class="literal">/s</tt> and <tt class="literal">/i</tt> modifiers were enabled in the pattern becausethey were supplied to <tt class="literal">qr//</tt>.  The <tt class="literal">/x</tt> and <tt class="literal">/m</tt>, however,are disabled because they were not.</p><p>Any time you interpolate strings of unknown provenance into a pattern,you should be prepared to handle any exceptions thrown by the regexcompiler, in case someone fed you a string containing untamable beasties:<blockquote><pre class="programlisting">$re = qr/$pat/is;                      # might escape and eat you$re = eval { qr/$pat/is } || warn ...  # caught it in an outer cage</pre></blockquote>For more on the <tt class="literal">eval</tt> operator, see <a href="ch29_01.htm">Chapter 29, "Functions"</a>.</p><a name="INDEX-1694"></a><a name="INDEX-1695"></a><h3 class="sect2">5.9.3. The Regex Compiler</h3><p><a name="INDEX-1696"></a><a name="INDEX-1697"></a>After the variable interpolation pass has had its way with the string, theregex parser finally gets a shot at trying to understand your regularexpression.  There's not actually a great deal that can go wrong atthis point, apart from messing up the parentheses, or using a sequenceof metacharacters that doesn't mean anything.  The parser does arecursive-descent analysis of your regular expression and, if itparses, turns it into a form suitable for interpretation by theEngine (see the next section).  Most of the interesting stuff that goes onin the parser involves optimizing your regular expression to run asfast as possible.  We're not going to explain that part.  It's a tradesecret.  (Rumors that looking at the regular expression code will driveyou insane are greatly exaggerated.  We hope.)</p><p>But you might like to know what the parser actually thought of yourregular expression, and if you ask it politely, it will tell you.  Bysaying <tt class="literal">use re "debug"</tt>, you can examine how the regex parserprocesses your pattern.  (You can also see the same information byusing the <span class="option">-Dr</span> command-line switch, which is available to you if yourPerl was compiled with the <span class="option">-DDEBUGGING</span> flag during installation.)<blockquote><pre class="programlisting">#!/usr/bin/perluse re "debug";"Smeagol" =~ /^Sm(.*)g[aeiou]l$/;</pre></blockquote>The output is below.  You can see that prior to execution Perlcompiles the regex and assigns meaning to the components of thepattern: <tt class="literal">BOL</tt> for the beginning of line(<tt class="literal">^</tt>), <tt class="literal">REG_ANY</tt> for the dot, and soon:<blockquote><pre class="programlisting">Compiling REx `^Sm(.*)g[aeiou]l$'size 24 first at 2rarest char l at 0rarest char S at 0   1: BOL(2)   2: EXACT &lt;Sm&gt;(4)   4: OPEN1(6)   6:   STAR(8)   7:     REG_ANY(0)   8: CLOSE1(10)  10: EXACT &lt;g&gt;(12)  12: ANYOF[aeiou](21)  21: EXACT &lt;l&gt;(23)  23: EOL(24)  24: END(0)anchored `Sm' at 0 floating `l'$ at 4..2147483647     (checking anchored) anchored(BOL) minlen 5 Omitting $` $&amp; $' support.</pre></blockquote>Some of the lines summarize the conclusions of the regex optimizer.  Itknows that the string must start with "<tt class="literal">Sm</tt>", and that thereforethere's no reason to do the ordinary left-to-right scan.  It knows thatthe string must end with an "<tt class="literal">l</tt>", so it can reject out of hand anystring that doesn't.  It knows that the string must be at least fivecharacters long, so it can ignore any string shorter than that rightoff the bat.  It also knows what the rarest character in each constantstring is, which can help in searching "studied" strings.  (See<tt class="literal">study</tt> in <a href="ch29_01.htm">Chapter 29, "Functions"</a>.)</p><p>It then goes on to trace how it executes the pattern:<blockquote><pre class="programlisting">EXECUTING... Guessing start of match, REx `^Sm(.*)g[aeiou]l$' against `Smeagol'...Guessed: match at offset 0Matching REx `^Sm(.*)g[aeiou]l$' against `Smeagol'  Setting an EVAL scope, savestack=3   0 &lt;&gt; &lt;Smeagol&gt;         |  1:  BOL   0 &lt;&gt; &lt;Smeagol&gt;         |  2:  EXACT &lt;Sm&gt;   2 &lt;Sm&gt; &lt;eagol&gt;         |  4:  OPEN1   2 &lt;Sm&gt; &lt;eagol&gt;         |  6:  STAR                           REG_ANY can match 5 times out of 32767...  Setting an EVAL scope, savestack=3   7 &lt;Smeagol&gt; &lt;&gt;         |  8:    CLOSE1   7 &lt;Smeagol&gt; &lt;&gt;         | 10:    EXACT &lt;g&gt;                              failed...   6 &lt;Smeago&gt; &lt;l&gt;         |  8:    CLOSE1   6 &lt;Smeago&gt; &lt;l&gt;         | 10:    EXACT &lt;g&gt;                              failed...   5 &lt;Smeag&gt; &lt;ol&gt;         |  8:    CLOSE1   5 &lt;Smeag&gt; &lt;ol&gt;         | 10:    EXACT &lt;g&gt;                              failed...   4 &lt;Smea&gt; &lt;gol&gt;         |  8:    CLOSE1   4 &lt;Smea&gt; &lt;gol&gt;         | 10:    EXACT &lt;g&gt;   5 &lt;Smeag&gt; &lt;ol&gt;         | 12:    ANYOF[aeiou]   6 &lt;Smeago&gt; &lt;l&gt;         | 21:    EXACT &lt;l&gt;   7 &lt;Smeagol&gt; &lt;&gt;         | 23:    EOL   7 &lt;Smeagol&gt; &lt;&gt;         | 24:    ENDMatch successful!Freeing REx: `^Sm(.*)g[aeiou]l$'</pre></blockquote>If you follow the stream of whitespace down the middle of<tt class="literal">Smeagol</tt>, you can actually see how the Engineovershoots to let the <tt class="literal">.*</tt> be as greedy as possible,then backtracks on that until it finds a way for the rest of thepattern to match.  But that's what the next section is about.</p><a name="ch05-sect-engine"></a><h3 class="sect2">5.9.4. The Little Engine That /Could(n't)?/</h3><a name="INDEX-1698"></a><a name="INDEX-1699"></a><a name="INDEX-1700"></a><a name="INDEX-1701"></a><a name="INDEX-1702"></a><a name="INDEX-1703"></a><p>And now we'd like to tell you the story of the Little Regex Engine thatsays, "I think I can.  I think I can.  I think I can."</p><p>In this section, we lay out the rules used by Perl's regular expressionengine to match your pattern against a string.  The Engine isextremely persistent and hardworking.  It's quite capable of workingeven after you think it should quit.  The Engine doesn't give up untilit's certain there's no way to match the pattern against the string.The Rules below explain how the Engine "thinks it can" for aslong as possible, until it <em class="emphasis">knows</em> it can or can't.  The problem for ourEngine is that its task is not merely to pull a train over a hill.  Ithas to search a (potentially) very complicated space of possibilities,keeping track of where it has been and where it hasn't.</p><p><a name="INDEX-1704"></a>The Engine uses a nondeterministic finite-state automaton (NFA, not tobe confused with NFL, a nondeterministic football league) to find amatch.  That just means that it keeps track of what it has tried andwhat it hasn't, and when something doesn't pan out, it backs up andtries something else.  This is known as <em class="emphasis">backtracking</em>.  (Er, sorry,we didn't invent these terms.  Really.)  The Engine is capable oftrying a million subpatterns at one spot, then giving up on all those,backing up to within one choice of the beginning, and trying themillion subpatterns again at a different spot.  The Engine is notterribly intelligent; just persistent, and thorough.  If you're cagey,you can give the Engine an efficient pattern that doesn't let it do alot of silly backtracking.</p><p>When someone trots out a phrase like "Regexes choose the leftmost,longest match", that means that Perl generally prefers the leftmostmatch over longest match.  But the Engine doesn't realize it's"preferring" anything, and it's not really thinking at all, justgutting it out.  The overall preferences are an emergent behaviorresulting from many individual and unrelated choices.  Here are thosechoices:<a href="#FOOTNOTE-11">[11]</a></p><blockquote class="footnote"><a name="FOOTNOTE-11"></a><p>[11] Some of these choices may be skipped if the regexoptimizer has any say, which is equivalent to the Little Enginesimply jumping through the hill via quantum tunneling.  But for thisdiscussion we're pretending the optimizer doesn't exist.</p></blockquote><dl><dt><b>Rule 1</b></dt><dd><p><a name="INDEX-1705"></a>The Engine tries to match as far left in the string as it can, suchthat the entire regular expression matches under Rule&nbsp;2.</p><p>The Engine starts just before the first character and tries to matchthe entire pattern starting there.  The entire pattern matches if andonly if the Engine reaches the end of the pattern before it runs offthe end of the string.  If it matches, it quits immediately--it

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -