perlfaq6.html

来自「perl教程」· HTML 代码 · 共 849 行 · 第 1/5 页

HTML
849
字号
these needs.  Two examples are HTML::Parser and XML::Parser. There
are many others.</p>
<p>An elaborate subroutine (for 7-bit ASCII only) to pull out balanced
and possibly nested single chars, like <code>`</code> and <code>'</code>, <code>{</code> and <code>}</code>,
or <code>(</code> and <code>)</code> can be found in
<a href="http://www.cpan.org/authors/id/TOMC/scripts/pull_quotes.gz">http://www.cpan.org/authors/id/TOMC/scripts/pull_quotes.gz</a> .</p>
<p>The C::Scan module from CPAN also contains such subs for internal use,
but they are undocumented.</p>
<p>
</p>
<h2><a name="what_does_it_mean_that_regexes_are_greedy_how_can_i_get_around_it">What does it mean that regexes are greedy?  How can I get around it?</a></h2>
<p>Most people mean that greedy regexes match as much as they can.
Technically speaking, it's actually the quantifiers (<code>?</code>, <code>*</code>, <code>+</code>,
<code>{}</code>) that are greedy rather than the whole pattern; Perl prefers local
greed and immediate gratification to overall greed.  To get non-greedy
versions of the same quantifiers, use (<code>??</code>, <code>*?</code>, <code>+?</code>, <code>{}?</code>).</p>
<p>An example:</p>
<pre>
        <span class="variable">$s1</span> <span class="operator">=</span> <span class="variable">$s2</span> <span class="operator">=</span> <span class="string">"I am very very cold"</span><span class="operator">;</span>
        <span class="variable">$s1</span> <span class="operator">=~</span> <span class="regex">s/ve.*y //</span><span class="operator">;</span>      <span class="comment"># I am cold</span>
        <span class="variable">$s2</span> <span class="operator">=~</span> <span class="regex">s/ve.*?y //</span><span class="operator">;</span>     <span class="comment"># I am very cold</span>
</pre>
<p>Notice how the second substitution stopped matching as soon as it
encountered &quot;y &quot;.  The <code>*?</code> quantifier effectively tells the regular
expression engine to find a match as quickly as possible and pass
control on to whatever is next in line, like you would if you were
playing hot potato.</p>
<p>
</p>
<h2><a name="how_do_i_process_each_word_on_each_line">How do I process each word on each line?</a></h2>
<p>Use the split function:</p>
<pre>
    <span class="keyword">while</span> <span class="operator">(&lt;&gt;)</span> <span class="operator">{</span>
        <span class="keyword">foreach</span> <span class="variable">$word</span> <span class="operator">(</span> <span class="keyword">split</span> <span class="operator">)</span> <span class="operator">{</span>
            <span class="comment"># do something with $word here</span>
        <span class="operator">}</span>
    <span class="operator">}</span>
</pre>
<p>Note that this isn't really a word in the English sense; it's just
chunks of consecutive non-whitespace characters.</p>
<p>To work with only alphanumeric sequences (including underscores), you
might consider</p>
<pre>
    <span class="keyword">while</span> <span class="operator">(&lt;&gt;)</span> <span class="operator">{</span>
        <span class="keyword">foreach</span> <span class="variable">$word</span> <span class="operator">(</span><span class="regex">m/(\w+)/g</span><span class="operator">)</span> <span class="operator">{</span>
            <span class="comment"># do something with $word here</span>
        <span class="operator">}</span>
    <span class="operator">}</span>
</pre>
<p>
</p>
<h2><a name="how_can_i_print_out_a_wordfrequency_or_linefrequency_summary">How can I print out a word-frequency or line-frequency summary?</a></h2>
<p>To do this, you have to parse out each word in the input stream.  We'll
pretend that by word you mean chunk of alphabetics, hyphens, or
apostrophes, rather than the non-whitespace chunk idea of a word given
in the previous question:</p>
<pre>
    <span class="keyword">while</span> <span class="operator">(&lt;&gt;)</span> <span class="operator">{</span>
        <span class="keyword">while</span> <span class="operator">(</span> <span class="regex">/(\b[^\W_\d][\w'-]+\b)/g</span> <span class="operator">)</span> <span class="operator">{</span>   <span class="comment"># misses "`sheep'"</span>
            <span class="variable">$seen</span><span class="operator">{</span><span class="variable">$1</span><span class="operator">}</span><span class="operator">++;</span>
        <span class="operator">}</span>
    <span class="operator">}</span>
    <span class="keyword">while</span> <span class="operator">(</span> <span class="operator">(</span><span class="variable">$word</span><span class="operator">,</span> <span class="variable">$count</span><span class="operator">)</span> <span class="operator">=</span> <span class="keyword">each</span> <span class="variable">%seen</span> <span class="operator">)</span> <span class="operator">{</span>
        <span class="keyword">print</span> <span class="string">"$count $word\n"</span><span class="operator">;</span>
    <span class="operator">}</span>
</pre>
<p>If you wanted to do the same thing for lines, you wouldn't need a
regular expression:</p>
<pre>
    <span class="keyword">while</span> <span class="operator">(&lt;&gt;)</span> <span class="operator">{</span>
        <span class="variable">$seen</span><span class="operator">{</span><span class="variable">$_</span><span class="operator">}</span><span class="operator">++;</span>
    <span class="operator">}</span>
    <span class="keyword">while</span> <span class="operator">(</span> <span class="operator">(</span><span class="variable">$line</span><span class="operator">,</span> <span class="variable">$count</span><span class="operator">)</span> <span class="operator">=</span> <span class="keyword">each</span> <span class="variable">%seen</span> <span class="operator">)</span> <span class="operator">{</span>
        <span class="keyword">print</span> <span class="string">"$count $line"</span><span class="operator">;</span>
    <span class="operator">}</span>
</pre>
<p>If you want these output in a sorted order, see <a href="../../lib/Pod/perlfaq4.html">the perlfaq4 manpage</a>: &quot;How do I
sort a hash (optionally by value instead of key)?&quot;.</p>
<p>
</p>
<h2><a name="how_can_i_do_approximate_matching">How can I do approximate matching?</a></h2>
<p>See the module String::Approx available from CPAN.</p>
<p>
</p>
<h2><a name="how_do_i_efficiently_match_many_regular_expressions_at_once">How do I efficiently match many regular expressions at once?</a></h2>
<p>( contributed by brian d foy )</p>
<p>Avoid asking Perl to compile a regular expression every time
you want to match it.  In this example, perl must recompile
the regular expression for every iteration of the <code>foreach()</code>
loop since it has no way to know what $pattern will be.</p>
<pre>
    <span class="variable">@patterns</span> <span class="operator">=</span> <span class="string">qw( foo bar baz )</span><span class="operator">;</span>
</pre>
<pre>
    <span class="variable">LINE</span><span class="operator">:</span> <span class="keyword">while</span><span class="operator">(</span> <span class="operator">&lt;&gt;</span> <span class="operator">)</span>
        <span class="operator">{</span>
                <span class="keyword">foreach</span> <span class="variable">$pattern</span> <span class="operator">(</span> <span class="variable">@patterns</span> <span class="operator">)</span>
                        <span class="operator">{</span>
                <span class="keyword">print</span> <span class="keyword">if</span> <span class="regex">/\b$pattern\b/i</span><span class="operator">;</span>
                <span class="keyword">next</span> <span class="variable">LINE</span><span class="operator">;</span>
                        <span class="operator">}</span>
                <span class="operator">}</span>
</pre>
<p>The qr// operator showed up in perl 5.005.  It compiles a
regular expression, but doesn't apply it.  When you use the
pre-compiled version of the regex, perl does less work. In
this example, I inserted a <a href="../../lib/Pod/perlfunc.html#item_map"><code>map()</code></a> to turn each pattern into
its pre-compiled form.  The rest of the script is the same,
but faster.</p>
<pre>
    <span class="variable">@patterns</span> <span class="operator">=</span> <span class="keyword">map</span> <span class="operator">{</span> <span class="string">qr/\b$_\b/i</span> <span class="operator">}</span> <span class="string">qw( foo bar baz )</span><span class="operator">;</span>
</pre>
<pre>
    <span class="variable">LINE</span><span class="operator">:</span> <span class="keyword">while</span><span class="operator">(</span> <span class="operator">&lt;&gt;</span> <span class="operator">)</span>
        <span class="operator">{</span>
                <span class="keyword">foreach</span> <span class="variable">$pattern</span> <span class="operator">(</span> <span class="variable">@patterns</span> <span class="operator">)</span>
                        <span class="operator">{</span>
                <span class="keyword">print</span> <span class="keyword">if</span> <span class="regex">/\b$pattern\b/i</span><span class="operator">;</span>
                <span class="keyword">next</span> <span class="variable">LINE</span><span class="operator">;</span>
                        <span class="operator">}</span>
                <span class="operator">}</span>
</pre>
<p>In some cases, you may be able to make several patterns into
a single regular expression.  Beware of situations that require
backtracking though.</p>
<pre>
        <span class="variable">$regex</span> <span class="operator">=</span> <span class="keyword">join</span> <span class="string">'|'</span><span class="operator">,</span> <span class="string">qw( foo bar baz )</span><span class="operator">;</span>
</pre>
<pre>
    <span class="variable">LINE</span><span class="operator">:</span> <span class="keyword">while</span><span class="operator">(</span> <span class="operator">&lt;&gt;</span> <span class="operator">)</span>
        <span class="operator">{</span>
                <span class="keyword">print</span> <span class="keyword">if</span> <span class="regex">/\b(?:$regex)\b/i</span><span class="operator">;</span>
                <span class="operator">}</span>
</pre>
<p>For more details on regular expression efficiency, see Mastering
Regular Expressions by Jeffrey Freidl.  He explains how regular
expressions engine work and why some patterns are surprisingly
inefficient.  Once you understand how perl applies regular
expressions, you can tune them for individual situations.</p>
<p>
</p>
<h2><a name="why_don_t_wordboundary_searches_with__b_work_for_me">Why don't word-boundary searches with <code>\b</code> work for me?</a></h2>
<p>(contributed by brian d foy)</p>
<p>Ensure that you know what \b really does: it's the boundary between a
word character, \w, and something that isn't a word character. That
thing that isn't a word character might be \W, but it can also be the
start or end of the string.</p>
<p>It's not (not!) the boundary between whitespace and non-whitespace,
and it's not the stuff between words we use to create sentences.</p>
<p>In regex speak, a word boundary (\b) is a &quot;zero width assertion&quot;,
meaning that it doesn't represent a character in the string, but a
condition at a certain position.</p>
<p>For the regular expression, /\bPerl\b/, there has to be a word
boundary before the &quot;P&quot; and after the &quot;l&quot;.  As long as something other
than a word character precedes the &quot;P&quot; and succeeds the &quot;l&quot;, the
pattern will match. These strings match /\bPerl\b/.</p>
<pre>
        &quot;Perl&quot;    # no word char before P or after l
        &quot;Perl &quot;   # same as previous (space is not a word char)
        &quot;'Perl'&quot;  # the ' char is not a word char
        &quot;Perl's&quot;  # no word char before P, non-word char after &quot;l&quot;</pre>
<p>These strings do not match /\bPerl\b/.</p>
<pre>
        &quot;Perl_&quot;   # _ is a word char!
        &quot;Perler&quot;  # no word char before P, but one after l</pre>
<p>You don't have to use \b to match words though.  You can look for
non-word characters surrounded by word characters.  These strings
match the pattern /\b'\b/.</p>
<pre>
        &quot;don't&quot;   # the ' char is surrounded by &quot;n&quot; and &quot;t&quot;

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?