📄 perlre.pod

📁 ARM上的如果你对底层感兴趣
💻 POD
📖 第 1 页 / 共 3 页
字号:
matches any occurrence of "foo" that isn't followed by "bar".  Note
however that lookahead and lookbehind are NOT the same thing.  You cannot
use this for lookbehind.

If you are looking for a "bar" that isn't preceded by a "foo", C</(?!foo)bar/>
will not do what you want.  That's because the C<(?!foo)> is just saying that
the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will
match.  You would have to do something like C</(?!foo)...bar/> for that.   We
say "like" because there's the case of your "bar" not having three characters
before it.  You could cover that this way: C</(?:(?!foo)...|^.{0,2})bar/>.
Sometimes it's still easier just to say:

    if (/bar/ && $` !~ /foo$/)

For lookbehind see below.

=item C<(?E<lt>=pattern)>

A zero-width positive lookbehind assertion.  For example, C</(?E<lt>=\t)\w+/>
matches a word following a tab, without including the tab in C<$&>.
Works only for fixed-width lookbehind.

=item C<(?<!pattern)>

A zero-width negative lookbehind assertion.  For example C</(?<!bar)foo/>
matches any occurrence of "foo" that isn't following "bar".  
Works only for fixed-width lookbehind.

=item C<(?{ code })>

Experimental "evaluate any Perl code" zero-width assertion.  Always
succeeds.  C<code> is not interpolated.  Currently the rules to
determine where the C<code> ends are somewhat convoluted.

The C<code> is properly scoped in the following sense: if the assertion
is backtracked (compare L<"Backtracking">), all the changes introduced after
C<local>isation are undone, so

  $_ = 'a' x 8;
  m< 
     (?{ $cnt = 0 })			# Initialize $cnt.
     (
       a 
       (?{
           local $cnt = $cnt + 1;	# Update $cnt, backtracking-safe.
       })
     )*  
     aaaa
     (?{ $res = $cnt })			# On success copy to non-localized
					# location.
   >x;

will set C<$res = 4>.  Note that after the match $cnt returns to the globally
introduced value 0, since the scopes which restrict C<local> statements
are unwound.

This assertion may be used as L<C<(?(condition)yes-pattern|no-pattern)>>
switch.  If I<not> used in this way, the result of evaluation of C<code>
is put into variable $^R.  This happens immediately, so $^R can be used from
other C<(?{ code })> assertions inside the same regular expression.

The above assignment to $^R is properly localized, thus the old value of $^R
is restored if the assertion is backtracked (compare L<"Backtracking">).

Due to security concerns, this construction is not allowed if the regular
expression involves run-time interpolation of variables, unless 
C<use re 'eval'> pragma is used (see L<re>), or the variables contain
results of qr() operator (see L<perlop/"qr/STRING/imosx">).

This restriction is due to the wide-spread (questionable) practice of 
using the construct

    $re = <>;
    chomp $re;
    $string =~ /$re/;

without tainting.  While this code is frowned upon from security point
of view, when C<(?{})> was introduced, it was considered bad to add 
I<new> security holes to existing scripts.

B<NOTE:>  Use of the above insecure snippet without also enabling taint mode
is to be severely frowned upon.  C<use re 'eval'> does not disable tainting
checks, thus to allow $re in the above snippet to contain C<(?{})>
I<with tainting enabled>, one needs both C<use re 'eval'> and untaint
the $re.

=item C<(?E<gt>pattern)>

An "independent" subexpression.  Matches the substring that a
I<standalone> C<pattern> would match if anchored at the given position,
B<and only this substring>.

Say, C<^(?E<gt>a*)ab> will never match, since C<(?E<gt>a*)> (anchored
at the beginning of string, as above) will match I<all> characters
C<a> at the beginning of string, leaving no C<a> for C<ab> to match.
In contrast, C<a*ab> will match the same as C<a+b>, since the match of
the subgroup C<a*> is influenced by the following group C<ab> (see
L<"Backtracking">).  In particular, C<a*> inside C<a*ab> will match
fewer characters than a standalone C<a*>, since this makes the tail match.

An effect similar to C<(?E<gt>pattern)> may be achieved by

   (?=(pattern))\1

since the lookahead is in I<"logical"> context, thus matches the same
substring as a standalone C<a+>.  The following C<\1> eats the matched
string, thus making a zero-length assertion into an analogue of
C<(?E<gt>...)>.  (The difference between these two constructs is that the
second one uses a catching group, thus shifting ordinals of
backreferences in the rest of a regular expression.)

This construct is useful for optimizations of "eternal"
matches, because it will not backtrack (see L<"Backtracking">).  

    m{ \(
	  ( 
	    [^()]+ 
          | 
            \( [^()]* \)
          )+
       \) 
     }x

That will efficiently match a nonempty group with matching
two-or-less-level-deep parentheses.  However, if there is no such group,
it will take virtually forever on a long string.  That's because there are
so many different ways to split a long string into several substrings.
This is what C<(.+)+> is doing, and C<(.+)+> is similar to a subpattern
of the above pattern.  Consider that the above pattern detects no-match
on C<((()aaaaaaaaaaaaaaaaaa> in several seconds, but that  each extra
letter doubles this time.  This exponential performance will make it
appear that your program has hung.

However, a tiny modification of this pattern 

    m{ \( 
	  ( 
	    (?> [^()]+ )
          | 
            \( [^()]* \)
          )+
       \) 
     }x

which uses C<(?E<gt>...)> matches exactly when the one above does (verifying
this yourself would be a productive exercise), but finishes in a fourth
the time when used on a similar string with 1000000 C<a>s.  Be aware,
however, that this pattern currently triggers a warning message under
B<-w> saying it C<"matches the null string many times">):

On simple groups, such as the pattern C<(?> [^()]+ )>, a comparable
effect may be achieved by negative lookahead, as in C<[^()]+ (?! [^()] )>.
This was only 4 times slower on a string with 1000000 C<a>s.

=item C<(?(condition)yes-pattern|no-pattern)>

=item C<(?(condition)yes-pattern)>

Conditional expression.  C<(condition)> should be either an integer in
parentheses (which is valid if the corresponding pair of parentheses
matched), or lookahead/lookbehind/evaluate zero-width assertion.

Say,

    m{ ( \( )? 
       [^()]+ 
       (?(1) \) ) 
     }x

matches a chunk of non-parentheses, possibly included in parentheses
themselves.

=item C<(?imsx-imsx)>

One or more embedded pattern-match modifiers.  This is particularly
useful for patterns that are specified in a table somewhere, some of
which want to be case sensitive, and some of which don't.  The case
insensitive ones need to include merely C<(?i)> at the front of the
pattern.  For example:

    $pattern = "foobar";
    if ( /$pattern/i ) { } 

    # more flexible:

    $pattern = "(?i)foobar";
    if ( /$pattern/ ) { } 

Letters after C<-> switch modifiers off.

These modifiers are localized inside an enclosing group (if any).  Say,

    ( (?i) blah ) \s+ \1

(assuming C<x> modifier, and no C<i> modifier outside of this group)
will match a repeated (I<including the case>!) word C<blah> in any
case.

=back

A question mark was chosen for this and for the new minimal-matching
construct because 1) question mark is pretty rare in older regular
expressions, and 2) whenever you see one, you should stop and "question"
exactly what is going on.  That's psychology...

=head2 Backtracking

A fundamental feature of regular expression matching involves the
notion called I<backtracking>, which is currently used (when needed)
by all regular expression quantifiers, namely C<*>, C<*?>, C<+>,
C<+?>, C<{n,m}>, and C<{n,m}?>.

For a regular expression to match, the I<entire> regular expression must
match, not just part of it.  So if the beginning of a pattern containing a
quantifier succeeds in a way that causes later parts in the pattern to
fail, the matching engine backs up and recalculates the beginning
part--that's why it's called backtracking.

Here is an example of backtracking:  Let's say you want to find the
word following "foo" in the string "Food is on the foo table.":

    $_ = "Food is on the foo table.";
    if ( /\b(foo)\s+(\w+)/i ) {
	print "$2 follows $1.\n";
    }

When the match runs, the first part of the regular expression (C<\b(foo)>)
finds a possible match right at the beginning of the string, and loads up
$1 with "Foo".  However, as soon as the matching engine sees that there's
no whitespace following the "Foo" that it had saved in $1, it realizes its
mistake and starts over again one character after where it had the
tentative match.  This time it goes all the way until the next occurrence
of "foo". The complete regular expression matches this time, and you get
the expected output of "table follows foo."

Sometimes minimal matching can help a lot.  Imagine you'd like to match
everything between "foo" and "bar".  Initially, you write something
like this:

    $_ =  "The food is under the bar in the barn.";
    if ( /foo(.*)bar/ ) {
	print "got <$1>\n";
    }

Which perhaps unexpectedly yields:

  got <d is under the bar in the >

That's because C<.*> was greedy, so you get everything between the
I<first> "foo" and the I<last> "bar".  In this case, it's more effective
to use minimal matching to make sure you get the text between a "foo"
and the first "bar" thereafter.

    if ( /foo(.*?)bar/ ) { print "got <$1>\n" }
  got <d is under the >

Here's another example: let's say you'd like to match a number at the end
of a string, and you also want to keep the preceding part the match.
So you write this:

    $_ = "I have 2 numbers: 53147";
    if ( /(.*)(\d*)/ ) {				# Wrong!
	print "Beginning is <$1>, number is <$2>.\n";
    }

That won't work at all, because C<.*> was greedy and gobbled up the
whole string. As C<\d*> can match on an empty string the complete
regular expression matched successfully.

    Beginning is <I have 2 numbers: 53147>, number is <>.

Here are some variants, most of which don't work:

    $_ = "I have 2 numbers: 53147";
    @pats = qw{
	(.*)(\d*)
	(.*)(\d+)
	(.*?)(\d*)
	(.*?)(\d+)
	(.*)(\d+)$
	(.*?)(\d+)$
	(.*)\b(\d+)$
	(.*\D)(\d+)$
    };

    for $pat (@pats) {
	printf "%-12s ", $pat;
	if ( /$pat/ ) {
	    print "<$1> <$2>\n";
	} else {
	    print "FAIL\n";
	}
    }

That will print out:

    (.*)(\d*)    <I have 2 numbers: 53147> <>
    (.*)(\d+)    <I have 2 numbers: 5314> <7>
    (.*?)(\d*)   <> <>
    (.*?)(\d+)   <I have > <2>
    (.*)(\d+)$   <I have 2 numbers: 5314> <7>
    (.*?)(\d+)$  <I have 2 numbers: > <53147>
    (.*)\b(\d+)$ <I have 2 numbers: > <53147>
    (.*\D)(\d+)$ <I have 2 numbers: > <53147>

As you see, this can be a bit tricky.  It's important to realize that a
regular expression is merely a set of assertions that gives a definition
of success.  There may be 0, 1, or several different ways that the
definition might succeed against a particular string.  And if there are
multiple ways it might succeed, you need to understand backtracking to
💿 文件大小 3329 K
👤 上传用户 mujinhua2010
📂 所属分类嵌入式/单片机编程
🏷️ 相关标签

#ARM #底层
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -