📄 perlre.pod

📁 MSYS在windows下模拟了一个类unix的终端
💻 POD
📖 第 1 页 / 共 4 页
字号:
    m{ \(	  ( 	    [^()]+		# x+          |             \( [^()]* \)          )+       \)      }xThat will efficiently match a nonempty group with matching parenthesestwo levels deep or less.  However, if there is no such group, itwill take virtually forever on a long string.  That's because thereare so many different ways to split a long string into severalsubstrings.  This is what C<(.+)+> is doing, and C<(.+)+> is similarto a subpattern of the above pattern.  Consider how the patternabove detects no-match on C<((()aaaaaaaaaaaaaaaaaa> in severalseconds, but that each extra letter doubles this time.  Thisexponential performance will make it appear that your program hashung.  However, a tiny change to this pattern    m{ \( 	  ( 	    (?> [^()]+ )	# change x+ above to (?> x+ )          |             \( [^()]* \)          )+       \)      }xwhich uses C<< (?>...) >> matches exactly when the one above does (verifyingthis yourself would be a productive exercise), but finishes in a fourththe time when used on a similar string with 1000000 C<a>s.  Be aware,however, that this pattern currently triggers a warning message underthe C<use warnings> pragma or B<-w> switch saying itC<"matches the null string many times">):On simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparableeffect may be achieved by negative look-ahead, as in C<[^()]+ (?! [^()] )>.This was only 4 times slower on a string with 1000000 C<a>s.The "grab all you can, and do not give anything back" semantic is desirablein many situations where on the first sight a simple C<()*> looks likethe correct solution.  Suppose we parse text with comments being delimitedby C<#> followed by some optional (horizontal) whitespace.  Contrary toits appearance, C<#[ \t]*> I<is not> the correct subexpression to matchthe comment delimiter, because it may "give up" some whitespace ifthe remainder of the pattern can be made to match that way.  The correctanswer is either one of these:    (?>#[ \t]*)    #[ \t]*(?![ \t])For example, to grab non-empty comments into $1, one should use eitherone of these:    / (?> \# [ \t]* ) (        .+ ) /x;    /     \# [ \t]*   ( [^ \t] .* ) /x;Which one you pick depends on which of these expressions better reflectsthe above specification of comments.=item C<(?(condition)yes-pattern|no-pattern)>=item C<(?(condition)yes-pattern)>B<WARNING>: This extended regular expression feature is consideredhighly experimental, and may be changed or deleted without notice.Conditional expression.  C<(condition)> should be either an integer inparentheses (which is valid if the corresponding pair of parenthesesmatched), or look-ahead/look-behind/evaluate zero-width assertion.For example:    m{ ( \( )?        [^()]+        (?(1) \) )      }xmatches a chunk of non-parentheses, possibly included in parenthesesthemselves.=back=head2 BacktrackingNOTE: This section presents an abstract approximation of regularexpression behavior.  For a more rigorous (and complicated) view ofthe rules involved in selecting a match among possible alternatives,see L<Combining pieces together>.A fundamental feature of regular expression matching involves thenotion called I<backtracking>, which is currently used (when needed)by all regular expression quantifiers, namely C<*>, C<*?>, C<+>,C<+?>, C<{n,m}>, and C<{n,m}?>.  Backtracking is often optimizedinternally, but the general principle outlined here is valid.For a regular expression to match, the I<entire> regular expression mustmatch, not just part of it.  So if the beginning of a pattern containing aquantifier succeeds in a way that causes later parts in the pattern tofail, the matching engine backs up and recalculates the beginningpart--that's why it's called backtracking.Here is an example of backtracking:  Let's say you want to find theword following "foo" in the string "Food is on the foo table.":    $_ = "Food is on the foo table.";    if ( /\b(foo)\s+(\w+)/i ) {	print "$2 follows $1.\n";    }When the match runs, the first part of the regular expression (C<\b(foo)>)finds a possible match right at the beginning of the string, and loads up$1 with "Foo".  However, as soon as the matching engine sees that there'sno whitespace following the "Foo" that it had saved in $1, it realizes itsmistake and starts over again one character after where it had thetentative match.  This time it goes all the way until the next occurrenceof "foo". The complete regular expression matches this time, and you getthe expected output of "table follows foo."Sometimes minimal matching can help a lot.  Imagine you'd like to matcheverything between "foo" and "bar".  Initially, you write somethinglike this:    $_ =  "The food is under the bar in the barn.";    if ( /foo(.*)bar/ ) {	print "got <$1>\n";    }Which perhaps unexpectedly yields:  got <d is under the bar in the >That's because C<.*> was greedy, so you get everything between theI<first> "foo" and the I<last> "bar".  Here it's more effectiveto use minimal matching to make sure you get the text between a "foo"and the first "bar" thereafter.    if ( /foo(.*?)bar/ ) { print "got <$1>\n" }  got <d is under the >Here's another example: let's say you'd like to match a number at the endof a string, and you also want to keep the preceding part the match.So you write this:    $_ = "I have 2 numbers: 53147";    if ( /(.*)(\d*)/ ) {				# Wrong!	print "Beginning is <$1>, number is <$2>.\n";    }That won't work at all, because C<.*> was greedy and gobbled up thewhole string. As C<\d*> can match on an empty string the completeregular expression matched successfully.    Beginning is <I have 2 numbers: 53147>, number is <>.Here are some variants, most of which don't work:    $_ = "I have 2 numbers: 53147";    @pats = qw{	(.*)(\d*)	(.*)(\d+)	(.*?)(\d*)	(.*?)(\d+)	(.*)(\d+)$	(.*?)(\d+)$	(.*)\b(\d+)$	(.*\D)(\d+)$    };    for $pat (@pats) {	printf "%-12s ", $pat;	if ( /$pat/ ) {	    print "<$1> <$2>\n";	} else {	    print "FAIL\n";	}    }That will print out:    (.*)(\d*)    <I have 2 numbers: 53147> <>    (.*)(\d+)    <I have 2 numbers: 5314> <7>    (.*?)(\d*)   <> <>    (.*?)(\d+)   <I have > <2>    (.*)(\d+)$   <I have 2 numbers: 5314> <7>    (.*?)(\d+)$  <I have 2 numbers: > <53147>    (.*)\b(\d+)$ <I have 2 numbers: > <53147>    (.*\D)(\d+)$ <I have 2 numbers: > <53147>As you see, this can be a bit tricky.  It's important to realize that aregular expression is merely a set of assertions that gives a definitionof success.  There may be 0, 1, or several different ways that thedefinition might succeed against a particular string.  And if there aremultiple ways it might succeed, you need to understand backtracking toknow which variety of success you will achieve.When using look-ahead assertions and negations, this can all get eventricker.  Imagine you'd like to find a sequence of non-digits notfollowed by "123".  You might try to write that as    $_ = "ABC123";    if ( /^\D*(?!123)/ ) {		# Wrong!	print "Yup, no 123 in $_\n";    }But that isn't going to match; at least, not the way you're hoping.  Itclaims that there is no 123 in the string.  Here's a clearer picture ofwhy it that pattern matches, contrary to popular expectations:    $x = 'ABC123' ;    $y = 'ABC445' ;    print "1: got $1\n" if $x =~ /^(ABC)(?!123)/ ;    print "2: got $1\n" if $y =~ /^(ABC)(?!123)/ ;    print "3: got $1\n" if $x =~ /^(\D*)(?!123)/ ;    print "4: got $1\n" if $y =~ /^(\D*)(?!123)/ ;This prints    2: got ABC    3: got AB    4: got ABCYou might have expected test 3 to fail because it seems to a moregeneral purpose version of test 1.  The important difference betweenthem is that test 3 contains a quantifier (C<\D*>) and so can usebacktracking, whereas test 1 will not.  What's happening isthat you've asked "Is it true that at the start of $x, following 0 or morenon-digits, you have something that's not 123?"  If the pattern matcher hadlet C<\D*> expand to "ABC", this would have caused the whole pattern tofail.The search engine will initially match C<\D*> with "ABC".  Then it willtry to match C<(?!123> with "123", which fails.  But becausea quantifier (C<\D*>) has been used in the regular expression, thesearch engine can backtrack and retry the match differentlyin the hope of matching the complete regular expression.The pattern really, I<really> wants to succeed, so it uses thestandard pattern back-off-and-retry and lets C<\D*> expand to just "AB" thistime.  Now there's indeed something following "AB" that is not"123".  It's "C123", which suffices.We can deal with this by using both an assertion and a negation.We'll say that the first part in $1 must be followed both by a digitand by something that's not "123".  Remember that the look-aheadsare zero-width expressions--they only look, but don't consume anyof the string in their match.  So rewriting this way produces whatyou'd expect; that is, case 5 will fail, but case 6 succeeds:    print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/ ;    print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/ ;    6: got ABCIn other words, the two zero-width assertions next to each other work as thoughthey're ANDed together, just as you'd use any built-in assertions:  C</^$/>matches only if you're at the beginning of the line AND the end of theline simultaneously.  The deeper underlying truth is that juxtaposition inregular expressions always means AND, except when you write an explicit ORusing the vertical bar.  C</ab/> means match "a" AND (then) match "b",although the attempted matches are made at different positions because "a"is not a zero-width assertion, but a one-width assertion.B<WARNING>: particularly complicated regular expressions can takeexponential time to solve because of the immense number of possibleways they can use backtracking to try match.  For example, withoutinternal optimizations done by the regular expression engine, this willtake a painfully long time to run:    'aaaaaaaaaaaa' =~ /((a{0,5}){0,5})*[c]/And if you used C<*>'s in the internal groups instead of limiting themto 0 through 5 matches, then it would take forever--or until you ranout of stack space.  Moreover, these internal optimizations are notalways applicable.  For example, if you put C<{0,5}> instead of C<*>on the external group, no current optimization is applicable, and thematch takes a long time to finish.A powerful tool for optimizing such beasts is what is known as an"independent group",which does not backtrack (see L<C<< (?>pattern) >>>).  Note also thatzero-length look-ahead/look-behind assertions will not backtrack to makethe tail match, since they are in "logical" context: only whether they match is considered relevant.  For an examplewhere side-effects of look-ahead I<might> have influenced thefollowing match, see L<C<< (?>pattern) >>>.=head2 Version 8 Regular ExpressionsIn case you're not familiar with the "regular" Version 8 regexroutines, here are the pattern-matching rules not described above.Any single character matches itself, unless it is a I<metacharacter>with a special meaning described here or above.  You can causecharacters that normally function as metacharacters to be interpretedliterally by prefixing them with a "\" (e.g., "\." matches a ".", not anycharacter; "\\" matches a "\").  A series of characters matches thatseries of characters in the target string, so the pattern C<blurfl>would match "blurfl" in the target string.You can specify a character class, by enclosing a list of charactersin C<[]>, which will match any one character from the list.  If thefirst character after the "[" is "^", the class matches any character notin the list.  Within a list, the "-" character specifies arange, so that C<a-z> represents all characters between "a" and "z",inclusive.  If you want either "-" or "]" itself to be a member of aclass, put it at the start of the list (possibly after a "^"), orescape it with a backslash.  "-" is also taken literally when it isat the end of the list, just before the closing "]".  (Thefollowing all specify the same class of three characters: C<[-az]>,C<[az-]>, and C<[a\-z]>.  All are different from C<[a-z]>, whichspecifies a class containing twenty-six characters, even on EBCDICbased coded character sets.)  Also, if you try to use the character classes C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, or C<\D> as endpoints of a range, that's not a range, the "-" is understood literally.Note also that the whole range idea is rather unportable betweencharacter sets--and even within character sets they may cause resultsyou probably didn't expect.  A sound principle is to use only ranges
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -