📄 perlretut.pod
字号:
position of the string 'a'. If there were no matches at the firstposition, Perl would move to the second character position 'b' andattempt the match all over again. Only when all possible paths at allpossible character positions have been exhausted does Perl giveup and declare S<C<$string =~ /(abd|abc)(df|d|de)/;>> to be false.Even with all this work, regexp matching happens remarkably fast. Tospeed things up, Perl compiles the regexp into a compact sequence ofopcodes that can often fit inside a processor cache. When the code isexecuted, these opcodes can then run at full throttle and search veryquickly.=head2 Extracting matchesThe grouping metacharacters C<()> also serve another completelydifferent function: they allow the extraction of the parts of a stringthat matched. This is very useful to find out what matched and fortext processing in general. For each grouping, the part that matchedinside goes into the special variables C<$1>, C<$2>, etc. They can beused just as ordinary variables: # extract hours, minutes, seconds if ($time =~ /(\d\d):(\d\d):(\d\d)/) { # match hh:mm:ss format $hours = $1; $minutes = $2; $seconds = $3; }Now, we know that in scalar context,S<C<$time =~ /(\d\d):(\d\d):(\d\d)/>> returns a true or falsevalue. In list context, however, it returns the list of matched valuesC<($1,$2,$3)>. So we could write the code more compactly as # extract hours, minutes, seconds ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);If the groupings in a regexp are nested, C<$1> gets the group with theleftmost opening parenthesis, C<$2> the next opening parenthesis,etc. Here is a regexp with nested groups: /(ab(cd|ef)((gi)|j))/; 1 2 34If this regexp matches, C<$1> contains a string starting withC<'ab'>, C<$2> is either set to C<'cd'> or C<'ef'>, C<$3> equals eitherC<'gi'> or C<'j'>, and C<$4> is either set to C<'gi'>, just like C<$3>,or it remains undefined.For convenience, Perl sets C<$+> to the string held by the highest numberedC<$1>, C<$2>,... that got assigned (and, somewhat related, C<$^N> to thevalue of the C<$1>, C<$2>,... most-recently assigned; i.e. the C<$1>,C<$2>,... associated with the rightmost closing parenthesis used in thematch).=head2 BackreferencesClosely associated with the matching variables C<$1>, C<$2>, ... arethe I<backreferences> C<\1>, C<\2>,... Backreferences are simplymatching variables that can be used I<inside> a regexp. This is areally nice feature -- what matches later in a regexp is made to depend onwhat matched earlier in the regexp. Suppose we wanted to lookfor doubled words in a text, like 'the the'. The following regexp findsall 3-letter doubles with a space in between: /\b(\w\w\w)\s\1\b/;The grouping assigns a value to \1, so that the same 3 letter sequenceis used for both parts.A similar task is to find words consisting of two identical parts: % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\1$' /usr/dict/words beriberi booboo coco mama murmur papaThe regexp has a single grouping which considers 4-lettercombinations, then 3-letter combinations, etc., and uses C<\1> to look fora repeat. Although C<$1> and C<\1> represent the same thing, care should betaken to use matched variables C<$1>, C<$2>,... only I<outside> a regexpand backreferences C<\1>, C<\2>,... only I<inside> a regexp; not doingso may lead to surprising and unsatisfactory results.=head2 Relative backreferencesCounting the opening parentheses to get the correct number for abackreference is errorprone as soon as there is more than onecapturing group. A more convenient technique became availablewith Perl 5.10: relative backreferences. To refer to the immediatelypreceding capture group one now may write C<\g{-1}>, the next butlast is available via C<\g{-2}>, and so on.Another good reason in addition to readability and maintainabilityfor using relative backreferences is illustrated by the following example,where a simple pattern for matching peculiar strings is used: $a99a = '([a-z])(\d)\2\1'; # matches a11a, g22g, x33x, etc.Now that we have this pattern stored as a handy string, we might feeltempted to use it as a part of some other pattern: $line = "code=e99e"; if ($line =~ /^(\w+)=$a99a$/){ # unexpected behavior! print "$1 is valid\n"; } else { print "bad line: '$line'\n"; }But this doesn't match -- at least not the way one might expect. Onlyafter inserting the interpolated C<$a99a> and looking at the resultingfull text of the regexp is it obvious that the backreferences havebackfired -- the subexpression C<(\w+)> has snatched number 1 anddemoted the groups in C<$a99a> by one rank. This can be avoided byusing relative backreferences: $a99a = '([a-z])(\d)\g{-1}\g{-2}'; # safe for being interpolated=head2 Named backreferencesPerl 5.10 also introduced named capture buffers and named backreferences.To attach a name to a capturing group, you write eitherC<< (?<name>...) >> or C<< (?'name'...) >>. The backreference maythen be written as C<\g{name}>. It is permissible to attach thesame name to more than one group, but then only the leftmost one of theeponymous set can be referenced. Outside of the pattern a namedcapture buffer is accessible through the C<%+> hash.Assuming that we have to match calendar dates which may be given in oneof the three formats yyyy-mm-dd, mm/dd/yyyy or dd.mm.yyyy, we can writethree suitable patterns where we use 'd', 'm' and 'y' respectively as thenames of the buffers capturing the pertaining components of a date. Thematching operation combines the three patterns as alternatives: $fmt1 = '(?<y>\d\d\d\d)-(?<m>\d\d)-(?<d>\d\d)'; $fmt2 = '(?<m>\d\d)/(?<d>\d\d)/(?<y>\d\d\d\d)'; $fmt3 = '(?<d>\d\d)\.(?<m>\d\d)\.(?<y>\d\d\d\d)'; for my $d qw( 2006-10-21 15.01.2007 10/31/2005 ){ if ( $d =~ m{$fmt1|$fmt2|$fmt3} ){ print "day=$+{d} month=$+{m} year=$+{y}\n"; } }If any of the alternatives matches, the hash C<%+> is bound to contain thethree key-value pairs.=head2 Alternative capture group numberingYet another capturing group numbering technique (also as from Perl 5.10)deals with the problem of referring to groups within a set of alternatives.Consider a pattern for matching a time of the day, civil or military style: if ( $time =~ /(\d\d|\d):(\d\d)|(\d\d)(\d\d)/ ){ # process hour and minute }Processing the results requires an additional if statement to determinewhether C<$1> and C<$2> or C<$3> and C<$4> contain the goodies. It wouldbe easier if we could use buffer numbers 1 and 2 in second alternative aswell, and this is exactly what the parenthesized construct C<(?|...)>,set around an alternative achieves. Here is an extended version of theprevious pattern: if ( $time =~ /(?|(\d\d|\d):(\d\d)|(\d\d)(\d\d))\s+([A-Z][A-Z][A-Z])/ ){ print "hour=$1 minute=$2 zone=$3\n"; }Within the alternative numbering group, buffer numbers start at the sameposition for each alternative. After the group, numbering continueswith one higher than the maximum reached across all the alternatives.=head2 Position informationIn addition to what was matched, Perl (since 5.6.0) also provides thepositions of what was matched as contents of the C<@-> and C<@+>arrays. C<$-[0]> is the position of the start of the entire match andC<$+[0]> is the position of the end. Similarly, C<$-[n]> is theposition of the start of the C<$n> match and C<$+[n]> is the positionof the end. If C<$n> is undefined, so are C<$-[n]> and C<$+[n]>. Thenthis code $x = "Mmm...donut, thought Homer"; $x =~ /^(Mmm|Yech)\.\.\.(donut|peas)/; # matches foreach $expr (1..$#-) { print "Match $expr: '${$expr}' at position ($-[$expr],$+[$expr])\n"; }prints Match 1: 'Mmm' at position (0,3) Match 2: 'donut' at position (6,11)Even if there are no groupings in a regexp, it is still possible tofind out what exactly matched in a string. If you use them, Perlwill set C<$`> to the part of the string before the match, will set C<$&>to the part of the string that matched, and will set C<$'> to the partof the string after the match. An example: $x = "the cat caught the mouse"; $x =~ /cat/; # $` = 'the ', $& = 'cat', $' = ' caught the mouse' $x =~ /the/; # $` = '', $& = 'the', $' = ' cat caught the mouse'In the second match, C<$`> equals C<''> because the regexp matched at thefirst character position in the string and stopped; it never saw thesecond 'the'. It is important to note that using C<$`> and C<$'>slows down regexp matching quite a bit, while C<$&> slows it down to alesser extent, because if they are used in one regexp in a program,they are generated for I<all> regexps in the program. So if rawperformance is a goal of your application, they should be avoided.If you need to extract the corresponding substrings, use C<@-> andC<@+> instead: $` is the same as substr( $x, 0, $-[0] ) $& is the same as substr( $x, $-[0], $+[0]-$-[0] ) $' is the same as substr( $x, $+[0] )=head2 Non-capturing groupingsA group that is required to bundle a set of alternatives may or may not beuseful as a capturing group. If it isn't, it just creates a superfluousaddition to the set of available capture buffer values, inside as well asoutside the regexp. Non-capturing groupings, denoted by C<(?:regexp)>,still allow the regexp to be treated as a single unit, but don't establisha capturing buffer at the same time. Both capturing and non-capturinggroupings are allowed to co-exist in the same regexp. Because there isno extraction, non-capturing groupings are faster than capturinggroupings. Non-capturing groupings are also handy for choosing exactlywhich parts of a regexp are to be extracted to matching variables: # match a number, $1-$4 are set, but we only want $1 /([+-]?\ *(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?)/; # match a number faster , only $1 is set /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?)/; # match a number, get $1 = whole number, $2 = exponent /([+-]?\ *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE]([+-]?\d+))?)/;Non-capturing groupings are also useful for removing nuisanceelements gathered from a split operation where parentheses arerequired for some reason: $x = '12aba34ba5'; @num = split /(a|b)+/, $x; # @num = ('12','a','34','b','5') @num = split /(?:a|b)+/, $x; # @num = ('12','34','5')=head2 Matching repetitionsThe examples in the previous section display an annoying weakness. Wewere only matching 3-letter words, or chunks of words of 4 letters orless. We'd like to be able to match words or, more generally, stringsof any length, without writing out tedious alternatives likeC<\w\w\w\w|\w\w\w|\w\w|\w>.This is exactly the problem the I<quantifier> metacharacters C<?>,C<*>, C<+>, and C<{}> were created for. They allow us to delimit thenumber of repeats for a portion of a regexp we consider to be amatch. Quantifiers are put immediately after the character, characterclass, or grouping that we want to specify. They have the followingmeanings:=over 4=item *C<a?> means: match 'a' 1 or 0 times=item *C<a*> means: match 'a' 0 or more times, i.e., any number of times=item *C<a+> means: match 'a' 1 or more times, i.e., at least once=item *C<a{n,m}> means: match at least C<n> times, but not more than C<m>times.=item *C<a{n,}> means: match at least C<n> or more times=item *C<a{n}> means: match exactly C<n> times=backHere are some examples: /[a-z]+\s+\d*/; # match a lowercase word, at least one space, and # any number of digits /(\w+)\s+\1/; # match doubled words of arbitrary length /y(es)?/i; # matches 'y', 'Y', or a case-insensitive 'yes' $year =~ /\d{2,4}/; # make sure year is at least 2 but not more # than 4 digits $year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates $year =~ /\d{2}(\d{2})?/; # same thing written differently. However, # this produces $1 and the other does not. % simple_grep '^(\w+)\1$' /usr/dict/words # isn't this easier? beriberi booboo coco mama murmur papaFor all of these quantifiers, Perl will try to match as much of thestring as possible, while still allowing the regexp to succeed. Thuswith C</a?.../>, Perl will first try to match the regexp with the C<a>present; if that fails, Perl will try to match the regexp without theC<a> present. For the quantifier C<*>, we get the following: $x = "the cat in the hat"; $x =~ /^(.*)(cat)(.*)$/; # matches, # $1 = 'the ' # $2 = 'cat' # $3 = ' in the hat'Which is what we might expect, the match finds the only C<cat> in thestring and locks onto it. Consider, however, this regexp: $x =~ /^(.*)(at)(.*)$/; # matches, # $1 = 'the cat in the h' # $2 = 'at' # $3 = '' (0 characters match)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -