📄 perlretut.1
字号:
.PPFor convenience, Perl sets \f(CW$+\fR to the string held by the highest numbered\&\f(CW$1\fR, \f(CW$2\fR,... that got assigned (and, somewhat related, \f(CW$^N\fR to thevalue of the \f(CW$1\fR, \f(CW$2\fR,... most-recently assigned; i.e. the \f(CW$1\fR,\&\f(CW$2\fR,... associated with the rightmost closing parenthesis used in thematch)..Sh "Backreferences".IX Subsection "Backreferences"Closely associated with the matching variables \f(CW$1\fR, \f(CW$2\fR, ... arethe \fIbackreferences\fR \f(CW\*(C`\e1\*(C'\fR, \f(CW\*(C`\e2\*(C'\fR,... Backreferences are simplymatching variables that can be used \fIinside\fR a regexp. This is areally nice feature \*(-- what matches later in a regexp is made to depend onwhat matched earlier in the regexp. Suppose we wanted to lookfor doubled words in a text, like 'the the'. The following regexp findsall 3\-letter doubles with a space in between:.PP.Vb 1\& /\eb(\ew\ew\ew)\es\e1\eb/;.Ve.PPThe grouping assigns a value to \e1, so that the same 3 letter sequenceis used for both parts..PPA similar task is to find words consisting of two identical parts:.PP.Vb 7\& % simple_grep \*(Aq^(\ew\ew\ew\ew|\ew\ew\ew|\ew\ew|\ew)\e1$\*(Aq /usr/dict/words\& beriberi\& booboo\& coco\& mama\& murmur\& papa.Ve.PPThe regexp has a single grouping which considers 4\-lettercombinations, then 3\-letter combinations, etc., and uses \f(CW\*(C`\e1\*(C'\fR to look fora repeat. Although \f(CW$1\fR and \f(CW\*(C`\e1\*(C'\fR represent the same thing, care should betaken to use matched variables \f(CW$1\fR, \f(CW$2\fR,... only \fIoutside\fR a regexpand backreferences \f(CW\*(C`\e1\*(C'\fR, \f(CW\*(C`\e2\*(C'\fR,... only \fIinside\fR a regexp; not doingso may lead to surprising and unsatisfactory results..Sh "Relative backreferences".IX Subsection "Relative backreferences"Counting the opening parentheses to get the correct number for abackreference is errorprone as soon as there is more than onecapturing group. A more convenient technique became availablewith Perl 5.10: relative backreferences. To refer to the immediatelypreceding capture group one now may write \f(CW\*(C`\eg{\-1}\*(C'\fR, the next butlast is available via \f(CW\*(C`\eg{\-2}\*(C'\fR, and so on..PPAnother good reason in addition to readability and maintainabilityfor using relative backreferences is illustrated by the following example,where a simple pattern for matching peculiar strings is used:.PP.Vb 1\& $a99a = \*(Aq([a\-z])(\ed)\e2\e1\*(Aq; # matches a11a, g22g, x33x, etc..Ve.PPNow that we have this pattern stored as a handy string, we might feeltempted to use it as a part of some other pattern:.PP.Vb 6\& $line = "code=e99e";\& if ($line =~ /^(\ew+)=$a99a$/){ # unexpected behavior!\& print "$1 is valid\en";\& } else {\& print "bad line: \*(Aq$line\*(Aq\en";\& }.Ve.PPBut this doesn't match \*(-- at least not the way one might expect. Onlyafter inserting the interpolated \f(CW$a99a\fR and looking at the resultingfull text of the regexp is it obvious that the backreferences havebackfired \*(-- the subexpression \f(CW\*(C`(\ew+)\*(C'\fR has snatched number 1 anddemoted the groups in \f(CW$a99a\fR by one rank. This can be avoided byusing relative backreferences:.PP.Vb 1\& $a99a = \*(Aq([a\-z])(\ed)\eg{\-1}\eg{\-2}\*(Aq; # safe for being interpolated.Ve.Sh "Named backreferences".IX Subsection "Named backreferences"Perl 5.10 also introduced named capture buffers and named backreferences.To attach a name to a capturing group, you write either\&\f(CW\*(C`(?<name>...)\*(C'\fR or \f(CW\*(C`(?\*(Aqname\*(Aq...)\*(C'\fR. The backreference maythen be written as \f(CW\*(C`\eg{name}\*(C'\fR. It is permissible to attach thesame name to more than one group, but then only the leftmost one of theeponymous set can be referenced. Outside of the pattern a namedcapture buffer is accessible through the \f(CW\*(C`%+\*(C'\fR hash..PPAssuming that we have to match calendar dates which may be given in oneof the three formats yyyy-mm-dd, mm/dd/yyyy or dd.mm.yyyy, we can writethree suitable patterns where we use 'd', 'm' and 'y' respectively as thenames of the buffers capturing the pertaining components of a date. Thematching operation combines the three patterns as alternatives:.PP.Vb 8\& $fmt1 = \*(Aq(?<y>\ed\ed\ed\ed)\-(?<m>\ed\ed)\-(?<d>\ed\ed)\*(Aq;\& $fmt2 = \*(Aq(?<m>\ed\ed)/(?<d>\ed\ed)/(?<y>\ed\ed\ed\ed)\*(Aq;\& $fmt3 = \*(Aq(?<d>\ed\ed)\e.(?<m>\ed\ed)\e.(?<y>\ed\ed\ed\ed)\*(Aq;\& for my $d qw( 2006\-10\-21 15.01.2007 10/31/2005 ){\& if ( $d =~ m{$fmt1|$fmt2|$fmt3} ){\& print "day=$+{d} month=$+{m} year=$+{y}\en";\& }\& }.Ve.PPIf any of the alternatives matches, the hash \f(CW\*(C`%+\*(C'\fR is bound to contain thethree key-value pairs..Sh "Alternative capture group numbering".IX Subsection "Alternative capture group numbering"Yet another capturing group numbering technique (also as from Perl 5.10)deals with the problem of referring to groups within a set of alternatives.Consider a pattern for matching a time of the day, civil or military style:.PP.Vb 3\& if ( $time =~ /(\ed\ed|\ed):(\ed\ed)|(\ed\ed)(\ed\ed)/ ){\& # process hour and minute\& }.Ve.PPProcessing the results requires an additional if statement to determinewhether \f(CW$1\fR and \f(CW$2\fR or \f(CW$3\fR and \f(CW$4\fR contain the goodies. It wouldbe easier if we could use buffer numbers 1 and 2 in second alternative aswell, and this is exactly what the parenthesized construct \f(CW\*(C`(?|...)\*(C'\fR,set around an alternative achieves. Here is an extended version of theprevious pattern:.PP.Vb 3\& if ( $time =~ /(?|(\ed\ed|\ed):(\ed\ed)|(\ed\ed)(\ed\ed))\es+([A\-Z][A\-Z][A\-Z])/ ){\& print "hour=$1 minute=$2 zone=$3\en";\& }.Ve.PPWithin the alternative numbering group, buffer numbers start at the sameposition for each alternative. After the group, numbering continueswith one higher than the maximum reached across all the alternatives..Sh "Position information".IX Subsection "Position information"In addition to what was matched, Perl (since 5.6.0) also provides thepositions of what was matched as contents of the \f(CW\*(C`@\-\*(C'\fR and \f(CW\*(C`@+\*(C'\fRarrays. \f(CW\*(C`$\-[0]\*(C'\fR is the position of the start of the entire match and\&\f(CW$+[0]\fR is the position of the end. Similarly, \f(CW\*(C`$\-[n]\*(C'\fR is theposition of the start of the \f(CW$n\fR match and \f(CW$+[n]\fR is the positionof the end. If \f(CW$n\fR is undefined, so are \f(CW\*(C`$\-[n]\*(C'\fR and \f(CW$+[n]\fR. Thenthis code.PP.Vb 5\& $x = "Mmm...donut, thought Homer";\& $x =~ /^(Mmm|Yech)\e.\e.\e.(donut|peas)/; # matches\& foreach $expr (1..$#\-) {\& print "Match $expr: \*(Aq${$expr}\*(Aq at position ($\-[$expr],$+[$expr])\en";\& }.Ve.PPprints.PP.Vb 2\& Match 1: \*(AqMmm\*(Aq at position (0,3)\& Match 2: \*(Aqdonut\*(Aq at position (6,11).Ve.PPEven if there are no groupings in a regexp, it is still possible tofind out what exactly matched in a string. If you use them, Perlwill set \f(CW\*(C`$\`\*(C'\fR to the part of the string before the match, will set \f(CW$&\fRto the part of the string that matched, and will set \f(CW\*(C`$\*(Aq\*(C'\fR to the partof the string after the match. An example:.PP.Vb 3\& $x = "the cat caught the mouse";\& $x =~ /cat/; # $\` = \*(Aqthe \*(Aq, $& = \*(Aqcat\*(Aq, $\*(Aq = \*(Aq caught the mouse\*(Aq\& $x =~ /the/; # $\` = \*(Aq\*(Aq, $& = \*(Aqthe\*(Aq, $\*(Aq = \*(Aq cat caught the mouse\*(Aq.Ve.PPIn the second match, \f(CW\*(C`$\`\*(C'\fR equals \f(CW\*(Aq\*(Aq\fR because the regexp matched at thefirst character position in the string and stopped; it never saw thesecond 'the'. It is important to note that using \f(CW\*(C`$\`\*(C'\fR and \f(CW\*(C`$\*(Aq\*(C'\fRslows down regexp matching quite a bit, while \f(CW$&\fR slows it down to alesser extent, because if they are used in one regexp in a program,they are generated for \fIall\fR regexps in the program. So if rawperformance is a goal of your application, they should be avoided.If you need to extract the corresponding substrings, use \f(CW\*(C`@\-\*(C'\fR and\&\f(CW\*(C`@+\*(C'\fR instead:.PP.Vb 3\& $\` is the same as substr( $x, 0, $\-[0] )\& $& is the same as substr( $x, $\-[0], $+[0]\-$\-[0] )\& $\*(Aq is the same as substr( $x, $+[0] ).Ve.Sh "Non-capturing groupings".IX Subsection "Non-capturing groupings"A group that is required to bundle a set of alternatives may or may not beuseful as a capturing group. If it isn't, it just creates a superfluousaddition to the set of available capture buffer values, inside as well asoutside the regexp. Non-capturing groupings, denoted by \f(CW\*(C`(?:regexp)\*(C'\fR,still allow the regexp to be treated as a single unit, but don't establisha capturing buffer at the same time. Both capturing and non-capturinggroupings are allowed to co-exist in the same regexp. Because there isno extraction, non-capturing groupings are faster than capturinggroupings. Non-capturing groupings are also handy for choosing exactlywhich parts of a regexp are to be extracted to matching variables:.PP.Vb 2\& # match a number, $1\-$4 are set, but we only want $1\& /([+\-]?\e *(\ed+(\e.\ed*)?|\e.\ed+)([eE][+\-]?\ed+)?)/;\&\& # match a number faster , only $1 is set\& /([+\-]?\e *(?:\ed+(?:\e.\ed*)?|\e.\ed+)(?:[eE][+\-]?\ed+)?)/;\&\& # match a number, get $1 = whole number, $2 = exponent\& /([+\-]?\e *(?:\ed+(?:\e.\ed*)?|\e.\ed+)(?:[eE]([+\-]?\ed+))?)/;.Ve.PPNon-capturing groupings are also useful for removing nuisanceelements gathered from a split operation where parentheses arerequired for some reason:.PP.Vb 3\& $x = \*(Aq12aba34ba5\*(Aq;\& @num = split /(a|b)+/, $x; # @num = (\*(Aq12\*(Aq,\*(Aqa\*(Aq,\*(Aq34\*(Aq,\*(Aqb\*(Aq,\*(Aq5\*(Aq)\& @num = split /(?:a|b)+/, $x; # @num = (\*(Aq12\*(Aq,\*(Aq34\*(Aq,\*(Aq5\*(Aq).Ve.Sh "Matching repetitions".IX Subsection "Matching repetitions"The examples in the previous section display an annoying weakness. Wewere only matching 3\-letter words, or chunks of words of 4 letters orless. We'd like to be able to match words or, more generally, stringsof any length, without writing out tedious alternatives like\&\f(CW\*(C`\ew\ew\ew\ew|\ew\ew\ew|\ew\ew|\ew\*(C'\fR..PPThis is exactly the problem the \fIquantifier\fR metacharacters \f(CW\*(C`?\*(C'\fR,\&\f(CW\*(C`*\*(C'\fR, \f(CW\*(C`+\*(C'\fR, and \f(CW\*(C`{}\*(C'\fR were created for. They allow us to delimit thenumber of repeats for a portion of a regexp we consider to be amatch. Quantifiers are put immediately after the character, characterclass, or grouping that we want to specify. They have the followingmeanings:.IP "\(bu" 4\&\f(CW\*(C`a?\*(C'\fR means: match 'a' 1 or 0 times.IP "\(bu" 4\&\f(CW\*(C`a*\*(C'\fR means: match 'a' 0 or more times, i.e., any number of times.IP "\(bu" 4\&\f(CW\*(C`a+\*(C'\fR means: match 'a' 1 or more times, i.e., at least once.IP "\(bu" 4\&\f(CW\*(C`a{n,m}\*(C'\fR means: match at least \f(CW\*(C`n\*(C'\fR times, but not more than \f(CW\*(C`m\*(C'\fRtimes..IP "\(bu" 4\&\f(CW\*(C`a{n,}\*(C'\fR means: match at least \f(CW\*(C`n\*(C'\fR or more times.IP "\(bu" 4\&\f(CW\*(C`a{n}\*(C'\fR means: match exactly \f(CW\*(C`n\*(C'\fR times.PPHere are some examples:.PP.Vb 9\& /[a\-z]+\es+\ed*/; # match a lowercase word, at least one space, and\& # any number of digits\& /(\ew+)\es+\e1/; # match doubled words of arbitrary length\& /y(es)?/i; # matches \*(Aqy\*(Aq, \*(AqY\*(Aq, or a case\-insensitive \*(Aqyes\*(Aq\& $year =~ /\ed{2,4}/; # make sure year is at least 2 but not more\& # than 4 digits\& $year =~ /\ed{4}|\ed{2}/; # better match; throw out 3 digit dates\& $year =~ /\ed{2}(\ed{2})?/; # same thing written differently. However,\& # this produces $1 and the other does not.\&\& % simple_grep \*(Aq^(\ew+)\e1$\*(Aq /usr/dict/words # isn\*(Aqt this easier?\& beriberi\& booboo\& coco\& mama\& murmur\& papa.Ve.PPFor all of these quantifiers, Perl will try to match as much of thestring as possible, while still allowing the regexp to succeed. Thuswith \f(CW\*(C`/a?.../\*(C'\fR, Perl will first try to match the regexp with the \f(CW\*(C`a\*(C'\fRpresent; if that fails, Perl will try to match the regexp without the\&\f(CW\*(C`a\*(C'\fR present. For the quantifier \f(CW\*(C`*\*(C'\fR, we get the following:.PP.Vb 5\& $x = "the cat in the hat";\& $x =~ /^(.*)(cat)(.*)$/; # matches,\& # $1 = \*(Aqthe \*(Aq\& # $2 = \*(Aqcat\*(Aq\& # $3 = \*(Aq in the hat\*(Aq.Ve.PPWhich is what we might expect, the match finds the only \f(CW\*(C`cat\*(C'\fR in thestring and locks onto it. Consider, however, this regexp:.PP.Vb 4\& $x =~ /^(.*)(at)(.*)$/; # matches,\& # $1 = \*(Aqthe cat in the h\*(Aq\& # $2 = \*(Aqat\*(Aq\& # $3 = \*(Aq\*(Aq (0 characters match)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -