📄 perlretut.1
字号:
\&\f(CW\*(C`\eb\*(C'\fR. This matches a boundary between a word character and a non-wordcharacter \f(CW\*(C`\ew\eW\*(C'\fR or \f(CW\*(C`\eW\ew\*(C'\fR:.PP.Vb 5\& $x = "Housecat catenates house and cat";\& $x =~ /cat/; # matches cat in \*(Aqhousecat\*(Aq\& $x =~ /\ebcat/; # matches cat in \*(Aqcatenates\*(Aq\& $x =~ /cat\eb/; # matches cat in \*(Aqhousecat\*(Aq\& $x =~ /\ebcat\eb/; # matches \*(Aqcat\*(Aq at end of string.Ve.PPNote in the last example, the end of the string is considered a wordboundary..PPYou might wonder why \f(CW\*(Aq.\*(Aq\fR matches everything but \f(CW"\en"\fR \- why notevery character? The reason is that often one is matching againstlines and would like to ignore the newline characters. For instance,while the string \f(CW"\en"\fR represents one line, we would like to thinkof it as empty. Then.PP.Vb 2\& "" =~ /^$/; # matches\& "\en" =~ /^$/; # matches, $ anchors before "\en"\&\& "" =~ /./; # doesn\*(Aqt match; it needs a char\& "" =~ /^.$/; # doesn\*(Aqt match; it needs a char\& "\en" =~ /^.$/; # doesn\*(Aqt match; it needs a char other than "\en"\& "a" =~ /^.$/; # matches\& "a\en" =~ /^.$/; # matches, $ anchors before "\en".Ve.PPThis behavior is convenient, because we usually want to ignorenewlines when we count and match characters in a line. Sometimes,however, we want to keep track of newlines. We might even want \f(CW\*(C`^\*(C'\fRand \f(CW\*(C`$\*(C'\fR to anchor at the beginning and end of lines within thestring, rather than just the beginning and end of the string. Perlallows us to choose between ignoring and paying attention to newlinesby using the \f(CW\*(C`//s\*(C'\fR and \f(CW\*(C`//m\*(C'\fR modifiers. \f(CW\*(C`//s\*(C'\fR and \f(CW\*(C`//m\*(C'\fR stand forsingle line and multi-line and they determine whether a string is tobe treated as one continuous string, or as a set of lines. The twomodifiers affect two aspects of how the regexp is interpreted: 1) howthe \f(CW\*(Aq.\*(Aq\fR character class is defined, and 2) where the anchors \f(CW\*(C`^\*(C'\fRand \f(CW\*(C`$\*(C'\fR are able to match. Here are the four possible combinations:.IP "\(bu" 4no modifiers (//): Default behavior. \f(CW\*(Aq.\*(Aq\fR matches any characterexcept \f(CW"\en"\fR. \f(CW\*(C`^\*(C'\fR matches only at the beginning of the string and\&\f(CW\*(C`$\*(C'\fR matches only at the end or before a newline at the end..IP "\(bu" 4s modifier (//s): Treat string as a single long line. \f(CW\*(Aq.\*(Aq\fR matchesany character, even \f(CW"\en"\fR. \f(CW\*(C`^\*(C'\fR matches only at the beginning ofthe string and \f(CW\*(C`$\*(C'\fR matches only at the end or before a newline at theend..IP "\(bu" 4m modifier (//m): Treat string as a set of multiple lines. \f(CW\*(Aq.\*(Aq\fRmatches any character except \f(CW"\en"\fR. \f(CW\*(C`^\*(C'\fR and \f(CW\*(C`$\*(C'\fR are able to matchat the start or end of \fIany\fR line within the string..IP "\(bu" 4both s and m modifiers (//sm): Treat string as a single long line, butdetect multiple lines. \f(CW\*(Aq.\*(Aq\fR matches any character, even\&\f(CW"\en"\fR. \f(CW\*(C`^\*(C'\fR and \f(CW\*(C`$\*(C'\fR, however, are able to match at the start or endof \fIany\fR line within the string..PPHere are examples of \f(CW\*(C`//s\*(C'\fR and \f(CW\*(C`//m\*(C'\fR in action:.PP.Vb 1\& $x = "There once was a girl\enWho programmed in Perl\en";\&\& $x =~ /^Who/; # doesn\*(Aqt match, "Who" not at start of string\& $x =~ /^Who/s; # doesn\*(Aqt match, "Who" not at start of string\& $x =~ /^Who/m; # matches, "Who" at start of second line\& $x =~ /^Who/sm; # matches, "Who" at start of second line\&\& $x =~ /girl.Who/; # doesn\*(Aqt match, "." doesn\*(Aqt match "\en"\& $x =~ /girl.Who/s; # matches, "." matches "\en"\& $x =~ /girl.Who/m; # doesn\*(Aqt match, "." doesn\*(Aqt match "\en"\& $x =~ /girl.Who/sm; # matches, "." matches "\en".Ve.PPMost of the time, the default behavior is what is wanted, but \f(CW\*(C`//s\*(C'\fR and\&\f(CW\*(C`//m\*(C'\fR are occasionally very useful. If \f(CW\*(C`//m\*(C'\fR is being used, the startof the string can still be matched with \f(CW\*(C`\eA\*(C'\fR and the end of the stringcan still be matched with the anchors \f(CW\*(C`\eZ\*(C'\fR (matches both the end andthe newline before, like \f(CW\*(C`$\*(C'\fR), and \f(CW\*(C`\ez\*(C'\fR (matches only the end):.PP.Vb 2\& $x =~ /^Who/m; # matches, "Who" at start of second line\& $x =~ /\eAWho/m; # doesn\*(Aqt match, "Who" is not at start of string\&\& $x =~ /girl$/m; # matches, "girl" at end of first line\& $x =~ /girl\eZ/m; # doesn\*(Aqt match, "girl" is not at end of string\&\& $x =~ /Perl\eZ/m; # matches, "Perl" is at newline before end\& $x =~ /Perl\ez/m; # doesn\*(Aqt match, "Perl" is not at end of string.Ve.PPWe now know how to create choices among classes of characters in aregexp. What about choices among words or character strings? Suchchoices are described in the next section..Sh "Matching this or that".IX Subsection "Matching this or that"Sometimes we would like our regexp to be able to match differentpossible words or character strings. This is accomplished by usingthe \fIalternation\fR metacharacter \f(CW\*(C`|\*(C'\fR. To match \f(CW\*(C`dog\*(C'\fR or \f(CW\*(C`cat\*(C'\fR, weform the regexp \f(CW\*(C`dog|cat\*(C'\fR. As before, Perl will try to match theregexp at the earliest possible point in the string. At eachcharacter position, Perl will first try to match the firstalternative, \f(CW\*(C`dog\*(C'\fR. If \f(CW\*(C`dog\*(C'\fR doesn't match, Perl will then try thenext alternative, \f(CW\*(C`cat\*(C'\fR. If \f(CW\*(C`cat\*(C'\fR doesn't match either, then thematch fails and Perl moves to the next position in the string. Someexamples:.PP.Vb 2\& "cats and dogs" =~ /cat|dog|bird/; # matches "cat"\& "cats and dogs" =~ /dog|cat|bird/; # matches "cat".Ve.PPEven though \f(CW\*(C`dog\*(C'\fR is the first alternative in the second regexp,\&\f(CW\*(C`cat\*(C'\fR is able to match earlier in the string..PP.Vb 2\& "cats" =~ /c|ca|cat|cats/; # matches "c"\& "cats" =~ /cats|cat|ca|c/; # matches "cats".Ve.PPHere, all the alternatives match at the first string position, so thefirst alternative is the one that matches. If some of thealternatives are truncations of the others, put the longest ones firstto give them a chance to match..PP.Vb 2\& "cab" =~ /a|b|c/ # matches "c"\& # /a|b|c/ == /[abc]/.Ve.PPThe last example points out that character classes are likealternations of characters. At a given character position, the firstalternative that allows the regexp match to succeed will be the onethat matches..Sh "Grouping things and hierarchical matching".IX Subsection "Grouping things and hierarchical matching"Alternation allows a regexp to choose among alternatives, but byitself it is unsatisfying. The reason is that each alternative is a wholeregexp, but sometime we want alternatives for just part of aregexp. For instance, suppose we want to search for housecats orhousekeepers. The regexp \f(CW\*(C`housecat|housekeeper\*(C'\fR fits the bill, but isinefficient because we had to type \f(CW\*(C`house\*(C'\fR twice. It would be nice tohave parts of the regexp be constant, like \f(CW\*(C`house\*(C'\fR, and someparts have alternatives, like \f(CW\*(C`cat|keeper\*(C'\fR..PPThe \fIgrouping\fR metacharacters \f(CW\*(C`()\*(C'\fR solve this problem. Groupingallows parts of a regexp to be treated as a single unit. Parts of aregexp are grouped by enclosing them in parentheses. Thus we could solvethe \f(CW\*(C`housecat|housekeeper\*(C'\fR by forming the regexp as\&\f(CW\*(C`house(cat|keeper)\*(C'\fR. The regexp \f(CW\*(C`house(cat|keeper)\*(C'\fR means match\&\f(CW\*(C`house\*(C'\fR followed by either \f(CW\*(C`cat\*(C'\fR or \f(CW\*(C`keeper\*(C'\fR. Some more examplesare.PP.Vb 4\& /(a|b)b/; # matches \*(Aqab\*(Aq or \*(Aqbb\*(Aq\& /(ac|b)b/; # matches \*(Aqacb\*(Aq or \*(Aqbb\*(Aq\& /(^a|b)c/; # matches \*(Aqac\*(Aq at start of string or \*(Aqbc\*(Aq anywhere\& /(a|[bc])d/; # matches \*(Aqad\*(Aq, \*(Aqbd\*(Aq, or \*(Aqcd\*(Aq\&\& /house(cat|)/; # matches either \*(Aqhousecat\*(Aq or \*(Aqhouse\*(Aq\& /house(cat(s|)|)/; # matches either \*(Aqhousecats\*(Aq or \*(Aqhousecat\*(Aq or\& # \*(Aqhouse\*(Aq. Note groups can be nested.\&\& /(19|20|)\ed\ed/; # match years 19xx, 20xx, or the Y2K problem, xx\& "20" =~ /(19|20|)\ed\ed/; # matches the null alternative \*(Aq()\ed\ed\*(Aq,\& # because \*(Aq20\ed\ed\*(Aq can\*(Aqt match.Ve.PPAlternations behave the same way in groups as out of them: at a givenstring position, the leftmost alternative that allows the regexp tomatch is taken. So in the last example at the first string position,\&\f(CW"20"\fR matches the second alternative, but there is nothing left overto match the next two digits \f(CW\*(C`\ed\ed\*(C'\fR. So Perl moves on to the nextalternative, which is the null alternative and that works, since\&\f(CW"20"\fR is two digits..PPThe process of trying one alternative, seeing if it matches, andmoving on to the next alternative, while going back in the stringfrom where the previous alternative was tried, if it doesn't, is called\&\fIbacktracking\fR. The term 'backtracking' comes from the idea thatmatching a regexp is like a walk in the woods. Successfully matchinga regexp is like arriving at a destination. There are many possibletrailheads, one for each string position, and each one is tried inorder, left to right. From each trailhead there may be many paths,some of which get you there, and some which are dead ends. When youwalk along a trail and hit a dead end, you have to backtrack along thetrail to an earlier point to try another trail. If you hit yourdestination, you stop immediately and forget about trying all theother trails. You are persistent, and only if you have tried all thetrails from all the trailheads and not arrived at your destination, doyou declare failure. To be concrete, here is a step-by-step analysisof what Perl does when it tries to match the regexp.PP.Vb 1\& "abcde" =~ /(abd|abc)(df|d|de)/;.Ve.IP "0" 4Start with the first letter in the string 'a'..IP "1" 4.IX Item "1"Try the first alternative in the first group 'abd'..IP "2" 4.IX Item "2"Match 'a' followed by 'b'. So far so good..IP "3" 4.IX Item "3"\&'d' in the regexp doesn't match 'c' in the string \- a deadend. So backtrack two characters and pick the second alternative inthe first group 'abc'..IP "4" 4.IX Item "4"Match 'a' followed by 'b' followed by 'c'. We are on a rolland have satisfied the first group. Set \f(CW$1\fR to 'abc'..IP "5" 4.IX Item "5"Move on to the second group and pick the first alternative\&'df'..IP "6" 4.IX Item "6"Match the 'd'..IP "7" 4.IX Item "7"\&'f' in the regexp doesn't match 'e' in the string, so a deadend. Backtrack one character and pick the second alternative in thesecond group 'd'..IP "8" 4.IX Item "8"\&'d' matches. The second grouping is satisfied, so set \f(CW$2\fR to\&'d'..IP "9" 4.IX Item "9"We are at the end of the regexp, so we are done! We havematched 'abcd' out of the string \*(L"abcde\*(R"..PPThere are a couple of things to note about this analysis. First, thethird alternative in the second group 'de' also allows a match, but westopped before we got to it \- at a given character position, leftmostwins. Second, we were able to get a match at the first characterposition of the string 'a'. If there were no matches at the firstposition, Perl would move to the second character position 'b' andattempt the match all over again. Only when all possible paths at allpossible character positions have been exhausted does Perl giveup and declare \f(CW\*(C`$string\ =~\ /(abd|abc)(df|d|de)/;\*(C'\fR to be false..PPEven with all this work, regexp matching happens remarkably fast. Tospeed things up, Perl compiles the regexp into a compact sequence ofopcodes that can often fit inside a processor cache. When the code isexecuted, these opcodes can then run at full throttle and search veryquickly..Sh "Extracting matches".IX Subsection "Extracting matches"The grouping metacharacters \f(CW\*(C`()\*(C'\fR also serve another completelydifferent function: they allow the extraction of the parts of a stringthat matched. This is very useful to find out what matched and fortext processing in general. For each grouping, the part that matchedinside goes into the special variables \f(CW$1\fR, \f(CW$2\fR, etc. They can beused just as ordinary variables:.PP.Vb 6\& # extract hours, minutes, seconds\& if ($time =~ /(\ed\ed):(\ed\ed):(\ed\ed)/) { # match hh:mm:ss format\& $hours = $1;\& $minutes = $2;\& $seconds = $3;\& }.Ve.PPNow, we know that in scalar context,\&\f(CW\*(C`$time\ =~\ /(\ed\ed):(\ed\ed):(\ed\ed)/\*(C'\fR returns a true or falsevalue. In list context, however, it returns the list of matched values\&\f(CW\*(C`($1,$2,$3)\*(C'\fR. So we could write the code more compactly as.PP.Vb 2\& # extract hours, minutes, seconds\& ($hours, $minutes, $second) = ($time =~ /(\ed\ed):(\ed\ed):(\ed\ed)/);.Ve.PPIf the groupings in a regexp are nested, \f(CW$1\fR gets the group with theleftmost opening parenthesis, \f(CW$2\fR the next opening parenthesis,etc. Here is a regexp with nested groups:.PP.Vb 2\& /(ab(cd|ef)((gi)|j))/;\& 1 2 34.Ve.PPIf this regexp matches, \f(CW$1\fR contains a string starting with\&\f(CW\*(Aqab\*(Aq\fR, \f(CW$2\fR is either set to \f(CW\*(Aqcd\*(Aq\fR or \f(CW\*(Aqef\*(Aq\fR, \f(CW$3\fR equals either\&\f(CW\*(Aqgi\*(Aq\fR or \f(CW\*(Aqj\*(Aq\fR, and \f(CW$4\fR is either set to \f(CW\*(Aqgi\*(Aq\fR, just like \f(CW$3\fR,or it remains undefined.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -