⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 perlretut.pod

📁 MSYS在windows下模拟了一个类unix的终端
💻 POD
📖 第 1 页 / 共 5 页
字号:
possible character positions have been exhausted does perl give giveup and declare S<C<$string =~ /(abd|abc)(df|d|de)/;> > to be false.Even with all this work, regexp matching happens remarkably fast.  Tospeed things up, during compilation stage, perl compiles the regexpinto a compact sequence of opcodes that can often fit inside aprocessor cache.  When the code is executed, these opcodes can then runat full throttle and search very quickly.=head2 Extracting matchesThe grouping metacharacters C<()> also serve another completelydifferent function: they allow the extraction of the parts of a stringthat matched.  This is very useful to find out what matched and fortext processing in general.  For each grouping, the part that matchedinside goes into the special variables C<$1>, C<$2>, etc.  They can beused just as ordinary variables:    # extract hours, minutes, seconds    $time =~ /(\d\d):(\d\d):(\d\d)/;  # match hh:mm:ss format    $hours = $1;    $minutes = $2;    $seconds = $3;Now, we know that in scalar context,S<C<$time =~ /(\d\d):(\d\d):(\d\d)/> > returns a true or falsevalue.  In list context, however, it returns the list of matched valuesC<($1,$2,$3)>.  So we could write the code more compactly as    # extract hours, minutes, seconds    ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);If the groupings in a regexp are nested, C<$1> gets the group with theleftmost opening parenthesis, C<$2> the next opening parenthesis,etc.  For example, here is a complex regexp and the matching variablesindicated below it:    /(ab(cd|ef)((gi)|j))/;     1  2      34so that if the regexp matched, e.g., C<$2> would contain 'cd' or 'ef'.For convenience, perl sets C<$+> to the highest numbered C<$1>, C<$2>,... that got assigned.Closely associated with the matching variables C<$1>, C<$2>, ... arethe B<backreferences> C<\1>, C<\2>, ... .  Backreferences are simplymatching variables that can be used I<inside> a regexp.  This is areally nice feature - what matches later in a regexp can depend onwhat matched earlier in the regexp.  Suppose we wanted to lookfor doubled words in text, like 'the the'.  The following regexp findsall 3-letter doubles with a space in between:    /(\w\w\w)\s\1/;The grouping assigns a value to \1, so that the same 3 letter sequenceis used for both parts.  Here are some words with repeated parts:    % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\1$' /usr/dict/words    beriberi    booboo    coco    mama    murmur    papaThe regexp has a single grouping which considers 4-lettercombinations, then 3-letter combinations, etc.  and uses C<\1> to look fora repeat.  Although C<$1> and C<\1> represent the same thing, care should betaken to use matched variables C<$1>, C<$2>, ... only outside a regexpand backreferences C<\1>, C<\2>, ... only inside a regexp; not doingso may lead to surprising and/or undefined results.In addition to what was matched, Perl 5.6.0 also provides thepositions of what was matched with the C<@-> and C<@+>arrays. C<$-[0]> is the position of the start of the entire match andC<$+[0]> is the position of the end. Similarly, C<$-[n]> is theposition of the start of the C<$n> match and C<$+[n]> is the positionof the end. If C<$n> is undefined, so are C<$-[n]> and C<$+[n]>. Thenthis code    $x = "Mmm...donut, thought Homer";    $x =~ /^(Mmm|Yech)\.\.\.(donut|peas)/; # matches    foreach $expr (1..$#-) {        print "Match $expr: '${$expr}' at position ($-[$expr],$+[$expr])\n";    }prints    Match 1: 'Mmm' at position (0,3)    Match 2: 'donut' at position (6,11)Even if there are no groupings in a regexp, it is still possible tofind out what exactly matched in a string.  If you use them, perlwill set C<$`> to the part of the string before the match, will set C<$&>to the part of the string that matched, and will set C<$'> to the partof the string after the match.  An example:    $x = "the cat caught the mouse";    $x =~ /cat/;  # $` = 'the ', $& = 'cat', $' = ' caught the mouse'    $x =~ /the/;  # $` = '', $& = 'the', $' = ' cat caught the mouse'In the second match, S<C<$` = ''> > because the regexp matched at thefirst character position in the string and stopped, it never saw thesecond 'the'.  It is important to note that using C<$`> and C<$'>slows down regexp matching quite a bit, and C< $& > slows it down to alesser extent, because if they are used in one regexp in a program,they are generated for <all> regexps in the program.  So if rawperformance is a goal of your application, they should be avoided.If you need them, use C<@-> and C<@+> instead:    $` is the same as substr( $x, 0, $-[0] )    $& is the same as substr( $x, $-[0], $+[0]-$-[0] )    $' is the same as substr( $x, $+[0] )=head2 Matching repetitionsThe examples in the previous section display an annoying weakness.  Wewere only matching 3-letter words, or syllables of 4 letters orless.  We'd like to be able to match words or syllables of any length,without writing out tedious alternatives likeC<\w\w\w\w|\w\w\w|\w\w|\w>.This is exactly the problem the B<quantifier> metacharacters C<?>,C<*>, C<+>, and C<{}> were created for.  They allow us to determine thenumber of repeats of a portion of a regexp we consider to be amatch.  Quantifiers are put immediately after the character, characterclass, or grouping that we want to specify.  They have the followingmeanings:=over 4=item *C<a?> = match 'a' 1 or 0 times=item *C<a*> = match 'a' 0 or more times, i.e., any number of times=item *C<a+> = match 'a' 1 or more times, i.e., at least once=item *C<a{n,m}> = match at least C<n> times, but not more than C<m>times.=item *C<a{n,}> = match at least C<n> or more times=item *C<a{n}> = match exactly C<n> times=backHere are some examples:    /[a-z]+\s+\d*/;  # match a lowercase word, at least some space, and                     # any number of digits    /(\w+)\s+\1/;    # match doubled words of arbitrary length    /y(es)?/i;       # matches 'y', 'Y', or a case-insensitive 'yes'    $year =~ /\d{2,4}/;  # make sure year is at least 2 but not more                         # than 4 digits    $year =~ /\d{4}|\d{2}/;    # better match; throw out 3 digit dates    $year =~ /\d{2}(\d{2})?/;  # same thing written differently. However,                               # this produces $1 and the other does not.    % simple_grep '^(\w+)\1$' /usr/dict/words   # isn't this easier?    beriberi    booboo    coco    mama    murmur    papaFor all of these quantifiers, perl will try to match as much of thestring as possible, while still allowing the regexp to succeed.  Thuswith C</a?.../>, perl will first try to match the regexp with the C<a>present; if that fails, perl will try to match the regexp without theC<a> present.  For the quantifier C<*>, we get the following:    $x = "the cat in the hat";    $x =~ /^(.*)(cat)(.*)$/; # matches,                             # $1 = 'the '                             # $2 = 'cat'                             # $3 = ' in the hat'Which is what we might expect, the match finds the only C<cat> in thestring and locks onto it.  Consider, however, this regexp:    $x =~ /^(.*)(at)(.*)$/; # matches,                            # $1 = 'the cat in the h'                            # $2 = 'at'                            # $3 = ''   (0 matches)One might initially guess that perl would find the C<at> in C<cat> andstop there, but that wouldn't give the longest possible string to thefirst quantifier C<.*>.  Instead, the first quantifier C<.*> grabs asmuch of the string as possible while still having the regexp match.  Inthis example, that means having the C<at> sequence with the final C<at>in the string.  The other important principle illustrated here is thatwhen there are two or more elements in a regexp, the I<leftmost>quantifier, if there is one, gets to grab as much the string aspossible, leaving the rest of the regexp to fight over scraps.  Thus inour example, the first quantifier C<.*> grabs most of the string, whilethe second quantifier C<.*> gets the empty string.   Quantifiers thatgrab as much of the string as possible are called B<maximal match> orB<greedy> quantifiers.When a regexp can match a string in several different ways, we can usethe principles above to predict which way the regexp will match:=over 4=item *Principle 0: Taken as a whole, any regexp will be matched at theearliest possible position in the string.=item *Principle 1: In an alternation C<a|b|c...>, the leftmost alternativethat allows a match for the whole regexp will be the one used.=item *Principle 2: The maximal matching quantifiers C<?>, C<*>, C<+> andC<{n,m}> will in general match as much of the string as possible whilestill allowing the whole regexp to match.=item *Principle 3: If there are two or more elements in a regexp, theleftmost greedy quantifier, if any, will match as much of the stringas possible while still allowing the whole regexp to match.  The nextleftmost greedy quantifier, if any, will try to match as much of thestring remaining available to it as possible, while still allowing thewhole regexp to match.  And so on, until all the regexp elements aresatisfied.=backAs we have seen above, Principle 0 overrides the others - the regexpwill be matched as early as possible, with the other principlesdetermining how the regexp matches at that earliest characterposition.Here is an example of these principles in action:    $x = "The programming republic of Perl";    $x =~ /^(.+)(e|r)(.*)$/;  # matches,                              # $1 = 'The programming republic of Pe'                              # $2 = 'r'                              # $3 = 'l'This regexp matches at the earliest string position, C<'T'>.  Onemight think that C<e>, being leftmost in the alternation, would bematched, but C<r> produces the longest string in the first quantifier.    $x =~ /(m{1,2})(.*)$/;  # matches,                            # $1 = 'mm'                            # $2 = 'ing republic of Perl'Here, The earliest possible match is at the first C<'m'> inC<programming>. C<m{1,2}> is the first quantifier, so it gets to matcha maximal C<mm>.    $x =~ /.*(m{1,2})(.*)$/;  # matches,                              # $1 = 'm'                              # $2 = 'ing republic of Perl'Here, the regexp matches at the start of the string. The firstquantifier C<.*> grabs as much as possible, leaving just a singleC<'m'> for the second quantifier C<m{1,2}>.    $x =~ /(.?)(m{1,2})(.*)$/;  # matches,                                # $1 = 'a'                                # $2 = 'mm'                                # $3 = 'ing republic of Perl'Here, C<.?> eats its maximal one character at the earliest possibleposition in the string, C<'a'> in C<programming>, leaving C<m{1,2}>the opportunity to match both C<m>'s. Finally,    "aXXXb" =~ /(X*)/; # matches with $1 = ''because it can match zero copies of C<'X'> at the beginning of thestring.  If you definitely want to match at least one C<'X'>, useC<X+>, not C<X*>.Sometimes greed is not good.  At times, we would like quantifiers tomatch a I<minimal> piece of string, rather than a maximal piece.  Forthis purpose, Larry Wall created the S<B<minimal match> > orB<non-greedy> quantifiers C<??>,C<*?>, C<+?>, and C<{}?>.  These arethe usual quantifiers with a C<?> appended to them.  They have thefollowing meanings:=over 4=item *C<a??> = match 'a' 0 or 1 times. Try 0 first, then 1.=item *C<a*?> = match 'a' 0 or more times, i.e., any number of times,but as few times as possible=item *C<a+?> = match 'a' 1 or more times, i.e., at least once, butas few times as possible=item *C<a{n,m}?> = match at least C<n> times, not more than C<m>times, as few times as possible=item *C<a{n,}?> = match at least C<n> times, but as few times aspossible=item *C<a{n}?> = match exactly C<n> times.  Because we match exactlyC<n> times, C<a{n}?> is equivalent to C<a{n}> and is just there fornotational consistency.=backLet's look at the example above, but with minimal quantifiers:

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -