📄 perlre.pod

📁 MSYS在windows下模拟了一个类unix的终端
💻 POD
📖 第 1 页 / 共 4 页
字号:
上一页 1 2 34
that begin from and end at either alphabets of equal case ([a-e],[A-E]), or digits ([0-9]).  Anything else is unsafe.  If in doubt,spell out the character sets in full.Characters may be specified using a metacharacter syntax much like thatused in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return,"\f" a form feed, etc.  More generally, \I<nnn>, where I<nnn> is a stringof octal digits, matches the character whose coded character set value is I<nnn>.  Similarly, \xI<nn>, where I<nn> are hexadecimal digits, matches the character whose numeric value is I<nn>. The expression \cI<x> matches the character control-I<x>.  Finally, the "." metacharacter matches any character except "\n" (unless you use C</s>).You can specify a series of alternatives for a pattern using "|" toseparate them, so that C<fee|fie|foe> will match any of "fee", "fie",or "foe" in the target string (as would C<f(e|i|o)e>).  Thefirst alternative includes everything from the last pattern delimiter("(", "[", or the beginning of the pattern) up to the first "|", andthe last alternative contains everything from the last "|" to the nextpattern delimiter.  That's why it's common practice to includealternatives in parentheses: to minimize confusion about where theystart and end.Alternatives are tried from left to right, so the firstalternative found for which the entire expression matches, is the one thatis chosen. This means that alternatives are not necessarily greedy. Forexample: when matching C<foo|foot> against "barefoot", only the "foo"part will match, as that is the first alternative tried, and it successfullymatches the target string. (This might not seem important, but it isimportant when you are capturing matched text using parentheses.)Also remember that "|" is interpreted as a literal within square brackets,so if you write C<[fee|fie|foe]> you're really only matching C<[feio|]>.Within a pattern, you may designate subpatterns for later referenceby enclosing them in parentheses, and you may refer back to theI<n>th subpattern later in the pattern using the metacharacter\I<n>.  Subpatterns are numbered based on the left to right orderof their opening parenthesis.  A backreference matches whateveractually matched the subpattern in the string being examined, notthe rules for that subpattern.  Therefore, C<(0|0x)\d*\s\1\d*> willmatch "0x1234 0x4321", but not "0x1234 01234", because subpattern1 matched "0x", even though the rule C<0|0x> could potentially matchthe leading 0 in the second number.=head2 Warning on \1 vs $1Some people get too used to writing things like:    $pattern =~ s/(\W)/\\\1/g;This is grandfathered for the RHS of a substitute to avoid shocking theB<sed> addicts, but it's a dirty habit to get into.  That's because inPerlThink, the righthand side of a C<s///> is a double-quoted string.  C<\1> inthe usual double-quoted string means a control-A.  The customary Unixmeaning of C<\1> is kludged in for C<s///>.  However, if you get into the habitof doing that, you get yourself into trouble if you then add an C</e>modifier.    s/(\d+)/ \1 + 1 /eg;    	# causes warning under -wOr if you try to do    s/(\d+)/\1000/;You can't disambiguate that by saying C<\{1}000>, whereas you can fix it withC<${1}000>.  The operation of interpolation should not be confusedwith the operation of matching a backreference.  Certainly they mean twodifferent things on the I<left> side of the C<s///>.=head2 Repeated patterns matching zero-length substringB<WARNING>: Difficult material (and prose) ahead.  This section needs a rewrite.Regular expressions provide a terse and powerful programming language.  Aswith most other power tools, power comes together with the abilityto wreak havoc.A common abuse of this power stems from the ability to make infiniteloops using regular expressions, with something as innocuous as:    'foo' =~ m{ ( o? )* }x;The C<o?> can match at the beginning of C<'foo'>, and since the positionin the string is not moved by the match, C<o?> would match again and againbecause of the C<*> modifier.  Another common way to create a similar cycleis with the looping modifier C<//g>:    @matches = ( 'foo' =~ m{ o? }xg );or    print "match: <$&>\n" while 'foo' =~ m{ o? }xg;or the loop implied by split().However, long experience has shown that many programming tasks maybe significantly simplified by using repeated subexpressions thatmay match zero-length substrings.  Here's a simple example being:    @chars = split //, $string;		  # // is not magic in split    ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /Thus Perl allows such constructs, by I<forcefully breakingthe infinite loop>.  The rules for this are different for lower-levelloops given by the greedy modifiers C<*+{}>, and for higher-levelones like the C</g> modifier or split() operator.The lower-level loops are I<interrupted> (that is, the loop isbroken) when Perl detects that a repeated expression matched azero-length substring.   Thus   m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;is made equivalent to    m{   (?: NON_ZERO_LENGTH )*       |         (?: ZERO_LENGTH )?     }x;The higher level-loops preserve an additional state between iterations:whether the last match was zero-length.  To break the loop, the following match after a zero-length match is prohibited to have a length of zero.This prohibition interacts with backtracking (see L<"Backtracking">), and so the I<second best> match is chosen if the I<best> match is ofzero length.For example:    $_ = 'bar';    s/\w??/<$&>/g;results in C<< <><b><><a><><r><> >>.  At each position of the string the bestmatch given by non-greedy C<??> is the zero-length match, and the I<second best> match is what is matched by C<\w>.  Thus zero-length matchesalternate with one-character-long matches.Similarly, for repeated C<m/()/g> the second-best match is the match at the position one notch further in the string.The additional state of being I<matched with zero-length> is associated withthe matched string, and is reset by each assignment to pos().Zero-length matches at the end of the previous match are ignoredduring C<split>.=head2 Combining pieces togetherEach of the elementary pieces of regular expressions which were describedbefore (such as C<ab> or C<\Z>) could match at most one substringat the given position of the input string.  However, in a typical regularexpression these elementary pieces are combined into more complicatedpatterns using combining operators C<ST>, C<S|T>, C<S*> etc(in these examples C<S> and C<T> are regular subexpressions).Such combinations can include alternatives, leading to a problem of choice:if we match a regular expression C<a|ab> against C<"abc">, will it matchsubstring C<"a"> or C<"ab">?  One way to describe which substring isactually matched is the concept of backtracking (see L<"Backtracking">).However, this description is too low-level and makes you thinkin terms of a particular implementation.Another description starts with notions of "better"/"worse".  All thesubstrings which may be matched by the given regular expression can besorted from the "best" match to the "worst" match, and it is the "best"match which is chosen.  This substitutes the question of "what is chosen?"by the question of "which matches are better, and which are worse?".Again, for elementary pieces there is no such question, since at mostone match at a given position is possible.  This section describes thenotion of better/worse for combining operators.  In the descriptionbelow C<S> and C<T> are regular subexpressions.=over 4=item C<ST>Consider two possible matches, C<AB> and C<A'B'>, C<A> and C<A'> aresubstrings which can be matched by C<S>, C<B> and C<B'> are substringswhich can be matched by C<T>. If C<A> is better match for C<S> than C<A'>, C<AB> is a bettermatch than C<A'B'>.If C<A> and C<A'> coincide: C<AB> is a better match than C<AB'> ifC<B> is better match for C<T> than C<B'>.=item C<S|T>When C<S> can match, it is a better match than when only C<T> can match.Ordering of two matches for C<S> is the same as for C<S>.  Similar fortwo matches for C<T>.=item C<S{REPEAT_COUNT}>Matches as C<SSS...S> (repeated as many times as necessary).=item C<S{min,max}>Matches as C<S{max}|S{max-1}|...|S{min+1}|S{min}>.=item C<S{min,max}?>Matches as C<S{min}|S{min+1}|...|S{max-1}|S{max}>.=item C<S?>, C<S*>, C<S+>Same as C<S{0,1}>, C<S{0,BIG_NUMBER}>, C<S{1,BIG_NUMBER}> respectively.=item C<S??>, C<S*?>, C<S+?>Same as C<S{0,1}?>, C<S{0,BIG_NUMBER}?>, C<S{1,BIG_NUMBER}?> respectively.=item C<< (?>S) >>Matches the best match for C<S> and only that.=item C<(?=S)>, C<(?<=S)>Only the best match for C<S> is considered.  (This is important only ifC<S> has capturing parentheses, and backreferences are used somewhereelse in the whole regular expression.)=item C<(?!S)>, C<(?<!S)>For this grouping operator there is no need to describe the ordering, sinceonly whether or not C<S> can match is important.=item C<(??{ EXPR })>The ordering is the same as for the regular expression which isthe result of EXPR.=item C<(?(condition)yes-pattern|no-pattern)>Recall that which of C<yes-pattern> or C<no-pattern> actually matches isalready determined.  The ordering of the matches is the same as for thechosen subexpression.=backThe above recipes describe the ordering of matches I<at a given position>.One more rule is needed to understand how a match is determined for thewhole regular expression: a match at an earlier position is always betterthan a match at a later position.=head2 Creating custom RE enginesOverloaded constants (see L<overload>) provide a simple way to extendthe functionality of the RE engine.Suppose that we want to enable a new RE escape-sequence C<\Y|> whichmatches at boundary between white-space characters and non-whitespacecharacters.  Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactlyat these positions, so we want to have each C<\Y|> in the place of themore complicated version.  We can create a module C<customre> to dothis:    package customre;    use overload;    sub import {      shift;      die "No argument to customre::import allowed" if @_;      overload::constant 'qr' => \&convert;    }    sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"}    my %rules = ( '\\' => '\\', 		  'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ );    sub convert {      my $re = shift;      $re =~ s{                 \\ ( \\ | Y . )              }              { $rules{$1} or invalid($re,$1) }sgex;       return $re;    }Now C<use customre> enables the new escape in constant regularexpressions, i.e., those without any runtime variable interpolations.As documented in L<overload>, this conversion will work only overliteral parts of regular expressions.  For C<\Y|$re\Y|> the variablepart of this regular expression needs to be converted explicitly(but only if the special meaning of C<\Y|> should be enabled inside $re):    use customre;    $re = <>;    chomp $re;    $re = customre::convert $re;    /\Y|$re\Y|/;=head1 BUGSThis document varies from difficult to understand to completelyand utterly opaque.  The wandering prose riddled with jargon ishard to fathom in several places.This document needs a rewrite that separates the tutorial contentfrom the reference content.=head1 SEE ALSOL<perlop/"Regexp Quote-Like Operators">.L<perlop/"Gory details of parsing quoted constructs">.L<perlfaq6>.L<perlfunc/pos>.L<perllocale>.L<perlebcdic>.I<Mastering Regular Expressions> by Jeffrey Friedl, publishedby O'Reilly and Associates.
上一页 1 2 34
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -