📄 perlre.pod
字号:
=head1 NAMEperlre - Perl regular expressions=head1 DESCRIPTIONThis page describes the syntax of regular expressions in Perl. For adescription of how to I<use> regular expressions in matchingoperations, plus various examples of the same, see discussionsof C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like Operators">.Matching operations can have various modifiers. Modifiersthat relate to the interpretation of the regular expression insideare listed below. Modifiers that alter the way a regular expressionis used by Perl are detailed in L<perlop/"Regexp Quote-Like Operators"> and L<perlop/"Gory details of parsing quoted constructs">.=over 4=item iDo case-insensitive pattern matching.If C<use locale> is in effect, the case map is taken from the currentlocale. See L<perllocale>.=item mTreat string as multiple lines. That is, change "^" and "$" from matchingthe start or end of the string to matching the start or end of anyline anywhere within the string.=item sTreat string as single line. That is, change "." to match any characterwhatsoever, even a newline, which normally it would not match.The C</s> and C</m> modifiers both override the C<$*> setting. Thatis, no matter what C<$*> contains, C</s> without C</m> will force"^" to match only at the beginning of the string and "$" to matchonly at the end (or just before a newline at the end) of the string.Together, as /ms, they let the "." match any character whatsoever,while still allowing "^" and "$" to match, respectively, just afterand just before newlines within the string.=item xExtend your pattern's legibility by permitting whitespace and comments.=backThese are usually written as "the C</x> modifier", even though the delimiterin question might not really be a slash. Any of thesemodifiers may also be embedded within the regular expression itself usingthe C<(?...)> construct. See below.The C</x> modifier itself needs a little more explanation. It tellsthe regular expression parser to ignore whitespace that is neitherbackslashed nor within a character class. You can use this to break upyour regular expression into (slightly) more readable parts. The C<#>character is also treated as a metacharacter introducing a comment,just as in ordinary Perl code. This also means that if you want realwhitespace or C<#> characters in the pattern (outside a characterclass, where they are unaffected by C</x>), that you'll either have to escape them or encode them using octal or hex escapes. Taken together,these features go a long way towards making Perl's regular expressionsmore readable. Note that you have to be careful not to include thepattern delimiter in the comment--perl has no way of knowing you didnot intend to close the pattern early. See the C-comment deletion codein L<perlop>.=head2 Regular ExpressionsThe patterns used in Perl pattern matching derive from supplied inthe Version 8 regex routines. (The routines are derived(distantly) from Henry Spencer's freely redistributable reimplementationof the V8 routines.) See L<Version 8 Regular Expressions> fordetails.In particular the following metacharacters have their standard I<egrep>-ishmeanings: \ Quote the next metacharacter ^ Match the beginning of the line . Match any character (except newline) $ Match the end of the line (or before newline at the end) | Alternation () Grouping [] Character classBy default, the "^" character is guaranteed to match only thebeginning of the string, the "$" character only the end (or before thenewline at the end), and Perl does certain optimizations with theassumption that the string contains only one line. Embedded newlineswill not be matched by "^" or "$". You may, however, wish to treat astring as a multi-line buffer, such that the "^" will match after anynewline within the string, and "$" will match before any newline. At thecost of a little more overhead, you can do this by using the /m modifieron the pattern match operator. (Older programs did this by setting C<$*>,but this practice is now deprecated.)To simplify multi-line substitutions, the "." character never matches anewline unless you use the C</s> modifier, which in effect tells Perl to pretendthe string is a single line--even if it isn't. The C</s> modifier alsooverrides the setting of C<$*>, in case you have some (badly behaved) oldercode that sets it in another module.The following standard quantifiers are recognized: * Match 0 or more times + Match 1 or more times ? Match 1 or 0 times {n} Match exactly n times {n,} Match at least n times {n,m} Match at least n but not more than m times(If a curly bracket occurs in any other context, it is treatedas a regular character.) The "*" modifier is equivalent to C<{0,}>, the "+"modifier to C<{1,}>, and the "?" modifier to C<{0,1}>. n and m are limitedto integral values less than a preset limit defined when perl is built.This is usually 32766 on the most common platforms. The actual limit canbe seen in the error message generated by code such as this: $_ **= $_ , / {$_} / for 2 .. 42;By default, a quantified subpattern is "greedy", that is, it will match asmany times as possible (given a particular starting location) while stillallowing the rest of the pattern to match. If you want it to match theminimum number of times possible, follow the quantifier with a "?". Notethat the meanings don't change, just the "greediness": *? Match 0 or more times +? Match 1 or more times ?? Match 0 or 1 time {n}? Match exactly n times {n,}? Match at least n times {n,m}? Match at least n but not more than m timesBecause patterns are processed as double quoted strings, the followingalso work: \t tab (HT, TAB) \n newline (LF, NL) \r return (CR) \f form feed (FF) \a alarm (bell) (BEL) \e escape (think troff) (ESC) \033 octal char (think of a PDP-11) \x1B hex char \x{263a} wide hex char (Unicode SMILEY) \c[ control char \N{name} named char \l lowercase next char (think vi) \u uppercase next char (think vi) \L lowercase till \E (think vi) \U uppercase till \E (think vi) \E end case modification (think vi) \Q quote (disable) pattern metacharacters till \EIf C<use locale> is in effect, the case map used by C<\l>, C<\L>, C<\u>and C<\U> is taken from the current locale. See L<perllocale>. Fordocumentation of C<\N{name}>, see L<charnames>.You cannot include a literal C<$> or C<@> within a C<\Q> sequence.An unescaped C<$> or C<@> interpolates the corresponding variable,while escaping will cause the literal string C<\$> to be matched.You'll need to write something like C<m/\Quser\E\@\Qhost/>.In addition, Perl defines the following: \w Match a "word" character (alphanumeric plus "_") \W Match a non-"word" character \s Match a whitespace character \S Match a non-whitespace character \d Match a digit character \D Match a non-digit character \pP Match P, named property. Use \p{Prop} for longer names. \PP Match non-P \X Match eXtended Unicode "combining character sequence", equivalent to C<(?:\PM\pM*)> \C Match a single C char (octet) even under utf8.A C<\w> matches a single alphanumeric character or C<_>, not a whole word.Use C<\w+> to match a string of Perl-identifier characters (which isn't the same as matching an English word). If C<use locale> is in effect, thelist of alphabetic characters generated by C<\w> is taken from thecurrent locale. See L<perllocale>. You may use C<\w>, C<\W>, C<\s>, C<\S>,C<\d>, and C<\D> within character classes, but if you try to use themas endpoints of a range, that's not a range, the "-" is understood literally.See L<utf8> for details about C<\pP>, C<\PP>, and C<\X>.The POSIX character class syntax [:class:]is also available. The available classes and their backslashequivalents (if available) are as follows: alpha alnum ascii blank [1] cntrl digit \d graph lower print punct space \s [2] upper word \w [3] xdigit [1] A GNU extension equivalent to C<[ \t]>, `all horizontal whitespace'. [2] Not I<exactly equivalent> to C<\s> since the C<[[:space:]]> includes also the (very rare) `vertical tabulator', "\ck", chr(11). [3] A Perl extension. For example use C<[:upper:]> to match all the uppercase characters.Note that the C<[]> are part of the C<[::]> construct, not part of thewhole character class. For example: [01[:alpha:]%]matches zero, one, any alphabetic character, and the percentage sign.If the C<utf8> pragma is used, the following equivalences to Unicode\p{} constructs and equivalent backslash character classes (if available),will hold: alpha IsAlpha alnum IsAlnum ascii IsASCII blank IsSpace cntrl IsCntrl digit IsDigit \d graph IsGraph lower IsLower print IsPrint punct IsPunct space IsSpace IsSpacePerl \s upper IsUpper word IsWord xdigit IsXDigitFor example C<[:lower:]> and C<\p{IsLower}> are equivalent.If the C<utf8> pragma is not used but the C<locale> pragma is, theclasses correlate with the usual isalpha(3) interface (except for`word' and `blank').The assumedly non-obviously named classes are:=over 4=item cntrlAny control character. Usually characters that don't produce output assuch but instead control the terminal somehow: for example newline andbackspace are control characters. All characters with ord() less than32 are most often classified as control characters (assuming ASCII,the ISO Latin character sets, and Unicode).=item graphAny alphanumeric or punctuation (special) character.=item printAny alphanumeric or punctuation (special) character or space.=item punctAny punctuation (special) character.=item xdigitAny hexadecimal digit. Though this may feel silly ([0-9A-Fa-f] wouldwork just fine) it is included for completeness.=backYou can negate the [::] character classes by prefixing the class namewith a '^'. This is a Perl extension. For example: POSIX trad. Perl utf8 Perl [:^digit:] \D \P{IsDigit} [:^space:] \S \P{IsSpace} [:^word:] \W \P{IsWord}The POSIX character classes [.cc.] and [=cc=] are recognized butB<not> supported and trying to use them will cause an error.Perl defines the following zero-width assertions: \b Match a word boundary \B Match a non-(word boundary) \A Match only at beginning of string \Z Match only at end of string, or before newline at the end \z Match only at end of string \G Match only at pos() (e.g. at the end-of-match position of prior m//g)A word boundary (C<\b>) is a spot between two charactersthat has a C<\w> on one side of it and a C<\W> on the other sideof it (in either order), counting the imaginary characters off thebeginning and end of the string as matching a C<\W>. (Withincharacter classes C<\b> represents backspace rather than a wordboundary, just as it normally does in any double-quoted string.)The C<\A> and C<\Z> are just like "^" and "$", except that theywon't match multiple times when the C</m> modifier is used, while"^" and "$" will match at every internal line boundary. To matchthe actual end of the string and not ignore an optional trailingnewline, use C<\z>.The C<\G> assertion can be used to chain global matches (usingC<m//g>), as described in L<perlop/"Regexp Quote-Like Operators">.It is also useful when writing C<lex>-like scanners, when you haveseveral patterns that you want to match against consequent substringsof your string, see the previous reference. The actual location
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -