📄 perlrecharclass.pod
字号:
=head1 NAMEperlrecharclass - Perl Regular Expression Character Classes=head1 DESCRIPTIONThe top level documentation about Perl regular expressionsis found in L<perlre>.This manual page discusses the syntax and use of characterclasses in Perl Regular Expressions.A character class is a way of denoting a set of characters,in such a way that one character of the set is matched.It's important to remember that matching a character classconsumes exactly one character in the source string. (The sourcestring is the string the regular expression is matched against.)There are three types of character classes in Perl regularexpressions: the dot, backslashed sequences, and the bracketed form.=head2 The dotThe dot (or period), C<.> is probably the most used, and certainlythe most well-known character class. By default, a dot matches anycharacter, except for the newline. The default can be changed toadd matching the newline with the I<single line> modifier: eitherfor the entire regular expression using the C</s> modifier, orlocally using C<(?s)>.Here are some examples: "a" =~ /./ # Match "." =~ /./ # Match "" =~ /./ # No match (dot has to match a character) "\n" =~ /./ # No match (dot does not match a newline) "\n" =~ /./s # Match (global 'single line' modifier) "\n" =~ /(?s:.)/ # Match (local 'single line' modifier) "ab" =~ /^.$/ # No match (dot matches one character)=head2 Backslashed sequencesPerl regular expressions contain many backslashed sequences thatconstitute a character class. That is, they will match a singlecharacter, if that character belongs to a specific set of characters(defined by the sequence). A backslashed sequence is a sequence ofcharacters starting with a backslash. Not all backslashed sequencesare character class; for a full list, see L<perlrebackslash>.Here's a list of the backslashed sequences, which are discussed inmore detail below. \d Match a digit character. \D Match a non-digit character. \w Match a "word" character. \W Match a non-"word" character. \s Match a white space character. \S Match a non-white space character. \h Match a horizontal white space character. \H Match a character that isn't horizontal white space. \v Match a vertical white space character. \V Match a character that isn't vertical white space. \pP, \p{Prop} Match a character matching a Unicode property. \PP, \P{Prop} Match a character that doesn't match a Unicode property.=head3 DigitsC<\d> matches a single character that is considered to be a I<digit>.What is considered a digit depends on the internal encoding ofthe source string. If the source string is in UTF-8 format, C<\d>not only matches the digits '0' - '9', but also Arabic, Devanagari anddigits from other languages. Otherwise, if there is a locale in effect,it will match whatever characters the locale considers digits. Withouta locale, C<\d> matches the digits '0' to '9'.See L</Locale, Unicode and UTF-8>.Any character that isn't matched by C<\d> will be matched by C<\D>.=head3 Word charactersC<\w> matches a single I<word> character: an alphanumeric character(that is, an alphabetic character, or a digit), or the underscore (C<_>).What is considered a word character depends on the internal encodingof the string. If it's in UTF-8 format, C<\w> matches those charactersthat are considered word characters in the Unicode database. That is, itnot only matches ASCII letters, but also Thai letters, Greek letters, etc.If the source string isn't in UTF-8 format, C<\w> matches those charactersthat are considered word characters by the current locale. Withouta locale in effect, C<\w> matches the ASCII letters, digits and theunderscore.Any character that isn't matched by C<\w> will be matched by C<\W>.=head3 White spaceC<\s> matches any single character that is consider white space. In theASCII range, C<\s> matches the horizontal tab (C<\t>), the new line(C<\n>), the form feed (C<\f>), the carriage return (C<\r>), and thespace (the vertical tab, C<\cK> is not matched by C<\s>). The exact setof characters matched by C<\s> depends on whether the source string isin UTF-8 format. If it is, C<\s> matches what is considered white spacein the Unicode database. Otherwise, if there is a locale in effect, C<\s>matches whatever is considered white space by the current locale. Withouta locale, C<\s> matches the five characters mentioned in the beginningof this paragraph. Perhaps the most notable difference is that C<\s>matches a non-breaking space only if the non-breaking space is in aUTF-8 encoded string.Any character that isn't matched by C<\s> will be matched by C<\S>.C<\h> will match any character that is considered horizontal white space;this includes the space and the tab characters. C<\H> will match any characterthat is not considered horizontal white space.C<\v> will match any character that is considered vertical white space;this includes the carriage return and line feed characters (newline).C<\V> will match any character that is not considered vertical white space.C<\R> matches anything that can be considered a newline under Unicoderules. It's not a character class, as it can match a multi-charactersequence. Therefore, it cannot be used inside a bracketed characterclass. Details are discussed in L<perlrebackslash>.C<\h>, C<\H>, C<\v>, C<\V>, and C<\R> are new in perl 5.10.0.Note that unlike C<\s>, C<\d> and C<\w>, C<\h> and C<\v> always matchthe same characters, regardless whether the source string is in UTF-8format or not. The set of characters they match is also not influencedby locale.One might think that C<\s> is equivalent with C<[\h\v]>. This is not true.The vertical tab (C<"\x0b">) is not matched by C<\s>, it is howeverconsidered vertical white space. Furthermore, if the source string isnot in UTF-8 format, the next line (C<"\x85">) and the no-break space(C<"\xA0">) are not matched by C<\s>, but are by C<\v> and C<\h> respectively.If the source string is in UTF-8 format, both the next line and theno-break space are matched by C<\s>.The following table is a complete listing of characters matched byC<\s>, C<\h> and C<\v>.The first column gives the code point of the character (in hex format),the second column gives the (Unicode) name. The third column indicatesby which class(es) the character is matched. 0x00009 CHARACTER TABULATION h s 0x0000a LINE FEED (LF) vs 0x0000b LINE TABULATION v 0x0000c FORM FEED (FF) vs 0x0000d CARRIAGE RETURN (CR) vs 0x00020 SPACE h s 0x00085 NEXT LINE (NEL) vs [1] 0x000a0 NO-BREAK SPACE h s [1] 0x01680 OGHAM SPACE MARK h s 0x0180e MONGOLIAN VOWEL SEPARATOR h s 0x02000 EN QUAD h s 0x02001 EM QUAD h s 0x02002 EN SPACE h s 0x02003 EM SPACE h s 0x02004 THREE-PER-EM SPACE h s 0x02005 FOUR-PER-EM SPACE h s 0x02006 SIX-PER-EM SPACE h s 0x02007 FIGURE SPACE h s 0x02008 PUNCTUATION SPACE h s 0x02009 THIN SPACE h s 0x0200a HAIR SPACE h s 0x02028 LINE SEPARATOR vs 0x02029 PARAGRAPH SEPARATOR vs 0x0202f NARROW NO-BREAK SPACE h s 0x0205f MEDIUM MATHEMATICAL SPACE h s 0x03000 IDEOGRAPHIC SPACE h s=over 4=item [1]NEXT LINE and NO-BREAK SPACE only match C<\s> if the source string is inUTF-8 format.=backIt is worth noting that C<\d>, C<\w>, etc, match single characters, notcomplete numbers or words. To match a number (that consists of integers),use C<\d+>; to match a word, use C<\w+>.=head3 Unicode PropertiesC<\pP> and C<\p{Prop}> are character classes to match characters thatfit given Unicode classes. One letter classes can be used in the C<\pP>form, with the class name following the C<\p>, otherwise, the propertyname is enclosed in braces, and follows the C<\p>. For instance, amatch for a number can be written as C</\pN/> or as C</\p{Number}/>.Lowercase letters are matched by the property I<LowercaseLetter> whichhas as short form I<Ll>. They have to be written as C</\p{Ll}/> orC</\p{LowercaseLetter}/>. C</\pLl/> is valid, but means something different.It matches a two character string: a letter (Unicode property C<\pL>),followed by a lowercase C<l>.For a list of possible properties, seeL<perlunicode/Unicode Character Properties>. It is also possible todefined your own properties. This is discussed inL<perlunicode/User-Defined Character Properties>.=head4 Examples "a" =~ /\w/ # Match, "a" is a 'word' character. "7" =~ /\w/ # Match, "7" is a 'word' character as well. "a" =~ /\d/ # No match, "a" isn't a digit. "7" =~ /\d/ # Match, "7" is a digit. " " =~ /\s/ # Match, a space is white space. "a" =~ /\D/ # Match, "a" is a non-digit. "7" =~ /\D/ # No match, "7" is not a non-digit. " " =~ /\S/ # No match, a space is not non-white space. " " =~ /\h/ # Match, space is horizontal white space. " " =~ /\v/ # No match, space is not vertical white space. "\r" =~ /\v/ # Match, a return is vertical white space. "a" =~ /\pL/ # Match, "a" is a letter. "a" =~ /\p{Lu}/ # No match, /\p{Lu}/ matches upper case letters. "\x{0e0b}" =~ /\p{Thai}/ # Match, \x{0e0b} is the character # 'THAI CHARACTER SO SO', and that's in # Thai Unicode class. "a" =~ /\P{Lao}/ # Match, as "a" is not a Laoian character.=head2 Bracketed Character ClassesThe third form of character class you can use in Perl regular expressionsis the bracketed form. In its simplest form, it lists the charactersthat may be matched inside square brackets, like this: C<[aeiou]>.This matches one of C<a>, C<e>, C<i>, C<o> or C<u>. Just as the othercharacter classes, exactly one character will be matched. To matcha longer string consisting of characters mentioned in the charactersclass, follow the character class with a quantifier. For instance,C<[aeiou]+> matches a string of one or more lowercase ASCII vowels.Repeating a character in a character class has noeffect; it's considered to be in the set only once.Examples: "e" =~ /[aeiou]/ # Match, as "e" is listed in the class. "p" =~ /[aeiou]/ # No match, "p" is not listed in the class. "ae" =~ /^[aeiou]$/ # No match, a character class only matches # a single character. "ae" =~ /^[aeiou]+$/ # Match, due to the quantifier.=head3 Special Characters Inside a Bracketed Character ClassMost characters that are meta characters in regular expressions (thatis, characters that carry a special meaning like C<*> or C<(>) losetheir special meaning and can be used inside a character class withoutthe need to escape them. For instance, C<[()]> matches either an openingparenthesis, or a closing parenthesis, and the parens inside the characterclass don't group or capture.Characters that may carry a special meaning inside a character class are:C<\>, C<^>, C<->, C<[> and C<]>, and are discussed below. They can be
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -