📄 perlrecharclass.1
字号:
\&\*(L"User-Defined Character Properties\*(R" in perlunicode..PPExamples.IX Subsection "Examples".PP.Vb 8\& "a" =~ /\ew/ # Match, "a" is a \*(Aqword\*(Aq character.\& "7" =~ /\ew/ # Match, "7" is a \*(Aqword\*(Aq character as well.\& "a" =~ /\ed/ # No match, "a" isn\*(Aqt a digit.\& "7" =~ /\ed/ # Match, "7" is a digit.\& " " =~ /\es/ # Match, a space is white space.\& "a" =~ /\eD/ # Match, "a" is a non\-digit.\& "7" =~ /\eD/ # No match, "7" is not a non\-digit.\& " " =~ /\eS/ # No match, a space is not non\-white space.\&\& " " =~ /\eh/ # Match, space is horizontal white space.\& " " =~ /\ev/ # No match, space is not vertical white space.\& "\er" =~ /\ev/ # Match, a return is vertical white space.\&\& "a" =~ /\epL/ # Match, "a" is a letter.\& "a" =~ /\ep{Lu}/ # No match, /\ep{Lu}/ matches upper case letters.\&\& "\ex{0e0b}" =~ /\ep{Thai}/ # Match, \ex{0e0b} is the character\& # \*(AqTHAI CHARACTER SO SO\*(Aq, and that\*(Aqs in\& # Thai Unicode class.\& "a" =~ /\eP{Lao}/ # Match, as "a" is not a Laoian character..Ve.Sh "Bracketed Character Classes".IX Subsection "Bracketed Character Classes"The third form of character class you can use in Perl regular expressionsis the bracketed form. In its simplest form, it lists the charactersthat may be matched inside square brackets, like this: \f(CW\*(C`[aeiou]\*(C'\fR.This matches one of \f(CW\*(C`a\*(C'\fR, \f(CW\*(C`e\*(C'\fR, \f(CW\*(C`i\*(C'\fR, \f(CW\*(C`o\*(C'\fR or \f(CW\*(C`u\*(C'\fR. Just as the othercharacter classes, exactly one character will be matched. To matcha longer string consisting of characters mentioned in the charactersclass, follow the character class with a quantifier. For instance,\&\f(CW\*(C`[aeiou]+\*(C'\fR matches a string of one or more lowercase \s-1ASCII\s0 vowels..PPRepeating a character in a character class has noeffect; it's considered to be in the set only once..PPExamples:.PP.Vb 5\& "e" =~ /[aeiou]/ # Match, as "e" is listed in the class.\& "p" =~ /[aeiou]/ # No match, "p" is not listed in the class.\& "ae" =~ /^[aeiou]$/ # No match, a character class only matches\& # a single character.\& "ae" =~ /^[aeiou]+$/ # Match, due to the quantifier..Ve.PP\fISpecial Characters Inside a Bracketed Character Class\fR.IX Subsection "Special Characters Inside a Bracketed Character Class".PPMost characters that are meta characters in regular expressions (thatis, characters that carry a special meaning like \f(CW\*(C`*\*(C'\fR or \f(CW\*(C`(\*(C'\fR) losetheir special meaning and can be used inside a character class withoutthe need to escape them. For instance, \f(CW\*(C`[()]\*(C'\fR matches either an openingparenthesis, or a closing parenthesis, and the parens inside the characterclass don't group or capture..PPCharacters that may carry a special meaning inside a character class are:\&\f(CW\*(C`\e\*(C'\fR, \f(CW\*(C`^\*(C'\fR, \f(CW\*(C`\-\*(C'\fR, \f(CW\*(C`[\*(C'\fR and \f(CW\*(C`]\*(C'\fR, and are discussed below. They can beescaped with a backslash, although this is sometimes not needed, in whichcase the backslash may be omitted..PPThe sequence \f(CW\*(C`\eb\*(C'\fR is special inside a bracketed character class. Whileoutside the character class \f(CW\*(C`\eb\*(C'\fR is an assertion indicating a pointthat does not have either two word characters or two non-word characterson either side, inside a bracketed character class, \f(CW\*(C`\eb\*(C'\fR matches abackspace character..PPA \f(CW\*(C`[\*(C'\fR is not special inside a character class, unless it's the startof a \s-1POSIX\s0 character class (see below). It normally does not need escaping..PPA \f(CW\*(C`]\*(C'\fR is either the end of a \s-1POSIX\s0 character class (see below), or itsignals the end of the bracketed character class. Normally it needsescaping if you want to include a \f(CW\*(C`]\*(C'\fR in the set of characters.However, if the \f(CW\*(C`]\*(C'\fR is the \fIfirst\fR (or the second if the firstcharacter is a caret) character of a bracketed character class, itdoes not denote the end of the class (as you cannot have an empty class)and is considered part of the set of characters that can be matched withoutescaping..PPExamples:.PP.Vb 8\& "+" =~ /[+?*]/ # Match, "+" in a character class is not special.\& "\ecH" =~ /[\eb]/ # Match, \eb inside in a character class\& # is equivalent with a backspace.\& "]" =~ /[][]/ # Match, as the character class contains.\& # both [ and ].\& "[]" =~ /[[]]/ # Match, the pattern contains a character class\& # containing just ], and the character class is\& # followed by a ]..Ve.PP\fICharacter Ranges\fR.IX Subsection "Character Ranges".PPIt is not uncommon to want to match a range of characters. Luckily, insteadof listing all the characters in the range, one may use the hyphen (\f(CW\*(C`\-\*(C'\fR).If inside a bracketed character class you have two characters separatedby a hyphen, it's treated as if all the characters between the two are inthe class. For instance, \f(CW\*(C`[0\-9]\*(C'\fR matches any \s-1ASCII\s0 digit, and \f(CW\*(C`[a\-m]\*(C'\fRmatches any lowercase letter from the first half of the \s-1ASCII\s0 alphabet..PPNote that the two characters on either side of the hyphen are notnecessary both letters or both digits. Any character is possible,although not advisable. \f(CW\*(C`[\*(Aq\-?]\*(C'\fR contains a range of characters, butmost people will not know which characters that will be. Furthermore,such ranges may lead to portability problems if the code has to run ona platform that uses a different character set, such as \s-1EBCDIC\s0..PPIf a hyphen in a character class cannot be part of a range, for instancebecause it is the first or the last character of the character class,or if it immediately follows a range, the hyphen isn't special, and will beconsidered a character that may be matched. You have to escape the hyphenwith a backslash if you want to have a hyphen in your set of characters tobe matched, and its position in the class is such that it can be consideredpart of a range..PPExamples:.PP.Vb 8\& [a\-z] # Matches a character that is a lower case ASCII letter.\& [a\-fz] # Matches any letter between \*(Aqa\*(Aq and \*(Aqf\*(Aq (inclusive) or the\& # letter \*(Aqz\*(Aq.\& [\-z] # Matches either a hyphen (\*(Aq\-\*(Aq) or the letter \*(Aqz\*(Aq.\& [a\-f\-m] # Matches any letter between \*(Aqa\*(Aq and \*(Aqf\*(Aq (inclusive), the\& # hyphen (\*(Aq\-\*(Aq), or the letter \*(Aqm\*(Aq.\& [\*(Aq\-?] # Matches any of the characters \*(Aq()*+,\-./0123456789:;<=>?\& # (But not on an EBCDIC platform)..Ve.PP\fINegation\fR.IX Subsection "Negation".PPIt is also possible to instead list the characters you do not want tomatch. You can do so by using a caret (\f(CW\*(C`^\*(C'\fR) as the first character in thecharacter class. For instance, \f(CW\*(C`[^a\-z]\*(C'\fR matches a character that is not alowercase \s-1ASCII\s0 letter..PPThis syntax make the caret a special character inside a bracketed characterclass, but only if it is the first character of the class. So if you wantto have the caret as one of the characters you want to match, you eitherhave to escape the caret, or not list it first..PPExamples:.PP.Vb 4\& "e" =~ /[^aeiou]/ # No match, the \*(Aqe\*(Aq is listed.\& "x" =~ /[^aeiou]/ # Match, as \*(Aqx\*(Aq isn\*(Aqt a lowercase vowel.\& "^" =~ /[^^]/ # No match, matches anything that isn\*(Aqt a caret.\& "^" =~ /[x^]/ # Match, caret is not special here..Ve.PP\fIBackslash Sequences\fR.IX Subsection "Backslash Sequences".PPYou can put a backslash sequence character class inside a bracketed characterclass, and it will act just as if you put all the characters matched bythe backslash sequence inside the character class. For instance,\&\f(CW\*(C`[a\-f\ed]\*(C'\fR will match any digit, or any of the lowercase letters between\&'a' and 'f' inclusive..PPExamples:.PP.Vb 4\& /[\ep{Thai}\ed]/ # Matches a character that is either a Thai\& # character, or a digit.\& /[^\ep{Arabic}()]/ # Matches a character that is neither an Arabic\& # character, nor a parenthesis..Ve.PPBackslash sequence character classes cannot form one of the endpointsof a range..PP\fIPosix Character Classes\fR.IX Subsection "Posix Character Classes".PPPosix character classes have the form \f(CW\*(C`[:class:]\*(C'\fR, where \fIclass\fR isname, and the \f(CW\*(C`[:\*(C'\fR and \f(CW\*(C`:]\*(C'\fR delimiters. Posix character classes appear\&\fIinside\fR bracketed character classes, and are a convenient and descriptiveway of listing a group of characters. Be careful about the syntax,.PP.Vb 2\& # Correct:\& $string =~ /[[:alpha:]]/\&\& # Incorrect (will warn):\& $string =~ /[:alpha:]/.Ve.PPThe latter pattern would be a character class consisting of a colon,and the letters \f(CW\*(C`a\*(C'\fR, \f(CW\*(C`l\*(C'\fR, \f(CW\*(C`p\*(C'\fR and \f(CW\*(C`h\*(C'\fR..PPPerl recognizes the following \s-1POSIX\s0 character classes:.PP.Vb 10\& alpha Any alphabetical character.\& alnum Any alphanumerical character.\& ascii Any ASCII character.\& blank A GNU extension, equal to a space or a horizontal tab (C<\et>).\& cntrl Any control character.\& digit Any digit, equivalent to C<\ed>.\& graph Any printable character, excluding a space.\& lower Any lowercase character.\& print Any printable character, including a space.\& punct Any punctuation character.\& space Any white space character. C<\es> plus the vertical tab (C<\ecK>).\& upper Any uppercase character.\& word Any "word" character, equivalent to C<\ew>.\& xdigit Any hexadecimal digit, \*(Aq0\*(Aq \- \*(Aq9\*(Aq, \*(Aqa\*(Aq \- \*(Aqf\*(Aq, \*(AqA\*(Aq \- \*(AqF\*(Aq..Ve.PPThe exact set of characters matched depends on whether the source stringis internally in \s-1UTF\-8\s0 format or not. See \*(L"Locale, Unicode and \s-1UTF\-8\s0\*(R"..PPMost \s-1POSIX\s0 character classes have \f(CW\*(C`\ep\*(C'\fR counterparts. The differenceis that the \f(CW\*(C`\ep\*(C'\fR classes will always match according to the Unicodeproperties, regardless whether the string is in \s-1UTF\-8\s0 format or not..PPThe following table shows the relation between \s-1POSIX\s0 character classesand the Unicode properties:.PP.Vb 1\& [[:...:]] \ep{...} backslash\&\& alpha IsAlpha\& alnum IsAlnum\& ascii IsASCII\& blank\& cntrl IsCntrl\& digit IsDigit \ed\& graph IsGraph\& lower IsLower\& print IsPrint\& punct IsPunct\& space IsSpace\& IsSpacePerl \es\& upper IsUpper\& word IsWord\& xdigit IsXDigit.Ve.PPSome character classes may have a non-obvious name:.IP "cntrl" 4.IX Item "cntrl"Any control character. Usually, control characters don't produce outputas such, but instead control the terminal somehow: for example newlineand backspace are control characters. All characters with \f(CW\*(C`ord()\*(C'\fR lessthan 32 are usually classified as control characters (in \s-1ASCII\s0, the \s-1ISO\s0Latin character sets, and Unicode), as is the character \f(CW\*(C`ord()\*(C'\fR valueof 127 (\f(CW\*(C`DEL\*(C'\fR)..IP "graph" 4.IX Item "graph"Any character that is \fIgraphical\fR, that is, visible. This class consistsof all the alphanumerical characters and all punctuation characters..IP "print" 4.IX Item "print"All printable characters, which is the set of all the graphical charactersplus the space..IP "punct" 4.IX Item "punct"Any punctuation (special) character..PPNegation.IX Subsection "Negation".PPA Perl extension to the \s-1POSIX\s0 character class is the ability tonegate it. This is done by prefixing the class name with a caret (\f(CW\*(C`^\*(C'\fR).Some examples:.PP.Vb 4\& POSIX Unicode Backslash\& [[:^digit:]] \eP{IsDigit} \eD\& [[:^space:]] \eP{IsSpace} \eS\& [[:^word:]] \eP{IsWord} \eW.Ve.PP[= =] and [. .].IX Subsection "[= =] and [. .]".PPPerl will recognize the \s-1POSIX\s0 character classes \f(CW\*(C`[=class=]\*(C'\fR, and\&\f(CW\*(C`[.class.]\*(C'\fR, but does not (yet?) support this construct. Use ofsuch a constructs will lead to an error..PPExamples.IX Subsection "Examples".PP.Vb 10\& /[[:digit:]]/ # Matches a character that is a digit.\& /[01[:lower:]]/ # Matches a character that is either a\& # lowercase letter, or \*(Aq0\*(Aq or \*(Aq1\*(Aq.\& /[[:digit:][:^xdigit:]]/ # Matches a character that can be anything,\& # but the letters \*(Aqa\*(Aq to \*(Aqf\*(Aq in either case.\& # This is because the character class contains\& # all digits, and anything that isn\*(Aqt a\& # hex digit, resulting in a class containing\& # all characters, but the letters \*(Aqa\*(Aq to \*(Aqf\*(Aq\& # and \*(AqA\*(Aq to \*(AqF\*(Aq..Ve.Sh "Locale, Unicode and \s-1UTF\-8\s0".IX Subsection "Locale, Unicode and UTF-8"Some of the character classes have a somewhat different behaviour dependingon the internal encoding of the source string, and the locale that isin effect..PP\&\f(CW\*(C`\ew\*(C'\fR, \f(CW\*(C`\ed\*(C'\fR, \f(CW\*(C`\es\*(C'\fR and the \s-1POSIX\s0 character classes (and their negations,including \f(CW\*(C`\eW\*(C'\fR, \f(CW\*(C`\eD\*(C'\fR, \f(CW\*(C`\eS\*(C'\fR) suffer from this behaviour..PPThe rule is that if the source string is in \s-1UTF\-8\s0 format, the characterclasses match according to the Unicode properties. If the source stringisn't, then the character classes match according to whatever locale isin effect. If there is no locale, they match the \s-1ASCII\s0 defaults(52 letters, 10 digits and underscore for \f(CW\*(C`\ew\*(C'\fR, 0 to 9 for \f(CW\*(C`\ed\*(C'\fR, etc)..PPThis usually means that if you are matching against characters whose \f(CW\*(C`ord()\*(C'\fRvalues are between 128 and 255 inclusive, your character class may matchor not depending on the current locale, and whether the source string isin \s-1UTF\-8\s0 format. The string will be in \s-1UTF\-8\s0 format if it containscharacters whose \f(CW\*(C`ord()\*(C'\fR value exceeds 255. But a string may be in \s-1UTF\-8\s0format without it having such characters..PPFor portability reasons, it may be better to not use \f(CW\*(C`\ew\*(C'\fR, \f(CW\*(C`\ed\*(C'\fR, \f(CW\*(C`\es\*(C'\fRor the \s-1POSIX\s0 character classes, and use the Unicode properties instead..PPExamples.IX Subsection "Examples".PP.Vb 6\& $str = "\exDF"; # $str is not in UTF\-8 format.\& $str =~ /^\ew/; # No match, as $str isn\*(Aqt in UTF\-8 format.\& $str .= "\ex{0e0b}"; # Now $str is in UTF\-8 format.\& $str =~ /^\ew/; # Match! $str is now in UTF\-8 format.\& chop $str;\& $str =~ /^\ew/; # Still a match! $str remains in UTF\-8 format..Ve
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -