📄 ch05_03.htm
字号:
<tr><td><tt class="literal">\w</tt></td><td>Yes</td><td><p>Match any "word" character (alphanumerics plus "_").</p></td></tr><tr><td><tt class="literal">\W</tt></td><td>Yes</td><td><p>Match any nonword character.</p></td></tr><tr><td><tt class="literal">\x{</tt><em class="replaceable">abcd</em><tt class="literal">}</tt></td><td>Yes</td><td><p>Match the character given in hexadecimal.</p></td></tr><tr><td><tt class="literal">\X</tt></td><td>Yes</td><td><p>Match Unicode "combining character sequence" string.</p></td></tr><tr><td><tt class="literal">\z</tt></td><td>No</td><td><p>True at end of string only.</p></td></tr><tr><td><tt class="literal">\Z</tt></td><td>No</td><td><p>True at end of string or before optional newline.</p></td></tr></table><p>The braces are optional on <tt class="literal">\p</tt> and<tt class="literal">\P</tt> if the property name is one character. Thebraces are optional on <tt class="literal">\x</tt> if the hexadecimal numberis two digits or less. The braces are never optional on<tt class="literal">\N</tt>.</p><p><a name="INDEX-1498"></a><a name="INDEX-1499"></a>Only metasymbols with "Match the..." or"Match any..." descriptions may be usedwithin character classes (square brackets). That is, characterclasses are limited to containing specific sets of characters, sowithin them you may only use metasymbols that describe other specificsets of characters, or that describe specific individual characters.Of course, these metasymbols may also be used outside characterclasses, along with all the other nonclassificatory metasymbols. Notehowever that <tt class="literal">\b</tt> is two entirely different beasties:it's a backspace character inside the character class, but a wordboundary assertion outside.</p><p>There is some amount of overlap between the characters that a patterncan match and the characters an ordinary double-quoted string caninterpolate. Since regexes undergo two passes, it is sometimesambiguous which pass should process a given character. When there isambiguity, the variable interpolation pass defers the interpretation ofsuch characters to the regular expression parser.</p><p>But the variable interpolation pass can only defer to the regex parserwhen it knows it is parsing a regex. You can specify regularexpressions as ordinary double-quoted strings, but then you mustfollow normal double-quote rules. Any of the previous metasymbolsthat happen to map to actual characters will still work, even thoughthey're not being deferred to the regex parser. But you can't use anyof the other metasymbols in ordinary double quotes (or in any similarconstructs such as <tt class="literal">`...`</tt>,<tt class="literal">qq(...)</tt>, <tt class="literal">qx(...)</tt>, orthe equivalent here documents). If you want your string to be parsedas a regular expression without doing any matching, you should beusing the <tt class="literal">qr//</tt> (quote regex) operator.</p><p><a name="INDEX-1500"></a>Note that the case and metaquote translation escapes(<tt class="literal">\U</tt> and friends) must be processed during thevariable interpolation pass because the purpose of those metasymbolsis to influence how variables are interpolated. If you suppressvariable interpolation with single quotes, you don't get thetranslation escapes either. Neither variables nor translation escapes(<tt class="literal">\U</tt>, etc.) are expanded in any single quotedstring, nor in single-quoted <tt class="literal">m'...'</tt> or<tt class="literal">qr'...'</tt> operators. Even when you dointerpolation, these translation escapes are ignored if they show upas the <em class="emphasis">result</em> of variable interpolation, since bythen it's too late to influence variable interpolation.</p><p>Although the transliteration operator doesn't take regular expressions,any metasymbol we've discussed that matches a single specific character alsoworks in a <tt class="literal">tr///</tt> operation. The rest do not (except for backslash,which continues to work in the backward way it always works.)</p><h3 class="sect2">5.3.2. Specific Characters</h3><p><a name="INDEX-1501"></a><a name="INDEX-1502"></a>As mentioned before, everything that's not special in a patternmatches itself. That means an <tt class="literal">/a/</tt> matches an "a", an <tt class="literal">/=/</tt>matches an "=", and so on. Some characters, though, aren't veryeasy to type in from the keyboard or, even if you manage that, don'tshow up on a printout; control characters are notorious for this.In a regular expression, Perl recognizes the following double-quotishcharacter aliases:</p><a name="perl3-tab-re-dblquotechars"></a><table border="1"><tr><th>Escape</th><th>Meaning</th></tr><tr><td><tt class="literal">\0</tt></td><td>Null character (ASCII NUL)</td></tr><tr><td><tt class="literal">\a</tt></td><td>Alarm (BEL)</td></tr><tr><td><tt class="literal">\e</tt></td><td>Escape (ESC)</td></tr><tr><td><tt class="literal">\f</tt></td><td>Form feed (FF)</td></tr><tr><td><tt class="literal">\n</tt></td><td>Newline (NL, CR on Mac)</td></tr><tr><td><tt class="literal">\r</tt></td><td>Return (CR, NL on Mac)</td></tr><tr><td><tt class="literal">\t</tt></td><td>Tab (HT)</td></tr></table><p><a name="INDEX-1503"></a><a name="INDEX-1504"></a>Just as in double-quoted strings, Perl also honors the following fourmetasymbols in patterns:</p><dl><dt><b><tt class="literal">\c</tt><em class="replaceable">X</em></b></dt><dd><p>A named control character, like <tt class="literal">\cC</tt> for Control-C, <tt class="literal">\cZ</tt> forControl-Z, <tt class="literal">\c[</tt> for ESC, and <tt class="literal">\c?</tt> for DEL.<a name="INDEX-1505"></a></p></dd><dt><b><tt class="literal">\</tt><em class="replaceable">NNN</em></b></dt><dd><p> A character specified using its two- or three-digitoctal code. The leading <tt class="literal">0</tt> is optional, except forvalues less than <tt class="literal">010</tt> (8 decimal) since (unlike indouble-quoted strings) the single-digit versions are always consideredto be backreferences to captured strings within a pattern. Multipledigits are interpreted as the <em class="emphasis">n</em>thbackreference if you've captured at least <em class="emphasis">n</em>substrings earlier in the pattern (where <em class="emphasis">n</em> isconsidered as a decimal number). Otherwise, they are interpreted as acharacter specified in octal.<a name="INDEX-1506"></a><a name="INDEX-1507"></a></p></dd><dt><b><tt class="literal">\x{</tt><em class="replaceable">LONGHEX</em><tt class="literal">}</tt></b></dt><dt><b><tt class="literal">\x</tt><em class="replaceable">HEX</em></b></dt><dd><p>A character number specified as one or two hex digits (<tt class="literal">[0-9a-fA-F]</tt>), as in<tt class="literal">\x1B</tt>. The one-digit form is usable only if the character followingit is not a hex digit. If braces are used, you may use as many digitsas you'd like, which may result in a Unicode character. For example,<tt class="literal">\x{262f}</tt> matches a Unicode YIN YANG.<a name="INDEX-1508"></a></p></dd><dt><b><tt class="literal">\N{</tt><em class="replaceable">NAME</em><tt class="literal">}</tt></b></dt><dd><p> A named character, such <tt class="literal">\N{GREEK SMALLLETTER EPSILON}</tt>, <tt class="literal">\N{greek:epsilon}</tt>, or<tt class="literal">\N{epsilon}</tt>. This requires the <tt class="literal">usecharnames</tt> pragma described in <a href="ch31_01.htm">Chapter 31, "Pragmatic Modules"</a>, which also determines whichflavors of those names you may use (<tt class="literal">":long"</tt>,<tt class="literal">":full"</tt>, <tt class="literal">":short"</tt> respectively,corresponding to the three styles just shown).<a name="INDEX-1509"></a></p><p>A list of all Unicode character names can be found in your closestUnicode standards document, or in<em class="replaceable">PATH_TO_PERLLIB</em><em class="emphasis">/unicode/Names.txt</em>.</p></dd></dl><h3 class="sect2">5.3.3. Wildcard Metasymbols</h3><p><a name="INDEX-1510"></a><a name="INDEX-1511"></a><a name="INDEX-1512"></a><a name="INDEX-1513"></a><a name="INDEX-1514"></a><a name="INDEX-1515"></a><a name="INDEX-1516"></a>Three special metasymbols serve as generic wildcards, each of themmatching "any" character (for certain values of "any"). These are thedot ("<tt class="literal">.</tt>"), <tt class="literal">\C</tt>, and<tt class="literal">\X</tt>. None of these may be used in a characterclass. You can't use the dot there because it would match (nearly)any character in existence, so it's something of a universal characterclass in its own right. If you're going to include or excludeeverything, there's not much point in having a character class. Thespecial wildcards <tt class="literal">\C</tt> and <tt class="literal">\X</tt> havespecial structural meanings that don't map well to the notion ofchoosing a single Unicode character, which is the level at whichcharacter classes work.</p><p><a name="INDEX-1517"></a>The dot metacharacter matches any one character other than anewline. (And with the <tt class="literal">/s</tt> modifier, it matchesthat, too.) Like any of the dozen special characters in a pattern, tomatch a dot literally, you must escape it with a backslash. Forexample, this checks whether a filename ends with a dot followed by aone-character extension:<blockquote><pre class="programlisting">if ($pathname =~ /\.(.)\z/s) { print "Ends in $1\n";}</pre></blockquote>The first dot, the escaped one, is the literal character, and thesecond says "match any character". The <tt class="literal">\z</tt> says tomatch only at the end of the string, and the <tt class="literal">/s</tt>modifier lets the dot match a newline as well. (Yes, using a newlineas a file extension Isn't Very Nice, but that doesn't mean it can'thappen.)</p><p><a name="INDEX-1518"></a>The dot metacharacter is most often used with a quantifier. A<tt class="literal">.*</tt> matches a maximal number of characters, while a<tt class="literal">.*?</tt> matches a minimal number of characters. Butit's also sometimes used without a quantifier for its width:<tt class="literal">/(..):(..):(..)/</tt> matches three colon-separatedfields, each of which is two characters long.</p><p><a name="INDEX-1519"></a><a name="INDEX-1520"></a>If you use a dot in a pattern compiled under the lexically scoped<tt class="literal">use utf8</tt> pragma, then it will match any Unicodecharacter. (You're not supposed to need a <tt class="literal">use utf8</tt>for that, but accidents will happen. The pragma may not be necessaryby the time you read this.)<blockquote><pre class="programlisting">use utf8;use charnames qw/:full/;$BWV[887] = "G\N{MUSIC SHARP SIGN} minor";($note, $black, $mode) = $BWV[887] =~ /^([A-G])(.)\s+(\S+)/;print "That's lookin' sharp!\n" if $black eq chr(9839);</pre></blockquote></p><p><a name="INDEX-1521"></a>The <tt class="literal">\X</tt> metasymbol matches a character in a moreextended sense. It really matches a string of one or more Unicodecharacters known as a "combining character sequence". Such a sequenceconsists of a base character followed by any "mark" characters(diacritical markings like cedillas or diereses) that combine withthat base character to form one logical unit. <tt class="literal">\X</tt>is exactly equivalent to <tt class="literal">(?:\PM\pM*)</tt>. This allowsit to match one logical character, even when that really comprisesseveral separate characters. The length of the match in<tt class="literal">/\X/</tt> would exceed one character if it matched anycombining characters. (And that's character length, which has littleto do with byte length).</p><p>If you are using Unicode and really want to get at a single byteinstead of a single character, you can use the <tt class="literal">\C</tt>metasymbol. This will always match one byte (specifically, one Clanguage <tt class="literal">char</tt> type), even if this gets you out ofsync with your Unicode character stream. See the appropriate warningsabout doing this in <a href="ch15_01.htm">Chapter 15, "Unicode"</a>.</p><a name="INDEX-1522"></a><a name="INDEX-1523"></a><a name="INDEX-1524"></a><!-- BOTTOM NAV BAR --><hr width="515" align="left"><div class="navbar"><table width="515" border="0"><tr><td align="left" valign="top" width="172"><a href="ch05_02.htm"><img src="../gifs/txtpreva.gif" alt="Previous" border="0"></a></td><td align="center" valign="top" width="171"><a href="index.htm"><img src="../gifs/txthome.gif" alt="Home" border="0"></a></td><td align="right" valign="top" width="172"><a href="ch05_04.htm"><img src="../gifs/txtnexta.gif" alt="Next" border="0"></a></td></tr><tr><td align="left" valign="top" width="172">5.2. Pattern-Matching Operators</td><td align="center" valign="top" width="171"><a href="index/index.htm"><img src="../gifs/index.gif" alt="Book Index" border="0"></a></td><td align="right" valign="top" width="172">5.4. Character Classes</td></tr></table></div><hr width="515" align="left"><!-- LIBRARY NAV BAR --><img src="../gifs/smnavbar.gif" usemap="#library-map" border="0" alt="Library Navigation Links"><p><font size="-1"><a href="copyrght.htm">Copyright © 2001</a> O'Reilly & Associates. All rights reserved.</font></p><map name="library-map"> <area shape="rect" coords="2,-1,79,99" href="../index.htm"><area shape="rect" coords="84,1,157,108" href="../perlnut/index.htm"><area shape="rect" coords="162,2,248,125" href="../prog/index.htm"><area shape="rect" coords="253,2,326,130" href="../advprog/index.htm"><area shape="rect" coords="332,1,407,112" href="../cookbook/index.htm"><area shape="rect" coords="414,2,523,103" href="../sysadmin/index.htm"></map><!-- END OF BODY --></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -