📄 ch05_04.htm
字号:
<html><head><title>Character Classes (Programming Perl)</title><!-- STYLESHEET --><link rel="stylesheet" type="text/css" href="../style/style1.css"><!-- METADATA --><!--Dublin Core Metadata--><meta name="DC.Creator" content=""><meta name="DC.Date" content=""><meta name="DC.Format" content="text/xml" scheme="MIME"><meta name="DC.Generator" content="XSLT stylesheet, xt by James Clark"><meta name="DC.Identifier" content=""><meta name="DC.Language" content="en-US"><meta name="DC.Publisher" content="O'Reilly & Associates, Inc."><meta name="DC.Source" content="" scheme="ISBN"><meta name="DC.Subject.Keyword" content=""><meta name="DC.Title" content="Character Classes"><meta name="DC.Type" content="Text.Monograph"></head><body><!-- START OF BODY --><!-- TOP BANNER --><img src="gifs/smbanner.gif" usemap="#banner-map" border="0" alt="Book Home"><map name="banner-map"><AREA SHAPE="RECT" COORDS="0,0,466,71" HREF="index.htm" ALT="Programming Perl"><AREA SHAPE="RECT" COORDS="467,0,514,18" HREF="jobjects/fsearch.htm" ALT="Search this book"></map><!-- TOP NAV BAR --><div class="navbar"><table width="515" border="0"><tr><td align="left" valign="top" width="172"><a href="ch05_03.htm"><img src="../gifs/txtpreva.gif" alt="Previous" border="0"></a></td><td align="center" valign="top" width="171"><a href="ch05_01.htm">Chapter 5: Pattern Matching</a></td><td align="right" valign="top" width="172"><a href="ch05_05.htm"><img src="../gifs/txtnexta.gif" alt="Next" border="0"></a></td></tr></table></div><hr width="515" align="left"><!-- SECTION BODY --><h2 class="sect1">5.4. Character Classes</h2><p><a name="INDEX-1525"></a><a name="INDEX-1526"></a><a name="INDEX-1527"></a><a name="INDEX-1528"></a><a name="INDEX-1529"></a><a name="INDEX-1530"></a>In a pattern match, you may match any character that has--or that doesnot have--a particular property. There are four ways to specifycharacter classes. You may specify a character classes in thetraditional way using square brackets and enumerating the possiblecharacters, or you may use any of three mnemonic shortcuts: theclassic Perl classes, the new Perl Unicode properties, or the standardPOSIX classes. Each of these shortcuts matches only one characterfrom its set. Quantify them to match larger expanses, such as<tt class="literal">\d+</tt> to match one or more digits. (An easy mistakeis to think that <tt class="literal">\w</tt> matches a word. Use<tt class="literal">\w+</tt> to match a word.)</p><h3 class="sect2">5.4.1. Custom Character Classes</h3><p><a name="INDEX-1531"></a><a name="INDEX-1532"></a>An enumerated list of characters in square brackets is called a<em class="emphasis">character class</em> and matches any one of thecharacters in the list. For example, <tt class="literal">[aeiouy]</tt>matches a letter that can be a vowel in English. (For Welsh add a"<tt class="literal">w</tt>", for Scottish an "<tt class="literal">r</tt>".) Tomatch a right square bracket, either backslash it or place it first inthe list.</p><p><a name="INDEX-1533"></a><a name="INDEX-1534"></a>Character ranges may be indicated using a hyphen and the<tt class="literal">a-z</tt> notation. Multiple ranges may be combined; forexample, <tt class="literal">[0-9a-fA-F]</tt> matches one hex "digit". Youmay use a backslash to protect a hyphen that would otherwise beinterpreted as a range delimiter, or just put it at the beginning orend of the class (a practice which is arguably less readable but moretraditional).</p><p><a name="INDEX-1535"></a><a name="INDEX-1536"></a><a name="INDEX-1537"></a> A caret (or circumflex, or hat,or up arrow) at the front of the character class inverts the class,causing it to match any single character <em class="emphasis">not</em> inthe list. (To match a caret, either <em class="emphasis">don't</em> put itfirst, or better, escape it with a backslash.) For example,<tt class="literal">[^aeiouy]</tt> matches any character that isn't a vowel.Be careful with character class negation, though, because the universeof characters is expanding. For example, that character class matchesconsonants--and also matches spaces, newlines, and anything (includingvowels) in Cyrillic, Greek, or nearly any other script, not to mentionevery idiograph in Chinese, Japanese, and Korean. And someday maybeeven Cirth, Tengwar, and Klingon. (Linear B and Etruscan, for sure.)So it might be better to specify your consonants explicitly, such as<tt class="literal">[cbdfghjklmnpqrstvwxyz]</tt>, or<tt class="literal">[b-df-hj-np-tv-z]</tt> for short. (This also solves theissue of "y" needing to be in two places at once, which a setcomplement would preclude.)</p><p><a name="INDEX-1538"></a><a name="INDEX-1539"></a><a name="INDEX-1540"></a>Normal character metasymbols are supported inside a character class,(see "Specific Characters"), such as <tt class="literal">\n</tt>,<tt class="literal">\t</tt>,<tt class="literal">\c</tt><em class="replaceable">X</em>,<tt class="literal">\</tt><em class="replaceable">NNN</em>, and<tt class="literal">\N{</tt><em class="replaceable">NAME</em><tt class="literal">}</tt>.Additionally, you may use <tt class="literal">\b</tt> within a characterclass to mean a backspace, just as it does in a double-quoted string.Normally, in a pattern match, it means a word boundary. Butzero-width assertions don't make any sense in character classes, sohere <tt class="literal">\b</tt> returns to its normal meaning in strings.You may also use any predefined character class described later in thechapter (classic, Unicode, or POSIX), but don't try to use them asendpoints of a range--that doesn't make sense, so the"<tt class="literal">-</tt>" will be interpreted literally.</p><p><a name="INDEX-1541"></a><a name="INDEX-1542"></a><a name="INDEX-1543"></a><a name="INDEX-1544"></a>All other metasymbols lose their special meaning inside squarebrackets. In particular, you can't use any of the three genericwildcards: "<tt class="literal">.</tt>", <tt class="literal">\X</tt>, or<tt class="literal">\C</tt>. The first often surprises people, but itdoesn't make much sense to use the universal character class within arestricted one, and you often want to match a literal dot as part of acharacter class--when you're matching filenames, for instance. It'salso meaningless to specify quantifiers, assertions, or alternationinside a character class, since the characters are interpretedindividually. For example, <tt class="literal">[fee|fie|foe|foo]</tt> meansthe same thing as <tt class="literal">[feio|]</tt>.</p><h3 class="sect2">5.4.2. Classic Perl Character Class Shortcuts</h3><p><a name="INDEX-1545"></a><a name="INDEX-1546"></a><a name="INDEX-1547"></a><a name="INDEX-1548"></a><a name="INDEX-1549"></a>Since the beginning, Perl has provided a number of character classshortcuts. These are listed in <a href="ch05_04.htm#perl3-tab-classiccharclass">Table 5-8</a>.All of them are backslashed alphabetic metasymbols, and ineach case, the uppercase version is the negation of the lowercaseversion. The meanings of these are not quite as fixed as you mightexpect; the meanings can be influenced by locale settings. Even ifyou don't use locales, the meanings can change whenever a new Unicodestandard comes out, adding scripts with new digits and letters. (Tokeep the old byte meanings, you can always <tt class="literal">usebytes</tt>. For explanations of the utf8 meanings, see "UnicodeProperties" later in this chapter. In any case, the utf8 meanings area superset of the byte meanings.)</p><a name="perl3-tab-classiccharclass"></a><h4 class="objtitle">Table 5.8. Classic Character Classes</h4><table border="1"><tr><th>Symbol</th><th>Meaning</th><th>As Bytes</th><th>As utf8</th></tr><tr><td><tt class="literal">\d</tt></td><td>Digit</td><td><tt class="literal">[0-9]</tt></td><td><tt class="literal">\p{IsDigit}</tt></td></tr><tr><td><tt class="literal">\D</tt></td><td>Nondigit</td><td><tt class="literal">[^0-9]</tt></td><td><tt class="literal">\P{IsDigit}</tt></td></tr><tr><td><tt class="literal">\s</tt></td><td>Whitespace</td><td><tt class="literal">[ \t\n\r\f]</tt></td><td><tt class="literal">\p{IsSpace}</tt></td></tr><tr><td><tt class="literal">\S</tt></td><td>Nonwhitespace</td><td><tt class="literal">[^ \t\n\r\f]</tt></td><td><tt class="literal">\P{IsSpace}</tt></td></tr><tr><td><tt class="literal">\w</tt></td><td>Word character</td><td><tt class="literal">[a-zA-Z0-9_]</tt></td><td><tt class="literal">\p{IsWord}</tt></td></tr><tr><td><tt class="literal">\W</tt></td><td>Non-(word character)</td><td><tt class="literal">[^a-zA-Z0-9_]</tt></td><td><tt class="literal">\P{IsWord}</tt></td></tr></table><p>(Yes, we know most words don't have numbers or underscores in them;<tt class="literal">\w</tt> is for matching "words" in the sense of tokens in a typical programminglanguage. Or Perl, for that matter.)</p><p>These metasymbols may be used either outside or inside square brackets,that is, either standalone or as part of a constructed character class:<blockquote><pre class="programlisting">if ($var =~ /\D/) { warn "contains non-digit" }if ($var =~ /[^\w\s.]/) { warn "contains non-(word, space, dot)" }</pre></blockquote></p><h3 class="sect2">5.4.3. Unicode Properties</h3><p><a name="INDEX-1550"></a><a name="INDEX-1551"></a>Unicode properties are available using <tt class="literal">\p{</tt><em class="replaceable">PROP</em><tt class="literal">}</tt> and its setcomplement, <tt class="literal">\P{</tt><em class="replaceable">PROP</em><tt class="literal">}</tt>. For the rare properties with one-characternames, braces are optional, as in <tt class="literal">\pN</tt> to indicate a numericcharacter (not necessarily decimal--Roman numerals are numericcharacters too). These property classes may be used by themselves orcombined in a constructed character class:<blockquote><pre class="programlisting">if ($var =~ /^\p{IsAlpha}+$/) { print "all alphabetic" }if ($var =~ s/[\p{Zl}\p{Zp}]/\n/g) { print "fixed newline wannabes" }</pre></blockquote>Some properties are directly defined in the Unicode standard, and someproperties are composites defined by Perl, based on the standardproperties. <tt class="literal">Zl</tt> and <tt class="literal">Zp</tt> arestandard Unicode properties representing line separators and paragraphseparators, while <tt class="literal">IsAlpha</tt> is defined by Perl to bea property class combining the standard properties<tt class="literal">Ll</tt>, <tt class="literal">Lu</tt>, <tt class="literal">Lt</tt>,and <tt class="literal">Lo</tt>, (that is, letters that are lowercase,uppercase, titlecase, or other). As of version 5.6.0 of Perl, youneed to <tt class="literal">use utf8</tt> for these properties to work.This restriction will be relaxed in the future.</p><p>There are a great many properties. We'll list the ones we know about,but the list is necessarily incomplete. New properties are likely tobe in new versions of Unicode, and you can even define your ownproperties. More about that later.</p><p><a name="INDEX-1552"></a>The Unicode Consortium produces the online resources that turn intothe various files Perl uses in its Unicode implementation. For moreabout these files, see <a href="ch15_01.htm">Chapter 15, "Unicode"</a>. You can get a niceoverview of Unicode in the document<em class="replaceable">PATH_TO_PERLLIB</em><em class="emphasis">/unicode/Unicode3.html</em>where <em class="replaceable">PATH_TO_PERLLIB</em> is what is printed out by:<blockquote><pre class="programlisting">perl -MConfig -le 'print $Config{privlib}'</pre></blockquote>Most Unicode properties are of the form <tt class="literal">\p{Is</tt><em class="replaceable">PROP</em><tt class="literal">}</tt>. The <tt class="literal">Is</tt>is optional, since it's so common, but you may prefer to leave it in forreadability.</p><h3 class="sect3">5.4.3.1. Perl's Unicode properties</h3><p><a name="INDEX-1553"></a><a name="INDEX-1554"></a>First, <a href="ch05_04.htm#perl3-tab-prop-composite">Table 5-9</a> lists Perl'scomposite properties. They're defined to be reasonably close to thestandard POSIX definitions for character classes.</p><a name="perl3-tab-prop-composite"></a><h4 class="objtitle">Table 5.9. Composite Unicode Properties</h4><table border="1"><tr><th>Property</th><th>Equivalent</th></tr><tr><td><tt class="literal">IsASCII</tt></td><td><tt class="literal">[\x00-\x7f]</tt></td></tr><tr><td><tt class="literal">IsAlnum</tt></td><td><tt class="literal">[\p{IsLl}\p{IsLu}\p{IsLt}\p{IsLo}\p{IsNd}]</tt></td></tr><tr><td><tt class="literal">IsAlpha</tt></td><td><tt class="literal">[\p{IsLl}\p{IsLu}\p{IsLt}\p{IsLo}]</tt></td></tr><tr><td><tt class="literal">IsCntrl</tt></td><td><tt class="literal">\p{IsC}</tt></td></tr><tr><td><tt class="literal">IsDigit</tt></td><td><tt class="literal">\p{Nd}</tt></td></tr><tr><td><tt class="literal">IsGraph</tt></td><td><tt class="literal">[^\pC\p{IsSpace}]</tt></td></tr><tr><td><tt class="literal">IsLower</tt></td><td><tt class="literal">\p{IsLl}</tt></td></tr><tr><td><tt class="literal">IsPrint</tt></td><td><tt class="literal">\P{IsC}</tt></td></tr><tr><td><tt class="literal">IsPunct</tt></td><td><tt class="literal">\p{IsP}</tt></td></tr><tr><td><tt class="literal">IsSpace</tt></td><td><tt class="literal">[\t\n\f\r\p{IsZ}]</tt></td></tr><tr><td><tt class="literal">IsUpper</tt></td><td><tt class="literal">[\p{IsLu}\p{IsLt}]</tt></td></tr><tr><td><tt class="literal">IsWord</tt></td><td><tt class="literal">[_\p{IsLl}\p{IsLu}\p{IsLt}\p{IsLo}\p{IsNd}]</tt></td></tr><tr><td><tt class="literal">IsXDigit</tt></td><td><tt class="literal">[0-9a-fA-F]</tt></td></tr></table><p>Perl also provides the following composites for each of main categoriesof standard Unicode properties (see the next section):</p><table border="1"><tr><th>Property</th><th>Meaning</th><th>Normative</th></tr><tr><td><tt class="literal">IsC</tt></td><td>Crazy control codes and such</td><td>Yes</td></tr><tr><td><tt class="literal">IsL</tt></td><td>Letters</td><td>Partly</td></tr><tr><td><tt class="literal">IsM</tt></td><td>Marks</td><td>Yes</td></tr><tr><td><tt class="literal">IsN</tt></td><td>Numbers</td><td>Yes</td></tr><tr><td><tt class="literal">IsP</tt></td><td>Punctuation</td><td>No</td></tr><tr><td><tt class="literal">IsS</tt></td><td>Symbols</td><td>No</td></tr><tr><td><tt class="literal">IsZ</tt></td><td>Separators (Zeparators?)</td><td>Yes</td></tr></table><h3 class="sect3">5.4.3.2. Standard Unicode properties</h3><p><a href="ch05_04.htm#perl3-tab-prop-basic">Table 5-10</a> lists the most basic standardUnicode properties, derived from each character's category. No
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -