📄 ch05_04.htm
字号:
<td>Reverse when used right-to-left</td></tr></table><p><a name="INDEX-1565"></a><a name="INDEX-1566"></a>The following properties classify various syllabaries according to vowelsounds:<blockquote><pre class="programlisting">IsSylA IsSylE IsSylO IsSylWAA IsSylWIIIsSylAA IsSylEE IsSylOO IsSylWC IsSylWOIsSylAAI IsSylI IsSylU IsSylWE IsSylWOOIsSylAI IsSylII IsSylV IsSylWEE IsSylWUIsSylC IsSylN IsSylWA IsSylWI IsSylWV</pre></blockquote>For example, <tt class="literal">\p{IsSylA}</tt> would match <tt class="literal">\N{KATAKANA LETTER KA}</tt>but not <tt class="literal">\N{KATAKANA LETTER KU}</tt>.</p><p>Now that we've basically told you all these Unicode 3.0 properties, weshould point out that a few of the more esoteric ones aren'timplemented in version 5.6.0 of Perl because its implementation wasbased in part on Unicode 2.0, and things like the bidirectional algorithmwere still being worked out. However, by the time you read this, themissing properties may well be implemented, so we listed them anyway.</p><h3 class="sect3">5.4.3.3. Unicode block properties</h3><p><a name="INDEX-1567"></a><a name="INDEX-1568"></a><a name="INDEX-1569"></a><a name="INDEX-1570"></a>Some Unicode properties are of the form<tt class="literal">\p{In</tt><em class="replaceable">SCRIPT</em><tt class="literal">}</tt>.(Note the distinction between <tt class="literal">Is</tt> and<tt class="literal">In</tt>.) The <tt class="literal">In</tt> properties are fortesting block ranges of a particular<em class="replaceable">SCRIPT</em>. If you have a character, and youwonder whether it were written in Greek script, you could test with:<blockquote><pre class="programlisting">print "It's Greek to me!\n" if chr(931) =~ /\p{InGreek}/;</pre></blockquote>That works by checking whether a character is "in" the valid range ofthat script type. This may be negated with<tt class="literal">\P{In</tt><em class="replaceable">SCRIPT</em><tt class="literal">}</tt>to find out whether something <em class="emphasis">isn't</em> in aparticular script's block, such as <tt class="literal">\P{InDingbats}</tt>to test whether a string contains a non-dingbat. Block propertiesinclude the following:<blockquote><pre class="programlisting">InArabic InCyrillic InHangulJamo InMalayalam InSyriacInArmenian InDevanagari InHebrew InMongolian InTamilInArrows InDingbats InHiragana InMyanmar InTeluguInBasicLatin InEthiopic InKanbun InOgham InThaanaInBengali InGeorgian InKannada InOriya InThaiInBopomofo InGreek InKatakana InRunic InTibetanInBoxDrawing InGujarati InKhmer InSinhala InYiRadicalsInCherokee InGurmukhi InLao InSpecials InYiSyllables</pre></blockquote>Not to mention jawbreakers like these:<blockquote><pre class="programlisting">InAlphabeticPresentationForms InHalfwidthandFullwidthFormsInArabicPresentationForms-A InHangulCompatibilityJamoInArabicPresentationForms-B InHangulSyllablesInBlockElements InHighPrivateUseSurrogatesInBopomofoExtended InHighSurrogatesInBraillePatterns InIdeographicDescriptionCharactersInCJKCompatibility InIPAExtensionsInCJKCompatibilityForms InKangxiRadicalsInCJKCompatibilityIdeographs InLatin-1SupplementInCJKRadicalsSupplement InLatinExtended-AInCJKSymbolsandPunctuation InLatinExtended-BInCJKUnifiedIdeographs InLatinExtendedAdditionalInCJKUnifiedIdeographsExtensionA InLetterlikeSymbolsInCombiningDiacriticalMarks InLowSurrogatesInCombiningHalfMarks InMathematicalOperatorsInCombiningMarksforSymbols InMiscellaneousSymbolsInControlPictures InMiscellaneousTechnicalInCurrencySymbols InNumberFormsInEnclosedAlphanumerics InOpticalCharacterRecognitionInEnclosedCJKLettersandMonths InPrivateUseInGeneralPunctuation InSuperscriptsandSubscriptsInGeometricShapes InSmallFormVariantsInGreekExtended InSpacingModifierLetters</pre></blockquote>And the winner is:<blockquote><pre class="programlisting">InUnifiedCanadianAboriginalSyllabics</pre></blockquote>See<em class="replaceable">PATH_TO_PERLLIB</em><em class="emphasis">/unicode/In/*.pl</em>to get an up-to-date listing of all of these character blockproperties. Note that these <tt class="literal">In</tt> properties are onlytesting to see if the character is in the block of charactersallocated for that script. There is no guarantee that allcharacters in that range are defined; you also need to testagainst one of the <tt class="literal">Is</tt> properties discussed earlierto see if the character is defined. There is also no guarantee that aparticular language doesn't use characters outside its assigned block.In particular, many European languages mix extended Latin characterswith Latin-1 characters.</p><p>But hey, if you need a particular property that isn't provided, that'snot a big problem. Read on.</p><h3 class="sect3">5.4.3.4. Defining your own character properties</h3><p><a name="INDEX-1571"></a><a name="INDEX-1572"></a>To define your own property, you need to write a subroutine with thename of the property you want (see <a href="ch06_01.htm">Chapter 6, "Subroutines"</a>). The subroutine should be definedin the package that needs the property (see <a href="ch10_01.htm">Chapter 10, "Packages"</a>), which means that if you want to useit in multiple packages, you'll either have to import it from a module(see <a href="ch11_01.htm">Chapter 11, "Modules"</a>), or inherit it as a classmethod from the package in which it is defined (see<a href="ch12_01.htm">Chapter 12, "Objects"</a>).</p><p>Once you've got that all settled, the subroutine should return data inthe same format as the files in <em class="replaceable">PATH_TO_PERLLIB</em><em class="emphasis">/unicode/Is</em>directory. That is, just return a list of characters or character ranges inhexadecimal, one per line. If there is a range, the two numbers areseparated by a tab. Suppose you wanted a property that would be trueif your character is in the range of either of the Japanesesyllabaries, known as hiragana and katakana. (Together they're knownas kana). You can just put in the two ranges like this:<blockquote><pre class="programlisting">sub InKana { return <<'END';3040 309F30A0 30FFEND}</pre></blockquote>Alternatively, you could define it in terms of existing property names:<blockquote><pre class="programlisting">sub InKana { return <<'END';+utf8::InHiragana+utf8::InKatakanaEND}</pre></blockquote><a name="INDEX-1573"></a><a name="INDEX-1574"></a>You can also do set subtraction using a "<tt class="literal">-</tt>" prefix. Suppose youonly wanted the actual characters, not just the block ranges of characters.You could weed out all the undefined ones like this:<blockquote><pre class="programlisting">sub IsKana { return <<'END';+utf8::InHiragana+utf8::InKatakana-utf8::IsCnEND}</pre></blockquote><a name="INDEX-1575"></a>You can also start with a complemented character set using the "<tt class="literal">!</tt>" prefix:<blockquote><pre class="programlisting">sub IsNotKana { return <<'END';!utf8::InHiragana-utf8::InKatakana+utf8::IsCnEND}</pre></blockquote><a name="INDEX-1576"></a>Perl itself uses exactly the same tricks to define the meanings of its"classic" character classes (like <tt class="literal">\w</tt>) when you include them in yourown custom character classes (like <tt class="literal">[-.\w\s]</tt>). You might think thatthe more complicated you get with your rules, the slower they will run,but in fact, once Perl has calculated the bit pattern for a particular64-bit swatch of your property, it caches it so it never has torecalculate the pattern again. (It does it in 64-bit swatches so that itdoesn't even have to decode your utf8 to do its lookups.) Thus, allcharacter classes, built-in or custom, run at essentially the samespeed (fast) once they get going.</p><a name="INDEX-1577"></a><a name="INDEX-1578"></a><h3 class="sect2">5.4.4. POSIX-Style Character Classes</h3><p><a name="INDEX-1579"></a><a name="INDEX-1580"></a>Unlike Perl's other character class shortcuts, the POSIX-stylecharacter-class syntax notation, <tt class="literal">[:</tt><em class="replaceable">CLASS</em><tt class="literal">:]</tt>, is available for use<em class="emphasis">only</em> when constructing other character classes, that is, inside anadditional pair of square brackets. For example,<tt class="literal">/[.,[:alpha:][:digit:]]/</tt> will search for one character that iseither a literal dot (because it's in a character class), a comma, analphabetic character, or a digit.</p><p>The POSIX classes available as of revision 5.6 of Perl are shown in<a href="ch05_04.htm#perl3-tab-posix-char-class">Table 5-11</a>.</p><a name="perl3-tab-posix-char-class"></a><h4 class="objtitle">Table 5.11. POSIX Character Classes</h4><table border="1"><tr><th>Class</th><th>Meaning</th></tr><tr><td><tt class="literal">alnum</tt></td><td><p>Any alphanumeric, that is, an <tt class="literal">alpha</tt> or a <tt class="literal">digit</tt>.</p></td></tr><tr><td><tt class="literal">alpha</tt></td><td><p>Any letter. (That's a lot more letters than you think, unless you'rethinking Unicode, in which case it's still a lot.)</p></td></tr><tr><td><tt class="literal">ascii</tt></td><td>Any character with an ordinal value between 0 and 127.</td></tr><tr><td><tt class="literal">cntrl</tt></td><td><p>Any control character. Usually characters that don't produce outputas such, but instead control the terminal somehow; for example,newline, form feed, and backspace are all control characters. Characters with an <tt class="literal">ord</tt> value less than 32 are most often classifiedas control characters.</p></td></tr><tr><td><tt class="literal">digit</tt></td><td><p> A characterrepresenting a decimal digit, such as <tt class="literal">0</tt> to<tt class="literal">9</tt>. (Includes other characters under Unicode.)Equivalent to <tt class="literal">\d</tt>.</p></td></tr><tr><td><tt class="literal">graph</tt></td><td><p>Any alphanumeric or punctuation character.</p></td></tr><tr><td><tt class="literal">lower</tt></td><td><p>A lowercase letter.</p></td></tr><tr><td><tt class="literal">print</tt></td><td><p>Any alphanumeric or punctuation character or space.</p></td></tr><tr><td><tt class="literal">punct</tt></td><td><p>Any punctuation character.</p></td></tr><tr><td><tt class="literal">space</tt></td><td><p>Any space character. Includes tab, newline, form feed, and carriage return(and a lot more under Unicode.) Equivalent to <tt class="literal">\s</tt>.</p></td></tr><tr><td><tt class="literal">upper</tt></td><td><p>Any uppercase (or titlecase) letter.</p></td></tr><tr><td><tt class="literal">word</tt></td><td><p>Any identifier character, either an <tt class="literal">alnum</tt> or underline.</p></td></tr><tr><td><tt class="literal">xdigit</tt></td><td><p>Any hexadecimal digit. Though this may seem silly (<tt class="literal">[0-9a-fA-F]</tt>works just fine), it is included for completeness.</p></td></tr></table><p><a name="INDEX-1581"></a>You can negate the POSIX character classes by prefixing the classname with a <tt class="literal">^</tt> following the <tt class="literal">[:</tt>. (This is a Perl extension.)For example:</p><a name="perl3-tab-posixtrad"></a><table border="1"><tr><th>POSIX</th><th>Classic</th></tr><tr><td><tt class="literal">[:^digit:]</tt></td><td><tt class="literal">\D</tt></td></tr><tr><td><tt class="literal">[:^space:]</tt></td><td><tt class="literal">\S</tt></td></tr><tr><td><tt class="literal">[:^word:]</tt></td><td><tt class="literal">\W</tt></td></tr></table><p>If the <tt class="literal">use utf8</tt> pragma is not requested, but the<tt class="literal">use locale</tt> pragma is, the classes correlatedirectly with the equivalent functions in the C library's<em class="emphasis">isalpha</em>(3) interface (except for<tt class="literal">word</tt>, which is a Perl extension, mirroring<tt class="literal">\w</tt>).<a name="INDEX-1582"></a></p><p>If the <tt class="literal">utf8</tt> pragma is used, POSIX character classesare exactly equivalent to the corresponding <tt class="literal">Is</tt>properties listed in <a href="ch05_04.htm#perl3-tab-prop-composite">Table 5-9</a>. Forexample <tt class="literal">[:lower:]</tt> and <tt class="literal">\p{Lower}</tt>are equivalent, except that the POSIX classes may only be used withinconstructed character classes, whereas Unicode properties have no suchrestriction and may be used in patterns wherever Perl shortcuts like<tt class="literal">\s</tt> and <tt class="literal">\w</tt> may be used.</p><p>The brackets are part of the POSIX-style <tt class="literal">[::]</tt> construct,not part of the whole character class. This leads to writingpatterns like <tt class="literal">/^[[:lower:][:digit:]]+$/</tt>, to match a stringconsisting entirely of lowercase letters or digits (plus anoptional trailing newline). In particular, this does not work:<blockquote><pre class="programlisting">42 =~ /^[:digit:]$/ # WRONG</pre></blockquote>That's because it's not inside a character class. Rather, it <em class="emphasis">is</em> acharacter class, the one representing the characters "<tt class="literal">:</tt>", "<tt class="literal">i</tt>","<tt class="literal">t</tt>", "<tt class="literal">g</tt>", and "<tt class="literal">d</tt>". Perl doesn't care that you specified"<tt class="literal">:</tt>" twice.</p><p>Here's what you need instead:<blockquote><pre class="programlisting">42 =~ /^[[:digit:]]+$/</pre></blockquote>The POSIX character classes <tt class="literal">[.cc.]</tt> and<tt class="literal">[=cc=]</tt> are recognized but produce an errorindicating they are not supported. Trying to use<em class="emphasis">any</em> POSIX character class in older verions ofPerl is likely to fail miserably, and perhaps even silently. Ifyou're going to use POSIX character classes, it's best to require anew version of Perl by saying:<blockquote><pre class="programlisting">use 5.6.0;</pre></blockquote></p><a name="INDEX-1583"></a><a name="INDEX-1584"></a><a name="INDEX-1585"></a><a name="INDEX-1586"></a><!-- BOTTOM NAV BAR --><hr width="515" align="left"><div class="navbar"><table width="515" border="0"><tr><td align="left" valign="top" width="172"><a href="ch05_03.htm"><img src="../gifs/txtpreva.gif" alt="Previous" border="0"></a></td><td align="center" valign="top" width="171"><a href="index.htm"><img src="../gifs/txthome.gif" alt="Home" border="0"></a></td><td align="right" valign="top" width="172"><a href="ch05_05.htm"><img src="../gifs/txtnexta.gif" alt="Next" border="0"></a></td></tr><tr><td align="left" valign="top" width="172">5.3. Metacharacters and Metasymbols</td><td align="center" valign="top" width="171"><a href="index/index.htm"><img src="../gifs/index.gif" alt="Book Index" border="0"></a></td><td align="right" valign="top" width="172">5.5. Quantifiers</td></tr></table></div><hr width="515" align="left"><!-- LIBRARY NAV BAR --><img src="../gifs/smnavbar.gif" usemap="#library-map" border="0" alt="Library Navigation Links"><p><font size="-1"><a href="copyrght.htm">Copyright © 2001</a> O'Reilly & Associates. All rights reserved.</font></p><map name="library-map"> <area shape="rect" coords="2,-1,79,99" href="../index.htm"><area shape="rect" coords="84,1,157,108" href="../perlnut/index.htm"><area shape="rect" coords="162,2,248,125" href="../prog/index.htm"><area shape="rect" coords="253,2,326,130" href="../advprog/index.htm"><area shape="rect" coords="332,1,407,112" href="../cookbook/index.htm"><area shape="rect" coords="414,2,523,103" href="../sysadmin/index.htm"></map><!-- END OF BODY --></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -