📄 ch15_02.htm

📁 编程珍珠,里面很多好用的代码,大家可以参考学习呵呵,
💻 HTM
📖 第 1 页 / 共 2 页
字号:
上一页 12
work right.  This restriction will be lifted in the future.)  See<a href="ch05_01.htm">Chapter 5, "Pattern Matching"</a>, for details of matchingon Unicode properties.<a name="INDEX-2845"></a></p></li><li><p><a name="INDEX-2846"></a><a name="INDEX-2847"></a>The special pattern <tt class="literal">\X</tt> matches any extended Unicodesequence (a "combining character sequence" in Standardese), where thefirst character is a base character and subsequent characters are markcharacters that apply to the base character.  It is equivalent to<tt class="literal">(?:\PM\pM*)</tt>:<blockquote><pre class="programlisting">"o\N{COMBINING TILDE BELOW}" =~ /\X/</pre></blockquote>You may not use <tt class="literal">\X</tt> in square brackets, because itmight match multiple characters and it doesn't match anyparticular character or set of characters.</p></li><li><p><a name="INDEX-2848"></a>The <tt class="literal">tr///</tt> operator transliterates charactersinstead of bytes.  To turn all characters outside the Latin-1 rangeinto a question mark, you could say:<blockquote><pre class="programlisting">tr/\0-\x{10ffff}/\0-\xff?/;       # utf8 to latin1 char</pre></blockquote></p></li><li><p><a name="INDEX-2849"></a><a name="INDEX-2850"></a><a name="INDEX-2851"></a><a name="INDEX-2852"></a><a name="INDEX-2853"></a>Case translation operators use the Unicode case translation tableswhen provided character input.  Note that <tt class="literal">uc</tt>translates to uppercase, while <tt class="literal">ucfirst</tt> translatesto titlecase (for languages that make the distinction).  Naturally thecorresponding backslash sequences have the same semantics:<blockquote><pre class="programlisting">$x = "\u$word";       # titlecase first letter of $word$x = "\U$word";       # uppercase $word$x = "\l$word";       # lowercase first letter of $word$x = "\L$word";       # lowercase $word</pre></blockquote>Be careful, because the Unicode case translation tables don't attemptto provide round-trip mappings in every instance, particularly forlanguages that use different numbers of characters for titlecase oruppercase than they do for the equivalent lowercase letter.  As theysay in the standard, while the case properties themselves arenormative, the case mappings are only informational.<a name="INDEX-2854"></a></p></li><li><p><a name="INDEX-2855"></a><a name="INDEX-2856"></a>Most operators that deal with positions or lengths in the string willautomatically switch to using character positions, including<tt class="literal">chop</tt>, <tt class="literal">substr</tt>,<tt class="literal">pos</tt>, <tt class="literal">index</tt>,<tt class="literal">rindex</tt>, <tt class="literal">sprintf</tt>,<tt class="literal">write</tt>, and <tt class="literal">length</tt>.  Operatorsthat deliberately don't switch include <tt class="literal">vec</tt>,<tt class="literal">pack</tt>, and <tt class="literal">unpack</tt>.  Operatorsthat really don't care include <tt class="literal">chomp</tt>, as well asany other operator that treats a string as a bucket of bits, such asthe default <tt class="literal">sort</tt> and the operators dealing withfilenames.</p><blockquote><pre class="programlisting">use bytes;$bytelen = length("I do <img src="figs/he2.gif">&nbsp;<img src="figs/qi4.gif">&nbsp; <img src="figs/dao4.gif"> &nbsp;.");   # 15 bytesno bytes;$charlen = length("I do <img src="figs/he2.gif">&nbsp;<img src="figs/qi4.gif">&nbsp;<img src="figs/dao4.gif">&nbsp;.");   # but 9 characters</pre></blockquote></li><li><p><a name="INDEX-2857"></a><a name="INDEX-2858"></a><a name="INDEX-2859"></a><a name="INDEX-2860"></a>The <tt class="literal">pack</tt>/<tt class="literal">unpack</tt> letters"<tt class="literal">c</tt>" and "<tt class="literal">C</tt>" do<em class="emphasis">not</em> change, since they're often used forbyte-oriented formats.  (Again, think "<tt class="literal">char</tt>" in theC language.)  However, there is a new "<tt class="literal">U</tt>" specifierthat will convert between UTF-8 characters and integers:<blockquote><pre class="programlisting">pack("U*", 1, 20, 300, 4000) eq v1.20.300.4000</pre></blockquote></p></li><li><p><a name="INDEX-2861"></a><a name="INDEX-2862"></a>The <tt class="literal">chr</tt> and <tt class="literal">ord</tt> functions workon characters:<blockquote><pre class="programlisting">chr(1).chr(20).chr(300).chr(4000) eq v1.20.300.4000</pre></blockquote>In other words, <tt class="literal">chr</tt> and <tt class="literal">ord</tt> arelike <tt class="literal">pack("U")</tt> and<tt class="literal">unpack("U")</tt>, not like<tt class="literal">pack("C")</tt> and<tt class="literal">unpack("C")</tt>.  In fact, the latter two are howyou now emulate byte-oriented <tt class="literal">chr</tt> and<tt class="literal">ord</tt> if you're too lazy to <tt class="literal">usebytes</tt>.</p></li><li><p><a name="INDEX-2863"></a>And finally, <tt class="literal">scalar reverse</tt> reverses by characterrather than by byte:</p><p><blockquote><pre class="programlisting">"<img src="figs/righthand.gif">&nbsp;<img src="figs/lefthand.gif">" eq reverse "<img src="figs/lefthand.gif">&nbsp;<img src="figs/righthand.gif">"</pre></blockquote></p></li></ul><p><a name="INDEX-2864"></a><a name="INDEX-2865"></a>If you look in directory<em class="replaceable">PATH_TO_PERLLIB/unicode</em>, you'll find anumber of files that have to do with defining the semantics above.The Unicode properties database from the Unicode Consortium is in afile called <em class="emphasis">Unicode.300</em> (for Unicode 3.0).  Thisfile has already been processed by <em class="emphasis">mktables.PL</em>into lots of little <em class="emphasis">.pl</em> files in the samedirectory (and in subdirectories <em class="emphasis">Is/</em>,<em class="emphasis">In/</em>, and <em class="emphasis">To/</em>), someof which are automatically slurped in by Perl to implement things like<tt class="literal">\p</tt> (see the <em class="emphasis">Is/</em> and<em class="emphasis">In/</em> directories) and <tt class="literal">uc</tt> (seethe <em class="emphasis">To/</em> directory).  Other files are slurpedin by modules like the <tt class="literal">use charnames</tt> pragma (see<em class="emphasis">Name.pl</em>).  But as of this writing, thereare still a number of files that are just sitting there waiting foryou to write an access module for them:<blockquote><pre class="programlisting"><em class="emphasis">ArabLink.pl</em><em class="emphasis">ArabLnkGrp.pl</em><em class="emphasis">Bidirectional.pl</em><em class="emphasis">Block.pl</em><em class="emphasis">Category.pl</em><em class="emphasis">CombiningClass.pl</em><em class="emphasis">Decomposition.pl</em><em class="emphasis">JamoShort.pl</em><em class="emphasis">Number.pl</em><em class="emphasis">To/Digit.pl</em></pre></blockquote>A much more readable summary of Unicode, with many hyperlinks, is in <em class="replaceable">PATH_TO_PERLLIB</em><em class="emphasis">/unicode/Unicode3.html</em>.<a name="INDEX-2866"></a></p><p>Note that when the Unicode consortium comes out with a new version,some of these filenames are likely to change, so you'll have to pokearound.  You can find <em class="replaceable">PATH_TO_PERLLIB</em> withthe following incantation:<blockquote><pre class="programlisting">% <tt class="userinput"><b>perl -MConfig -le 'print $Config{privlib}'</b></tt></pre></blockquote>To find out just about everything there is to find out about Unicode, you should check out <em class="emphasis">The Unicode Standard, Version 3.0</em>(ISBN 0-201-61633-5). <a name="INDEX-2867"></a></p><!-- BOTTOM NAV BAR --><hr width="515" align="left"><div class="navbar"><table width="515" border="0"><tr><td align="left" valign="top" width="172"><a href="ch15_01.htm"><img src="../gifs/txtpreva.gif" alt="Previous" border="0"></a></td><td align="center" valign="top" width="171"><a href="index.htm"><img src="../gifs/txthome.gif" alt="Home" border="0"></a></td><td align="right" valign="top" width="172"><a href="ch15_03.htm"><img src="../gifs/txtnexta.gif" alt="Next" border="0"></a></td></tr><tr><td align="left" valign="top" width="172">15.1. Building Character</td><td align="center" valign="top" width="171"><a href="index/index.htm"><img src="../gifs/index.gif" alt="Book Index" border="0"></a></td><td align="right" valign="top" width="172">15.3. Caution, <img src="figs/ren2_bold.gif"> Working</td></tr></table></div><hr width="515" align="left"><!-- LIBRARY NAV BAR --><img src="../gifs/smnavbar.gif" usemap="#library-map" border="0" alt="Library Navigation Links"><p><font size="-1"><a href="copyrght.htm">Copyright &copy; 2001</a> O'Reilly &amp; Associates. All rights reserved.</font></p><map name="library-map"> <area shape="rect" coords="2,-1,79,99" href="../index.htm"><area shape="rect" coords="84,1,157,108" href="../perlnut/index.htm"><area shape="rect" coords="162,2,248,125" href="../prog/index.htm"><area shape="rect" coords="253,2,326,130" href="../advprog/index.htm"><area shape="rect" coords="332,1,407,112" href="../cookbook/index.htm"><area shape="rect" coords="414,2,523,103" href="../sysadmin/index.htm"></map><!-- END OF BODY --></body></html>
上一页 12
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -