📄 ch15_02.htm

📁 编程珍珠,里面很多好用的代码,大家可以参考学习呵呵,
💻 HTM
📖 第 1 页 / 共 2 页
字号:
12 下一页
<html><head><title>Effects of Character Semantics (Programming Perl)</title><!-- STYLESHEET --><link rel="stylesheet" type="text/css" href="../style/style1.css"><!-- METADATA --><!--Dublin Core Metadata--><meta name="DC.Creator" content=""><meta name="DC.Date" content=""><meta name="DC.Format" content="text/xml" scheme="MIME"><meta name="DC.Generator" content="XSLT stylesheet, xt by James Clark"><meta name="DC.Identifier" content=""><meta name="DC.Language" content="en-US"><meta name="DC.Publisher" content="O'Reilly &amp; Associates, Inc."><meta name="DC.Source" content="" scheme="ISBN"><meta name="DC.Subject.Keyword" content=""><meta name="DC.Title" content="Effects of Character Semantics"><meta name="DC.Type" content="Text.Monograph"></head><body><!-- START OF BODY --><!-- TOP BANNER --><img src="gifs/smbanner.gif" usemap="#banner-map" border="0" alt="Book Home"><map name="banner-map"><AREA SHAPE="RECT" COORDS="0,0,466,71" HREF="index.htm" ALT="Programming Perl"><AREA SHAPE="RECT" COORDS="467,0,514,18" HREF="jobjects/fsearch.htm" ALT="Search this book"></map><!-- TOP NAV BAR --><div class="navbar"><table width="515" border="0"><tr><td align="left" valign="top" width="172"><a href="ch15_01.htm"><img src="../gifs/txtpreva.gif" alt="Previous" border="0"></a></td><td align="center" valign="top" width="171"><a href="ch15_01.htm">Chapter 15: Unicode</a></td><td align="right" valign="top" width="172"><a href="ch15_03.htm"><img src="../gifs/txtnexta.gif" alt="Next" border="0"></a></td></tr></table></div><hr width="515" align="left"><!-- SECTION BODY --><h2 class="sect1">15.2. Effects of Character Semantics</h2><p><a name="INDEX-2823"></a><a name="INDEX-2824"></a><a name="INDEX-2825"></a>The upshot of all this is that a typical built-in operator willoperate on characters unless it is in the scope of a <tt class="literal">usebytes</tt> pragma.  However, even outside the scope of<tt class="literal">use bytes</tt>, if all of the operands of the operatorare stored as 8-bit characters (that is, none of the operands arestored in utf8), then character semantics are indistinguishablefrom byte semantics, and the result of the operator will be storedin 8-bit form internally.  This preserves backward compatibility aslong as you don't feed your program any characters wider than Latin-1.</p><p><a name="INDEX-2826"></a>The <tt class="literal">utf8</tt> pragma is primarily a compatibility device that enablesrecognition of UTF-8 in literals and identifiers encountered by theparser.  It may also be used for enabling some of the more experimentalUnicode support features.  Our long-term goal is to turn the <tt class="literal">utf8</tt>pragma into a no-op.</p><p><a name="INDEX-2827"></a><a name="INDEX-2828"></a>The <tt class="literal">use bytes</tt> pragma will never turn into a no-op.  Not only is itnecessary for byte-oriented code, but it also has the side effect ofdefining byte-oriented wrappers around certain functions for useoutside the scope of <tt class="literal">use bytes</tt>.  As of this writing, the onlydefined wrapper is for <tt class="literal">length</tt>, but there are likely to be more astime goes by.  To use such a wrapper, say:<blockquote><pre class="programlisting">use bytes ();   # Load wrappers without importing byte semantics....$charlen =        length("\x{ffff_ffff}");   # Returns 1.$bytelen = bytes::length("\x{ffff_ffff}");   # Returns 7.</pre></blockquote>Outside the scope of a <tt class="literal">use bytes</tt> declaration, Perl version 5.6 works (orat least, is intended to work) like this:</p><ul><li><p>Strings and patterns may now contain characters that have an ordinalvalue larger than 255:<blockquote><pre class="programlisting">use utf8;$convergence = "<img src="figs/righthand.gif">&nbsp;<img src="figs/lefthand.gif">";</pre></blockquote><a name="INDEX-2829"></a><a name="INDEX-2830"></a>Presuming you have a Unicode-capable editor to edit your program, suchcharacters will typically occur directly within the literal strings asUTF-8 characters.  For now, you have to declare a <tt class="literal">useutf8</tt> at the top of your program to enable the use of UTF-8in literals.</p><p><a name="INDEX-2831"></a>If you don't have a Unicode editor, you can always specify aparticular character in ASCII with an extension of the<tt class="literal">\x</tt> notation.  A character in the Latin-1 range maybe written either as <tt class="literal">\x{ab}</tt> or as<tt class="literal">\xab</tt>, but if the number exceeds two hexidecimaldigits, you must use braces.  Unicode characters are specified byputting the hexadecimal code within braces after the<tt class="literal">\x</tt>.  For instance, a Unicode smiley face is<tt class="literal">\x{263A}</tt>.  There is no syntactic construct in Perlthat assumes Unicode characters are exactly 16 bits, so you may notuse <tt class="literal">\u263A</tt> as you can in other languages;<tt class="literal">\x{263A}</tt> is the closest equivalent.</p><p><a name="INDEX-2832"></a><a name="INDEX-2833"></a>For inserting named characters via<tt class="literal">\N{</tt><em class="replaceable">CHARNAME</em><tt class="literal">}</tt>,see the <tt class="literal">use charnames</tt> pragma in <a href="ch31_01.htm">Chapter 31, "Pragmatic Modules"</a>.</p></li><li><p> Identifiers within the Perl script may containUnicode alphanumeric characters, including ideographs:<a name="INDEX-2834"></a></p><p></p><p><blockquote><pre class="programlisting">use utf8;$<img src="figs/ren2_bold.gif">&nbsp;++;        # A child is born.</pre></blockquote>Again, <tt class="literal">use utf8</tt> is needed (for now) to recognizeUTF-8 in your script. You are currently on your own when it comes tousing the canonical forms of characters--Perl doesn't (yet) attempt tocanonicalize variable names for you.  We recommend that youcanonicalize your programs to Normalization Form C, since that's whatPerl will someday canonicalize to by default.  Seewww.unicode.org for the latest technical reporton canonicalization.<a name="INDEX-2835"></a></p></li><li><p><a name="INDEX-2836"></a><a name="INDEX-2837"></a>Regular expressions match characters instead of bytes.  For instance,dot matches a character instead of a byte.  If the Unicode Consortiumever gets around to approving the Tengwar script, then (despite thefact that such characters are represented in four bytes of UTF-8), thismatches:<blockquote><pre class="programlisting">"\N{TENGWAR LETTER SILME NUQUERNA}" =~ /^.$/</pre></blockquote><a name="INDEX-2838"></a>The <tt class="literal">\C</tt> pattern is provided to force a match on asingle byte ("<tt class="literal">char</tt>" in C, hence<tt class="literal">\C</tt>).  Use <tt class="literal">\C</tt> with care, since itcan put you out of sync with the character boundaries in your string,and you may get "Malformed UTF-8 character" errors.  You may not use<tt class="literal">\C</tt> in square brackets, since it doesn't representany particular character or set of characters.</p></li><li><p><a name="INDEX-2839"></a><a name="INDEX-2840"></a><a name="INDEX-2841"></a> Character classes in regular expressions matchcharacters instead of bytes and match against the character propertiesspecified in the Unicode properties database.  So<tt class="literal">\w</tt> can be used to match an ideograph:<blockquote><pre class="programlisting">"<img src="figs/ren2_bold.gif">&nbsp;" =~ /\w/</pre></blockquote></p></li><li><p><a name="INDEX-2842"></a><a name="INDEX-2843"></a><a name="INDEX-2844"></a>Named Unicode properties and block ranges can be used as characterclasses via the new <tt class="literal">\p</tt> (matches property) and<tt class="literal">\P</tt> (doesn't match property) constructs.  Forinstance, <tt class="literal">\p{Lu}</tt> matches any character with theUnicode uppercase property, while <tt class="literal">\p{M}</tt> matches anymark character.  Single-letter properties may omit the brackets, somark characters can be matched by <tt class="literal">\pM</tt> also.  Manypredefinedcharacter classes are available, such as<tt class="literal">\p{IsMirrored}</tt> and<tt class="literal">\p{InTibetan}</tt>:<blockquote><pre class="programlisting">"\N{greek:Iota}" =~ /\p{Lu}/</pre></blockquote>You may also use <tt class="literal">\p</tt> and <tt class="literal">\P</tt>within square bracket character classes. (In version 5.6.0 of Perl,you need to <tt class="literal">use utf8</tt> for character properties to
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -