📄 ch15_01.htm
字号:
the same as ASCII in the lowest seven bits.</p><blockquote class="footnote"><a name="FOOTNOTE-2"></a><p>[2] You may prefer to call them "octets"; that'sokay, but we think the two words are pretty much synonymous thesedays, so we'll stick with the blue-collar word.</p></blockquote><p><a name="INDEX-2810"></a><a name="INDEX-2811"></a><a name="INDEX-2812"></a>Perl uses UTF-8 only when it thinks it is beneficial, so if all thecharacters in your string are in the range <tt class="literal">0..255</tt>,there's a good chance the characters are all packed in bytes--but inthe absence of other knowledge, you can't be sure because internallyPerl converts between fixed 8-bit characters and variable-length UTF-8characters as necessary. The point is, you shouldn't have to worryabout it most of the time, because the character semantics arepreserved at an abstract level regardless of representation.</p><p><a name="INDEX-2813"></a>In any event, if your string contains any character numbers largerthan <tt class="literal">255</tt> decimal, the string is certainly stored inUTF-8. More accurately, it is stored in Perl's extended version ofUTF-8, which we call <em class="emphasis">utf8</em>, in honor of a pragmaby that name, but mostly because it's easier to type. (And because"real" UTF-8 is only allowed to contain character numbers blessed bythe Unicode Consortium. Perl's utf8 is allowed to contain anycharacter numbers you need to get your job done. Perl doesn't give arip whether your character numbers are officially correct or justcorrect.)</p><p><a name="INDEX-2814"></a><a name="INDEX-2815"></a>We said you shouldn't worry about it most of the time, but people like toworry anyway. Suppose you use a v-string to represent an IPv4address:<blockquote><pre class="programlisting">$locaddr = v127.0.0.1; # Certainly stored as bytes.$oreilly = v204.148.40.9; # Might be stored as bytes or utf8.$badaddr = v2004.148.40.9; # Certainly stored as utf8.</pre></blockquote>Everyone can figure out that <tt class="literal">$badaddr</tt> will not work as an IP address.So it's easy to think that if O'Reilly's network address gets forced intoa UTF-8 representation, it will no longer work. But the characters inthe string are abstract numbers, not bytes. Anything that uses an IPv4address, such as the <tt class="literal">gethostbyaddr</tt> function, should automaticallycoerce the abstract character numbers back into a byte representation(and fail on <tt class="literal">$badaddr</tt>).</p><p><a name="INDEX-2816"></a><a name="INDEX-2817"></a><a name="INDEX-2818"></a><a name="INDEX-2819"></a><a name="INDEX-2820"></a>The interfaces between Perl and the real world have to deal with thedetails of the representation. To the extent possible, existinginterfaces try to do the right thing without your having to tell themwhat to do. But you do occasionally have to give instructions to someinterfaces (such as the <tt class="literal">open</tt> function), and if youwrite your own interface to the real world, it will need to be eithersmart enough to figure things out for itself or at least smart enoughto follow instructions when you want it to behave differently than itwould by default.<a href="#FOOTNOTE-3">[3]</a></p><blockquote class="footnote"><a name="FOOTNOTE-3"></a><p>[3] On some systems, there may be waysof switching all your interfaces at once. If the <span class="option">-C</span>command-line switch is used, (or the global<tt class="literal">${^WIDE_SYSTEM_CALLS}</tt> variable is set to<tt class="literal">1</tt>), all system calls will use the correspondingwide character APIs. (This is currently only implemented on MicrosoftWindows.) The current plan of the Linux community is that allinterfaces will switch to UTF-8 mode if<tt class="literal">$ENV{LC_CTYPE}</tt> is set to"<tt class="literal">UTF-8</tt>". Other communities may take otherapproaches. Our mileage may vary.</p></blockquote><p>Since Perl worries about maintaining transparent character semanticswithin the language itself, the only place you need to worry about byteversus character semantics is in your interfaces. By default, all yourold Perl interfaces to the outside world are byte-oriented,so they produce and consume byte-oriented data. That is to say, on theabstract level, all your strings are sequences of numbers in the range<tt class="literal">0..255</tt>, so if nothing in the program forces them into utf8representations, your old program continues to work on byte-orienteddata just as it did before. So put a check mark by Goal #1 above.</p><p><a name="INDEX-2821"></a>If you want your old program to work on new character-oriented data,you must mark your character-oriented interfaces such that Perl knowsto expect character-oriented data from those interfaces. Once you've donethis, Perl should automatically do any conversions necessary topreserve the character abstraction. The only difference is that you'veintroduced some strings into your program that are marked aspotentially containing characters higher than <tt class="literal">255</tt>, so if you performan operation between a byte string and utf8 string, Perl willinternally coerce the byte string into a utf8 string before performingthe operation. Typically, utf8 strings are coerced back to bytestrings only when you send them to a byte interface, at which point, ifthe string contains characters larger than <tt class="literal">255</tt>, you have a problemthat can be handled in various ways depending on the interface inquestion. So you can put a check mark by Goal #2.</p><p><a name="INDEX-2822"></a>Sometimes you want to mix code that understands character semanticswith code that has to run with byte semantics, such as I/O code thatreads or writes fixed-size blocks. In this case, you may put a<tt class="literal">use bytes</tt> declaration around the byte-oriented codeto force it to use byte semantics even on strings marked as utf8strings. You are then responsible for any necessary conversions. Butit's a way of enforcing a stricter local reading of Goal #1, at theexpense of a looser global reading of Goal #2.</p><p>Goal #3 has largely been achieved, partly by doing lazy conversionsbetween byte and utf8 representations and partly by being sneaky inhow we implement potentially slow features of Unicode, such ascharacter property lookups in huge tables.</p><p>Goal #4 has been achieved by sacrificing a small amount of interfacecompatibility in pursuit of the other Goals. By one way of looking atit, we didn't fork into two different Perls; but by another way oflooking at it, revision 5.6 of Perl <em class="emphasis">is</em> a forkedversion of Perl with regard to earlier versions, and we don't expectpeople to switch from earlier versions until they're sure the newversion will do what they want. But that's always the case with newversions, so we'll allow ourselves to put a check mark by Goal #4 aswell.</p><a name="INDEX-2870"></a><a name="INDEX-2871"></a><!-- BOTTOM NAV BAR --><hr width="515" align="left"><div class="navbar"><table width="515" border="0"><tr><td align="left" valign="top" width="172"><a href="part3.htm"><img src="../gifs/txtpreva.gif" alt="Previous" border="0"></a></td><td align="center" valign="top" width="171"><a href="index.htm"><img src="../gifs/txthome.gif" alt="Home" border="0"></a></td><td align="right" valign="top" width="172"><a href="ch15_02.htm"><img src="../gifs/txtnexta.gif" alt="Next" border="0"></a></td></tr><tr><td align="left" valign="top" width="172">Part 3. Perl as Technology</td><td align="center" valign="top" width="171"><a href="index/index.htm"><img src="../gifs/index.gif" alt="Book Index" border="0"></a></td><td align="right" valign="top" width="172">15.2. Effects of Character Semantics</td></tr></table></div><hr width="515" align="left"><!-- LIBRARY NAV BAR --><img src="../gifs/smnavbar.gif" usemap="#library-map" border="0" alt="Library Navigation Links"><p><font size="-1"><a href="copyrght.htm">Copyright © 2001</a> O'Reilly & Associates. All rights reserved.</font></p><map name="library-map"> <area shape="rect" coords="2,-1,79,99" href="../index.htm"><area shape="rect" coords="84,1,157,108" href="../perlnut/index.htm"><area shape="rect" coords="162,2,248,125" href="../prog/index.htm"><area shape="rect" coords="253,2,326,130" href="../advprog/index.htm"><area shape="rect" coords="332,1,407,112" href="../cookbook/index.htm"><area shape="rect" coords="414,2,523,103" href="../sysadmin/index.htm"></map><!-- END OF BODY --></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -