ch15_01.htm

来自「编程珍珠,里面很多好用的代码,大家可以参考学习呵呵,」· HTM 代码 · 共 384 行 · 第 1/2 页

HTM

384 行

<html><head><title>Unicode (Programming Perl)</title><!-- STYLESHEET --><link rel="stylesheet" type="text/css" href="../style/style1.css"><!-- METADATA --><!--Dublin Core Metadata--><meta name="DC.Creator" content=""><meta name="DC.Date" content=""><meta name="DC.Format" content="text/xml" scheme="MIME"><meta name="DC.Generator" content="XSLT stylesheet, xt by James Clark"><meta name="DC.Identifier" content=""><meta name="DC.Language" content="en-US"><meta name="DC.Publisher" content="O'Reilly &amp; Associates, Inc."><meta name="DC.Source" content="" scheme="ISBN"><meta name="DC.Subject.Keyword" content=""><meta name="DC.Title" content="Unicode"><meta name="DC.Type" content="Text.Monograph"></head><body><!-- START OF BODY --><!-- TOP BANNER --><img src="gifs/smbanner.gif" usemap="#banner-map" border="0" alt="Book Home"><map name="banner-map"><AREA SHAPE="RECT" COORDS="0,0,466,71" HREF="index.htm" ALT="Programming Perl"><AREA SHAPE="RECT" COORDS="467,0,514,18" HREF="jobjects/fsearch.htm" ALT="Search this book"></map><!-- TOP NAV BAR --><div class="navbar"><table width="515" border="0"><tr><td align="left" valign="top" width="172"><a href="part3.htm"><img src="../gifs/txtpreva.gif" alt="Previous" border="0"></a></td><td align="center" valign="top" width="171"><a href="part3.htm">Part 3: Perl as Technology</a></td><td align="right" valign="top" width="172"><a href="ch15_02.htm"><img src="../gifs/txtnexta.gif" alt="Next" border="0"></a></td></tr></table></div><hr width="515" align="left"><!-- SECTION BODY --><h1 class="chapter">Chapter 15.  Unicode</h1><div class="htmltoc"><h4 class="tochead">Contents:</h4><p><a href="ch15_01.htm">Building Character</a><br><a href="ch15_02.htm">Effects of Character Semantics</a><br><a href="ch15_03.htm">Caution, <img border="0" src="figs/ren2_bold.gif"> Working</a><br></p></div><p><a name="INDEX-2799"></a><a name="INDEX-2800"></a>If you do not yet know what Unicode is, you will soon--even if you skipreading this chapter--because working with Unicode is becominga necessity.  (Some people think of it as a necessary evil, but it'sreally more of a necessary good.  In either case, it's a necessary pain.)</p><p>Historically, people made up character sets to reflect what they neededto do in the context of their own culture.  Since people of allcultures are naturally lazy, they've tended to include only the symbolsthey needed, excluding the ones they didn't need.  That worked fine aslong as we were only communicating with other people of our ownculture, but now that we're starting to use the Internet forcross-cultural communication, we're running into problems with theexclusive approach.  It's hard enough to figure out how to typeaccented characters on an American keyboard.  How in the world(literally) can one write a multilingual web page?</p><p><a name="INDEX-2801"></a><a name="INDEX-2802"></a>Unicode is the answer, or at least part of the answer (see also XML).Unicode is an inclusive rather than an exclusive character set.  Whilepeople can and do haggle over the various details of Unicode (andthere are plenty of details to haggle over), the overall intent is tomake everyone sufficiently happy<a href="#FOOTNOTE-1">[1]</a> with Unicode so that they'llwillingly use Unicode as the international medium of exchange fortextual data.  Nobody is forcing you to use Unicode, just as nobody isforcing you to read this chapter (we hope).  People will always beallowed to use their old exclusive character sets within their ownculture.  But in that case (as we say), portabilitysuffers.</p><blockquote class="footnote"><a name="FOOTNOTE-1"></a><p>[1] Or in some cases,insufficiently unhappy.</p></blockquote><p><a name="INDEX-2803"></a><a name="INDEX-2804"></a><a name="INDEX-2805"></a>The Law of Conservation of Suffering says that if we reduce thesuffering in one place, suffering must increase elsewhere.  In thecase of Unicode, we must suffer the migration from byte semantics tocharacter semantics.  Since, through an accident of history, Perl wasinvented by an American, Perl has historically confused the notions ofbytes and characters.  In migrating to Unicode, Perl must somehowunconfuse them.</p><p>Paradoxically, by getting Perl itself to unconfuse bytes andcharacters, we can allow the Perl programmer to confuse them, relyingon Perl to keep them straight, just as we allow programmers to confusenumbers and strings and rely on Perl to convert back and forth asnecessary.  To the extent possible, Perl's approach to Unicode is thesame as its approach to everything else: Just Do The Right Thing.Ideally, we'd like to achieve these four Goals:</p><dl><dt><b>Goal #1:</b></dt><dd><p>Old byte-oriented programs should not spontaneously break on theold byte-oriented data they used to work on.</p></dd><dt><b>Goal #2:</b></dt><dd><p>Old byte-oriented programs should magically start working onthe new character-oriented data when appropriate.</p></dd><dt><b>Goal #3:</b></dt><dd><p>Programs should run just as fast in the new character-oriented mode asin the old byte-oriented mode.</p></dd><dt><b>Goal #4:</b></dt><dd><p>Perl should remain one language, rather than forking into abyte-oriented Perl and a character-oriented Perl.</p></dd></dl><p>Taken together, these Goals are practically impossible to reach.  Butwe've come remarkably close.  Or rather, we're still in the process ofcoming remarkably close, since this is a work in progress.  As Unicodecontinues to evolve, so will Perl.  But our overarching plan is toprovide a safe migration path that gets us where we want to go withminimal casualties along the way.  How we do that is the subject ofthe next section.</p><h2 class="sect1">15.1. Building Character</h2><p><a name="INDEX-2806"></a><a name="INDEX-2807"></a><a name="INDEX-2808"></a><a name="INDEX-2809"></a>In releases of Perl prior to 5.6, all strings were viewed as sequencesof bytes.<a href="#FOOTNOTE-2">[2]</a> Inversions 5.6 and later, however, a string may contain characters widerthan a byte.  We now view strings not as sequences of bytes, but assequences of numbers in the range <tt class="literal">0 .. 2**32-1</tt> (orin the case of 64-bit computers, <tt class="literal">0 .. 2**64-1</tt>).These numbers represent abstract characters, and the larger thenumber, the "wider" the character, in some sense; but unlike manylanguages, Perl is not tied to any particular width of characterrepresentation.  Perl uses a variable-length encoding (based onUTF-8), so these abstract character numbers may, or may not, be packedone number per byte.  Obviously, character number<tt class="literal">18,446,744,073,709,551,615</tt> (that is,"<tt class="literal">\x{ffff_ffff_ffff_ffff}</tt>") is nevergoing to fit into a byte (in fact, it takes 13 bytes), but if all thecharacters in your string are in the range <tt class="literal">0..127</tt>decimal, then they are certainly packed one per byte, since UTF-8 is

ch15_01.htm - 源码说明

本页面展示了「编程珍珠,里面很多好用的代码,大家可以参考学习呵呵,」中的 ch15_01.htm 源码文件，采用 HTM 编程语言编写，共 384 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。

虫虫下载站收录了大量与编程相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。

⌨️ 快捷键说明

复制代码Ctrl + C

搜索代码Ctrl + F

全屏模式F11

增大字号Ctrl + =

减小字号Ctrl + -

显示快捷键?