📄 ch03_09.htm
字号:
<html><head><title>Character Sets and Encodings (Perl and XML)</title><link rel="stylesheet" type="text/css" href="../style/style1.css" /><meta name="DC.Creator" content="Erik T. Ray and Jason McIntosh" /><meta name="DC.Format" content="text/xml" scheme="MIME" /><meta name="DC.Language" content="en-US" /><meta name="DC.Publisher" content="O'Reilly & Associates, Inc." /><meta name="DC.Source" scheme="ISBN" content="059600205XL" /><meta name="DC.Subject.Keyword" content="stuff" /><meta name="DC.Title" content="Perl and XML" /><meta name="DC.Type" content="Text.Monograph" /></head><body bgcolor="#ffffff"><img alt="Book Home" border="0" src="gifs/smbanner.gif" usemap="#banner-map" /><map name="banner-map"><area shape="rect" coords="1,-2,616,66" href="index.htm" alt="Perl & XML" /><area shape="rect" coords="629,-11,726,25" href="jobjects/fsearch.htm" alt="Search this book" /></map><div class="navbar"><table width="684" border="0"><tr><td align="left" valign="top" width="228"><a href="ch03_08.htm"><img alt="Previous" border="0" src="../gifs/txtpreva.gif" /></a></td><td align="center" valign="top" width="228" /><td align="right" valign="top" width="228"><a href="ch04_01.htm"><img alt="Next" border="0" src="../gifs/txtnexta.gif" /></a></td></tr></table></div><h2 class="sect1">3.9. Character Sets and Encodings</h2><p>No<a name="INDEX-292" /> matter<a name="INDEX-293" /> how you choose to manage yourprogram's output, you must keep in mind the conceptof character encoding -- the protocol your output XML documentuses to represent the various symbols of its language, be they analphabet of letters or a catalog of ideographs and diacritical marks.Character encoding may represent the trickiest part of XML-slinging,perhaps especially so for programmers in Western Europe and theAmericas, most of whom have not explored the universe of possibleencodings beyond the 128 characters of ASCII.</p><p>While it's technically legal for an XMLdocument's <tt class="literal">encoding</tt> declarationto contain the name of any text encoding scheme, the only ones thatXML processors are, according to spec, required to understand are<a name="INDEX-294" />UTF-8 and <a name="INDEX-295" />UTF-16. UTF-8 and UTF-16 are twoflavors of <em class="emphasis">Unicode</em><a name="INDEX-296" />, a recent and powerfulcharacter encoding architecture that embraces every funny littlesquiggle a person might care to make.</p><p>In this section, we conspire with Perl and XML to nudge you gentlyinto thinking about Unicode, if you're not ponderingit already. While you can do everything described in this book byusing the legacy encoding of your choice, you'llfind, as time passes, that you're swimming againstthe current.</p><a name="perlxml-CHP-3-SECT-9.1" /><div class="sect2"><h3 class="sect2">3.9.1. Unicode, Perl, and XML</h3><p><a name="INDEX-297" />Unicode has creptin as the digital age's way of uniting the thousandsof different writing systems that have paid the salaries of monks andlinguists for centuries. Of course, if you program in an environmentwhere non-ASCII characters are found in abundance,you're probably already familiar with it. However,even then, much of your text processing work might be restricted tolow-bit Latin alphanumerics, simply because that'sbeen the character set of choice -- of fiat, really -- for theInternet. Unicode hopes to change this trend, Perl hopes to help, andsneaky little XML is already doing so.</p><p>As any Unicode-evangelizing document will tell you,<a href="#FOOTNOTE-20">[20]</a> Unicode is great for internationalizing code. It letsprogrammers come up with localization solutions without theadditional worry of juggling different character architectures.</p><blockquote class="footnote"><a name="FOOTNOTE-20" /><p>[20]These documents include Chapter 15 ofO'Reilly's <em class="citetitle">ProgrammingPerl, Third Edition</em> and the FAQ that the Unicodeconsortium hosts at <a href="http://unicode.org/unicode/faq/">http://unicode.org/unicode/faq/</a>.</p></blockquote><p>However, Unicode's importance increases by an orderof magnitude when you introduce the question of data representation.The languages that a given program's users (orprogrammers) might prefer is one thing, but as computing becomes moreubiquitous, it touches more people's lives in moreways every day, and some of these people speak Kurku. Byunderstanding the basics of Unicode, you can see how it can help totransparently keep all the data you'll ever workwith, no matter the script, in one architecture.</p></div><a name="perlxml-CHP-3-SECT-9.2" /><div class="sect2"><h3 class="sect2">3.9.2. Unicode Encodings</h3><p>We are careful to separate the words"architecture" and"encoding" because Unicode actuallyrepresents one of the former that contains several of the latter.</p><p>In Unicode, every discrete squiggle that's gainedofficial recognition, from A to <img align="absmiddle" src="figs/U03B1.gif" /> to <img src="figs/smile.gif">, has its own<a name="INDEX-298" /> <em class="emphasis">code point</em> -- a unique positive integer that serves as its <a name="INDEX-299" />address in the whole map of Unicode. For example, the first letter of the Latin alphabet, capitalized, lives at the hexadecimal address <tt class="literal">0x0041</tt> (as it does in ASCII and friends), and the other two symbols, the lowercase Greek alpha and the smileyface, are found in <tt class="literal">0x03B1</tt> and <tt class="literal">0x263A</tt>, respectively. A character can be constructed from any one of these code points, or by combining several of them. Many code points are dedicated to holding the various <a name="INDEX-300" />diacritical marks, such as accents and radicals, that many scripts use in conjunction with<a name="INDEX-301" /> base alphabetical or <a name="INDEX-302" />ideographic glyphs. </p><p>These addresses, as well as those of the tens of thousands (and, intime, hundreds of thousands) of other glyphs on the map, remain trueacross Unicode's encodings. The only difference liesin the way these numbers are encoded in the ones and zeros that makeup the document at its lowest level.</p><p>Unicode officially supports three types of encoding, all named<a name="INDEX-303" /> <a name="INDEX-304" />UTF (shortfor Unicode Transformation Format), followed by a number representingthe smallest bit-size any character might take. The encodings areUTF-8, UTF-16, and UTF-32. UTF-8 is the most flexible of all, and istherefore the one that Perl has adopted.</p><a name="perlxml-CHP-3-SECT-9.2.1" /><div class="sect3"><h3 class="sect3">3.9.2.1. UTF-8</h3><p>The <a name="INDEX-305" />UTF-8 encoding, arguably the mostPerlish in its impish trickery, is also the most efficient sinceit's the only one that can pack characters intosingle bytes. For that reason, UTF-8 is the default encoding for XMLdocuments: if XML documents specify no encoding in theirdeclarations, then processors should assume that they use UTF-8.</p><p>Each character appearing within a document encoded with UTF-8 uses asmany bytes as it has to in order to represent thatcharacter's code point, up to a maximum of sixbytes. Thus, the character A, with the itty-bitty address of<tt class="literal">0x41</tt>, gets one byte to represent it, while ourfriend <img src="figs/smile.gif"> lives way up the street in one of Unicode's blocks of miscellaneous doohickeys, with the address <tt class="literal">0x263A</tt>. It takes three bytes for itself -- two for the character's code point number and one that signals to text processors that there are, in fact, multiple bytes to this character. Several centuries from now, after Earth begrudgingly joins the Galactic Friendship Union and we find ourselves needing to encode the characters from countless off-planet civilizations, bytes four through six will come in quite handy. </p></div><a name="perlxml-CHP-3-SECT-9.2.2" /><div class="sect3"><h3 class="sect3">3.9.2.2. UTF-16</h3><p>The UTF-16 encoding uses a full two bytes to represent the characterin question, even if its ordinal is small enough to fit into one(which is how UTF-8 would handle it). If, on the other hand, thecharacter is rare enough to have a very high ordinal, then it gets anadditional two bytes tacked onto it (called a surrogate pair),bringing that one character's total length to fourbytes.</p><a name="ch03-25-fm2xml" /><blockquote><b>TIP:</b> Because Unicode 2.0 used a 16-bits-per-character style as its solesupported encoding, many people, and the programs they write, talkabout the "Unicode encoding" whenthey really mean Unicode UTF-16. Even newapplications' "SaveAs..." dialog boxes sometimes offer"Unicode" and"UTF-8" as separate choices, eventhough these labels don't make much sense in Unicode3.2 terminology.</p></blockquote></div><a name="perlxml-CHP-3-SECT-9.2.3" /><div class="sect3"><h3 class="sect3">3.9.2.3. UTF-32</h3><p><a name="INDEX-306" />UTF-32 works a lot likeUTF-16, but eliminates any question of variable character size bydeclaring that every invoked Unicode-mapped glyph shall occupyexactly four bytes. Because of its maximum maximosity, this encodingdoesn't see much practical use, since all but themost unusual communication would have significantly more than half ofits total mass made up of leading zeros, whichdoesn't work wonders for efficiency. However, ifguaranteed character width is an inflexible issue, this encoding canhandle all the million-plus glyph addresses that Unicodeaccommodates. Of the three major Unicode encodings, UTF-32 is the onethat XML <a name="INDEX-307" />parsers aren'tobliged to understand. Hence, you probably don'tneed to worry about it, either.</p></div></div><a name="perlxml-CHP-3-SECT-9.3" /><div class="sect2"><h3 class="sect2">3.9.3. Other Encodings</h3><p>The XML standard defines 21 names for character sets that parsersmight use (beyond the two they're required to know,UTF-8 and UTF-16). These names range from<tt class="literal">ISO-8859-1</tt> (ASCII plus 128 characters outside theLatin alphabet) to <tt class="literal">Shift_JIS</tt>, a Microsoftianencoding for Japanese ideographs. While they're notUnicode encodings per se, each character within them maps to one ormore Unicode code points (and vice versa, allowing for round-trippingbetween common encodings by way of Unicode).</p><p>XML parsers in Perl all have their own ways of dealing with otherencodings. Some may need an extra little nudge.<tt class="literal">XML::Parser</tt><a name="INDEX-308" />, for example, is weak in its raw statebecause its underlying library, Expat, understands only a handful ofnon-Unicode encodings. Fortunately, you can give it a helping hand byinstalling Clark Cooper's<tt class="literal">XML::Encoding</tt> module, an<tt class="literal">XML::Parser</tt> subclass that can read and understandmap files (themselves XML documents) that bind the character codepoints of other encodings to their Unicode addresses.</p><a name="perlxml-CHP-3-SECT-9.3.1" /><div class="sect3"><h3 class="sect3">3.9.3.1. Core Perl support </h3><p>As with XML, Perl's relationship with Unicode hasheated up at a cautious but inevitable pace.<a href="#FOOTNOTE-21">[21]</a>Generally, you should use Perl version 5.6 or greater to work withUnicode properly in your code. If you do have 5.6 or greater, consultits <tt class="literal">perlunicode</tt> manpage for details on how deepits support runs, as each release since then has gradually deepenedits loving embrace with Unicode. If you have an even earlier Perl,whew, you really ought to consider upgrading it. You can eke by withsome of the tools we'll mention later in thischapter, but hacking Perl and XML means hacking in Unicode, andyou'll notice the lack of core support for it.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -