📄 ch02_06.htm
字号:
<html><head><title>Unicode, Character Sets, and Encodings (Perl and XML)</title><link rel="stylesheet" type="text/css" href="../style/style1.css" /><meta name="DC.Creator" content="Erik T. Ray and Jason McIntosh" /><meta name="DC.Format" content="text/xml" scheme="MIME" /><meta name="DC.Language" content="en-US" /><meta name="DC.Publisher" content="O'Reilly & Associates, Inc." /><meta name="DC.Source" scheme="ISBN" content="059600205XL" /><meta name="DC.Subject.Keyword" content="stuff" /><meta name="DC.Title" content="Perl and XML" /><meta name="DC.Type" content="Text.Monograph" /></head><body bgcolor="#ffffff"><img alt="Book Home" border="0" src="gifs/smbanner.gif" usemap="#banner-map" /><map name="banner-map"><area shape="rect" coords="1,-2,616,66" href="index.htm" alt="Perl & XML" /><area shape="rect" coords="629,-11,726,25" href="jobjects/fsearch.htm" alt="Search this book" /></map><div class="navbar"><table width="684" border="0"><tr><td align="left" valign="top" width="228"><a href="ch02_05.htm"><img alt="Previous" border="0" src="../gifs/txtpreva.gif" /></a></td><td align="center" valign="top" width="228" /><td align="right" valign="top" width="228"><a href="ch02_07.htm"><img alt="Next" border="0" src="../gifs/txtnexta.gif" /></a></td></tr></table></div><h2 class="sect1">2.6. Unicode, Character Sets, and Encodings</h2><p>At<a name="INDEX-115" /> low levels,computers<a name="INDEX-116" /> see text as a series ofpositive<a name="INDEX-117" /> integer numbers mapped ontocharacter sets, which are collections of numbered characters (andsometimes control codes) that some standards body created. A verycommon collection is the venerable <a name="INDEX-118" />US-ASCII character set,which contains 128 characters, including upper- and lowercase lettersof the Latin alphabet, numerals, various symbols and spacecharacters, and a few special print codes inherited from the old daysof teletype terminals. By adding on the eighth bit, this 7-bit systemis extended into a larger set with twice as many characters, such asISO-Latin1, used in many Unix systems. These characters include otherEuropean characters, such as Latin letters with accents, Icelandiccharacters, ligatures, footnote marks, and legal symbols. Alas,humanity, a species bursting with both creativity and pride, hasinvented many more linguistic symbols than can be mapped onto an8-bit number.</p><p>For this reason, a new character encoding architecture called Unicodehas gained acceptance as the standard way to represent every writtenscript in which people might want to store data (or write computercode). Depending on the flavor used, it uses up to 32 bits todescribe a character, giving the standard room for millions ofindividual glyphs. For over a decade, the <a name="INDEX-119" />Unicode Consortium has been fillingup this space with characters ranging from the entire<a name="INDEX-120" />Han<a name="INDEX-121" />Chinesecharacter set to various mathematical, notational, and signagesymbols, and still leaves the encoding space with enough room to growfor the coming millennium or two.</p><p>Given all this effort we're putting into hyping it,it shouldn't surprise you to learn that, while anXML document can use any type of encoding, it will by default assumethe Unicode-flavored, variable-length encoding known as<a name="INDEX-122" />UTF-8. This encoding uses between oneand six bytes to encode the number that represents thecharacter's Unicode address and thecharacter's length in bytes, if that address isgreater than 255. It's possible to write an entiredocument in 1-byte characters and have it be indistinguishable from<a name="INDEX-123" />ISOLatin-1 (a humble address block with addresses ranging from 0 to255), but if you need the occasional high character, or if you need alot of them (as you would when storing Asian-language data, forexample), it's easy to encode in UTF-8.Unicode-aware processors handle the encoding correctly and displaythe right glyphs, while older applications simply ignore themultibyte characters and pass them through unharmed. Since Version5.6, Perl has handled UTF-8 characters with increasing finesse.We'll discuss Perl's handling ofUnicode in more depth in <a href="ch03_01.htm">Chapter 3, "XML Basics: Reading and Writing"</a>.</p><hr width="684" align="left" /><div class="navbar"><table width="684" border="0"><tr><td align="left" valign="top" width="228"><a href="ch02_05.htm"><img alt="Previous" border="0" src="../gifs/txtpreva.gif" /></a></td><td align="center" valign="top" width="228"><a href="index.htm"><img alt="Home" border="0" src="../gifs/txthome.gif" /></a></td><td align="right" valign="top" width="228"><a href="ch02_07.htm"><img alt="Next" border="0" src="../gifs/txtnexta.gif" /></a></td></tr><tr><td align="left" valign="top" width="228">2.5. Entities</td><td align="center" valign="top" width="228"><a href="index/index.htm"><img alt="Book Index" border="0" src="../gifs/index.gif" /></a></td><td align="right" valign="top" width="228">2.7. The XML Declaration</td></tr></table></div><hr width="684" align="left" /><img alt="Library Navigation Links" border="0" src="../gifs/navbar.gif" usemap="#library-map" /><p><p><font size="-1"><a href="copyrght.htm">Copyright © 2002</a> O'Reilly & Associates. All rights reserved.</font></p><map name="library-map"><area shape="rect" coords="1,0,85,94" href="../index.htm"><area shape="rect" coords="86,1,178,103" href="../lwp/index.htm"><area shape="rect" coords="180,0,265,103" href="../lperl/index.htm"><area shape="rect" coords="267,0,353,105" href="../perlnut/index.htm"><area shape="rect" coords="354,1,446,115" href="../prog/index.htm"><area shape="rect" coords="448,0,526,132" href="../tk/index.htm"><area shape="rect" coords="528,1,615,119" href="../cookbook/index.htm"><area shape="rect" coords="617,0,690,135" href="../pxml/index.htm"></map></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -