首页 › 资源下载 › 其他书籍 › Perl & XML. by Er › 源码查看

ch02_06.htm

来自「Perl & XML. by Erik T. Ray and Jason 」· HTM 代码 · 共 82 行

HTM

82 行

<html><head><title>Unicode, Character Sets, and Encodings (Perl and XML)</title><link rel="stylesheet" type="text/css" href="../style/style1.css" /><meta name="DC.Creator" content="Erik T. Ray and Jason McIntosh" /><meta name="DC.Format" content="text/xml" scheme="MIME" /><meta name="DC.Language" content="en-US" /><meta name="DC.Publisher" content="O'Reilly &amp; Associates, Inc." /><meta name="DC.Source" scheme="ISBN" content="059600205XL" /><meta name="DC.Subject.Keyword" content="stuff" /><meta name="DC.Title" content="Perl and XML" /><meta name="DC.Type" content="Text.Monograph" /></head><body bgcolor="#ffffff"><img alt="Book Home" border="0" src="gifs/smbanner.gif" usemap="#banner-map" /><map name="banner-map"><area shape="rect" coords="1,-2,616,66" href="index.htm" alt="Perl &amp; XML" /><area shape="rect" coords="629,-11,726,25" href="jobjects/fsearch.htm" alt="Search this book" /></map><div class="navbar"><table width="684" border="0"><tr><td align="left" valign="top" width="228"><a href="ch02_05.htm"><img alt="Previous" border="0" src="../gifs/txtpreva.gif" /></a></td><td align="center" valign="top" width="228" /><td align="right" valign="top" width="228"><a href="ch02_07.htm"><img alt="Next" border="0" src="../gifs/txtnexta.gif" /></a></td></tr></table></div><h2 class="sect1">2.6. Unicode, Character Sets, and Encodings</h2><p>At<a name="INDEX-115" /> low levels,computers<a name="INDEX-116" /> see text as a series ofpositive<a name="INDEX-117" /> integer numbers mapped ontocharacter sets, which are collections of numbered characters (andsometimes control codes) that some standards body created. A verycommon collection is the venerable <a name="INDEX-118" />US-ASCII character set,which contains 128 characters, including upper- and lowercase lettersof the Latin alphabet, numerals, various symbols and spacecharacters, and a few special print codes inherited from the old daysof teletype terminals. By adding on the eighth bit, this 7-bit systemis extended into a larger set with twice as many characters, such asISO-Latin1, used in many Unix systems. These characters include otherEuropean characters, such as Latin letters with accents, Icelandiccharacters, ligatures, footnote marks, and legal symbols. Alas,humanity, a species bursting with both creativity and pride, hasinvented many more linguistic symbols than can be mapped onto an8-bit number.</p><p>For this reason, a new character encoding architecture called Unicodehas gained acceptance as the standard way to represent every writtenscript in which people might want to store data (or write computercode). Depending on the flavor used, it uses up to 32 bits todescribe a character, giving the standard room for millions ofindividual glyphs. For over a decade, the <a name="INDEX-119" />Unicode Consortium has been fillingup this space with characters ranging from the entire<a name="INDEX-120" />Han<a name="INDEX-121" />Chinesecharacter set to various mathematical, notational, and signagesymbols, and still leaves the encoding space with enough room to growfor the coming millennium or two.</p><p>Given all this effort we're putting into hyping it,it shouldn't surprise you to learn that, while anXML document can use any type of encoding, it will by default assumethe Unicode-flavored, variable-length encoding known as<a name="INDEX-122" />UTF-8. This encoding uses between oneand six bytes to encode the number that represents thecharacter's Unicode address and thecharacter's length in bytes, if that address isgreater than 255. It's possible to write an entiredocument in 1-byte characters and have it be indistinguishable from<a name="INDEX-123" />ISOLatin-1 (a humble address block with addresses ranging from 0 to255), but if you need the occasional high character, or if you need alot of them (as you would when storing Asian-language data, forexample), it's easy to encode in UTF-8.Unicode-aware processors handle the encoding correctly and displaythe right glyphs, while older applications simply ignore themultibyte characters and pass them through unharmed. Since Version5.6, Perl has handled UTF-8 characters with increasing finesse.We'll discuss Perl's handling ofUnicode in more depth in <a href="ch03_01.htm">Chapter 3, "XML Basics: Reading and Writing"</a>.</p><hr width="684" align="left" /><div class="navbar"><table width="684" border="0"><tr><td align="left" valign="top" width="228"><a href="ch02_05.htm"><img alt="Previous" border="0" src="../gifs/txtpreva.gif" /></a></td><td align="center" valign="top" width="228"><a href="index.htm"><img alt="Home" border="0" src="../gifs/txthome.gif" /></a></td><td align="right" valign="top" width="228"><a href="ch02_07.htm"><img alt="Next" border="0" src="../gifs/txtnexta.gif" /></a></td></tr><tr><td align="left" valign="top" width="228">2.5. Entities</td><td align="center" valign="top" width="228"><a href="index/index.htm"><img alt="Book Index" border="0" src="../gifs/index.gif" /></a></td><td align="right" valign="top" width="228">2.7. The XML Declaration</td></tr></table></div><hr width="684" align="left" /><img alt="Library Navigation Links" border="0" src="../gifs/navbar.gif" usemap="#library-map" /><p><p><font size="-1"><a href="copyrght.htm">Copyright &copy; 2002</a> O'Reilly &amp; Associates. All rights reserved.</font></p><map name="library-map"><area shape="rect" coords="1,0,85,94" href="../index.htm"><area shape="rect" coords="86,1,178,103" href="../lwp/index.htm"><area shape="rect" coords="180,0,265,103" href="../lperl/index.htm"><area shape="rect" coords="267,0,353,105" href="../perlnut/index.htm"><area shape="rect" coords="354,1,446,115" href="../prog/index.htm"><area shape="rect" coords="448,0,526,132" href="../tk/index.htm"><area shape="rect" coords="528,1,615,119" href="../cookbook/index.htm"><area shape="rect" coords="617,0,690,135" href="../pxml/index.htm"></map></body></html>

ch02_06.htm - 源码说明

本页面展示了「Perl & XML. by Erik T. Ray and Jason McIntosh ISBN 0-596-00205-X First Edition, published April」中的 ch02_06.htm 源码文件，采用 HTM 编程语言编写，共 82 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。

虫虫下载站收录了大量与T.相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。

⌨️ 快捷键说明

复制代码Ctrl + C

搜索代码Ctrl + F

全屏模式F11

增大字号Ctrl + =

减小字号Ctrl + -

显示快捷键?