📄 ch03_09.htm
字号:
</p><blockquote class="footnote"> <a name="FOOTNOTE-21" /><p>[21]Theromantic metaphor may start to break down for you here, but youprobably understand by now that Perl's polyamorousproclivities help make it the language that it is.</p> </blockquote><p>Currently, the most recent stable Perl release, 5.6.1, containspartial support for <a name="INDEX-309" />Unicode. Invoking the <tt class="literal">useutf8</tt> pragma tells Perl to use UTF-8 encoding with most ofits string-handling functions. Perl also allows code to exist inUTF-8, allowing identifiers built from characters living beyondASCII's one-byte reach. This can prove very usefulfor hackers who primarily think in glyphs outside the Latin alphabet.</p><p>Perl 5.8's Unicode support will be much morecomplete, allowing UTF-8 and regular expressions to play nice. The5.8 distribution also introduces the <tt class="literal">Encode</tt> moduleto Perl's standard library, which will allow anyPerl programmer to shift text from legacy encodings to Unicodewithout fuss:</p><blockquote><pre class="code">use Encode 'from_to';from_to($data, "iso-8859-3", "utf-8"); # from legacy toutf-8</pre></blockquote><p>Finally, Perl 6, being a redesign of the whole language that includeseverything the Perl community learned over the last dozen years, willnaturally have an even more intimate relationship with Unicode (andwill give us an excuse to print a second edition of this book in afew years). Stay tuned to the usual information channels forcontinuing developments on this front as we see what happens.</p></div></div><a name="perlxml-CHP-3-SECT-9.4" /><div class="sect2"><h3 class="sect2">3.9.4. Encoding Conversion</h3><p>If <a name="INDEX-310" /><a name="INDEX-311" /> <a name="INDEX-312" />you use a version of Perl older than5.8, you'll need a little extra help when switchingfrom one encoding to another. Fortunately, your toolbox contains someratchety little devices to assist you.</p><a name="perlxml-CHP-3-SECT-9.4.1" /><div class="sect3"><h3 class="sect3">3.9.4.1. iconv and Text::Iconv</h3><p><tt class="literal">iconv</tt><a name="INDEX-313" /> is a library and program available forWindows and Unix (inlcuding Mac OS X) that provides an easy interfacefor turning a document of type A into one of type B. On the Unixcommand line, you can use it like this:</p><blockquote><pre class="code">$ <tt class="userinput"><b>iconv -f latin1 -t utf8 my_file.txt > my_unicode_file.txt</b></tt></pre></blockquote><p>If you have <tt class="literal">iconv</tt> on your system, you can alsograb the<tt class="literal">Text::Iconv</tt><a name="INDEX-314" /> Perl module from CPAN, which gives you aPerl API to this library. This allows you to quickly re-encodeon-disk files or strings in memory.</p></div><a name="perlxml-CHP-3-SECT-9.4.2" /><div class="sect3"><h3 class="sect3">3.9.4.2. Unicode::String</h3><p>A more portable solution comes in the form of the<tt class="literal">Unicode::String</tt><a name="INDEX-315" /> module, which needs no underlying Clibrary. The module's basic API is as blissfullysimple as all basic APIs should be. Got a string? Feed it to theclass's constructor method and get back an objectholding that string, as well as a bevy of methods that let you squashand stretch it in useful and amusing ways. <a href="ch03_09.htm#perlxml-CHP-3-EX-12">Example 3-12</a> tests the module.</p><a name="perlxml-CHP-3-EX-12" /><div class="example"><h4 class="objtitle">Example 3-12. Unicode test </h4><blockquote><pre class="code">use Unicode::String;my $string = "This sentence exists in ASCII and UTF-8, but not UTF-16. Darn!\n";my $u = Unicode::String->new($string);# $u now holds an object representing a stringful of 16-bit characters# It uses overloading so Perl string operators do what you expect!$u .= "\n\nOh, hey, it's Unicode all of a sudden. Hooray!!\n"# print as UTF-16 (also known as UCS2)print $u->ucs2;# print as something more human-readableprint $u->utf8;</pre></blockquote></div><p>The module's many methods allow you to downgradeyour strings, too -- specifically, the <tt class="literal">utf7</tt>method lets you pop the eighth bit off of UTF-8 characters, which isacceptable if you need to throw a bunch of ASCII characters at areceiver that would flip out if it saw chains of UTF-8 marchingproudly its way instead of the austere and solitary encodings of old.</p><a name="ch03-32-fm2xml" /><blockquote><b>WARNING:</b> <tt class="literal">XML::Parser</tt> sometimes seems a little too eager toget you into Unicode. No matter what a document'sdeclared encoding is, it silently transforms all characters withhigher Unicode code points into UTF-8, and if you ask the parser foryour data back, it delivers those characters back to you in thatmanner. This silent transformation can be an unpleasant surprise. Ifyou use <tt class="literal">XML::Parser</tt> as the core of any processingsoftware you write, be aware that you may need to use the convertiontools mentioned in this section to massage your data into a moresuitable format.</p></blockquote></div><a name="perlxml-CHP-3-SECT-9.4.3" /><div class="sect3"><h3 class="sect3">3.9.4.3. Byte order marks</h3><p>If, for some reason, you have an XML document from an unknown sourceand have no idea what its encoding might be, it may behoove you tocheck for the presence of a <em class="emphasis">byte ordermark</em><a name="INDEX-316" /> <a name="INDEX-317" /> (BOM) at the start of the document.Documents that use Unicode's UTF-16 and UTF-32encodings are endian-dependent (while UTF-8 escapes this fate bynature of its peculiar protocol). Not knowing which end of a bytecarries the significant bit will make reading these documents similarto reading them in a mirror, rendering their content into a garblethat your programs will not appreciate.</p><p>Unicode defines a special code point, <tt class="literal">U+FEFF</tt>, asthe byte order mark. According to the Unicode specification,documents using the UTF-16 or UTF-32 encodings have the option ofdedicating their first two or four bytes to this character.<a href="#FOOTNOTE-22">[22]</a> This way, if a programcarefully inspecting the document scans the first two bits and seesthat they're <tt class="literal">0xFE</tt> and<tt class="literal">0xFF</tt>, in that order, it knowsit's big-endian UTF-16. On the other hand, if itsees <tt class="literal">0xFF 0xFE</tt>, it knows that document islittle-endian because there is no Unicode code point of<tt class="literal">U+FFFE</tt>. (UTF-32's big- andlittle-endian BOMs have more padding: <tt class="literal">0x00 0x00 0xFE0xFF</tt> and <tt class="literal">0xFF 0xFE 0x00 0x00</tt>,respectively.)</p><blockquote class="footnote"><a name="FOOTNOTE-22" /><p>[22]UTF-8 has its own byte order mark, but its purpose is toidentify the document at UTF-8, and thus has little use in the XMLworld. The UTF-8 encoding doesn't have to worryabout any of this endianness business since all its characters aremade of strung-together byte sequences that are always read fromfirst to last instead of little boxes holding byte pairs whose ordermay be questionable.</p> </blockquote><p>The XML specification states that UTF-16- and UTF-32-encodeddocuments must use a BOM, but, referring to the Unicodespecification, we see that documents created by the engines of saneand benevolent masters will arrive to you in network order. In otherwords, they arrive to you in a big-endian fashion, which was sometime ago declared as the order to use when transmitting data betweenmachines. Conversely, because you are sane and benevolent, you shouldalways transmit documents in network order whenyou're not sure which order to use. However, if youever find yourself in doubt that you've received asane document, just close your eyes and hum this tune:</p><blockquote><pre class="code">open XML_FILE, $filename or die "Can't read $filename: $!";my $bom; # will hold possible byte order mark# read the first two bytesread XML_FILE, $bom, 2;# Fetch their numeric values, via Perl's ord() functionmy $ord1 = ord(substr($bom,0,1));my $ord2 = ord(substr($bom,1,1));if ($ord1 == 0xFE && $ord2 == 0xFF) { # It looks like a UTF-16 big-endian document! # ... act accordingly here ...} elsif ($ord1 == 0xFF && $ord2 == 0xEF) { # Oh, someone was naughty and sent us a UTF-16 little-endian document. # Probably we'll want to effect a byteswap on the thing before working with it.} else { # No byte order mark detected.}</pre></blockquote><p>You might run this example as a last-ditch effort if your parsercomplains that it can't find any XML in thedocument. The first line might indeed be a valid <tt class="literal"><?xml... ></tt> declaration<a name="INDEX-318" /> <a name="INDEX-319" /> <a name="INDEX-320" />, but<a name="INDEX-321" /> your<a name="INDEX-322" /> parser sees<a name="INDEX-323" /> some<a name="INDEX-324" /> gobbledygookinstead.</p></div></div><hr width="684" align="left" /><div class="navbar"><table width="684" border="0"><tr><td align="left" valign="top" width="228"><a href="ch03_08.htm"><img alt="Previous" border="0" src="../gifs/txtpreva.gif" /></a></td><td align="center" valign="top" width="228"><a href="index.htm"><img alt="Home" border="0" src="../gifs/txthome.gif" /></a></td><td align="right" valign="top" width="228"><a href="ch04_01.htm"><img alt="Next" border="0" src="../gifs/txtnexta.gif" /></a></td></tr><tr><td align="left" valign="top" width="228">3.8. XML::Writer</td><td align="center" valign="top" width="228"><a href="index/index.htm"><img alt="Book Index" border="0" src="../gifs/index.gif" /></a></td><td align="right" valign="top" width="228">4. Event Streams</td></tr></table></div><hr width="684" align="left" /><img alt="Library Navigation Links" border="0" src="../gifs/navbar.gif" usemap="#library-map" /><p><p><font size="-1"><a href="copyrght.htm">Copyright © 2002</a> O'Reilly & Associates. All rights reserved.</font></p><map name="library-map"><area shape="rect" coords="1,0,85,94" href="../index.htm"><area shape="rect" coords="86,1,178,103" href="../lwp/index.htm"><area shape="rect" coords="180,0,265,103" href="../lperl/index.htm"><area shape="rect" coords="267,0,353,105" href="../perlnut/index.htm"><area shape="rect" coords="354,1,446,115" href="../prog/index.htm"><area shape="rect" coords="448,0,526,132" href="../tk/index.htm"><area shape="rect" coords="528,1,615,119" href="../cookbook/index.htm"><area shape="rect" coords="617,0,690,135" href="../pxml/index.htm"></map></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -