📄 ch03_01.htm

📁 Perl & XML. by Erik T. Ray and Jason McIntosh ISBN 0-596-00205-X First Edition, published April
💻 HTM
📖 第 1 页 / 共 2 页
字号:
12 下一页
<html><head><title>XML Basics: Reading and Writing (Perl and XML)</title><link rel="stylesheet" type="text/css" href="../style/style1.css" /><meta name="DC.Creator" content="Erik T. Ray and Jason McIntosh" /><meta name="DC.Format" content="text/xml" scheme="MIME" /><meta name="DC.Language" content="en-US" /><meta name="DC.Publisher" content="O'Reilly &amp; Associates, Inc." /><meta name="DC.Source" scheme="ISBN" content="059600205XL" /><meta name="DC.Subject.Keyword" content="stuff" /><meta name="DC.Title" content="Perl and XML" /><meta name="DC.Type" content="Text.Monograph" /></head><body bgcolor="#ffffff"><img alt="Book Home" border="0" src="gifs/smbanner.gif" usemap="#banner-map" /><map name="banner-map"><area shape="rect" coords="1,-2,616,66" href="index.htm" alt="Perl &amp; XML" /><area shape="rect" coords="629,-11,726,25" href="jobjects/fsearch.htm" alt="Search this book" /></map><div class="navbar"><table width="684" border="0"><tr><td align="left" valign="top" width="228"><a href="ch02_12.htm"><img alt="Previous" border="0" src="../gifs/txtpreva.gif" /></a></td><td align="center" valign="top" width="228" /><td align="right" valign="top" width="228"><a href="ch03_02.htm"><img alt="Next" border="0" src="../gifs/txtnexta.gif" /></a></td></tr></table></div><h1 class="chapter">Chapter 3. XML Basics: Reading and Writing</h1><div class="htmltoc"><h4 class="tochead">Contents:</h4><p><a href="ch03_01.htm">XML Parsers</a><br /><a href="ch03_02.htm">XML::Parser</a><br /><a href="ch03_03.htm">Stream-Based Versus Tree-Based Processing</a><br /><a href="ch03_04.htm">Putting Parsers to Work</a><br /><a href="ch03_05.htm">XML::LibXML</a><br /><a href="ch03_06.htm">XML::XPath</a><br /><a href="ch03_07.htm">Document Validation</a><br /><a href="ch03_08.htm">XML::Writer</a><br /><a href="ch03_09.htm">Character Sets and Encodings</a><br /></p></div><p>This<a name="INDEX-165" /> chaptercovers the two most important tasks in working with XML: reading itinto memory and writing it out again. XML is a structured,predictable, and standard data storage format, and as such carries aprice. Unlike the line-by-line, make-it-up-as-you-go style thattypifies text hacking in Perl, XML expects you to learn the rules ofits game -- the structures and protocols outlined in <a href="ch02_01.htm">Chapter 2, "An XML Recap"</a> -- before you can play with it.Fortunately, much of the hard work is already done, in the form ofmodule-based parsers and other tools that trailblazing Perl and XMLhackers already created (some of which we touched on in <a href="ch01_01.htm">Chapter 1, "Perl and XML"</a>). </p><p>Knowing how to use parsers is very important. They typically drivethe rest of the processing for you, or at least get the data into astate where you can work with it. Any good programmer knows thatgetting the data ready is half the battle. We'lllook deeply into the parsing process and detail the strategies usedto drive processing.</p><p>Parsers come with a bewildering array of options that let youconfigure the output to your needs. Which character set should youuse? Should you validate the document or merely check ifit's well formed? Do you need to expand entityreferences, or should you keep them as references? How can you sethandlers for events or tell the parser to build a tree for you?We'll explain these options fully so you can get themost out of parsing.</p><p>Finally, we'll show you how to spit XML back out,which can be surprisingly tricky if one isn't awareof XML's expectations regarding text encoding.Getting this step right is vital if you ever want to be able to useyour data again without painful hand fixing.</p><div class="sect1"><a name="perlxml-CHP-3-SECT-1" /><h2 class="sect1">3.1. XML Parsers</h2><p>File<a name="INDEX-166" /> I/Ois an intrinsic part of any programming language, but it has alwaysbeen done at a fairly low level: reading a character or a line at atime, running it through a regular expression filter, etc. Raw textis an unruly commodity, lacking any clear rules for how to separatediscrete portions, other than basic, flat concepts such asnewline-separated lines and tab-separated columns. Consequently, moredata packaging schemes are available than even the chroniclers ofBabel could have foreseen. It's from this cacophonythat XML has risen, providing clear rules for how to createboundaries between data, assign hierarchy, and link resources in apredictable, unambiguous fashion. A program that relies on theserules can read any well-formed XML document, as if someone had jammeda babelfish into its ear.<a href="#FOOTNOTE-11">[11]</a></p><blockquote class="footnote"> <a name="FOOTNOTE-11" /><p>[11]Readers of DouglasAdams' book <em class="citetitle">TheHitchhiker's Guide to the Galaxy</em> willrecall that a babelfish is a living, universal language-translationdevice, about the size of an anchovy, that fits, head-first, into asentient being's aural canal.</p> </blockquote><p>Where can you get this babelfish to put in yourprogram's ear? An <em class="emphasis">XML parser</em>is a program or code library that translates XML data into either astream of events or a data object, giving your program direct accessto structured data. The XML can come from one or more files orfilehandles, a character stream, or a static string. It could bepeppered with entity references that may or may not need to beresolved. Some of the parts could come from outside your computersystem, living in some far corner of the Internet. It could beencoded in a Latin character set, or perhaps in a Japanese set.Fortunately for you, the developer, none of these details have to beaccounted for in your program because they are all taken care of bythe parser, an abstract tunnel between the physical state of data andthe crystallized representation seen by your subroutines.</p><p>An XML parser acts as a bridge between marked-up data (data packagedwith embedded XML instructions) and some predigested form yourprogram can work with. In Perl's case, we meanhashes, arrays, scalars, and objects made of references to these oldfriends. XML can be complex, residing in many files or streams, andcan contain unresolved regions (entities) that may need to be patchedup. Also, a parser usually tries to accept only good XML, rejectingit if it contains well-formedness errors. Its output has to reflectthe structure (order, containment, associative data) while ignoringirrelevant details such as what files the data came from and whatcharacter set was used. That's a lot of work. Toitemize these points, an XML parser:</p><ul><li><p>Reads a stream of characters and distinguishes between markup anddata</p></li><li><p>Optionally replaces entity references with their values </p></li><li><p>Assembles a complete, logical document from many disparate sources </p></li><li><p>Reports syntax errors and optionally reports grammatical (validation)errors</p></li><li><p>Serves data and structural information to a client program </p></li></ul><p>In XML, data and markup are mixed together, so the parser first hasto sift through a character stream and tell the two apart. Certaincharacters delimit the instructions from data, primarily<a name="INDEX-167" /><a name="INDEX-168" />angle brackets (<tt class="literal">&lt;</tt>and <tt class="literal">&gt;</tt>) for elements, comments, and processinginstructions, and <a name="INDEX-169" /><a name="INDEX-170" />ampersand (<tt class="literal">&amp;</tt>) and<a name="INDEX-171" /><a name="INDEX-172" />semicolon(<tt class="literal">;</tt>) for entity references. The parser also knowswhen to expect a certain instruction, or if a bad instruction hasoccurred; for example, an element that contains data must bracket thedata in both a start and end tag. With this knowledge, the parser canquickly chop a character stream into discrete portions as encoded bythe XML markup.</p><p>The next task is to fill in placeholders. <em class="emphasis">Entityreferences</em><a name="INDEX-173" /> may need to be resolved. Early inthe process of reading XML, the processor will have encountered alist of placeholder definitions in the form of entity declarations,which associate a brief identifier with an entity. The identifier issome literal text defined in the document's DTD, andthe entity itself can be defined right there or at the business endof a URL. These entities can themselves contain entity references, sothe process of resolving an entity can take several iterations beforethe placeholders are filled in.</p><p>You may not always want<a name="INDEX-174" />entities to be resolved. Ifyou're just spitting XML back out after some minorprocessing, then you may want to turn entity resolution off orsubstitute your own routine for handling entity references. Forexample, you may want to resolve external entity references (entitieswhose values are in locations external to the document, pointed to byURLs), but not resolve internal ones. Most parsers give you theability to do this, but none will let you use entity referenceswithout declaring them.</p><p>That leads to the third task. If you allow the parser to resolveexternal entities, it will fetch all the documents, local or remote,that contain parts of the larger XML document. In doing so, all theseentities get smushed into one unbroken document. Since your programusually doesn't need to know how the document isdistributed physically, information about the physical origin of anypiece of data goes away once it knits the whole document together.</p><p>While interpreting the markup, the parser may trip over a syntacticerror. XML was designed to make it very easy to spot such errors.Everything from attributes to empty element tags have rigid rules fortheir construction so a parser doesn't have to thinkvery hard about it. For example, the following piece of XML has anobvious error. The start tag for the<tt class="literal">&lt;decree&gt;</tt> element contains an attribute witha defective value assignment. The value"now" is missing a second quotecharacter, and there's another error, somewhere inthe end tag. Can you see it?</p><blockquote><pre class="code">&lt;decree effective="now&gt;All motorbikes shall be painted red.&lt;/decree&lt;</pre></blockquote><p>When such an error occurs, the parser has little choice but to shutdown the operation. There's no point in trying toparse the rest of the document. The point of XML is to make thingsunambiguous. If the parser had to guess how the document shouldlook,<a href="#FOOTNOTE-12">[12]</a> it would open up the data to uncertainty andyou'd lose that precious level of confidence in yourprogram. Instead, the XML framers (wisely, we feel) opted to make XMLparsers choke and die on bad XML documents. If the parser likes yourXML, it is said to be <em class="emphasis">wellformed</em><a name="INDEX-177" />.</p><blockquote class="footnote"> <a name="FOOTNOTE-12" /><p>[12]Most <a name="INDEX-175" />HTML<a name="INDEX-176" /> browsers try to ignorewell-formedness errors in HTML documents, attempting to fix them andmove on. While ignoring these errors may seem to be more convenientto the reader, it actually encourages sloppy documents and results inoverall degradation of the quality of information on the Web. Afterall, would you fix parse errors if you didn't haveto?</p> </blockquote><p>What do we mean by "grammaticalerrors"? You will encounter them only with so-called<em class="emphasis">validating</em><a name="INDEX-178" /> parsers. A document is considered to be<em class="emphasis">valid</em> if it passes a test defined in a DTD.XML-based languages and applications often have DTDs to set a minimalstandard above well-formedness for how elements and data should beordered. For example, the W3C has posted at least one DTD to describeXHTML (the XML-compliant flavor of HTML), listing all elements thatcan appear, where they can go, and what they can contain. It would begrammatically correct to put a <tt class="literal">&lt;p&gt;</tt> elementinside a <tt class="literal">&lt;body&gt;</tt>, but putting<tt class="literal">&lt;p&gt;</tt> inside <tt class="literal">&lt;head&gt;</tt>,for example, would be incorrect. And don't eventhink about inserting an element <tt class="literal">&lt;blooby&gt;</tt>anywhere in the document, because it isn't declaredanywhere in the DTD.<a href="#FOOTNOTE-13">[13]</a> If even one error of this type is in a document, then thewhole document is considered<em class="emphasis">invalid</em><a name="INDEX-179" />. It may be well formed, but not validagainst the particular DTD. Often, this level of checking is more ofa burden than a help, but it's available if you needit.</p><blockquote class="footnote"> <a name="FOOTNOTE-13" /><p>[13]If you insist on authoring a<tt class="literal">&lt;blooby&gt;</tt>-enabled web page in XML, you candesign your own extension by drafting a DTD that uses entityreferences to pull in the XHTML DTD, and then defines your ownspecial elements on top of it. At this point it'snot officially XHTML anymore, but a subclass thereof.</p></blockquote><p>Rounding out our list is the requirement that a parser ship thedigested data to a program or end user. You can do this in many ways,and we devote much of the rest of the book in analyzing them. We canbreak up the forms into a few categories:</p><dl><dt><i>Event stream</i></dt><dd><p>First, a parser can generate an event stream: the parser converts astream of markup characters into a new kind of stream that is moreabstract, with data that is partially processed and easier to handleby your program.</p></dd><dt><i>Object Representation</i></dt><dd><p>Second, a parser can construct a data structure that reflects theinformation in the XML markup. This construction requires moreresources from your system, but may be more convenient because itcreates a persistent object that will wait around while you work onit.</p></dd><dt><i>Hybrid form</i></dt><dd><p>We might call the third group"hybrid" output. It includesparsers that try to be smart about processing, using some advanceknowledge about the document to construct an object representing onlya portion of your document.</p></dd></dl><a name="perlxml-CHP-3-SECT-1.1" /><div class="sect2"><h3 class="sect2">3.1.1. Example (of What Not to Do): A Well-Formedness Checker</h3><p>We've<a name="INDEX-180" />described XML parsers abstractly, but now it's timeto get our hands dirty. We're going to write our own<a name="INDEX-181" />parserwhose sole purpose is to check whether a document is well-formed XMLor if it fails the basic test. This is about the simplest a parsercan get; it doesn't drive any further processing,but just returns a "yes" or"no."</p><p>Our mission here is twofold. First, we hope to shave some of themystique off of XML processing -- at the end of the day,it's just pushing text around. However, we also wantto emphasize that writing a proper parser in Perl (or any language)requires a lot of work, which would be better spent writing more
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -