📄 ch05_01.htm
字号:
<html><head><title>SAX (Perl and XML)</title><link rel="stylesheet" type="text/css" href="../style/style1.css" /><meta name="DC.Creator" content="Erik T. Ray and Jason McIntosh" /><meta name="DC.Format" content="text/xml" scheme="MIME" /><meta name="DC.Language" content="en-US" /><meta name="DC.Publisher" content="O'Reilly & Associates, Inc." /><meta name="DC.Source" scheme="ISBN" content="059600205XL" /><meta name="DC.Subject.Keyword" content="stuff" /><meta name="DC.Title" content="Perl and XML" /><meta name="DC.Type" content="Text.Monograph" /></head><body bgcolor="#ffffff"><img alt="Book Home" border="0" src="gifs/smbanner.gif" usemap="#banner-map" /><map name="banner-map"><area shape="rect" coords="1,-2,616,66" href="index.htm" alt="Perl & XML" /><area shape="rect" coords="629,-11,726,25" href="jobjects/fsearch.htm" alt="Search this book" /></map><div class="navbar"><table width="684" border="0"><tr><td align="left" valign="top" width="228"><a href="ch04_06.htm"><img alt="Previous" border="0" src="../gifs/txtpreva.gif" /></a></td><td align="center" valign="top" width="228" /><td align="right" valign="top" width="228"><a href="ch05_02.htm"><img alt="Next" border="0" src="../gifs/txtnexta.gif" /></a></td></tr></table></div><h1 class="chapter">Chapter 5. SAX</h1><div class="htmltoc"><h4 class="tochead">Contents:</h4><p><a href="ch05_01.htm">SAX Event Handlers</a><br /><a href="ch05_02.htm">DTD Handlers</a><br /><a href="ch05_03.htm">External Entity Resolution</a><br /><a href="ch05_04.htm">Drivers for Non-XML Sources</a><br /><a href="ch05_05.htm">A Handler Base Class</a><br /><a href="ch05_06.htm">XML::Handler::YAWriter as a Base Handler Class</a><br /><a href="ch05_07.htm">XML::SAX: The Second Generation</a><br /></p></div><p><tt class="literal">XML::Parser</tt> has<a name="INDEX-363" /> done remarkably well as a multipurpose XMLparser and stream generator, but it really isn't thefuture of Perl and XML. The problem is that we don'twant one standard parser for all ends and purposes; we want to beable to choose from multiple parsers, each serving a differentpurpose. One parser might be written completely in Perl forportability, while another is accelerated with a core written in C.Or, you might want a parser that translates one format (such as aspreadsheet) into an XML stream. You simply can'tanticipate all the things a parser might be called on to do. Even<tt class="literal">XML::Parser</tt>, with its many options and multiplemodes of operation, can't please everybody. Thefuture, then, is a multiplicity of parsers that cover any situationyou encounter.</p><p>An environment with multiple parsers demands some level ofconsistency. If every parser had its own interface, developers wouldgo mad. Learning one interface and being able to expect all parsersto comply to that is better than having to learn a hundred differentways to do the same thing. We need a standard interface betweenparsers and code: a universal plug that is flexible and reliable,free from the individual quirks of any particular parser.</p><p>The XML development world has settled on an event-driven interfacecalled SAX. SAX evolved from discussions on the XML-DEV mailing listand, shepherded by David <a name="INDEX-364" />Megginson,<a href="#FOOTNOTE-24">[24]</a>was quickly shaped into a useful specification. The firstincarnation, called SAX Level 1 (or just SAX1), supports elements,attributes, and processing instructions. It doesn'thandle some other things like namespaces or CDATA sections, so thesecond iteration, SAX2, was devised, adding support for just aboutany event you can imagine in generic XML.</p><blockquote class="footnote"> <a name="FOOTNOTE-24" /><p>[24]David Megginsonmaintains a web page about SAX at <a href="http://www.saxproject.org">http://www.saxproject.org</a>.</p> </blockquote><p>SAX has been a huge success. Its simplicity makes it easy to learnand work with. Early development with XML was mostly in the realm ofJava, so SAX was codified as an interface construct. An interfaceconstruct is a special kind of class that declares anobject's methods without implementing them, leavingthe implementation up to the developer.</p><p>Enthusiasm for SAX soon infected the Perl community andimplementations began to appear in CPAN, but there was a problem.Perl doesn't provide a rigorous way to define astandard interface like <a name="INDEX-365" />Java does. It has weak type checkingand forgives all kinds of inconsistencies. Whereas Java comparesargument types in functions with those defined in the interfaceconstruct at compile time, Perl quietly accepts any arguments youuse. Thus, defining a standard in Perl is mostly a verbal activity,relying on the developer's experience andwatchfulness to comply.</p><p>One of the first Perl implementations of SAX is KenMcLeod's <tt class="literal">XML::Parser::PerlSAX</tt>module. As a subclass of <tt class="literal">XML::Parser</tt>, it modifiesthe stream of events from Expat to repackage them as SAX events.</p><div class="sect1"><a name="perlxml-CHP-5-SECT-1" /><h2 class="sect1">5.1. SAX Event Handlers</h2><p>To<a name="INDEX-366" /><a name="INDEX-367" /> use a typical SAX modulein a program, you must pass it an object whose methods implementhandlers for SAX events. <a href="ch05_01.htm#perlxml-CHP-5-TABLE-1">Table 5-1</a> describes themethods in a typical handler object. A SAX parser passes a hash toeach handler containing properties relevant to the event. Forexample, in this hash, an element handler would receive theelement's name and a list of attributes.</p><a name="perlxml-CHP-5-TABLE-1" /><h4 class="objtitle">Table 5-1. PerlSAX handlers </h4><table border="1"><tr><th><p>Method name</p></th><th><p>Event</p></th><th><p>Properties</p></th></tr><tr><td><p><tt class="literal">start_document</tt><a name="INDEX-368" /></p></td><td><p>The document processing has started (this is the first event) </p></td><td><p>(none defined)</p></td></tr><tr><td><p><tt class="literal">end_document</tt><a name="INDEX-369" /></p></td><td><p>The document processing is complete (this is the last event) </p></td><td><p>(none defined)</p></td></tr><tr><td><p><tt class="literal">start_element</tt><a name="INDEX-370" /></p></td><td><p>An element start tag or empty element tag was found </p></td><td><p>Name, Attributes</p></td></tr><tr><td><p><tt class="literal">end_element</tt><a name="INDEX-371" /></p></td><td><p>An element end tag or empty element tag was found </p></td><td><p>Name</p></td></tr><tr><td><p><tt class="literal">characters</tt><a name="INDEX-372" /></p></td><td><p>A string of nonmarkup characters (character data) was found </p></td><td><p>Data</p></td></tr><tr><td><p><tt class="literal">processing_instruction</tt><a name="INDEX-373" /></p></td><td><p>A parser encountered a processing instruction </p></td><td><p>Target, Data</p></td></tr><tr><td><p><tt class="literal">comment</tt><a name="INDEX-374" /></p></td><td><p>A parser encountered a comment </p></td><td><p>Data</p></td></tr><tr><td><p><tt class="literal">start_cdata</tt><a name="INDEX-375" /></p></td><td><p>The beginning of a CDATA section encountered (the following characterdata may contain reserved markup characters)</p></td><td><p>(none defined)</p></td></tr><tr><td><p><tt class="literal">end_cdata</tt><a name="INDEX-376" /></p></td><td><p>The end of an encountered CDATA section </p></td><td><p>(none defined)</p></td></tr><tr><td><p><tt class="literal">entity_reference</tt><a name="INDEX-377" /></p></td><td><p>An internal entity reference was found (as opposed to an externalentity reference, which would indicate that a file needs to beloaded)</p></td><td><p>Name, Value</p></td></tr></table><p><p>A few notes about handler methods: </p><ul><li><p>For an empty element, both the <tt class="literal">start_element( )</tt>and <tt class="literal">end_element( )</tt> handlers are called, in thatorder. No handler exists specifically for empty elements.</p></li><li><p>The <tt class="literal">characters( )</tt> handler may be called morethan once for a string of contiguous character data, parceling itinto pieces. For example, a parser might break text around an entityreference, which is often more efficient for the parser.</p></li><li><p>The <tt class="literal">characters( )</tt> handler will be called for anywhitespace between elements, even if it doesn't seemlike significant data. In XML, all characters are considered part ofdata. It's simply more efficient not to make adistinction otherwise.</p></li><li><p>Handling of processing instructions, comments, and CDATA sections isoptional. In the absence of handlers, the data from processinginstructions and comments is discarded. For CDATA sections, calls arestill made to the <tt class="literal">characters(</tt> <tt class="literal">)</tt>handler as before so the data will not be lost.</p></li><li><p>The <tt class="literal">start_cdata( )</tt> and <tt class="literal">end_cdata()</tt> handlers do not receive data. Instead, they merely actas signals to tell you whether reserved markup characters can beexpected in future calls to the <tt class="literal">characters( )</tt>handler.</p></li><li><p>In the absence of an <tt class="literal">entity_reference( )</tt>handler, all internal entity references will be resolvedautomatically by the parser, and the resulting text or markup will behandled normally. If you do define an <tt class="literal">entity_reference()</tt> handler, the entity references will not be expanded andyou can do what you want with them.</p></li></ul><p>Let's show an example now. We'llwrite a program called a filter, a special processor that outputs a
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -