📄 ch04_06.htm
字号:
<html><head><title>XML::Parser (Perl and XML)</title><link rel="stylesheet" type="text/css" href="../style/style1.css" /><meta name="DC.Creator" content="Erik T. Ray and Jason McIntosh" /><meta name="DC.Format" content="text/xml" scheme="MIME" /><meta name="DC.Language" content="en-US" /><meta name="DC.Publisher" content="O'Reilly & Associates, Inc." /><meta name="DC.Source" scheme="ISBN" content="059600205XL" /><meta name="DC.Subject.Keyword" content="stuff" /><meta name="DC.Title" content="Perl and XML" /><meta name="DC.Type" content="Text.Monograph" /></head><body bgcolor="#ffffff"><img alt="Book Home" border="0" src="gifs/smbanner.gif" usemap="#banner-map" /><map name="banner-map"><area shape="rect" coords="1,-2,616,66" href="index.htm" alt="Perl & XML" /><area shape="rect" coords="629,-11,726,25" href="jobjects/fsearch.htm" alt="Search this book" /></map><div class="navbar"><table width="684" border="0"><tr><td align="left" valign="top" width="228"><a href="ch04_05.htm"><img alt="Previous" border="0" src="../gifs/txtpreva.gif" /></a></td><td align="center" valign="top" width="228" /><td align="right" valign="top" width="228"><a href="ch05_01.htm"><img alt="Next" border="0" src="../gifs/txtnexta.gif" /></a></td></tr></table></div><h2 class="sect1">4.6. XML::Parser</h2><p>Another early parser is <tt class="literal">XML::Parser</tt><a name="INDEX-356" /><a name="INDEX-357" />, the first fast and efficient parser tohit CPAN. We detailed its many-faceted interface in <a href="ch03_01.htm">Chapter 3, "XML Basics: Reading and Writing"</a>. Its built-in stream mode is worth a closerlook, though. Let's return to it now with a solidstream example.</p><p>We'll use <tt class="literal">XML::Parser</tt> to read alist of records encoded as an XML document. The records containcontact information for people, including their names, streetaddresses, and phone numbers. As the parser reads the file, ourhandler will store the information in its own data structure forlater processing. Finally, when the parser is done, the program sortsthe records by the person's name and outputs them asan HTML table.</p><p>The source document is listed in <a href="ch04_06.htm#perlxml-CHP-4-EX-3">Example 4-3</a>. It hasa <tt class="literal"><list></tt> element as the root, with four<tt class="literal"><entry></tt> elements inside it, each with anaddress, a name, and a phone number.</p><a name="perlxml-CHP-4-EX-3" /><div class="example"><h4 class="objtitle">Example 4-3. Address book file </h4><blockquote><pre class="code"><list> <entry> <name><first>Thadeus</first><last>Wrigley</last></name> <phone>716-505-9910</phone> <address> <street>105 Marsupial Court</street> <city>Fairport</city><state>NY</state><zip>14450</zip> </address> </entry> <entry> <name><first>Jill</first><last>Baxter</last></name> <address> <street>818 S. Rengstorff Avenue</street> <zip>94040</zip> <city>Mountainview</city><state>CA</state> </address> <phone>217-302-5455</phone> </entry> <entry> <name><last>Riccardo</last> <first>Preston</first></name> <address> <street>707 Foobah Drive</street> <city>Mudhut</city><state>OR</state><zip>32777</zip> </address> <phone>111-222-333</phone> </entry> <entry> <address> <street>10 Jiminy Lane</street> <city>Scrapheep</city><state>PA</state><zip>99001</zip> </address> <name><first>Benn</first><last>Salter</last></name> <phone>611-328-7578</phone> </entry></list></pre></blockquote></div><p>This simple structure lends itself naturally to event processing.Each <tt class="literal"><entry></tt> start tag signals thepreparation of a new part of the data structure for storing data. An<tt class="literal"></entry></tt> end tag indicates that all data forthe record has been collected and can be saved. Similarly, start andend tags for <tt class="literal"><entry></tt> subelements are cuesthat tell the handler when and where to save information. Each<tt class="literal"><entry></tt> is self-contained, with no links tothe outside, making it easy to process.</p><p>The program is listed in <a href="ch04_06.htm#perlxml-CHP-4-EX-4">Example 4-4</a>. At the top iscode used to initialize the parser object with references tosubroutines, each of which will serve as the handler for a singleevent. This style of event handling is called a<em class="emphasis">callback</em> because you write the subroutinefirst, and the parser then calls it back when it needs it to handlean event.</p><p>After the initialization, we declare some global variables to storeinformation from XML elements for later processing. These variablesgive the handlers a memory, as mentioned earlier. Storing informationfor later retrieval is often called <em class="emphasis">savingstate</em> because it helps the handlers preserve the state ofthe parsing up to the current point in the document.</p><p>After reading in the data and applying the parser to it, the rest ofthe program defines the handler subroutines. We handle five events:the start and end of the document, the start and end of elements, andcharacter data. Other events, such as comments, processinginstructions, and document type declarations, will all be ignored.</p><a name="perlxml-CHP-4-EX-4" /><div class="example"><h4 class="objtitle">Example 4-4. Code for the address program </h4><blockquote><pre class="code"># initialize the parser with references to handler routines#use XML::Parser;my $parser = XML::Parser->new( Handlers => { Init => \&handle_doc_start, Final => \&handle_doc_end, Start => \&handle_elem_start, End => \&handle_elem_end, Char => \&handle_char_data,});## globals#my $record; # points to a hash of element contentsmy $context; # name of current elementmy %records; # set of address entries## read in the data and run the parser on it#my $file = shift @ARGV;if( $file ) { $parser->parsefile( $file );} else { my $input = ""; while( <STDIN> ) { $input .= $_; } $parser->parse( $input );}exit;###### Handlers##### As processing starts, output the beginning of an HTML file.# sub handle_doc_start { print "<html><head><title>addresses</title></head>\n"; print "<body><h1>addresses</h1>\n";}## save element name and attributes#sub handle_elem_start { my( $expat, $name, %atts ) = @_; $context = $name; $record = {} if( $name eq 'entry' );} # collect character data into the recent element's buffer#sub handle_char_data { my( $expat, $text ) = @_; # Perform some minimal entitizing of naughty characters $text =~ s/&/&/g; $text =~ s/</&lt;/g; $record->{ $context } .= $text;}## if this is an <entry>, collect all the data into a record#sub handle_elem_end { my( $expat, $name ) = @_; return unless( $name eq 'entry' ); my $fullname = $record->{'last'} . $record->{'first'}; $records{ $fullname } = $record;}## Output the close of the file at the end of processing.#sub handle_doc_end { print "<table border='1'>\n"; print "<tr><th>name</th><th>phone</th><th>address</th></tr>\n"; foreach my $key ( sort( keys( %records ))) { print "<tr><td>" . $records{ $key }->{ 'first' } . ' '; print $records{ $key }->{ 'last' } . "</td><td>"; print $records{ $key }->{ 'phone' } . "</td><td>"; print $records{ $key }->{ 'street' } . ', '; print $records{ $key }->{ 'city' } . ', '; print $records{ $key }->{ 'state' } . ' '; print $records{ $key }->{ 'zip' } . "</td></tr>\n"; } print "</table>\n</div>\n</body></html>\n";}</pre></blockquote></div><p>To understand how this program works, we need to study the handlers.All handlers called by <tt class="literal">XML::Parser</tt> receive areference to the <tt class="literal">expat</tt> parser object as theirfirst argument, a courtesy to developers in case they want to accessits data (for example, to check the input file'scurrent line number). Other arguments may be passed, depending on thekind of event. For example, the start-element event handler gets thename of the element as the second argument, and then gets a list ofattribute names and values.</p><p>Our handlers use global variables to store information. If youdon't like global variables (in larger programs,they can be a headache to debug), you can create an object thatstores the information internally. You would then give the parseryour object's methods as handlers.We'll stick with globals for now because they areeasier to read in our example.</p><p>The first<a name="INDEX-358" />handler is<tt class="literal">handle_doc_start</tt>, called at the start of parsing.This handler is a convenient way to do some work before processingthe document. In our case, it just outputs HTML code to begin theHTML page in which the sorted address entries will be formatted. Thissubroutine has no special arguments.</p><p>The next handler, <tt class="literal">handle_elem_start</tt>, is calledwhenever the parser encounters the start of a new element. After theobligatory <tt class="literal">expat</tt> reference, the routine gets twoarguments: <tt class="literal">$name</tt>, which is the element name, and<tt class="literal">%atts</tt>, a hash of attribute names and values. (Notethat using a hash will not preserve the order of attributes, so iforder is important to you, you should use an <tt class="literal">@atts</tt>array instead.) For this simple example, we don'tuse attributes, but we leave open the possibility of using themlater.</p><p>This routine sets up processing of an element by saving the name ofthe element in a variable called <tt class="literal">$context</tt>. Savingthe element's name ensures that we will know what todo with character data events the parser will send later. The routinealso initializes a hash called <tt class="literal">%record</tt>, which willcontain the data for each of<tt class="literal"><entry></tt>'s subelements in aconvenient look-up table.</p><p>The handler <tt class="literal">handle_char_data</tt> takes care ofnonmarkup data -- basically all the character data in elements.This text is stored in the second argument, here called<tt class="literal">$text</tt>. The handler only needs to save the contentin the buffer <tt class="literal">$record->{ $context }</tt>. Noticethat we append the character data to the buffer, rather than assignit outright. <tt class="literal">XML::Parser</tt> has a funny quirk inwhich it calls the character handler after each line ornewline-separated string of text.<a href="#FOOTNOTE-23">[23]</a> Thus, if the content of an element includes a newlinecharacter, this will result in two separate calls to the handler. Ifyou didn't append the data, then the last call wouldoverwrite the one before it.</p><blockquote class="footnote"> <a name="FOOTNOTE-23" /><p>[23]This way of readingtext is uniquely Perlish. XML purists might be confused about thishandling of character data. XML doesn't care aboutnewlines, or any whitespace for that matter; it'sall just character data and is treated the same way. </p></blockquote><p>Not surprisingly, <tt class="literal">handle_elem_end</tt> handles the endof element events. The second argument is theelement's name, as with the start-element eventhandler. For most elements, there's not much to dohere, but for <tt class="literal"><entry></tt>, we have a finalhousekeeping task. At this point, all the information for a recordhas been collected, so the record is complete. We only have to storeit in a hash, indexed by the person's full name sothat we can easily sort the records later. The sorting can be doneonly after all the records are in, so we need to store the record forlater processing. If we weren't interested insorting, we could just output the record as HTML.</p><p>Finally, the <tt class="literal">handle_doc_end</tt> handler completes ourset, performing any final tasks that remain after reading thedocument. It so happens that we do have something to do. We need toprint out the records, sorted alphabetically by contact name. Thesubroutine generates an HTML table to format the entries nicely.</p><p>This example, which involved a flat sequence of records, was prettysimple, but not all XML is like that. In some complex documentformats, you have to consider the parent, grandparent, and evendistant ancestors of the current element to decide what to do with anevent. Remembering an element's ancestry requires amore sophisticated state-saving structure, which<a name="INDEX-359" /> <a name="INDEX-360" /> we willshow<a name="INDEX-361" /><a name="INDEX-362" /> in alater example.</p><hr width="684" align="left" /><div class="navbar"><table width="684" border="0"><tr><td align="left" valign="top" width="228"><a href="ch04_05.htm"><img alt="Previous" border="0" src="../gifs/txtpreva.gif" /></a></td><td align="center" valign="top" width="228"><a href="index.htm"><img alt="Home" border="0" src="../gifs/txthome.gif" /></a></td><td align="right" valign="top" width="228"><a href="ch05_01.htm"><img alt="Next" border="0" src="../gifs/txtnexta.gif" /></a></td></tr><tr><td align="left" valign="top" width="228">4.5. XML::PYX</td><td align="center" valign="top" width="228"><a href="index/index.htm"><img alt="Book Index" border="0" src="../gifs/index.gif" /></a></td><td align="right" valign="top" width="228">5. SAX</td></tr></table></div><hr width="684" align="left" /><img alt="Library Navigation Links" border="0" src="../gifs/navbar.gif" usemap="#library-map" /><p><p><font size="-1"><a href="copyrght.htm">Copyright © 2002</a> O'Reilly & Associates. All rights reserved.</font></p><map name="library-map"><area shape="rect" coords="1,0,85,94" href="../index.htm"><area shape="rect" coords="86,1,178,103" href="../lwp/index.htm"><area shape="rect" coords="180,0,265,103" href="../lperl/index.htm"><area shape="rect" coords="267,0,353,105" href="../perlnut/index.htm"><area shape="rect" coords="354,1,446,115" href="../prog/index.htm"><area shape="rect" coords="448,0,526,132" href="../tk/index.htm"><area shape="rect" coords="528,1,615,119" href="../cookbook/index.htm"><area shape="rect" coords="617,0,690,135" href="../pxml/index.htm"></map></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -