📄 ch08_04.htm
字号:
<html><head><title>Optimized Tree Processing (Perl and XML)</title><link rel="stylesheet" type="text/css" href="../style/style1.css" /><meta name="DC.Creator" content="Erik T. Ray and Jason McIntosh" /><meta name="DC.Format" content="text/xml" scheme="MIME" /><meta name="DC.Language" content="en-US" /><meta name="DC.Publisher" content="O'Reilly & Associates, Inc." /><meta name="DC.Source" scheme="ISBN" content="059600205XL" /><meta name="DC.Subject.Keyword" content="stuff" /><meta name="DC.Title" content="Perl and XML" /><meta name="DC.Type" content="Text.Monograph" /></head><body bgcolor="#ffffff"><img alt="Book Home" border="0" src="gifs/smbanner.gif" usemap="#banner-map" /><map name="banner-map"><area shape="rect" coords="1,-2,616,66" href="index.htm" alt="Perl & XML" /><area shape="rect" coords="629,-11,726,25" href="jobjects/fsearch.htm" alt="Search this book" /></map><div class="navbar"><table width="684" border="0"><tr><td align="left" valign="top" width="228"><a href="ch08_03.htm"><img alt="Previous" border="0" src="../gifs/txtpreva.gif" /></a></td><td align="center" valign="top" width="228" /><td align="right" valign="top" width="228"><a href="ch09_01.htm"><img alt="Next" border="0" src="../gifs/txtnexta.gif" /></a></td></tr></table></div><h2 class="sect1">8.4. Optimized Tree Processing</h2><p>The big<a name="INDEX-714" />drawback to using trees for XML crunching is that they tend toconsume scandalous amounts of memory and processor time. This mightnot be apparent with small documents, but it becomes noticeable asdocuments grow to many thousands of nodes. A typical book of a fewhundred pages' length could easily have tens ofthousands of nodes. Each one requires the allocation of an object, aprocess that takes considerable time and memory.</p><p>Perhaps you don't need to build the entire tree toget your work done, though. You might only want a small branch of thetree and can safely do all the processing inside of it. Ifthat's the case, then you can take advantage of theoptimized parsing modes in <tt class="literal">XML::Twig</tt> (recall thatwe dealt with this module earlier in <a href="ch08_02.htm#perlxml-CHP-8-SECT-2">Section 8.2, "XPath"</a>). These modes allow you tospecify ahead of time what parts (or"twigs") of the treeyou'll be working with so that only those parts areassembled. The result is a hybrid of tree and event processing withhighly optimized performance in speed and memory.</p><p><tt class="literal">XML::Twig</tt> has three modes of operation: theregular old tree mode, similar to what we've seen sofar; "chunk" mode, which builds awhole tree, but has only a fraction of it in memory at a time (sortof like paged memory); and multiple roots mode, which builds only afew selected twigs from the tree.</p><p><a href="ch08_04.htm#perlxml-CHP-8-EX-11">Example 8-11</a> demonstrates the power of<tt class="literal">XML::Twig</tt> in chunk mode. The data to this programis a<a name="INDEX-715" /><a name="INDEX-716" />DocBook book with some<tt class="literal"><chapter></tt> elements. These documents can beenormous, sometimes a hundred megabytes or more. The program breaksup the processing per chapter so that only a fraction of the space isneeded.</p><a name="perlxml-CHP-8-EX-11" /><div class="example"><h4 class="objtitle">Example 8-11. A chunking program </h4><blockquote><pre class="code">use XML::Twig;# initalize the twig, parse, and output the revised twigmy $twig = new XML::Twig( TwigHandlers => { chapter => \&process_chapter });$twig->parsefile( shift @ARGV );$twig->print;# handler for chapter elements: process and then flush up the chaptersub process_chapter { my( $tree, $elem ) = @_; &process_element( $elem ); $tree->flush_up_to( $elem ); # comment out this line to waste memory}# append 'foo' to the name of an elementsub process_element { my $elem = shift; $elem->set_gi( $elem->gi . 'foo' ); my @children = $elem->children; foreach my $child ( @children ) { next if( $child->gi eq '#PCDATA' ); &process_element( $child ); }}</pre></blockquote></div><p>The program changes element names to append the string"foo" to them. Changing names isjust busy work to keep the program running long enough to check thememory usage. Note the line in the function<tt class="literal">process_chapter( )</tt>:</p><blockquote><pre class="code">$tree->flush_up_to( $elem );</pre></blockquote><p>We get our memory savings from this command. Without it, the entiretree will be built and kept in memory until the document is finallyprinted out. But when it is called, the tree that has been built upto a given element is dismantled and its text is output (called<em class="emphasis">flushing</em><a name="INDEX-717" />). The memory usage never rises higher thanwhat is needed for the largest chapter in the book.</p><p>To test this theory, we ran the program on a 3 MB document, firstwithout and then with the line shown above. Without flushing, theprogram's heap space grew to over 30 MB.It's staggering to see how much memory anobject-oriented tree processor needs -- in this case ten times thesize of the file. But with flushing enabled, the program hoveredaround only a few MB of memory usage, a savings of about 90 percent.In both cases, the entire tree is eventually built, so the totalprocessing time is about the same. To save CPU cycles as well asmemory, we need to use multiple roots mode.</p><p>Multiple roots mode works by specifying before parsing the roots ofthe twigs that you want built. You will save significant time andmemory if the twigs are much smaller than the document as a whole. Inour chunk mode example, we probably can't do much tospeed up the process, since the sum of<tt class="literal"><chapter></tt> elements is about the same as thesize of the document. So let's focus on an examplethat fits the profile.</p><p>The program in <a href="ch08_04.htm#perlxml-CHP-8-EX-12">Example 8-12</a> reads in DocBookdocuments and outputs the titles of chapters -- a table ofcontents of sorts. To get this information, we don'tneed to build a tree for the whole chapter; only the<tt class="literal"><title></tt> element is necessary. So for roots,we specify titles of chapters, expressed in the XPath notation<tt class="literal">chapter/title</tt>.</p><a name="perlxml-CHP-8-EX-12" /><div class="example"><h4 class="objtitle">Example 8-12. A many-twigged program </h4><blockquote><pre class="code">use XML::Twig;my $twig = new XML::Twig( TwigRoots => { 'chapter/title' => \&output_title });$twig->parsefile( shift @ARGV );sub output_title { my( $tree, $elem ) = @_; print $elem->text, "\n";}</pre></blockquote></div><p>The key line here is the one with the keyword<tt class="literal">TwigRoots</tt>. It's set to a hash ofhandlers and works very similarly to <tt class="literal">TwigHandlers</tt>that we saw earlier. The difference is that instead of building thewhole document tree, the program builds only trees whose roots are<tt class="literal"><title></tt> elements. This is a small fractionof the whole document, so we can expect time and memory savings to behigh.</p><p>How high? Running the program on the same test data, we saw memoryusage barely reach 2 MB, and the total processing time was 13seconds. Compare that to 30 MB memory usage (the size required tobuild the whole tree) and a full minute to grind out the titles. Thisconservation of resources is significant for both memory and CPUtime.</p><p><tt class="literal">XML::Twig</tt> can give you a big performance boost foryour tree processing programs, but you have to know when chunking andmultiple roots will help. You won't save much timeif the sum of twigs is almost as big as the document itself. Chunkingis not useful unless the chunks are significantly smallerthan<a name="INDEX-718" />the<a name="INDEX-719" />document.</p><hr width="684" align="left" /><div class="navbar"><table width="684" border="0"><tr><td align="left" valign="top" width="228"><a href="ch08_03.htm"><img alt="Previous" border="0" src="../gifs/txtpreva.gif" /></a></td><td align="center" valign="top" width="228"><a href="index.htm"><img alt="Home" border="0" src="../gifs/txthome.gif" /></a></td><td align="right" valign="top" width="228"><a href="ch09_01.htm"><img alt="Next" border="0" src="../gifs/txtnexta.gif" /></a></td></tr><tr><td align="left" valign="top" width="228">8.3. XSLT</td><td align="center" valign="top" width="228"><a href="index/index.htm"><img alt="Book Index" border="0" src="../gifs/index.gif" /></a></td><td align="right" valign="top" width="228">9. RSS, SOAP, and Other XML Applications </td></tr></table></div><hr width="684" align="left" /><img alt="Library Navigation Links" border="0" src="../gifs/navbar.gif" usemap="#library-map" /><p><p><font size="-1"><a href="copyrght.htm">Copyright © 2002</a> O'Reilly & Associates. All rights reserved.</font></p><map name="library-map"><area shape="rect" coords="1,0,85,94" href="../index.htm"><area shape="rect" coords="86,1,178,103" href="../lwp/index.htm"><area shape="rect" coords="180,0,265,103" href="../lperl/index.htm"><area shape="rect" coords="267,0,353,105" href="../perlnut/index.htm"><area shape="rect" coords="354,1,446,115" href="../prog/index.htm"><area shape="rect" coords="448,0,526,132" href="../tk/index.htm"><area shape="rect" coords="528,1,615,119" href="../cookbook/index.htm"><area shape="rect" coords="617,0,690,135" href="../pxml/index.htm"></map></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -