📄 ch07_03.htm
字号:
<html><head><title>XML::DOM (Perl and XML)</title><link rel="stylesheet" type="text/css" href="../style/style1.css" /><meta name="DC.Creator" content="Erik T. Ray and Jason McIntosh" /><meta name="DC.Format" content="text/xml" scheme="MIME" /><meta name="DC.Language" content="en-US" /><meta name="DC.Publisher" content="O'Reilly & Associates, Inc." /><meta name="DC.Source" scheme="ISBN" content="059600205XL" /><meta name="DC.Subject.Keyword" content="stuff" /><meta name="DC.Title" content="Perl and XML" /><meta name="DC.Type" content="Text.Monograph" /></head><body bgcolor="#ffffff"><img alt="Book Home" border="0" src="gifs/smbanner.gif" usemap="#banner-map" /><map name="banner-map"><area shape="rect" coords="1,-2,616,66" href="index.htm" alt="Perl & XML" /><area shape="rect" coords="629,-11,726,25" href="jobjects/fsearch.htm" alt="Search this book" /></map><div class="navbar"><table width="684" border="0"><tr><td align="left" valign="top" width="228"><a href="ch07_02.htm"><img alt="Previous" border="0" src="../gifs/txtpreva.gif" /></a></td><td align="center" valign="top" width="228" /><td align="right" valign="top" width="228"><a href="ch07_04.htm"><img alt="Next" border="0" src="../gifs/txtnexta.gif" /></a></td></tr></table></div><h2 class="sect1">7.3. XML::DOM</h2><p>Enno <a name="INDEX-652" />Derkson's<tt class="literal">XML::DOM</tt><a name="INDEX-653" /> <a name="INDEX-654" />module is a good place to start exploring DOM in Perl.It's a complete implementation of Level 1 DOM with afew extra features thrown in for convenience.<tt class="literal">XML::DOM::Parser</tt> extends<tt class="literal">XML::Parser</tt> to build a document tree installed inan <tt class="literal">XML::DOM::Document</tt> object whose reference itreturns. This reference gives you complete access to the tree. Therest, we happily report, works pretty much as you'dexpect.</p><p>Here's a program that uses DOM to processan<a name="INDEX-655" /> XHTML file. It looks inside<tt class="literal"><p></tt> elements for the word"monkeys," replacing every instancewith a link to <tt class="literal">monkeystuff.com</tt>. Sure, you could doit with a regular expression substitution, but this example isvaluable because it shows how to search for and create new nodes, andread and change values, all in the unique DOM style.</p><p>The first part of the program creates a parser object and gives it afile to parse with the call to<a name="INDEX-656" /> <tt class="literal">parsefile( )</tt>:</p><blockquote><pre class="code">use XML::DOM;&process_file( shift @ARGV );sub process_file { my $infile = shift; my $dom_parser = new XML::DOM::Parser; # create a parser object my $doc = $dom_parser->parsefile( $infile ); # make it parse a file &add_links( $doc ); # perform our changes print $doc->toString; # output the tree again $doc->dispose; # clean up memory}</pre></blockquote><p>This method returns a reference to an<tt class="literal">XML::DOM::Document</tt> object, which is our gateway tothe nodes inside. We pass this reference along to a routine called<tt class="literal">add_links( ),</tt><a name="INDEX-657" /> which will do all the processing werequire. Finally, we output the tree with a call to<tt class="literal">toString( )</tt><a name="INDEX-658" />, and then dispose of the object. Thislast step performs necessary cleanup in case any circular referencesbetween nodes could result in a memory leak.</p><p>The next part burrows into the tree to start processing paragraphs: </p><blockquote><pre class="code">sub add_links { my $doc = shift; # find all the <p> elements my $paras = $doc->getElementsByTagName( "p" ); for( my $i = 0; $i < $paras->getLength; $i++ ) { my $para = $paras->item( $i ); # for each child of a <p>, if it is a text node, process it my @children = $para->getChildNodes; foreach my $node ( @children ) { &fix_text( $node ) if( $node->getNodeType eq TEXT_NODE ); } }}</pre></blockquote><p>The <tt class="literal">add_links( )</tt><a name="INDEX-659" /> routine starts with a call to thedocument object's <tt class="literal">getElementsByTagName()</tt><a name="INDEX-660" /> method. It returns an<tt class="literal">XML::DOM::NodeList</tt> object containing all matching<tt class="literal"><p></tt>s in the document (multilevel searchingis so convenient) from which we can select nodes by index using<tt class="literal">item( )</tt><a name="INDEX-661" />.</p><p>The bit we're interested in will be hiding inside atext node inside the <tt class="literal"><p></tt> element, so we haveto iterate over the children to find text nodes and process them. Thecall to <tt class="literal">getChildNodes()</tt><a name="INDEX-662" /> gives us several child nodes, either ina <tt class="literal">generic</tt> Perl list (when called in an arraycontext) or another <tt class="literal">XML::DOM::NodeList</tt> object; forvariety's sake, we've selected thefirst option. For each node, we test its type with a call to<tt class="literal">getNodeType</tt> and compare the result to<tt class="literal">XML::DOM</tt>'s constant for textnodes, provided by<a name="INDEX-663" /> <tt class="literal">TEXT_NODE( )</tt>. Nodesthat pass the test are sent off to a routine for some node massaging.</p><p>The last part of the program targets text nodes and splits themaround the word "monkeys" to createa link:</p><blockquote><pre class="code">sub fix_text { my $node = shift; my $text = $node->getNodeValue; if( $text =~ /(monkeys)/i ) { # split the text node into 2 text nodes around the monkey word my( $pre, $orig, $post ) = ( $`, $1, $' ); my $tnode = $node->getOwnerDocument->createTextNode( $pre ); $node->getParentNode->insertBefore( $tnode, $node ); $node->setNodeValue( $post ); # insert an <a> element between the two nodes my $link = $node->getOwnerDocument->createElement( 'a' ); $link->setAttribute( 'href', 'http://www.monkeystuff.com/' ); $tnode = $node->getOwnerDocument->createTextNode( $orig ); $link->appendChild( $tnode ); $node->getParentNode->insertBefore( $link, $node ); # recurse on the rest of the text node # in case the word appears again fix_text( $node ); }}</pre></blockquote><p>First, the routine grabs the node's text value bycalling its <tt class="literal">getNodeValue( )</tt> method. DOMspecifies redundant accessor methods used to get and set values ornames, either through the generic<tt class="literal">Node</tt><a name="INDEX-664" /> class or through the more specificclass's methods. Instead of <tt class="literal">getNodeValue()</tt><a name="INDEX-665" />, we could have used <tt class="literal">getData()</tt>, which is specific to the text node class. For somenodes, such as elements, there is no defined value, so the generic<tt class="literal">getNodeValue( )</tt> method would return an undefinedvalue.</p><p>Next, we slice the node in two. We do this by creating a new textnode and inserting it before the existing one. After we set the textvalues of each node, the first will contain everything before theword "monkeys", and the other willhave everything after the word. Note the use of the<tt class="literal">XML::DOM::Document</tt> object as a factory to createthe new text node. This DOM feature takes care of many administrativetasks behind the scenes, making the genesis of new nodes painless.</p><p>After that step, we create an <tt class="literal"><a></tt> elementand insert it between the text nodes. Like all good links, it needs aplace to put the URL, so we set it up with an <tt class="literal">href</tt>attribute. To have something to click on, the link needs text, so wecreate a text node with the word"monkeys" and append it to theelement's child list. Then the routine will recurseon the text node after the link in case there are more instances of"monkeys" to process.</p><p>Does it work? Running the program on this file: </p><blockquote><pre class="code"><html><head><title>Why I like Monkeys</title></head><body><h1>Why I like Monkeys</h1><h2>Monkeys are Cute</h2><p>Monkeys are <b>cute</b>. They are like small, hyper versions ofourselves. They can make funny facial expressions and stick out theirtongues.</p></body></html></pre></blockquote><p>produces this<a name="INDEX-666" /> <a name="INDEX-667" /> output:</p><blockquote><pre class="code"><html><head><title>Why I like Monkeys</title></head><body><h1>Why I like Monkeys</h1><h2>Monkeys are Cute</h2><p><a href="http://www.monkeystuff.com/">Monkeys</a> are <b>cute</b>. They are like small, hyper versions ofourselves. They can make funny facial expressions and stick out theirtongues.</p></body></html></pre></blockquote><hr width="684" align="left" /><div class="navbar"><table width="684" border="0"><tr><td align="left" valign="top" width="228"><a href="ch07_02.htm"><img alt="Previous" border="0" src="../gifs/txtpreva.gif" /></a></td><td align="center" valign="top" width="228"><a href="index.htm"><img alt="Home" border="0" src="../gifs/txthome.gif" /></a></td><td align="right" valign="top" width="228"><a href="ch07_04.htm"><img alt="Next" border="0" src="../gifs/txtnexta.gif" /></a></td></tr><tr><td align="left" valign="top" width="228">7.2. DOM Class Interface Reference</td><td align="center" valign="top" width="228"><a href="index/index.htm"><img alt="Book Index" border="0" src="../gifs/index.gif" /></a></td><td align="right" valign="top" width="228">7.4. XML::LibXML</td></tr></table></div><hr width="684" align="left" /><img alt="Library Navigation Links" border="0" src="../gifs/navbar.gif" usemap="#library-map" /><p><p><font size="-1"><a href="copyrght.htm">Copyright © 2002</a> O'Reilly & Associates. All rights reserved.</font></p><map name="library-map"><area shape="rect" coords="1,0,85,94" href="../index.htm"><area shape="rect" coords="86,1,178,103" href="../lwp/index.htm"><area shape="rect" coords="180,0,265,103" href="../lperl/index.htm"><area shape="rect" coords="267,0,353,105" href="../perlnut/index.htm"><area shape="rect" coords="354,1,446,115" href="../prog/index.htm"><area shape="rect" coords="448,0,526,132" href="../tk/index.htm"><area shape="rect" coords="528,1,615,119" href="../cookbook/index.htm"><area shape="rect" coords="617,0,690,135" href="../pxml/index.htm"></map></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -