📄 ch05_01.htm
字号:
replica of the original document with a few modifications.Specifically, it makes these changes to a document:</p><ul><li><p>Turns every XML comment into a <tt class="literal"><comment></tt>element</p></li><li><p>Deletes processing instructions </p></li><li><p>Removes tags, but leaves the content, for<tt class="literal"><literal></tt> elements that occur within<tt class="literal"><programlisting></tt> elements at any level</p></li></ul><p>The code for this program is listed in <a href="ch05_01.htm#perlxml-CHP-5-EX-1">Example 5-1</a>.Like the last program, we initialize the parser with a set ofhandlers, except this time they are bundled together in a convenientpackage: an object called <tt class="literal">MyHandler</tt>. Notice thatwe've implemented a few more handlers, since we wantto be able to deal with comments, processing instructions, and thedocument prolog.</p><a name="perlxml-CHP-5-EX-1" /><div class="example"><h4 class="objtitle">Example 5-1. Filter program </h4><blockquote><pre class="code"># initialize the parser#use XML::Parser::PerlSAX;my $parser = XML::Parser::PerlSAX->new( Handler => MyHandler->new( ) );if( my $file = shift @ARGV ) { $parser->parse( Source => {SystemId => $file} );} else { my $input = ""; while( <STDIN> ) { $input .= $_; } $parser->parse( Source => {String => $input} );}exit;## global variables#my @element_stack; # remembers element namesmy $in_intset; # flag: are we in the internal subset?###### Document Handler Package###package MyHandler;## initialize the handler package#sub new { my $type = shift; return bless {}, $type;}## handle a start-of-element event: output start tag and attributes#sub start_element { my( $self, $properties ) = @_; # note: the hash %{$properties} will lose attribute order # close internal subset if still open output( "]>\n" ) if( $in_intset ); $in_intset = 0; # remember the name by pushing onto the stack push( @element_stack, $properties->{'Name'} ); # output the tag and attributes UNLESS it's a <literal> # inside a <programlisting> unless( stack_top( 'literal' ) and stack_contains( 'programlisting' )) { output( "<" . $properties->{'Name'} ); my %attributes = %{$properties->{'Attributes'}}; foreach( keys( %attributes )) { output( " $_=\"" . $attributes{$_} . "\"" ); } output( ">" ); }} ## handle an end-of-element event: output end tag UNLESS it's from a# <literal> inside a <programlisting>#sub end_element { my( $self, $properties ) = @_; output( "</" . $properties->{'Name'} . ">" ) unless( stack_top( 'literal' ) and stack_contains( 'programlisting' )); pop( @element_stack );}## handle a character data event#sub characters { my( $self, $properties ) = @_; # parser unfortunately resolves some character entities for us, # so we need to replace them with entity references again my $data = $properties->{'Data'}; $data =~ s/\&/\&/; $data =~ s/</\&lt;/; $data =~ s/>/\&gt;/; output( $data );}## handle a comment event: turn into a <comment> element#sub comment { my( $self, $properties ) = @_; output( "<comment>" . $properties->{'Data'} . "</comment>" );}## handle a PI event: delete it#sub processing_instruction { # do nothing!}## handle internal entity reference (we don't want them resolved)#sub entity_reference { my( $self, $properties ) = @_; output( "&" . $properties->{'Name'} . ";" );}sub stack_top { my $guess = shift; return $element_stack[ $#element_stack ] eq $guess;}sub stack_contains { my $guess = shift; foreach( @element_stack ) { return 1 if( $_ eq $guess ); } return 0;}sub output { my $string = shift; print $string;}</pre></blockquote></div><p>Looking closely at the handlers, we see that one argument is passed,in addition to the obligatory object reference<tt class="literal">$self</tt>. This argument is a reference to a hash ofproperties about the event. This technique has one disadvantage: inthe element start handler, the attributes are stored in a hash, whichhas no memory of the original attribute order. Semantically, this isnot a big deal, since XML is supposed to be ignorant of attributeorder. However, there may be cases when you want to replicate thatorder.<a href="#FOOTNOTE-25">[25]</a></p><blockquote class="footnote"> <a name="FOOTNOTE-25" /><p>[25]In the case of our filter, we might want tocompare the versions from before and after processing using a utilitysuch as the Unix program <em class="emphasis">diff</em>. Such a comparisonwould yield many false differences where the order of attributeschanged. Instead of using <em class="emphasis">diff</em>, you shouldconsider using the module <tt class="literal">XML::SemanticDiff</tt> by KipHampton. This module would ignore syntactic differences and compareonly the semantics of two documents.</p> </blockquote><p>As a filter, this program preserves everything about the originaldocument, except for the few details that have to be changed. Theprogram preserves the document prolog, processing instructions, andcomments. Even entity references should be preserved as they areinstead of being resolved (as the parser may want to do). Therefore,the program has a few more handlers than in the last example, fromwhich we were interested only in extracting very specificinformation.</p><p>Let's test this program now. Our input datafile islisted in <a href="ch05_01.htm#perlxml-CHP-5-EX-2">Example 5-2</a>.</p><a name="perlxml-CHP-5-EX-2" /><div class="example"><h4 class="objtitle">Example 5-2. Data for the filter </h4><blockquote><pre class="code"><?xml version="1.0"?><!DOCTYPE book SYSTEM "/usr/local/prod/sgml/db.dtd"[ <!ENTITY thingy "hoo hah blah blah">]><book id="mybook"><?print newpage?> <title>GRXL in a Nutshell</title> <chapter id="intro"> <title>What is GRXL?</title><!-- need a better title --> <para>Yet another acronym. That was our attitude at first, but then we saw the amazing uses of this new technology called<literal>GRXL</literal>. Consider the following program: </para><?print newpage?> <programlisting>AH aof -- %%%%{{{{{{ let x = 0 }}}}}} print! <lineannotation><literal>wow</literal></lineannotation>or not!</programlisting><!-- what font should we use? --> <para>What does it do? Who cares? It's just lovely to look at. In fact,I'd have to say, "&thingy;". </para><?print newpage?> </chapter></book></pre></blockquote></div><p>The result, after running the program on the data, is shown in <a href="ch05_01.htm#perlxml-CHP-5-EX-3">Example 5-3</a>. </p><a name="perlxml-CHP-5-EX-3" /><div class="example"><h4 class="objtitle">Example 5-3. Output from the filter </h4><blockquote><pre class="code"><book id="mybook"> <title>GRXL in a Nutshell</title> <chapter id="intro"> <title>What is GRXL?</title><comment> need a better title </comment> <para>Yet another acronym. That was our attitude at first, but then we saw the amazing uses of this new technology called<literal>GRXL</literal>. Consider the following program: </para> <programlisting>AH aof -- %%%%{{{{{{ let x = 0 }}}}}} print! <lineannotation>wow</lineannotation>or not!</programlisting><comment> what font should we use? </comment> <para>What does it do? Who cares? It's just lovely to look at. In fact,I'd have to say, "&thingy;". </para> </chapter></book></pre></blockquote></div><p>Here's what the filter did right. It turned an XMLcomment into a <tt class="literal"><comment></tt> element and deletedthe processing instruction. The <tt class="literal"><literal></tt>element in the <tt class="literal"><programlisting></tt> was removed,with its contents left intact, while other<tt class="literal"><literal></tt> elements were preserved. Entityreferences were left unresolved, as we wanted. So far, so good. Butsomething's missing. The XML declaration, documenttype declaration, and internal subset are gone. Without thedeclaration for the entity <tt class="literal">thingy</tt>, this documentis not valid. It looks like the handlers<a name="INDEX-378" /> <a name="INDEX-379" /> we had available to us were notsufficient.</p></div><hr width="684" align="left" /><div class="navbar"><table width="684" border="0"><tr><td align="left" valign="top" width="228"><a href="ch04_06.htm"><img alt="Previous" border="0" src="../gifs/txtpreva.gif" /></a></td><td align="center" valign="top" width="228"><a href="index.htm"><img alt="Home" border="0" src="../gifs/txthome.gif" /></a></td><td align="right" valign="top" width="228"><a href="ch05_02.htm"><img alt="Next" border="0" src="../gifs/txtnexta.gif" /></a></td></tr><tr><td align="left" valign="top" width="228">4.6. XML::Parser</td><td align="center" valign="top" width="228"><a href="index/index.htm"><img alt="Book Index" border="0" src="../gifs/index.gif" /></a></td><td align="right" valign="top" width="228">5.2. DTD Handlers</td></tr></table></div><hr width="684" align="left" /><img alt="Library Navigation Links" border="0" src="../gifs/navbar.gif" usemap="#library-map" /><p><p><font size="-1"><a href="copyrght.htm">Copyright © 2002</a> O'Reilly & Associates. All rights reserved.</font></p><map name="library-map"><area shape="rect" coords="1,0,85,94" href="../index.htm"><area shape="rect" coords="86,1,178,103" href="../lwp/index.htm"><area shape="rect" coords="180,0,265,103" href="../lperl/index.htm"><area shape="rect" coords="267,0,353,105" href="../perlnut/index.htm"><area shape="rect" coords="354,1,446,115" href="../prog/index.htm"><area shape="rect" coords="448,0,526,132" href="../tk/index.htm"><area shape="rect" coords="528,1,615,119" href="../cookbook/index.htm"><area shape="rect" coords="617,0,690,135" href="../pxml/index.htm"></map></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -