📄 ch03_01.htm
字号:
interesting code that uses one of the many available XML-parsing Perlmodules. To that end, we'll write only a fraction ofa pure-Perl XML parser with a very specific goal in mind.</p><a name="ch03-4-fm2xml" /><blockquote><b>WARNING:</b> Feel free to play with this program, but pleasedon't try to use this code in a productionenvironment! It's not a real Perl and XML solution,but an illustration of the sorts of things that parsers do. Also,it's incomplete and will not always give correctresults, as we'll show later. Don'tworry; the rest of this book talks about real XML parsers and Perltools you'll want to use.</p></blockquote><p>The program is a <a name="INDEX-182" />loop in which regular expressions matchXML markup objects and pluck them out of the text. The loop runsuntil nothing is left to remove, meaning the document is well formed,or until the regular expressions can't matchanything in the remaining text, in which case it'snot well-formed. A few other tests could abort the parsing, such aswhen an end tag is found that doesn't match the nameof the currently open start tag. It won't beperfect, but it should give you a good idea of how a well-formednessparser might work.</p><p><a href="ch03_01.htm#perlxml-CHP-3-EX-1">Example 3-1</a> is a routine that parses a string of XMLtext, tests to see if it is well-formed, and returns a boolean value.We've added some pattern variables to make it easierto understand the regular expressions. For example, the string<tt class="literal">$ident</tt> contains regular expression code to matchan XML identifier, which is used for elements, attributes, andprocessing instructions.</p><a name="perlxml-CHP-3-EX-1" /><div class="example"><h4 class="objtitle">Example 3-1. A rudimentary XML parser </h4><blockquote><pre class="code">sub is_well_formed { my $text = shift; # XML text to check # match patterns my $ident = '[:_A-Za-z][:A-Za-z0-9\-\._]*'; # identifier my $optsp = '\s*'; # optional space my $att1 = "$ident$optsp=$optsp\"[^\"]*\""; # attribute my $att2 = "$ident$optsp=$optsp'[^']*'"; # attr. variant my $att = "($att1|$att2)"; # any attribute my @elements = ( ); # stack of open elems # loop through the string to pull out XML markup objects while( length($text) ) { # match an empty element if( $text =~ /^&($ident)(\s+$att)*\s*\/>/ ) { $text = $'; # match an element start tag } elsif( $text =~ /^&($ident)(\s+$att)*\s*>/ ) { push( @elements, $1 ); $text = $'; # match an element end tag } elsif( $text =~ /^&\/($ident)\s*>/ ) { return unless( $1 eq pop( @elements )); $text = $'; # match a comment } elsif( $text =~ /^&!--/ ) { $text = $'; # bite off the rest of the comment if( $text =~ /-->/ ) { $text = $'; return if( $` =~ /--/ ); # comments can't # contain '--' } else { return; } # match a CDATA section } elsif( $text =~ /^&!\[CDATA\[/ ) { $text = $'; # bite off the rest of the comment if( $text =~ /\]\]>/ ) { $text = $'; } else { return; } # match a processing instruction } elsif( $text =~ m|^&\?$ident\s*[^\?]+\?>| ) { $text = $'; # match extra whitespace # (in case there is space outside the root element) } elsif( $text =~ m|^\s+| ) { $text = $'; # match character data } elsif( $text =~ /(^[^&&>]+)/ ) { my $data = $1; # make sure the data is inside an element return if( $data =~ /\S/ and not( @elements )); $text = $'; # match entity reference } elsif( $text =~ /^&$ident;+/ ) { $text = $'; # something unexpected } else { return; } } return if( @elements ); # the stack should be empty return 1;}</pre></blockquote></div><p>Perl's <a name="INDEX-183" />arrays are so useful partly due to theirability to masquerade as more abstract computer science dataconstructs.<a href="#FOOTNOTE-14">[14]</a> Here, we use a data structure called a<em class="emphasis">stack</em><a name="INDEX-184" />, which is really just anarray that we access with <tt class="literal">push( )</tt> and<tt class="literal">pop( )</tt>. Items in a stackare<a name="INDEX-185" /> <a name="INDEX-186" /> last-in, first-out (LIFO), meaningthat the last thing put into it will be the first thing to be removedfrom it. This arrangement is convenient for remembering the names ofcurrently open elements because at any time, the next element to beclosed was the last element pushed onto the stack. Whenever weencounter a start tag, it will be pushed onto the stack, and it willbe popped from the stack when we find an end tag. To be well-formed,every end tag must match the previous start tag, which is why we needthe stack.</p><blockquote class="footnote"> <a name="FOOTNOTE-14" /><p>[14]The O'Reilly book<em class="citetitle">Mastering Algorithms with Perl</em> by Jon Orwant,Jarkko Hietaniemi, and John Macdonald devotes a chapter to thistopic.</p> </blockquote><p>The stack represents all the elements along a branch of the XML tree,from the root down to the current element being processed. Elementsare processed in the order in which they appear in a document; if youview the document as a tree, it looks like you'regoing from the root all the way down to the tip of a branch, thenback up to another branch, and so on. This is called<em class="emphasis">depth-firstorder</em><a name="INDEX-187" />, the canonical way all XML documents areprocessed.</p><p>There are a few places where we deviate from the simple loopingscheme to do some extra testing. The code for matching a comment isseveral steps, since it ends with a three-character delimiter, and wealso have to check for an illegal string of dashes"<tt class="literal">--</tt>" inside thecomment. The character data matcher, which performs an extra check tosee if the stack is empty, is also noteworthy; if the stack is empty,that's an error because nonwhitespace text is notallowed outside of the root element. Here is a short list ofwell-formedness errors that would cause the parser to return a falseresult:</p><ul><li><p>An identifier in an element or attribute is malformed (examples:"<tt class="literal">12foo</tt>,""<tt class="literal">-bla</tt>," and"<tt class="literal">..</tt>").</p></li><li><p>A nonwhitespace character is found outside of the root element. </p></li><li><p>An element end tag doesn't match the last discoveredstart tag.</p></li><li><p>An attribute is unquoted or uses a bad combination of quotecharacters.</p></li><li><p>An empty element is missing a <a name="INDEX-188" /> <a name="INDEX-189" />slash character (<tt class="literal">/</tt>) atthe end of its tag.</p></li><li><p>An illegal character, such as a lone <a name="INDEX-190" /> <a name="INDEX-191" />ampersand (<tt class="literal">&</tt>)or an <a name="INDEX-192" /> <a name="INDEX-193" />angle bracket(<tt class="literal"><</tt>), is found in character data.</p></li><li><p>A malformed markup tag (examples:"<tt class="literal"><fooby<</tt>"and "<tt class="literal"><?bubba?></tt>") is found.</p></li></ul><p>Try the parser out on some test cases. Probably the simplestcomplete, well-formed XML document you will ever see is this:</p><blockquote><pre class="code"><:-/> </pre></blockquote><p>The next document should cause the parser to halt with an error.(Hint: look at the <tt class="literal"><message></tt> end tag.)</p><blockquote><pre class="code"><memo> <to>self</to> <message>Don't forget to mow the car and wash the lawn.<message></memo></pre></blockquote><p>Many other kinds of <a name="INDEX-194" />syntax errors could appear in adocument, and our program picks up most of them. However, it doesmiss a few. For example, there should be exactly one root element,but our program will accept more than one:</p><blockquote><pre class="code"><root>I am the one, true root!</root><root>No, I am!</root><root>Uh oh...</root></pre></blockquote><p>Other problems? The parser cannot handle a document type declaration.This structure is sometimes seen at the top of a document thatspecifies a <a name="INDEX-195" />DTD for validating parsers, and itmay also declare some entities. With a specialized syntax of its own,we'd have to write another loop just for thedocument type declaration.</p><p>Our parser's most significant omission is theresolution of <a name="INDEX-196" />entity references.It can check basic entity reference syntax, butdoesn't bother to expand the entity and insert itinto the text. Why is that bad? Consider that an entity can containmore than just some character data. It can contain any amount ofmarkup, too, from an element to a big, external file. Entities canalso contain other entity references, so it might require many passesto resolve one entity reference completely. The parserdoesn't even check to see if the entities aredeclared (it couldn't anyway, since itdoesn't know how to read a document type declarationsyntax). Clearly, there is a lot of room for errors to creep into adocument through entities, right under the nose of our parser. To fixthe problems just mentioned, follow these steps:</p><ol><li><p>Add a parsing loop to read in a document type declaration before anyother parsing occurs. Any entity declarations would be parsed andstored, so we can resolve entity references later in the document.</p></li><li><p>Parse the DTD, if the document type declaration mentions one, to readany entity declarations.</p></li><li><p>In the main loop, resolve all entity references when we come acrossthem. These entities have to be parsed, and there may be entityreferences within them, too. The process can be rather loopy, withloops inside loops, recursion, or other complex programming stunts.</p></li></ol><p>What started out as a simple parser now has grown into a complexbeast. That tells us two things: that the theory of parsing XML iseasy to grasp; and that, in practice, it gets complicated veryquickly. This exercise was useful because it showed issues involvedin parsing XML, but we don't encourage you to writecode like this. On the contrary, we expect you to take advantage ofthe exhaustive work already put into making ready-made parsers.Let's leave the dark ages and walk into thehappy<a name="INDEX-197" />land of prepackaged<a name="INDEX-198" /> parsers.</p></div></div><hr width="684" align="left" /><div class="navbar"><table width="684" border="0"><tr><td align="left" valign="top" width="228"><a href="ch02_12.htm"><img alt="Previous" border="0" src="../gifs/txtpreva.gif" /></a></td><td align="center" valign="top" width="228"><a href="index.htm"><img alt="Home" border="0" src="../gifs/txthome.gif" /></a></td><td align="right" valign="top" width="228"><a href="ch03_02.htm"><img alt="Next" border="0" src="../gifs/txtnexta.gif" /></a></td></tr><tr><td align="left" valign="top" width="228">2.12. Transformations</td><td align="center" valign="top" width="228"><a href="index/index.htm"><img alt="Book Index" border="0" src="../gifs/index.gif" /></a></td><td align="right" valign="top" width="228">3.2. XML::Parser</td></tr></table></div><hr width="684" align="left" /><img alt="Library Navigation Links" border="0" src="../gifs/navbar.gif" usemap="#library-map" /><p><p><font size="-1"><a href="copyrght.htm">Copyright © 2002</a> O'Reilly & Associates. All rights reserved.</font></p><map name="library-map"><area shape="rect" coords="1,0,85,94" href="../index.htm"><area shape="rect" coords="86,1,178,103" href="../lwp/index.htm"><area shape="rect" coords="180,0,265,103" href="../lperl/index.htm"><area shape="rect" coords="267,0,353,105" href="../perlnut/index.htm"><area shape="rect" coords="354,1,446,115" href="../prog/index.htm"><area shape="rect" coords="448,0,526,132" href="../tk/index.htm"><area shape="rect" coords="528,1,615,119" href="../cookbook/index.htm"><area shape="rect" coords="617,0,690,135" href="../pxml/index.htm"></map></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -