📄 ch14_05.htm
字号:
<a name="INDEX-2865" />arrays whoseelements are the URL itself, the title of the URL, plus another arraypointing to second-level related URLs. The subarray of second-levelrelated URLs contains only two elements: the URL and the title. <a href="ch14_05.htm#ch14-15165">Figure 14-4</a> illustrates this data structure.</p><a name="ch14-15165" /><div class="figure"><img width="427" src="figs/cgi2.1404.gif" height="206" alt="Figure 14-4" /></div><h4 class="objtitle">Figure 14-4. Perl data structure that contains the related URLs and subsequent related URLs</h4><p>If there are no related items found at the top level submitted URL, amessage is printed to notify the user.</p><p>Later, we want to print out <a name="INDEX-2867" /><a name="INDEX-2868" /> <a name="INDEX-2,869" />self-referencing hypertext links backto this script. In preparation for this action, we create a variablecalled <tt class="literal">$scriptname</tt> that will hold the currentscriptname for referencing in <A HREF> tags. CGI.pm's<tt class="function">script_name</tt> method provides a convenient way ofgetting this data.</p><p>Of course, we could have simply chosen a static name for this script.However, it is generally considered good practice to code forflexibility where possible. In this case, we can name the scriptanything we want and the code here will not have to change.</p><p><a name="INDEX-2870" /> <a name="INDEX-2,871" /><a name="INDEX-2872" />For each related URL, we print out"[*]" embedded in an <A> tag that will contain areference to the script itself plus the current URL being passed toit as a search parameter. If one element of<tt class="literal">@related</tt> contains<tt class="literal">["http://www.eff.org/",</tt> <tt class="literal">"The</tt><tt class="literal">Electronic</tt> <tt class="literal">Frontier</tt><tt class="literal">Foundation"]</tt> the resulting HTML would look likethis:</p><blockquote><pre class="code"><A HREF="whatsrelated.cgi?url=http://www.eff.org/" >[*]</A><A HREF="http://www.eff.org/">The Electronic Frontier Foundation</A></pre></blockquote><p>This will let the user pursue the "What's Related"trail another step by running this script on the chosen URL.Immediately afterwards, the title (<tt class="literal">$_->[1]</tt>) isprinted with a hypertext reference to the URL that the titlerepresents (<tt class="literal">$_->[0]</tt>).</p><p><tt class="literal">@subrelated</tt><a name="INDEX-2873" /> contains the URLsthat are related to the URL we just printed for the user(<tt class="literal">$_->[2]</tt>). If there are second-level relatedURLs, we can proceed to print them. The second-level related URLarray follows the same format as the related URL array except thatthere is no third element containing further references to morerelated URLs. <tt class="literal">$_->[0]</tt> is the URL and<tt class="literal">$_->[1]</tt> is the title of the URL itself. If<tt class="literal">@subrelated</tt> is empty, the user is told that thereare no related items to the URL that is currently being displayed.</p><p>Finally, we output the footer for the What's Related queryresults page. In addition, the user is presented with another textfield in which they can enter in a new URL to search on.</p><p>The<tt class="function">get_whats_related_to_whats_related</tt><a name="INDEX-2874" />subroutine contains logic to take a URL and construct a datastructure that contains not only URLs that are related to the passedURL, but also the second-level related URLs.<tt class="literal">@related</tt> contains the list of what's relatedto the first URL.</p><p>Then, each record is examined in <tt class="literal">@related</tt> to seeif there is anything related to that URL. If there is, the thirdelement (<tt class="literal">$record->[2]</tt>) of the record is set toa reference to the second-level related URLs we are currentlyexamining. Finally, the entire <tt class="literal">@related</tt> datastructure is returned.</p><p>The <tt class="function">get_whats_related</tt> subroutine returns anarray of references to an array with two elements: a related URL andthe title of that URL. The key to getting this information is toparse it from an XML document. <tt class="literal">$parser</tt> is theXML::Parser object that will be used to perform this task.</p><p><a name="INDEX-2875" /><a name="INDEX-2876" />XML parsers do not simply parsedata in a linear fashion. After all, XML itself is hierarchical innature. There are two different ways that XML parsers can look at XMLdata.</p><p>One way is to have the XML parser take the entire document and simplyreturn a tree of objects that represents the XML document hierarchy.Perl supports this concept via the <a name="INDEX-2877" /><a name="INDEX-2878" />XML::Grovemodule by Ken MacLeod. The second way to parse XML documents is usinga <a name="INDEX-2879" />SAX (Simple API for XML) style of parser.This type of parser is <a name="INDEX-2880" /> <a name="INDEX-2,881" />event-based and is the one that XML::Parseris based on.</p><p>The event based parser is popular because it starts returning data tothe calling program as it parses the document. There is no need towait until the whole document is parsed before getting a picture ofhow the XML elements are placed in the document. XML::Parser acceptsa file handle or the text of an XML document and then goes throughits structure looking for certain events. When a particular event isencountered, the parser calls the appropriate Perl subroutine tohandle it on the fly.</p><p>For this program, we define a handler that looks for the<a name="INDEX-2882" /><a name="INDEX-2883" />start of any XMLtag. This handler is declared as a reference to a<a name="INDEX-2884" />subroutinecalled <tt class="function">handle_start</tt>. The<tt class="function">handle_start</tt> subroutine is declared furtherbelow within the local context of the subroutine we are discussing.</p><p><a name="INDEX-2885" />XML::Parser can handle more thanjust start tags. XML::Parser also supports the capability of writinghandlers for other types of parsing events such as end tags, or evenfor specific tag names. However, in this program, we only need todeclare a handler that will be triggered any time an XML start tag isencountered.</p><p><tt class="literal">$data</tt> contains the raw XML code to be parsed. The<tt class="function">get</tt><a name="INDEX-2886" /><a name="INDEX-2887" /> <a name="INDEX-2,888" />subroutine was previously imported by pulling the LWP::Simple moduleinto the Perl script. When we pass<tt class="literal">WHATS_RELATED_URL</tt> along with the URL we arelooking for to the <tt class="function">get</tt> subroutine,<tt class="function">get</tt> will go out on the Internet and retrieve theoutput from the "What's Related" web server.</p><p>You will notice that as soon as <tt class="literal">$data</tt> iscollected, there is some additional manipulation done to it.<a name="INDEX-2889" /><a name="INDEX-2890" />XML::Parser willparse only well-formed XML documents. Unfortunately, the Netscape XMLserver sometimes returns data that is not entirely well-formed, so ageneric XML parser has a little difficulty with it.</p><p>To get around this problem, we filter out potentially bad data insideof the tags. The <a name="INDEX-2891" /><a name="INDEX-2892" />regularexpressions in the above code respectively transform ampersands,double-quotes, HTML tags, and stray < and > characters intowell-formed counterparts. The last regular expression deals withfiltering out non-ASCII characters.</p><p>Before parsing the data, we set the baseline global variables<tt class="literal">@RECORDS</tt> to the empty set and<tt class="literal">$RELATED_RECORDS</tt> to true (1).</p><p>Simply calling the <em class="filename">parse</em> method on the<tt class="literal">$parser</tt> object starts the parsing process. The<tt class="literal">$data</tt> variable that is passed to<tt class="function">parse</tt> is the XML subject to be read. The<tt class="function">parse</tt> method also accepts other types of dataincluding file handles to XML files.</p><p>Recall that the <tt class="function">handle_start</tt> subroutine waspassed to the <tt class="literal">$parser</tt> object upon its creation.The <tt class="function">handle_start</tt> subroutine that is declaredwithin the <tt class="function">get_whats_related</tt> subroutine iscalled by XML::Parser every time a start tag is encountered.</p><p><tt class="literal">$expat</tt><a name="INDEX-2893" /> is a reference to theXML::Parser object itself.<tt class="literal">$element</tt><a name="INDEX-2894" /><a name="INDEX-2895" /> <a name="INDEX-2,896" />is the start element name and <tt class="literal">%attributes</tt> is ahash table of attributes that were declared inside the XML element.</p><p>For this example, we are concerned only with tags that begin with thename "child" and contain the<tt class="literal">href</tt> attribute. In addition, the<tt class="literal">$href</tt> tag is filtered so any non-URL informationis stripped out of the URL.</p><p>If there is no name attribute, or if the name attribute contains thephrase "Smart Browsing", or if there were no relatedrecords found previously for this URL, we do not want to add anythingto the <tt class="literal">@RECORDS</tt> array. In addition, if the nameattribute contains the phrase "no related", the<tt class="literal">$RELATED_RECORDS</tt> flag is set to false (0).</p><p>Otherwise, if these conditions are not met, we will add the URL tothe <tt class="literal">@RECORDS</tt> array. This is done by making areference to an array with two elements: the URL and the title of theURL. At the end of the subroutine, the compiled<tt class="literal">@RECORDS</tt> array is returned.</p><p>This program was a simple example of using a CGI program to pull dataautomatically from an XML-based server. While the What'sRelated server is just one XML server, it is conceivable that as XMLgrows, there will be more database engines on the Internet thatdeliver even more types of data. Since XML is the standard languagefor delivering data markup on the Web, extensions to this CGI scriptcan be used to access those new <a name="INDEX-2897" />data repositories.</p><p>More information about <a name="INDEX-2898" /> <a name="INDEX-2,899" /> <a name="INDEX-2,900" /> <a name="INDEX-2,901" />XML, DTD, RDF, and even the PerlXML::Parser library can be found at <a href="http://www.xml.com/">http://www.xml.com/</a>. Of course, XML::Parsercan also be found on <a name="INDEX-2902" /> <a name="INDEX-2,903" /> <a name="INDEX-2,904" />CPAN.</p><hr align="left" width="515" /><div class="navbar"><table border="0" width="515"><tr><td width="172" valign="top" align="left"><a href="ch14_04.htm"><img src="../gifs/txtpreva.gif" alt="Previous" border="0" /></a></td><td width="171" valign="top" align="center"><a href="index.htm"><img src="../gifs/txthome.gif" alt="Home" border="0" /></a></td><td width="172" valign="top" align="right"><a href="ch15_01.htm"><img src="../gifs/txtnexta.gif" alt="Next" border="0" /></a></td></tr><tr><td width="172" valign="top" align="left">14.4. Writing an XML Parser</td><td width="171" valign="top" align="center"><a href="index/index.htm"><img src="../gifs/index.gif" alt="Book Index" border="0" /></a></td><td width="172" valign="top" align="right">15. Debugging CGI Applications</td></tr></table></div><hr align="left" width="515" /><img src="../gifs/navbar.gif" alt="Library Navigation Links" usemap="#library-map" border="0" /><p><font size="-1"><a href="copyrght.htm">Copyright © 2001</a> O'Reilly & Associates. All rights reserved.</font></p><map name="library-map"><area href="../index.htm" coords="1,1,83,102" shape="rect" /><area href="../lnut/index.htm" coords="81,0,152,95" shape="rect" /><area href="../run/index.htm" coords="172,2,252,105" shape="rect" /><area href="../apache/index.htm" coords="238,2,334,95" shape="rect" /><area href="../sql/index.htm" coords="336,0,412,104" shape="rect" /><area href="../dbi/index.htm" coords="415,0,507,101" shape="rect" /><area href="../cgi/index.htm" coords="511,0,601,99" shape="rect" /></map></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -