📄 ch09.htm

📁 Web_Programming_with_Perl5,一个不错的Perl语言教程。
💻 HTM
📖 第 1 页 / 共 3 页
字号:


needed here is the LWP::UserAgent module. This powerful module acts as our communications



vehicle. The code to retrieve data from the URL looks as simple as Listing 9.2. <BR>



<BR>



<A HREF="11wpp01.jpg" tppabs="http://210.32.137.15/ebook/Web%20Programming%20with%20Perl%205/11wpp01.jpg"><TT><B>Figure 9.1.</B></TT></A><TT> </TT>The Security



APL Quote Server, seen in Netscape.



<H3 ALIGN="CENTER"><A NAME="Heading5"></A><FONT COLOR="#000077">Listing 9.2. Automatic



stock quote retriever (getquote.pl).</FONT></H3>



<PRE><FONT COLOR="#0066FF">#!/public/bin/perl5







require LWP::UserAgent;



require HTTP::Request;







$anHour=60*60;



$symbols=join(`+', @ARGV);



$url=&quot;http://qs.secapl.com/cgi-bin/qso?tick=$symbols&quot;;







$ua = new LWP::UserAgent;



$request = new HTTP::Request `GET', $url;



while (1) {



   $response = $ua-&gt;request($request);



   if ($response-&gt;is_success) {



      &amp;handleResponse($response);



   } else {



      &amp;handleError($response);



   }



   # We want to receive quotes every hour.



   sleep $anHour;



}



</FONT></PRE>



<P>As you can see, the symbols are passed in as arguments to this Perl script. They



are then joined into a single string with each symbol separated by a plus sign. This



string is appended to the URL creating the full URL request for the specified stock



symbols. This leaves only the handleResponse() and handleError() subroutines left



to implement. handleError() is rather easy. HTTP::Response provides a method called



error_as_HTML(), which returns a string containing the error nicely packaged as an



HTML document. You either can print the HTML as it is or ignore the error and continue.



In this example, we will just fail silently and continue.</P>



<P>Handling the response should be fairly straightforward given the sample HTML you've



already seen in Listing 9.1. You simply need to write a loop that looks for the quote



indicator strings, which are &lt;/center&gt;&lt;pre&gt; and &lt;/pre&gt;&lt;center&gt;.



These strings indicate where you are in the document. You can then use regular expressions



to parse out the symbol, current trading price, and other important values. The handleResponse()



is implemented in Listing 9.3. The output appears in Figure 9.2.



<H3 ALIGN="CENTER"><A NAME="Heading6"></A><FONT COLOR="#000077">Listing 9.3. Subroutine



to extract the quote information.</FONT></H3>



<PRE><FONT COLOR="#0066FF">sub handleResponse {



   my($response)=@_;



   my(@lines)=split(/\n/,$response-&gt;content());



   $insideQuote=0;



   foreach (@lines) {



      if ($insideQuote) {



         if (/&lt;\/pre&gt;&lt;center&gt;/) {



            print &quot;$symbol on $exchange is trading at $value on $dateTime\n&quot;;



            $insideQuote=0;



         } elsif (/^Symbol\s*:\s*(\S*)\s*Exchange\s*:\s*(.*)\s*$/) {



            $symbol=$1;



            $exchange=$2;



         } elsif (/^Last Traded at\s*:\s*(\S*)\s*Date\/Time\s*:\s*(.*)$/) {



            $value=$1;



            $dateTime=$2;



         }



      }



      if (/&lt;\/center&gt;&lt;pre&gt;/) {



         $insideQuote=1;



      }



   }



} 



</FONT></PRE>



<P>Of course, you can add more code to parse out other returned values, such as the



52-week low and high values. This would involve just adding another elsif block and



a regular expression to match the particular pattern.



<H4 ALIGN="CENTER"><A NAME="Heading7"></A><FONT COLOR="#000077">Adapting the Code



for General Purpose Use</FONT></H4>



<P>The UserAgent module can prove useful in other examples as well. The code in Listing



9.2 that retrieved the stock quote can be turned into a general purpose URL retriever.



This next example does this, adding to it the ability to send the request through



a firewall. The code in Listing 9.4 should look quite familiar. <BR>



<BR>



<A HREF="11wpp02.jpg" tppabs="http://210.32.137.15/ebook/Web%20Programming%20with%20Perl%205/11wpp02.jpg"><TT><B>Figure 9.2.</B></TT></A><TT> </TT>Output from



the getquote.pl program.



<H3 ALIGN="CENTER"><A NAME="Heading8"></A><FONT COLOR="#000077">Listing 9.4. General



purpose URL retriever going through a firewall.</FONT></H3>



<PRE><FONT COLOR="#0066FF">#!/public/bin/perl5







require LWP::UserAgent;



require HTTP::Request;







$ua = new LWP::UserAgent;



$ua-&gt;proxy(`http',$ENV{`HTTP_PROXY'});







foreach $url (@ARGV) {



   $request = new HTTP::Request `GET', $url;



   $response = $ua-&gt;request($request);



   if ($response-&gt;is_success) {



      &amp;handleResponse($response);



   } else {



      &amp;handleError($response);



   }



}



</FONT></PRE>



<P>Listing 9.4 simply replaces the forever loop with a foreach loop where the iterator



is a list of URLs to retrieve. You also may have noticed the line</P>



<PRE><FONT COLOR="#0066FF">$ua-&gt;proxy(`http',$ENV{`HTTP_PROXY'});



</FONT></PRE>



<P>This is how you can send a request through a firewall or proxy server. The mechanism



used here is to define an environment variable called HTTP_PROXY. However, you could



use a different approach, such as a hard-coded constant value, or the proxy server



could be passed into the script as an argument. The functions handleResponse() and



handleError() are left unimplemented. These are the functions that would turn this



general-purpose URL retriever into something more useful such as our stock quote



retriever or a Web spider, as you'll see next. That function can be specific to whatever



might suit your requirements.</P>



<P>You'll see how this general URL retriever can be applied to useful functionality



in the following examples. We will also explore some of the other powerful features



that the LWP::UserAgent module provides.



<H3 ALIGN="CENTER"><A NAME="Heading9"></A><FONT COLOR="#000077">Generating Web Indexes</FONT></H3>



<P>The ability to generate a thorough Web index is a very hot commodity these days.



Companies are now building an entire business case around the ability to provide



Web users with the best search engine. These search engines are made possible using



programs such as the one you'll see in the following example. Crawling through the



Web to find all of the useful (as well as useless) pages is only one aspect of a



good search engine. You then need to be able to categorize and index all of what



you find into an efficient searchable pile of data. Our example will simply focus



on the former.</P>



<P>As you can imagine, this kind of program could go on forever, so consider limiting



the level of search to some reasonable depth. It is also important to abide by certain



robot rules that include an exclusion protocol, identifying yourself with the User-agent



field and notifying the sites that you plan to target. This will make your Web spider



friendly to the rest of the Web community and will prevent you from being blacklisted.



An automatic robot can cause an extremely large number of hits to occur on a given



site, so please be sensitive to those sites of which you wish to obtain indexes.



<H4 ALIGN="CENTER"><A NAME="Heading10"></A><FONT COLOR="#000077">Web RobotsSpiders</FONT></H4>



<P>A Web robot is a program that silently visits Web sites, explores the links in



that site, writes the URLs of the linked sites to disk, and continues in a recursive



fashion until enough sites have been visited. Using the general purpose URL retriever



in Listing 9.4 and a few regular expressions, you can easily construct such a program.</P>



<P>There are several classes available in the LWP modules that provide an easy way



to parse HTML files to obtain the elements of interest. The HTML::Parse class allows



you to parse an entire HTML file into a tree of HTML::Element objects. These classes



can be used by our Web robot to easily obtain the title and all of the hyperlinks



of an HTML document. You will first call the parse_htmlfile method in HTML::Parse



to obtain a syntax tree of HTML::Element nodes. You can then use the extract_links



method to enumerate all of the links or the traverse method to enumerate through



all of the tags. My example will use the traverse method so we can locate the &lt;TITLE&gt;



tag if it exists. The only other tag we will be interested in is the anchor element



or the &lt;A&gt; tag.</P>



<P>You can also make use of the URI::URL module to determine what components of the



URL are specified. This is useful for determining if the URL is relative or absolute.</P>



<P>Let's take a look at the crawlIt() function, which retrieves the URL, parses it,



and traverses through the elements looking for links and the title. Listing 9.5 should



look familiar--it's yet another way to reuse the code you've seen twice already.



<H3 ALIGN="CENTER"><A NAME="Heading11"></A><FONT COLOR="#000077">Listing 9.5. The



crawlIt() main function of the Web spider.</FONT></H3>



<PRE><FONT COLOR="#0066FF">sub crawlIt {



   my($ua,$urlStr,$urlLog,$visitedAlready,$depth)=@_;



   if ($depth++&gt;$MAX_DEPTH) {



      return;



   }



   $request = new HTTP::Request `GET', $urlStr;



   $response = $ua-&gt;request($request);



   if ($response-&gt;is_success) {



      my($urlData)=$response-&gt;content();



      my($html) = parse_html($urlData);



      $title=&quot;&quot;;



      $html-&gt;traverse(\&amp;searchForTitle,1);



      &amp;writeToLog($urlLog,$urlStr,$title);



      foreach (@{$html-&gt;extract_links(qw(a))}) {



         ($link,$linkelement)=@$_;



         my($url)=&amp;getAbsoluteURL($link,$urlStr);



         if ($url ne &quot;&quot;) {



            $escapedURL=$url;



            $escapedURL=~s/\//\\\//g;



            $escapedURL=~s/\?/\\\?/g;



            $escapedURL=~s/\+/\\\+/g;



            if (eval &quot;grep(/$escapedURL/,\@\$visitedAlready)&quot; == 0) {



               push(@$visitedAlready,$url);



               &amp;crawlIt($ua,$url,$urlLog,$visitedAlready,$depth);



            }



         }



      }



   }



}







sub searchForTitle {



   my($node,$startflag,$depth)=@_;



   $lwr_tag=$node-&gt;tag;



   $lwr_tag=~tr/A-Z/a-z/;



   if ($lwr_tag eq `title') {



      foreach (@{$node-&gt;content()}) {



         $title .= $_;



      }



      return 0;



   }



   return 1;



}



</FONT></PRE>







<DL>



	<DT><FONT COLOR="#0066FF"></FONT></DT>



</DL>







<H3 ALIGN="CENTER">



<HR WIDTH="85%">



<FONT COLOR="#0066FF"><BR>



</FONT><FONT COLOR="#000077">NOTE:</FONT></H3>











<BLOCKQUOTE>



	<P>In this function, all of the my() qualifiers are very meaningful. Because this



	function is called recursively, make sure that you don't accidentally reuse any variables



	that existed in the previous call to the function. Another thing to note about this



	function is that errors are silently ignored. You could easily add an error handler



	that notifies the user of any stale links found.<BR>



	



<HR>











</BLOCKQUOTE>







<P>The other important function you need to write is getAbsoluteURL(). This function



takes the parent URL string and the current URL string as arguments. It makes use



of the URI::URL module to determine whether or not the current URL is already an



absolute URL. If so, it returns the current URL as is; otherwise, it constructs a



new URL based on the parent URL. You also need to check that the protocol of the



URL is HTTP. Listing 9.6 shows how to convert a relative URL to an absolute URL.
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -