📄 ch09.htm
字号:
needed here is the LWP::UserAgent module. This powerful module acts as our communications
vehicle. The code to retrieve data from the URL looks as simple as Listing 9.2. <BR>
<BR>
<A HREF="11wpp01.jpg" tppabs="http://210.32.137.15/ebook/Web%20Programming%20with%20Perl%205/11wpp01.jpg"><TT><B>Figure 9.1.</B></TT></A><TT> </TT>The Security
APL Quote Server, seen in Netscape.
<H3 ALIGN="CENTER"><A NAME="Heading5"></A><FONT COLOR="#000077">Listing 9.2. Automatic
stock quote retriever (getquote.pl).</FONT></H3>
<PRE><FONT COLOR="#0066FF">#!/public/bin/perl5
require LWP::UserAgent;
require HTTP::Request;
$anHour=60*60;
$symbols=join(`+', @ARGV);
$url="http://qs.secapl.com/cgi-bin/qso?tick=$symbols";
$ua = new LWP::UserAgent;
$request = new HTTP::Request `GET', $url;
while (1) {
$response = $ua->request($request);
if ($response->is_success) {
&handleResponse($response);
} else {
&handleError($response);
}
# We want to receive quotes every hour.
sleep $anHour;
}
</FONT></PRE>
<P>As you can see, the symbols are passed in as arguments to this Perl script. They
are then joined into a single string with each symbol separated by a plus sign. This
string is appended to the URL creating the full URL request for the specified stock
symbols. This leaves only the handleResponse() and handleError() subroutines left
to implement. handleError() is rather easy. HTTP::Response provides a method called
error_as_HTML(), which returns a string containing the error nicely packaged as an
HTML document. You either can print the HTML as it is or ignore the error and continue.
In this example, we will just fail silently and continue.</P>
<P>Handling the response should be fairly straightforward given the sample HTML you've
already seen in Listing 9.1. You simply need to write a loop that looks for the quote
indicator strings, which are </center><pre> and </pre><center>.
These strings indicate where you are in the document. You can then use regular expressions
to parse out the symbol, current trading price, and other important values. The handleResponse()
is implemented in Listing 9.3. The output appears in Figure 9.2.
<H3 ALIGN="CENTER"><A NAME="Heading6"></A><FONT COLOR="#000077">Listing 9.3. Subroutine
to extract the quote information.</FONT></H3>
<PRE><FONT COLOR="#0066FF">sub handleResponse {
my($response)=@_;
my(@lines)=split(/\n/,$response->content());
$insideQuote=0;
foreach (@lines) {
if ($insideQuote) {
if (/<\/pre><center>/) {
print "$symbol on $exchange is trading at $value on $dateTime\n";
$insideQuote=0;
} elsif (/^Symbol\s*:\s*(\S*)\s*Exchange\s*:\s*(.*)\s*$/) {
$symbol=$1;
$exchange=$2;
} elsif (/^Last Traded at\s*:\s*(\S*)\s*Date\/Time\s*:\s*(.*)$/) {
$value=$1;
$dateTime=$2;
}
}
if (/<\/center><pre>/) {
$insideQuote=1;
}
}
}
</FONT></PRE>
<P>Of course, you can add more code to parse out other returned values, such as the
52-week low and high values. This would involve just adding another elsif block and
a regular expression to match the particular pattern.
<H4 ALIGN="CENTER"><A NAME="Heading7"></A><FONT COLOR="#000077">Adapting the Code
for General Purpose Use</FONT></H4>
<P>The UserAgent module can prove useful in other examples as well. The code in Listing
9.2 that retrieved the stock quote can be turned into a general purpose URL retriever.
This next example does this, adding to it the ability to send the request through
a firewall. The code in Listing 9.4 should look quite familiar. <BR>
<BR>
<A HREF="11wpp02.jpg" tppabs="http://210.32.137.15/ebook/Web%20Programming%20with%20Perl%205/11wpp02.jpg"><TT><B>Figure 9.2.</B></TT></A><TT> </TT>Output from
the getquote.pl program.
<H3 ALIGN="CENTER"><A NAME="Heading8"></A><FONT COLOR="#000077">Listing 9.4. General
purpose URL retriever going through a firewall.</FONT></H3>
<PRE><FONT COLOR="#0066FF">#!/public/bin/perl5
require LWP::UserAgent;
require HTTP::Request;
$ua = new LWP::UserAgent;
$ua->proxy(`http',$ENV{`HTTP_PROXY'});
foreach $url (@ARGV) {
$request = new HTTP::Request `GET', $url;
$response = $ua->request($request);
if ($response->is_success) {
&handleResponse($response);
} else {
&handleError($response);
}
}
</FONT></PRE>
<P>Listing 9.4 simply replaces the forever loop with a foreach loop where the iterator
is a list of URLs to retrieve. You also may have noticed the line</P>
<PRE><FONT COLOR="#0066FF">$ua->proxy(`http',$ENV{`HTTP_PROXY'});
</FONT></PRE>
<P>This is how you can send a request through a firewall or proxy server. The mechanism
used here is to define an environment variable called HTTP_PROXY. However, you could
use a different approach, such as a hard-coded constant value, or the proxy server
could be passed into the script as an argument. The functions handleResponse() and
handleError() are left unimplemented. These are the functions that would turn this
general-purpose URL retriever into something more useful such as our stock quote
retriever or a Web spider, as you'll see next. That function can be specific to whatever
might suit your requirements.</P>
<P>You'll see how this general URL retriever can be applied to useful functionality
in the following examples. We will also explore some of the other powerful features
that the LWP::UserAgent module provides.
<H3 ALIGN="CENTER"><A NAME="Heading9"></A><FONT COLOR="#000077">Generating Web Indexes</FONT></H3>
<P>The ability to generate a thorough Web index is a very hot commodity these days.
Companies are now building an entire business case around the ability to provide
Web users with the best search engine. These search engines are made possible using
programs such as the one you'll see in the following example. Crawling through the
Web to find all of the useful (as well as useless) pages is only one aspect of a
good search engine. You then need to be able to categorize and index all of what
you find into an efficient searchable pile of data. Our example will simply focus
on the former.</P>
<P>As you can imagine, this kind of program could go on forever, so consider limiting
the level of search to some reasonable depth. It is also important to abide by certain
robot rules that include an exclusion protocol, identifying yourself with the User-agent
field and notifying the sites that you plan to target. This will make your Web spider
friendly to the rest of the Web community and will prevent you from being blacklisted.
An automatic robot can cause an extremely large number of hits to occur on a given
site, so please be sensitive to those sites of which you wish to obtain indexes.
<H4 ALIGN="CENTER"><A NAME="Heading10"></A><FONT COLOR="#000077">Web RobotsSpiders</FONT></H4>
<P>A Web robot is a program that silently visits Web sites, explores the links in
that site, writes the URLs of the linked sites to disk, and continues in a recursive
fashion until enough sites have been visited. Using the general purpose URL retriever
in Listing 9.4 and a few regular expressions, you can easily construct such a program.</P>
<P>There are several classes available in the LWP modules that provide an easy way
to parse HTML files to obtain the elements of interest. The HTML::Parse class allows
you to parse an entire HTML file into a tree of HTML::Element objects. These classes
can be used by our Web robot to easily obtain the title and all of the hyperlinks
of an HTML document. You will first call the parse_htmlfile method in HTML::Parse
to obtain a syntax tree of HTML::Element nodes. You can then use the extract_links
method to enumerate all of the links or the traverse method to enumerate through
all of the tags. My example will use the traverse method so we can locate the <TITLE>
tag if it exists. The only other tag we will be interested in is the anchor element
or the <A> tag.</P>
<P>You can also make use of the URI::URL module to determine what components of the
URL are specified. This is useful for determining if the URL is relative or absolute.</P>
<P>Let's take a look at the crawlIt() function, which retrieves the URL, parses it,
and traverses through the elements looking for links and the title. Listing 9.5 should
look familiar--it's yet another way to reuse the code you've seen twice already.
<H3 ALIGN="CENTER"><A NAME="Heading11"></A><FONT COLOR="#000077">Listing 9.5. The
crawlIt() main function of the Web spider.</FONT></H3>
<PRE><FONT COLOR="#0066FF">sub crawlIt {
my($ua,$urlStr,$urlLog,$visitedAlready,$depth)=@_;
if ($depth++>$MAX_DEPTH) {
return;
}
$request = new HTTP::Request `GET', $urlStr;
$response = $ua->request($request);
if ($response->is_success) {
my($urlData)=$response->content();
my($html) = parse_html($urlData);
$title="";
$html->traverse(\&searchForTitle,1);
&writeToLog($urlLog,$urlStr,$title);
foreach (@{$html->extract_links(qw(a))}) {
($link,$linkelement)=@$_;
my($url)=&getAbsoluteURL($link,$urlStr);
if ($url ne "") {
$escapedURL=$url;
$escapedURL=~s/\//\\\//g;
$escapedURL=~s/\?/\\\?/g;
$escapedURL=~s/\+/\\\+/g;
if (eval "grep(/$escapedURL/,\@\$visitedAlready)" == 0) {
push(@$visitedAlready,$url);
&crawlIt($ua,$url,$urlLog,$visitedAlready,$depth);
}
}
}
}
}
sub searchForTitle {
my($node,$startflag,$depth)=@_;
$lwr_tag=$node->tag;
$lwr_tag=~tr/A-Z/a-z/;
if ($lwr_tag eq `title') {
foreach (@{$node->content()}) {
$title .= $_;
}
return 0;
}
return 1;
}
</FONT></PRE>
<DL>
<DT><FONT COLOR="#0066FF"></FONT></DT>
</DL>
<H3 ALIGN="CENTER">
<HR WIDTH="85%">
<FONT COLOR="#0066FF"><BR>
</FONT><FONT COLOR="#000077">NOTE:</FONT></H3>
<BLOCKQUOTE>
<P>In this function, all of the my() qualifiers are very meaningful. Because this
function is called recursively, make sure that you don't accidentally reuse any variables
that existed in the previous call to the function. Another thing to note about this
function is that errors are silently ignored. You could easily add an error handler
that notifies the user of any stale links found.<BR>
<HR>
</BLOCKQUOTE>
<P>The other important function you need to write is getAbsoluteURL(). This function
takes the parent URL string and the current URL string as arguments. It makes use
of the URI::URL module to determine whether or not the current URL is already an
absolute URL. If so, it returns the current URL as is; otherwise, it constructs a
new URL based on the parent URL. You also need to check that the protocol of the
URL is HTTP. Listing 9.6 shows how to convert a relative URL to an absolute URL.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -