📄 ch09.htm
字号:
<H3 ALIGN="CENTER"><A NAME="Heading13"></A><FONT COLOR="#000077">Listing 9.6. Converting
a relative URL to an absolute URL.</FONT></H3>
<PRE><FONT COLOR="#0066FF">sub getAbsoluteURL {
my($parent,$current)=@_;
my($absURL)="";
$pURL = new URI::URL $parent;
$cURL = new URI::URL $current;
if ($cURL->scheme() eq `http') {
if ($cURL->host() eq "") {
$absURL=$cURL->abs($pURL);
} else {
$absURL=$current;
}
}
return $absURL;
}
</FONT></PRE>
<P>The only remaining function besides the main program is writeToLog(). This is
a very straightforward function. All you need to do is open the log file and write
a line containing the title and URL. For simplicity, write each to separate lines,
thus avoiding having to parse anything during lookup. All titles will be on odd-numbered
lines and all URLs on the even-numbered lines immediately following the title. If
a document has no title, a blank line will appear where the title would have been.
Listing 9.7 shows the writeToLog() function.
<H3 ALIGN="CENTER"><A NAME="Heading14"></A><FONT COLOR="#000077">Listing 9.7. Writing
the title and URL to the log file.</FONT></H3>
<PRE><FONT COLOR="#0066FF">sub writeToLog {
my($logFile,$url,$title)=@_;
if (open(OUT,">> $logFile")) {
print OUT "$title\n";
print OUT "$url\n";
close(OUT);
} else {
warn("Could not open $logFile for append! $!\n");
}
}
</FONT></PRE>
<P>Now you can put this all together in the main program. The program will accept
multiple URLs as starting points. You'll also specify a maximum depth of 20 recursive
calls. Listing 9.8 shows the code for specifying these criteria.
<H3 ALIGN="CENTER"><A NAME="Heading15"></A><FONT COLOR="#000077">Listing 9.8. Specifying
the starting points and stopping points.</FONT></H3>
<PRE><FONT COLOR="#0066FF">#!/public/bin/perl5
use URI::URL;
use LWP::UserAgent;use HTTP::Request;
use HTML::Parse;
use HTML::Element;
my($ua) = new LWP::UserAgent;
if (defined($ENV{`HTTP_PROXY'})) {
$ua->proxy(`http',$ENV{`HTTP_PROXY'});
}
$MAX_DEPTH=20;
$CRLF="\n";
$URL_LOG="/usr/httpd/index/crawl.index";
my(@visitedAlready)=();
foreach $url (@ARGV) {
&crawlIt($ua,$url,$URL_LOG,\@visitedAlready,0);
}
</FONT></PRE>
<DL>
<DT><FONT COLOR="#0066FF"></FONT></DT>
</DL>
<H3 ALIGN="CENTER">
<HR WIDTH="84%">
<FONT COLOR="#0066FF"><BR>
</FONT><FONT COLOR="#000077">NOTE:</FONT></H3>
<BLOCKQUOTE>
<P>There is another module available called RobotRules that will make it easier for
you to abide by the Standard for Robot Exclusion. This module parses a file called
robots.txt in the remote directory to see find out if robots are allowed at the site.
For more information on the Standard for Robot Exclusion refer to</P>
<PRE><A HREF="javascript:if(confirm('http://www.webcrawler.com/mak/projects/robots/norobots.html \n\nThis file was not retrieved by Teleport Pro, because it is addressed on a domain or path outside the boundaries set for its Starting Address. \n\nDo you want to open it from the server?'))window.location='http://www.webcrawler.com/mak/projects/robots/norobots.html'" tppabs="http://www.webcrawler.com/mak/projects/robots/norobots.html"><FONT
COLOR="#0066FF">http://www.webcrawler.com/mak/projects/robots/norobots.html</FONT></A><FONT
COLOR="#0066FF"></FONT></PRE>
</BLOCKQUOTE>
<H3 ALIGN="CENTER">
<HR WIDTH="83%">
<FONT COLOR="#0066FF"><BR>
<BR>
<A NAME="Heading17"></A></FONT><FONT COLOR="#000077">Mirroring Remote Sites</FONT></H3>
<P>One task a Webmaster might want to automate is the mirroring of a site across
multiple servers. Mirroring is essentially copying all of the files associated with
a Web site and making them available at another Web site. This is done to prevent
any major downtime from happening due to a hardware or software failure with the
primary server. This is also done to provide identical sites across different locations
in the world, so that a person in Beijing doesn't need to access a physical machine
in New York but rather can access a physical machine in Hong Kong, which happens
to be a mirror of the New York site.</P>
<P>Mirroring can be accomplished by starting at the home page of a server and recursively
traversing through all of its local links to determine the files that need to be
copied. Using this approach and much of the code in the previous examples, you can
fairly easily automate the process of mirroring a Web site.</P>
<P>We will make the assumption that any link reference that is a relative URL rather
than an absolute one should be considered local and thus needs to be mirrored. All
absolute URLs will be considered documents owned by other servers, which we can ignore.
This means that the following types of links will be ignored:</P>
<PRE><FONT COLOR="#0066FF"><A HREF=http://www.netscape.com>
<A HREF=ftp://ftp.netscape.com/Software/ns201b2.exe>
<A HREF=http://www.apple.com/cgi-bin/doit.pl>
</FONT></PRE>
<P>However, these links will be considered local and will be mirrored:</P>
<PRE><FONT COLOR="#0066FF"><A HREF=images/home.gif>
<A HREF=pdfs/layout.pdf>
<A HREF=information.html>
<IMG SRC=images/animage.gif>
</FONT></PRE>
<P>The LWP::UserAgent module contains a method called mirror(), which gets and stores
a Web document from a server using the modification date and content length to determine
whether or not it needs mirroring.</P>
<P>The changes you would need to make to the sample above are fairly minimal. For
example, getAbsoluteURL() would be changed to return an absolute URL only for URLs
local to the server you are mirroring, as shown in Listing 9.9.
<H3 ALIGN="CENTER"><A NAME="Heading18"></A><FONT COLOR="#000077">Listing 9.9. Modified
function to convert relative URLs to absolute URLs.</FONT></H3>
<PRE><FONT COLOR="#0066FF">sub getAbsoluteURL {
my($parent,$current)=@_;
my($absURL)="";
$pURL = new URI::URL $parent;
$cURL = new URI::URL $current;
if ($cURL->scheme() eq `http') {
if ($cURL->host() eq "") {
$absURL=$cURL->abs($pURL);
}
}
return $absURL;
}
</FONT></PRE>
<P>The other change would be in crawlIt(), shown earlier in Listing 9.5. Instead
of writing the URL and title to the log, follow Listing 9.10 to call a subroutine
called mirrorFile(), which utilizes the LWP::UserAgent mirror() method. You should
also search for other file references such as the image element or <IMG> tag.
<H3 ALIGN="CENTER"><A NAME="Heading19"></A><FONT COLOR="#000077">Listing 9.10. Modified
crawlIt() function for mirroring a site.</FONT></H3>
<PRE><FONT COLOR="#0066FF">sub crawlIt {
my($ua,$urlStr,$urlLog,$visitedAlready)=@_;
$request = new HTTP::Request `GET', $urlStr;
$response = $ua->request($request);
if ($response->is_success) {
my($urlData)=$response->content();
my($html) = parse_html($urlData);
$title="";
$html->traverse(\&searchForTitle,1);
&mirrorFile($ua,$urlStr);
foreach (@{$html->extract_links(qw(a img))}) {
($link,$linkelement)=@$;
if ($linkelement->tag() eq `a') {
my($url)=&getAbsoluteURL($link,$urlStr);
if ($url ne "") {
$escapedURL=$url;
$escapedURL=~s/\//\\\//g;
$escapedURL=~s/\?/\\\?/g;
$escapedURL=~s/\+/\\\+/g;
if (eval "grep(/$escapedURL/,\@\$visitedAlready)" == 0) {
push(@$visitedAlready,$url);
&crawlIt($ua,$url,$urlLog,$visitedAlready,$depth);
}
}
} elsif ($linkelement->tag() eq `img') {
my($url)=&getAbsoluteURL($link,$urlStr);
if ($url ne "") {
&mirrorFile($url);
}
}
}
}
}
sub searchForTitle {
my($node,$startflag,$depth)=@_;
$lwr_tag=$node->tag;
$lwr_tag=~tr/A-Z/a-z/;
if ($lwr_tag eq `title') {
foreach (@{$node->content()}) {
$title .= $_;
}
return 0;
}
return 1;
}
sub mirrorFile {
my($ua,$urlStr)=@_;
my($url)=new URI::URL $urlStr;
my($localpath)=$MIRROR_ROOT;
$localpath .= $url->path();
$ua->mirror($urlStr,$localpath);
}
</FONT></PRE>
<P>This example of mirroring remote sites might be useful for simple sites with only
HTML files. If you have the need for a more sophisticated remote mirroring system,
it would be best to use a UNIX-based replication tool like rdist for your site. If
you are running a Windows NT server, there are replication tools available for these
systems as well.
<H3 ALIGN="CENTER"><A NAME="Heading20"></A><FONT COLOR="#000077">Summary</FONT></H3>
<P>As you have seen in this chapter, writing user agents to automate operations that
connect to Web servers can be greatly simplified using the LWP::UserAgent module.
It is important to note, however, that the examples you have seen here work only
with HTML documents. As Web content grows richer to include other non-text based
document formats (such as PDF), it will become more important to be able to add more
advanced indexing capability by leveraging work that has already been done using
Perl5.<BR>
<H2 ALIGN="CENTER"><A HREF="ch08.htm" tppabs="http://210.32.137.15/ebook/Web%20Programming%20with%20Perl%205/ch08.htm"><IMG SRC="blanprev.gif" tppabs="http://210.32.137.15/ebook/Web%20Programming%20with%20Perl%205/blanprev.gif" WIDTH="37" HEIGHT="37"
ALIGN="BOTTOM" BORDER="2"></A><A HREF="index-1.htm" tppabs="http://210.32.137.15/ebook/Web%20Programming%20with%20Perl%205/index-1.htm"><IMG SRC="blantoc.gif" tppabs="http://210.32.137.15/ebook/Web%20Programming%20with%20Perl%205/blantoc.gif" WIDTH="42"
HEIGHT="37" ALIGN="BOTTOM" BORDER="2"></A><A HREF="ch10.htm" tppabs="http://210.32.137.15/ebook/Web%20Programming%20with%20Perl%205/ch10.htm"><IMG SRC="blannext.gif" tppabs="http://210.32.137.15/ebook/Web%20Programming%20with%20Perl%205/blannext.gif"
WIDTH="45" HEIGHT="37" ALIGN="BOTTOM" BORDER="2"></A>
</BODY>
</HTML>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -