📄 ch09.htm

📁 Web_Programming_with_Perl5,一个不错的Perl语言教程。
💻 HTM
📖 第 1 页 / 共 3 页
字号:
上一页 1 23
<H3 ALIGN="CENTER"><A NAME="Heading13"></A><FONT COLOR="#000077">Listing 9.6. Converting



a relative URL to an absolute URL.</FONT></H3>



<PRE><FONT COLOR="#0066FF">sub getAbsoluteURL {



   my($parent,$current)=@_;



   my($absURL)=&quot;&quot;;



   $pURL = new URI::URL $parent;



   $cURL = new URI::URL $current;



   if ($cURL-&gt;scheme() eq `http') {



      if ($cURL-&gt;host() eq &quot;&quot;) {



          $absURL=$cURL-&gt;abs($pURL);



      } else {



         $absURL=$current;



      }



   }



   return $absURL;



}



</FONT></PRE>



<P>The only remaining function besides the main program is writeToLog(). This is



a very straightforward function. All you need to do is open the log file and write



a line containing the title and URL. For simplicity, write each to separate lines,



thus avoiding having to parse anything during lookup. All titles will be on odd-numbered



lines and all URLs on the even-numbered lines immediately following the title. If



a document has no title, a blank line will appear where the title would have been.



Listing 9.7 shows the writeToLog() function.



<H3 ALIGN="CENTER"><A NAME="Heading14"></A><FONT COLOR="#000077">Listing 9.7. Writing



the title and URL to the log file.</FONT></H3>



<PRE><FONT COLOR="#0066FF">sub writeToLog {



   my($logFile,$url,$title)=@_;



   if (open(OUT,&quot;&gt;&gt; $logFile&quot;)) {



      print OUT &quot;$title\n&quot;;



      print OUT &quot;$url\n&quot;;



      close(OUT);



   } else {



      warn(&quot;Could not open $logFile for append! $!\n&quot;);



   }



}



</FONT></PRE>



<P>Now you can put this all together in the main program. The program will accept



multiple URLs as starting points. You'll also specify a maximum depth of 20 recursive



calls. Listing 9.8 shows the code for specifying these criteria.



<H3 ALIGN="CENTER"><A NAME="Heading15"></A><FONT COLOR="#000077">Listing 9.8. Specifying



the starting points and stopping points.</FONT></H3>



<PRE><FONT COLOR="#0066FF">#!/public/bin/perl5







use URI::URL;



use LWP::UserAgent;use HTTP::Request;



use HTML::Parse;



use HTML::Element;







my($ua) = new LWP::UserAgent;



if (defined($ENV{`HTTP_PROXY'})) {



   $ua-&gt;proxy(`http',$ENV{`HTTP_PROXY'});



}



$MAX_DEPTH=20;



$CRLF=&quot;\n&quot;;



$URL_LOG=&quot;/usr/httpd/index/crawl.index&quot;;



my(@visitedAlready)=();







foreach $url (@ARGV) {



   &amp;crawlIt($ua,$url,$URL_LOG,\@visitedAlready,0);



}



</FONT></PRE>







<DL>



	<DT><FONT COLOR="#0066FF"></FONT></DT>



</DL>







<H3 ALIGN="CENTER">



<HR WIDTH="84%">



<FONT COLOR="#0066FF"><BR>



</FONT><FONT COLOR="#000077">NOTE:</FONT></H3>











<BLOCKQUOTE>



	<P>There is another module available called RobotRules that will make it easier for



	you to abide by the Standard for Robot Exclusion. This module parses a file called



	robots.txt in the remote directory to see find out if robots are allowed at the site.



	For more information on the Standard for Robot Exclusion refer to</P>



	<PRE><A HREF="javascript:if(confirm('http://www.webcrawler.com/mak/projects/robots/norobots.html  \n\nThis file was not retrieved by Teleport Pro, because it is addressed on a domain or path outside the boundaries set for its Starting Address.  \n\nDo you want to open it from the server?'))window.location='http://www.webcrawler.com/mak/projects/robots/norobots.html'" tppabs="http://www.webcrawler.com/mak/projects/robots/norobots.html"><FONT



	COLOR="#0066FF">http://www.webcrawler.com/mak/projects/robots/norobots.html</FONT></A><FONT



	COLOR="#0066FF"></FONT></PRE>







</BLOCKQUOTE>







<H3 ALIGN="CENTER">



<HR WIDTH="83%">



<FONT COLOR="#0066FF"><BR>



<BR>



<A NAME="Heading17"></A></FONT><FONT COLOR="#000077">Mirroring Remote Sites</FONT></H3>



<P>One task a Webmaster might want to automate is the mirroring of a site across



multiple servers. Mirroring is essentially copying all of the files associated with



a Web site and making them available at another Web site. This is done to prevent



any major downtime from happening due to a hardware or software failure with the



primary server. This is also done to provide identical sites across different locations



in the world, so that a person in Beijing doesn't need to access a physical machine



in New York but rather can access a physical machine in Hong Kong, which happens



to be a mirror of the New York site.</P>



<P>Mirroring can be accomplished by starting at the home page of a server and recursively



traversing through all of its local links to determine the files that need to be



copied. Using this approach and much of the code in the previous examples, you can



fairly easily automate the process of mirroring a Web site.</P>



<P>We will make the assumption that any link reference that is a relative URL rather



than an absolute one should be considered local and thus needs to be mirrored. All



absolute URLs will be considered documents owned by other servers, which we can ignore.



This means that the following types of links will be ignored:</P>



<PRE><FONT COLOR="#0066FF">&lt;A HREF=http://www.netscape.com&gt;







&lt;A HREF=ftp://ftp.netscape.com/Software/ns201b2.exe&gt;







&lt;A HREF=http://www.apple.com/cgi-bin/doit.pl&gt;



</FONT></PRE>



<P>However, these links will be considered local and will be mirrored:</P>



<PRE><FONT COLOR="#0066FF">&lt;A HREF=images/home.gif&gt;







&lt;A HREF=pdfs/layout.pdf&gt;







&lt;A HREF=information.html&gt;







&lt;IMG SRC=images/animage.gif&gt;



</FONT></PRE>



<P>The LWP::UserAgent module contains a method called mirror(), which gets and stores



a Web document from a server using the modification date and content length to determine



whether or not it needs mirroring.</P>



<P>The changes you would need to make to the sample above are fairly minimal. For



example, getAbsoluteURL() would be changed to return an absolute URL only for URLs



local to the server you are mirroring, as shown in Listing 9.9.



<H3 ALIGN="CENTER"><A NAME="Heading18"></A><FONT COLOR="#000077">Listing 9.9. Modified



function to convert relative URLs to absolute URLs.</FONT></H3>



<PRE><FONT COLOR="#0066FF">sub getAbsoluteURL {



   my($parent,$current)=@_;



   my($absURL)=&quot;&quot;;



   $pURL = new URI::URL $parent;



   $cURL = new URI::URL $current;



   if ($cURL-&gt;scheme() eq `http') {



      if ($cURL-&gt;host() eq &quot;&quot;) {



         $absURL=$cURL-&gt;abs($pURL);



      }



   }



   return $absURL;



}



</FONT></PRE>



<P>The other change would be in crawlIt(), shown earlier in Listing 9.5. Instead



of writing the URL and title to the log, follow Listing 9.10 to call a subroutine



called mirrorFile(), which utilizes the LWP::UserAgent mirror() method. You should



also search for other file references such as the image element or &lt;IMG&gt; tag.



<H3 ALIGN="CENTER"><A NAME="Heading19"></A><FONT COLOR="#000077">Listing 9.10. Modified



crawlIt() function for mirroring a site.</FONT></H3>



<PRE><FONT COLOR="#0066FF">sub crawlIt {



   my($ua,$urlStr,$urlLog,$visitedAlready)=@_;



   $request = new HTTP::Request `GET', $urlStr;



   $response = $ua-&gt;request($request);



   if ($response-&gt;is_success) {



      my($urlData)=$response-&gt;content();



      my($html) = parse_html($urlData);



      $title=&quot;&quot;;



      $html-&gt;traverse(\&amp;searchForTitle,1);



      &amp;mirrorFile($ua,$urlStr);



      foreach (@{$html-&gt;extract_links(qw(a img))}) {



         ($link,$linkelement)=@$;



         if ($linkelement-&gt;tag() eq `a') {



            my($url)=&amp;getAbsoluteURL($link,$urlStr);



            if ($url ne &quot;&quot;) {



               $escapedURL=$url;



               $escapedURL=~s/\//\\\//g;



               $escapedURL=~s/\?/\\\?/g;



               $escapedURL=~s/\+/\\\+/g;



               if (eval &quot;grep(/$escapedURL/,\@\$visitedAlready)&quot; == 0) {



                  push(@$visitedAlready,$url);



                  &amp;crawlIt($ua,$url,$urlLog,$visitedAlready,$depth);



               }



            }



         } elsif ($linkelement-&gt;tag() eq `img') {



            my($url)=&amp;getAbsoluteURL($link,$urlStr);



            if ($url ne &quot;&quot;) {



               &amp;mirrorFile($url);



            }



         }



      }



   }



}







sub searchForTitle {



   my($node,$startflag,$depth)=@_;



   $lwr_tag=$node-&gt;tag;



   $lwr_tag=~tr/A-Z/a-z/;



   if ($lwr_tag eq `title') {



      foreach (@{$node-&gt;content()}) {



         $title .= $_;



      }



      return 0;



   }



   return 1;



}







sub mirrorFile {



   my($ua,$urlStr)=@_;



   my($url)=new URI::URL $urlStr;



   my($localpath)=$MIRROR_ROOT;



   $localpath .= $url-&gt;path();



   $ua-&gt;mirror($urlStr,$localpath);



}



</FONT></PRE>



<P>This example of mirroring remote sites might be useful for simple sites with only



HTML files. If you have the need for a more sophisticated remote mirroring system,



it would be best to use a UNIX-based replication tool like rdist for your site. If



you are running a Windows NT server, there are replication tools available for these



systems as well.



<H3 ALIGN="CENTER"><A NAME="Heading20"></A><FONT COLOR="#000077">Summary</FONT></H3>



<P>As you have seen in this chapter, writing user agents to automate operations that



connect to Web servers can be greatly simplified using the LWP::UserAgent module.



It is important to note, however, that the examples you have seen here work only



with HTML documents. As Web content grows richer to include other non-text based



document formats (such as PDF), it will become more important to be able to add more



advanced indexing capability by leveraging work that has already been done using



Perl5.<BR>







<H2 ALIGN="CENTER"><A HREF="ch08.htm" tppabs="http://210.32.137.15/ebook/Web%20Programming%20with%20Perl%205/ch08.htm"><IMG SRC="blanprev.gif" tppabs="http://210.32.137.15/ebook/Web%20Programming%20with%20Perl%205/blanprev.gif" WIDTH="37" HEIGHT="37"



ALIGN="BOTTOM" BORDER="2"></A><A HREF="index-1.htm" tppabs="http://210.32.137.15/ebook/Web%20Programming%20with%20Perl%205/index-1.htm"><IMG SRC="blantoc.gif" tppabs="http://210.32.137.15/ebook/Web%20Programming%20with%20Perl%205/blantoc.gif" WIDTH="42"



HEIGHT="37" ALIGN="BOTTOM" BORDER="2"></A><A HREF="ch10.htm" tppabs="http://210.32.137.15/ebook/Web%20Programming%20with%20Perl%205/ch10.htm"><IMG SRC="blannext.gif" tppabs="http://210.32.137.15/ebook/Web%20Programming%20with%20Perl%205/blannext.gif"



WIDTH="45" HEIGHT="37" ALIGN="BOTTOM" BORDER="2"></A>











</BODY>







</HTML>
上一页 1 23
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -