📄 ch14.htm

📁 Web_Programming_with_Perl5,一个不错的Perl语言教程。
💻 HTM
📖 第 1 页 / 共 5 页
字号:
</CENTER>



<P>The process of parsing an HTML document implies several algorithms. First, you



must be able to recognize and possibly take action on each of the elements in the



HTML specification, in the input stream, and on-the-fly. Usually, you'll wish to



find the URLs or anchors in a document, but even this turns out to be non-trivial



when you're attempting to match a URL with a single regular expression. Even the



newer Perl5 Regular Expression Extensions don't completely solve the problem, partly



because making the determination of validity depends on whether the URL is complete,



partial, or relative. Fortunately, there is a Perl5 module devoted specifically to



parsing HTML and URLs.</P>



<P>As it turns out, the best way to parse and determine the validity (but not necessarily



the retrievability or existence) of a given URL is via a method chosen dynamically



from a table or set based on the URL's protocol specifier. This sort of runtime decision-making



is exactly how the <TT>URI::URL.pm</TT> module works, and using it saves you a lot



of guesswork, testing, and/or debugging, and spares you from having to create potentially



mind-boggling regular expressions to match the various types of URLs that exist.</P>



<P>When parsing HTML to find the embedded URLs, you'll also need to use the module



called <TT>HTML::TreeBuilder</TT>. This module takes care of the gory details in



parsing the other internal elements from an HTML document and builds an internal



tree object to represent all the HTML elements within the file. These modules are



part of the Web toolkit that you've been using throughout this book, called libwww.



The complete suite of libwww modules includes the URI, HTML, HTTP, WWW, and Font



classes. Libwww is written and maintained by Mr. Gisle Aas of Norway. The latest



version is always available from his CPAN<TT> </TT>directory:</P>



<PRE><FONT COLOR="#0066FF">~authors/id/GAAS/



</FONT></PRE>



<P>Listing 14.1 demonstrates how to use these modules to extract simple URLs from



an HTML file.



<CENTER>



<H3><A NAME="Heading9"></A><FONT COLOR="#000077">Listing 14.1. simpleparse.</FONT></H3>



</CENTER>



<PRE><FONT COLOR="#0066FF">use URI::URL;



use HTML::TreeBuilder;







my($h,$link,$base,$url);







$base = &quot;test.html&quot;;



$h = HTML::TreeBuilder-&gt;new;



$h-&gt;parse_file($base);







foreach $pair (@{$h-&gt;extract_links(qw&lt;a img&gt;)}) {



    my($link,$elem) = @$pair;



    $url = url($link,$base);



    print $url-&gt;abs,&quot;\n&quot;;



}



</FONT></PRE>



<P>This short script prints out all the links in the file, <TT>test.html</TT>, whose



attributes begin with an <TT>A</TT> or <TT>IMG</TT> tag. If you want to parse the



file returned directly from the server, you would use the <TT>parse</TT> method,



instead of the <TT>parse_file</TT> method. You'll also need to add the capability



to slurp an HTML file directly from the server with the declaration</P>



<PRE><FONT COLOR="#0066FF">use LWP::Simple qw(get);



</FONT></PRE>



<P>Now the script looks like that in Listing 14.2.



<CENTER>



<H3><A NAME="Heading10"></A><FONT COLOR="#000077">Listing 14.2. simpleparse-net.</FONT></H3>



</CENTER>



<PRE><FONT COLOR="#0066FF">use URI::URL;



use HTML::TreeBuilder;



use LWP::Simple qw(get);







my($h,$link,$base,$url);







$base = &quot;http://www.best.com/&quot;;



$h  = HTML::TreeBuilder-&gt;new;



$h-&gt;parse(get($base));







foreach $pair (@{$h-&gt;extract_links(qw&lt;a img&gt;)}) {



    my($link,$elem) = @$pair;



    $url = url($link,$base);



    print $url-&gt;abs,&quot;\n&quot;;



}



</FONT></PRE>



<P>Running Listing 14.2, with the libwww module properly installed creates the following



output, based on the URL</P>



<PRE><FONT COLOR="#0066FF">http://www.best.com/index.html



</FONT></PRE>



<P>my current ISP, and all of the <TT>A</TT> and <TT>IMG</TT> links within that page:</P>



<PRE><FONT COLOR="#0066FF">http://webx.best.com/cgi-bin/imagemap/mainpl.map



http://www.best.com/images/mainpnl3.gif



http://www.best.com/about.html



http://www.best.com/images/persoff.gif



http://www.best.com/corp.html



http://www.best.com/images/corpserv.gif



http://www.best.com/policy.html



http://www.best.com/images/ourpol.gif



http://www.best.com/support.html



http://www.best.com/images/faq.gif



http://www.best.com/prices.html



http://www.best.com/images/pricepol.gif



http://www.best.com/pop.html



http://www.best.com/images/lan.gif



http://www.best.com/corpppp.html



http://www.best.com/images/webpd.gif



http://www.best.com/client.html



http://www.best.com/images/hosted.gif



http://www.best.com/images/announce.gif



http://www.onlive.com/



http://crystal.onlive.com/beta/index.htm



http://www.best.com/best_resort/entrance.sds



mailto:info@best.com



http://www.best.com/images/best4.gif



mailto:www@best.com



</FONT></PRE>



<P>Note that this listing may vary, depending on your location and whether there



have been changes to <TT>index.html</TT> since this chapter was written. Now that



you've seen how to use the <TT>LWP</TT> modules to do some very simple parsing, let's



take a look at how to use them for some useful tasks.



<CENTER>



<H4><A NAME="Heading11"></A><FONT COLOR="#000077">Editing and Verifying HTML</FONT></H4>



</CENTER>



<P>You can use Perl in a number of ways to make changes in and perform verification



and validation on HTML. There are modules that handle the parsing and substitutions,



as well as several complete tools to check the syntax of the HTML and the validity



of the internal anchors to other locations and documents. The following examples



demonstrate how to use these tools to perform tasks that may confront you as a Webmaster



from time to time. <B><TT>Converting from Absolute to Relative URLs</TT></B> Suppose



that at some point, when the Webmaster is coming up to speed on the HTML specifications,



he or she creates a document that uses the complete form of the URL in all links,



giving the scheme, host, and path. Later, as understanding grows, the Webmaster wishes



to go back and change all the links in the HTML documents that correspond to local



resources to have the relative form. This way, if any site is mirroring his/her site,



requests for local documents from the mirror copy will be served from the mirror



site instead of the master site.</P>



<P>In order to accomplish this task, you'll need to start with the script that parses



URLs generally, shown in Listing 14.2. Then you'll add the capability (see Listing



14.3) to print out the new HTML file with the links changed to relative form when



they refer to local resources.



<CENTER>



<H3><A NAME="Heading12"></A><FONT COLOR="#000077">Listing 14.3. relativize.</FONT></H3>



</CENTER>



<PRE><FONT COLOR="#0066FF">#!/usr/bin/perl







# relativize - parse html documents and change complete urls



# to relative or partial urls, for the sake of brevity, and to assure



# that connections to mirror sites grab the mirror's copy of the files



#



# Usage: relativize hostname file newfile basepath



# hostname is the local host



# file is the html file you wish to parse



# newfile is the new file to create from the original



# basepath is the full path to the file, from the http root



#



# Example:



# relativize www.metronet.com perl5.html newperl5.html /perlinfo/perl5



#



# Note: does not attempt to do parent-relative directory substitutions







use HTML::TreeBuilder;



use URI::URL;



require 5.002;



use strict;







my($h,$filename,$link,$base,$url);



my($usage,$localhost,$filename,$newfile,$base_path);







$usage =&quot;usage: $0 hostname htmlfile newhtmlfile BasePath\n&quot;;



$localhost= shift;



$filename = shift;



$newfile= shift;



$base_path = shift;







die $usage unless defined($localhost) and defined($filename)



     and defined($base_path) and defined($newfile);







$h = HTML::TreeBuilder-&gt;new;



$h-&gt;parse_file($filename);







(open(NEW,&quot;&gt;$newfile&quot;)) or die($usage);







$h-&gt;traverse(\&amp;relativize_urls);







sub relativize_urls {



    my($e, $start,$depth) = @_;







    # if we've got an element



    if(ref $e){



        my $tag = $e-&gt;tag;



        if($start){







            # if the tag is an &quot;A&quot; tag



            if($tag eq  &quot;a&quot;){



                my $url = URI::URL-&gt;new( $e-&gt;{href} );







                # if the scheme of the url is http



                if($url-&gt;scheme eq &quot;http&quot;){







                    # if the host is the local host, modify the



                    # href attribute to have the relative url only.



                    if($url-&gt;host eq $localhost){







                        # if the path is relative to the base path



                        # of this file (specified on command line)



                        my $path = $url-&gt;path;



                        if($path =~ s/^$base_path\/?//){



                            # a filetest could be added here for assurance



                            $e-&gt;attr(&quot;href&quot;,$path);



                        }



                    }



                }



            }



            print NEW  $e-&gt;starttag;



        }



        elsif((not ($HTML::Element::emptyElement{$tag} or



                $HTML::Element::optionalEndTag{$tag}))){



            print NEW $e-&gt;endtag,&quot;\n&quot;;



        }







    # else text stuff, just print it out



    } else {



        print NEW $e;



    }



}



</FONT></PRE>



<P>In the subroutine <TT>relativize_urls()</TT>, I've borrowed the algorithm from



the <TT>HTML::Element</TT> module's method, called <TT>as_HTML()</TT>, to print everything



from within the HTML file by default. A reference to the <TT>relativize_urls()</TT>
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -