📄 ch14.htm
字号:
</CENTER>
<P>The process of parsing an HTML document implies several algorithms. First, you
must be able to recognize and possibly take action on each of the elements in the
HTML specification, in the input stream, and on-the-fly. Usually, you'll wish to
find the URLs or anchors in a document, but even this turns out to be non-trivial
when you're attempting to match a URL with a single regular expression. Even the
newer Perl5 Regular Expression Extensions don't completely solve the problem, partly
because making the determination of validity depends on whether the URL is complete,
partial, or relative. Fortunately, there is a Perl5 module devoted specifically to
parsing HTML and URLs.</P>
<P>As it turns out, the best way to parse and determine the validity (but not necessarily
the retrievability or existence) of a given URL is via a method chosen dynamically
from a table or set based on the URL's protocol specifier. This sort of runtime decision-making
is exactly how the <TT>URI::URL.pm</TT> module works, and using it saves you a lot
of guesswork, testing, and/or debugging, and spares you from having to create potentially
mind-boggling regular expressions to match the various types of URLs that exist.</P>
<P>When parsing HTML to find the embedded URLs, you'll also need to use the module
called <TT>HTML::TreeBuilder</TT>. This module takes care of the gory details in
parsing the other internal elements from an HTML document and builds an internal
tree object to represent all the HTML elements within the file. These modules are
part of the Web toolkit that you've been using throughout this book, called libwww.
The complete suite of libwww modules includes the URI, HTML, HTTP, WWW, and Font
classes. Libwww is written and maintained by Mr. Gisle Aas of Norway. The latest
version is always available from his CPAN<TT> </TT>directory:</P>
<PRE><FONT COLOR="#0066FF">~authors/id/GAAS/
</FONT></PRE>
<P>Listing 14.1 demonstrates how to use these modules to extract simple URLs from
an HTML file.
<CENTER>
<H3><A NAME="Heading9"></A><FONT COLOR="#000077">Listing 14.1. simpleparse.</FONT></H3>
</CENTER>
<PRE><FONT COLOR="#0066FF">use URI::URL;
use HTML::TreeBuilder;
my($h,$link,$base,$url);
$base = "test.html";
$h = HTML::TreeBuilder->new;
$h->parse_file($base);
foreach $pair (@{$h->extract_links(qw<a img>)}) {
my($link,$elem) = @$pair;
$url = url($link,$base);
print $url->abs,"\n";
}
</FONT></PRE>
<P>This short script prints out all the links in the file, <TT>test.html</TT>, whose
attributes begin with an <TT>A</TT> or <TT>IMG</TT> tag. If you want to parse the
file returned directly from the server, you would use the <TT>parse</TT> method,
instead of the <TT>parse_file</TT> method. You'll also need to add the capability
to slurp an HTML file directly from the server with the declaration</P>
<PRE><FONT COLOR="#0066FF">use LWP::Simple qw(get);
</FONT></PRE>
<P>Now the script looks like that in Listing 14.2.
<CENTER>
<H3><A NAME="Heading10"></A><FONT COLOR="#000077">Listing 14.2. simpleparse-net.</FONT></H3>
</CENTER>
<PRE><FONT COLOR="#0066FF">use URI::URL;
use HTML::TreeBuilder;
use LWP::Simple qw(get);
my($h,$link,$base,$url);
$base = "http://www.best.com/";
$h = HTML::TreeBuilder->new;
$h->parse(get($base));
foreach $pair (@{$h->extract_links(qw<a img>)}) {
my($link,$elem) = @$pair;
$url = url($link,$base);
print $url->abs,"\n";
}
</FONT></PRE>
<P>Running Listing 14.2, with the libwww module properly installed creates the following
output, based on the URL</P>
<PRE><FONT COLOR="#0066FF">http://www.best.com/index.html
</FONT></PRE>
<P>my current ISP, and all of the <TT>A</TT> and <TT>IMG</TT> links within that page:</P>
<PRE><FONT COLOR="#0066FF">http://webx.best.com/cgi-bin/imagemap/mainpl.map
http://www.best.com/images/mainpnl3.gif
http://www.best.com/about.html
http://www.best.com/images/persoff.gif
http://www.best.com/corp.html
http://www.best.com/images/corpserv.gif
http://www.best.com/policy.html
http://www.best.com/images/ourpol.gif
http://www.best.com/support.html
http://www.best.com/images/faq.gif
http://www.best.com/prices.html
http://www.best.com/images/pricepol.gif
http://www.best.com/pop.html
http://www.best.com/images/lan.gif
http://www.best.com/corpppp.html
http://www.best.com/images/webpd.gif
http://www.best.com/client.html
http://www.best.com/images/hosted.gif
http://www.best.com/images/announce.gif
http://www.onlive.com/
http://crystal.onlive.com/beta/index.htm
http://www.best.com/best_resort/entrance.sds
mailto:info@best.com
http://www.best.com/images/best4.gif
mailto:www@best.com
</FONT></PRE>
<P>Note that this listing may vary, depending on your location and whether there
have been changes to <TT>index.html</TT> since this chapter was written. Now that
you've seen how to use the <TT>LWP</TT> modules to do some very simple parsing, let's
take a look at how to use them for some useful tasks.
<CENTER>
<H4><A NAME="Heading11"></A><FONT COLOR="#000077">Editing and Verifying HTML</FONT></H4>
</CENTER>
<P>You can use Perl in a number of ways to make changes in and perform verification
and validation on HTML. There are modules that handle the parsing and substitutions,
as well as several complete tools to check the syntax of the HTML and the validity
of the internal anchors to other locations and documents. The following examples
demonstrate how to use these tools to perform tasks that may confront you as a Webmaster
from time to time. <B><TT>Converting from Absolute to Relative URLs</TT></B> Suppose
that at some point, when the Webmaster is coming up to speed on the HTML specifications,
he or she creates a document that uses the complete form of the URL in all links,
giving the scheme, host, and path. Later, as understanding grows, the Webmaster wishes
to go back and change all the links in the HTML documents that correspond to local
resources to have the relative form. This way, if any site is mirroring his/her site,
requests for local documents from the mirror copy will be served from the mirror
site instead of the master site.</P>
<P>In order to accomplish this task, you'll need to start with the script that parses
URLs generally, shown in Listing 14.2. Then you'll add the capability (see Listing
14.3) to print out the new HTML file with the links changed to relative form when
they refer to local resources.
<CENTER>
<H3><A NAME="Heading12"></A><FONT COLOR="#000077">Listing 14.3. relativize.</FONT></H3>
</CENTER>
<PRE><FONT COLOR="#0066FF">#!/usr/bin/perl
# relativize - parse html documents and change complete urls
# to relative or partial urls, for the sake of brevity, and to assure
# that connections to mirror sites grab the mirror's copy of the files
#
# Usage: relativize hostname file newfile basepath
# hostname is the local host
# file is the html file you wish to parse
# newfile is the new file to create from the original
# basepath is the full path to the file, from the http root
#
# Example:
# relativize www.metronet.com perl5.html newperl5.html /perlinfo/perl5
#
# Note: does not attempt to do parent-relative directory substitutions
use HTML::TreeBuilder;
use URI::URL;
require 5.002;
use strict;
my($h,$filename,$link,$base,$url);
my($usage,$localhost,$filename,$newfile,$base_path);
$usage ="usage: $0 hostname htmlfile newhtmlfile BasePath\n";
$localhost= shift;
$filename = shift;
$newfile= shift;
$base_path = shift;
die $usage unless defined($localhost) and defined($filename)
and defined($base_path) and defined($newfile);
$h = HTML::TreeBuilder->new;
$h->parse_file($filename);
(open(NEW,">$newfile")) or die($usage);
$h->traverse(\&relativize_urls);
sub relativize_urls {
my($e, $start,$depth) = @_;
# if we've got an element
if(ref $e){
my $tag = $e->tag;
if($start){
# if the tag is an "A" tag
if($tag eq "a"){
my $url = URI::URL->new( $e->{href} );
# if the scheme of the url is http
if($url->scheme eq "http"){
# if the host is the local host, modify the
# href attribute to have the relative url only.
if($url->host eq $localhost){
# if the path is relative to the base path
# of this file (specified on command line)
my $path = $url->path;
if($path =~ s/^$base_path\/?//){
# a filetest could be added here for assurance
$e->attr("href",$path);
}
}
}
}
print NEW $e->starttag;
}
elsif((not ($HTML::Element::emptyElement{$tag} or
$HTML::Element::optionalEndTag{$tag}))){
print NEW $e->endtag,"\n";
}
# else text stuff, just print it out
} else {
print NEW $e;
}
}
</FONT></PRE>
<P>In the subroutine <TT>relativize_urls()</TT>, I've borrowed the algorithm from
the <TT>HTML::Element</TT> module's method, called <TT>as_HTML()</TT>, to print everything
from within the HTML file by default. A reference to the <TT>relativize_urls()</TT>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -