📄 ch14.htm
字号:
subroutine is passed into the <TT>traverse()</TT> method, inherited in the <TT>HTML::TreeBuilder</TT>
class, from the <TT>HTML::Element</TT> class. The desired changes to the complete
URLs that refer to local files are made after verification that the path component
has the specified base path, and the host is the localhost. The output goes to the
new HTML file, specified on the command line.</P>
<P>There are plenty of other uses for the <TT>HTML::TreeBuilder</TT> class and its
parent classes, <TT>HTML::Element</TT> and <TT>HTML::Parser</TT>. See the POD documentation
for the libwww modules for more details. <B><TT>Moving an Entire Archive</TT></B>
Copying an external archive may give rise to the need to change the file or directory
names associated with the external site, and then to correct the URLs in the HTML
files. There may be several reasons for this: The copying site may wish to use a
different layout of the archive; or, as mentioned previously, it may be using a DOS
file system or follow an ISO9660 naming policy, which requires a change of file or
directory names if they're not ISO9660-compliant. Placing an archive's contents on
a CD-ROM may also require renaming or re-organizing the original structure. Whatever
the reason, this task can be quite intimidating to perform.</P>
<P>The algorithm itself implies six steps and three complete passes over the archive,
using <TT>File::Find</TT>, or something similar, in order to get everything right.
Let's consider the case where you need to migrate an archive from a UNIX file system,
which allows long filenames, to a DOS file system, which doesn't. I'm not providing
any code here; I'm simply outlining the algorithm, based on a former consulting job
where the I performed the service for a client.
<TABLE BORDER="0">
<TR ALIGN="LEFT" rowspan="1">
<TD ALIGN="LEFT" VALIGN="TOP">Step 1: List Directories</TD>
<TD ALIGN="LEFT">The first pass over the archive should create a listing file of all the directories,
in the full path form, within the archive. Each entry of the list should have three
components: the original name; then if the current directory's name is too long,
the list entry should have the original name with any parent directories' new names;
followed by the new name, which is shortened to eight alpha-numeric characters and
doesn't collide with any other names in the current directory, prepended with all
of the parent directories' new names.</TD>
</TR>
<TR ALIGN="LEFT" rowspan="1">
<TD ALIGN="LEFT" VALIGN="TOP">Step 2: Rename Directories</TD>
<TD ALIGN="LEFT">Directories should be renamed during this step, based on the list created during
pass one. The list has to be sorted hierarchically--from the top level to the lowest
level--for the renaming operations to work. The original name of the current directory,
with its parent directories' new names as a full path, should be the first argument
to <TT>rename()</TT>, followed by the new short name, with any new parents in the
path. These should be the second and third elements of the list created during pass
one.</TD>
</TR>
<TR ALIGN="LEFT" rowspan="1">
<TD ALIGN="LEFT" VALIGN="TOP">Step 3: List Files</TD>
<TD ALIGN="LEFT">The third step makes another pass over the archive, creating another list. This list
will have the current (possibly renamed) directory and original filename of each
file, as a full path, followed by the current directory and the new filename. The
new filename will be shortened to the 8.3 format and with verification, again, that
there are no namespace collisions in the current directory.</TD>
</TR>
<TR ALIGN="LEFT" rowspan="1">
<TD ALIGN="LEFT" VALIGN="TOP">Step 4: Rename Files</TD>
<TD ALIGN="LEFT">The fourth step should rename files, based on the list created in pass three.</TD>
</TR>
<TR ALIGN="LEFT" rowspan="1">
<TD ALIGN="LEFT" VALIGN="TOP">Step 5: Create HTML Fixup List</TD>
<TD ALIGN="LEFT">The fifth step in the algorithm takes both lists created previously and creates one
final list, with the original filename or directory name for each file or directory,
followed by the current name. Again, both of these should be specified as a full
path. This list will then be used to correct any anchors or links that have been
affected by this massive change and that live in your HTML files.</TD>
</TR>
<TR ALIGN="LEFT" rowspan="1">
<TD ALIGN="LEFT" VALIGN="TOP">Step 6: Fix HTML Files</TD>
<TD ALIGN="LEFT">The final step in the algorithm reads in the list created in Step 5, and opens each
file for fixing the internal links that still have the original names and paths of
the files. It should refer to the list created in Step 5 to decide whether to change
a given URL during the parsing process and overwrite the current HTML file. Line
termination characters should be changed to the appropriate one for the new architecture
at this time, too.</TD>
</TR>
</TABLE>
It's a rather complicated process, to be sure. Of course, if you design your archive
from the original planning stages to account for the possibility of this sort of
task (by using ISO9660 names), then you'll never have to suffer the pain and time
consumption of this process. <B><TT>Verification of HTML Elements</TT></B> The process
of verifying the links that point to local documents within your HTML should be performed
on a regular basis. Occasionally, and especially if you're not using a form of revision
control as discussed previously, you may make a change to the structure of your archive
that will render a link useless until it is changed to reflect the new name or location
of the resource to which it points.</P>
<P>Broken links are also a problem that you will confront when you're using links
to external sites' HTML pages or to other remote resources. The external site may
change its layout or structure or, more drastically, its hostname, due to a move
or other issues. In these cases, you might be notified of the pending change or directly
thereafter--if the remote site is "aware" that you're linking to its resources.
(This is one reason to notify an external site when you create links to its resources.)
Then, at the appropriate time, you'll be able to make the change to your local HTML
files that include these links.</P>
<P>Several scripts and tools are available that implement this task for you. Tom
Christiansen has written a simple one called <TT>churl</TT>. This simple script does
limited verification of URLs in an HTML file retrieved from a server. It verifies
the existence and retrievability of HTTP, ftp, and file URLs. It could be modified
to suit your needs, and optionally verify relative (local) URLs or partial URLs.
It's available at the CPAN in his <TT>authors</TT> directory:</P>
<PRE><FONT COLOR="#0066FF">~/authors/id/TOMC/scripts.
</FONT></PRE>
<P>He has also created a number of other useful scripts and tools for use in Web
maintenance and security, which also can be retrieved from his directory at any CPAN
site.</P>
<P>The other tool we'll mention here, called <TT>weblint</TT>, is written by Neil
Bowers and is probably the most comprehensive package available for verification
of HTML files. In addition to checking for the existence of local anchor targets,
it also thoroughly checks the other elements in your HTML file.</P>
<P>The <TT>weblint</TT> tool is available at any CPAN archive, under Neil Bower's
<TT>authors</TT> directory:</P>
<PRE><FONT COLOR="#0066FF">~/authors/id/NEILB/weblint-*.tar.gz.
</FONT></PRE>
<P>It's widely used and highly recommended. Combining this tool with something such
as Tom Christiansen's <TT>churl</TT> script will give you a complete verification
package for your HTML files. See the <TT>README</TT> file with <TT>weblint</TT> for
a complete description of all the features.
<CENTER>
<H4><A NAME="Heading13"></A><FONT COLOR="#000077">Parsing HTTP Logfiles</FONT></H4>
</CENTER>
<P>As Webmaster, you may be called upon, from time to time, to provide a report of
the usage of your Web pages. There may be several reasons for this, not the least
of which may be to justify your existence <TT>:-)</TT>. More likely, though, the
need will be to get a feel for the general usage model of your Web site or what types
of errors are occurring.</P>
<P>Most of the available httpd servers provide you with an access log by default,
along with some sort of an error log. Each of these logs has a separate format for
its records, but there are a number of common fields, which naturally lends to the
object-oriented model for parsing them and producing reports.</P>
<P>We'll be looking at the <TT>Logfile</TT> module, written by Ulrich Pfeifer, in
this section. It provides you with the ability to subclass the base record object
and has subclass modules available for a number of servers' log files, including
NCSA httpd, Apache httpd, CERN httpd, WUFTP, and others. If there isn't a subclass
for your particular server, it's pretty easy to write one. <B><TT>General Issues</TT></B>
An HTTP server implements its logging according to configuration settings, usually
within the <TT>httpd.conf</TT> file. The data you have to analyze depends on which
log files you enable in the configuration file, or at compile time for the server's
source in the case of the Apache server. Several logs can be enabled in the configuration,
including the access log, error log, referer log, and agent log. Each of these has
information that you may need to summarize or analyze.
<DL>
<DT></DT>
</DL>
<P><B><TT>Logging Connections</TT></B>
<BLOCKQUOTE>
<P>There are some security and privacy issues related to logging too much information.
Be sure to keep the appropriate permissions on your logfiles to prevent arbitrary
snooping or parsing, and truncate them when you've completed the data gathering.
See Chapter 3 for more details.
</BLOCKQUOTE>
<P>In general, the httpd log file is a text file with records as lines terminated
with the appropriate line terminator for the architecture under which the server
is running. The individual records have fields that are strings that form dates,
file paths, and hostnames or IP numbers, and other items, usually separated by blank
space. Ordinarily, there is one line or record per connection, but some types of
transactions generate multiple lines in the log file(s). This should be considered
when designing the algorithm and code that parses the log.</P>
<P>The access log gives you general information regarding what site is connecting
to your server and what files are being retrieved. The error log receives and records
the output from the <TT>STDERR</TT> filehandle from all connections. Both of these,
and especially the error log, may need to be parsed every now and then to see what's
happening with your server's connections. <B><TT>Parsing</TT></B> Using the <TT>Logfile</TT>
module, the discrete transaction record, based on some parameter of the request,
is abstracted to a Perl object after being parsed. During the process of parsing
the log file, the instance variables that are created with the <TT>new()</TT> method
depend on which type of log is being parsed and which field (Hostname, Date, Path,
and so on) from the log file you're interested in summarizing. When parsing is complete,
the return value, a blessed reference to the <TT>Logfile</TT> class, has a hash with
key/value pairs corresponding to the parameters on which you want to gather statistics
about the log and the number of times each one was counted. In the simplest case,
you simply write these lines:</P>
<PRE><FONT COLOR="#0066FF">use Logfile::Apache; # to parse the popular Apache server log
$l = new Logfile::Apache File => `/usr/local/etc/httpd/logs/access_log',
Group => [qw(Host Domain File)];
</FONT></PRE>
<P>This parses your access log and returns the blessed reference. <B><TT>Reporting
and Summaries</TT></B> After you've invoked the <TT>new()</TT> method for the <TT>Logfile</TT>
class and passed in your log file to be parsed, you can invoke the <TT>report()</TT>
method on the returned object.</P>
<PRE><FONT COLOR="#0066FF">$l->report(Group => File, Sort => Records, Top => 10);
</FONT></PRE>
<P>The preceding line produces a report detailing the access counts of each of the
top ten files retrieved from your archive and their percentages of the total number
of retrievals. For the sample Apache <TT>access.conf</TT> log file included with
the Log file distribution, the results from the <TT>report()</TT> method look like
this:</P>
<PRE><FONT COLOR="#0066FF">File Records
=======================================
/mall/os 5 35.71%
/mall/web 3 21.43%
/~watkins 3 21.43%
/cgi-bin/mall 1 7.14%
/graphics/bos-area-map 1 7.14%
/~rsalz 1 7.14%
</FONT></PRE>
<P>You can generate many other reports with the <TT>Logfile</TT> module, including
multiple-variable reports, to suit your needs and interests. See the <TT>Logfile</TT>
documentation as embedded POD in <TT>Logfile.pm</TT>, for additional information.
You can get the <TT>Logfile</TT> module from the CPAN, from Ulrich Pfeifer's author's
directory:</P>
<PRE><FONT COLOR="#0066FF">~/authors/id/ULPFR/
</FONT></PRE>
<P>The latest release, as of the writing of this chapter, was 0.113. Have a look,
and don't forget to give feedback to the author when you can. <B><TT>Generating Graphical
Data</TT></B> After you've gotten your reports back from <TT>Logfile</TT>, you've
pretty much exhausted the functionality of the module. In order to produce an image
that illustrates the data, you'll need to resort to other means. Because the report
gives essentially two-dimensional data, it'll be easy to produce a representative
image using the <TT>GD</TT> module, which was previously introduced in Chapter 12,
"Multimedia."</P>
<P>This example provides you with a module that uses the <TT>GD</TT> class and provides
one method to which you should pass a <TT>Logfile</TT> object, along with some other
parameters to specify which field from the log file you wish to graph, the resultant
image size, and the font. This method actually would be better placed into the <TT>Logfile::Base</TT>
class, because that's where each of the <TT>Logfile</TT> subclasses, including the
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -