📄 ch14.htm

📁 Web_Programming_with_Perl5,一个不错的Perl语言教程。
💻 HTM
📖 第 1 页 / 共 5 页
字号:
subroutine is passed into the <TT>traverse()</TT> method, inherited in the <TT>HTML::TreeBuilder</TT>



class, from the <TT>HTML::Element</TT> class. The desired changes to the complete



URLs that refer to local files are made after verification that the path component



has the specified base path, and the host is the localhost. The output goes to the



new HTML file, specified on the command line.</P>



<P>There are plenty of other uses for the <TT>HTML::TreeBuilder</TT> class and its



parent classes, <TT>HTML::Element</TT> and <TT>HTML::Parser</TT>. See the POD documentation



for the libwww modules for more details. <B><TT>Moving an Entire Archive</TT></B>



Copying an external archive may give rise to the need to change the file or directory



names associated with the external site, and then to correct the URLs in the HTML



files. There may be several reasons for this: The copying site may wish to use a



different layout of the archive; or, as mentioned previously, it may be using a DOS



file system or follow an ISO9660 naming policy, which requires a change of file or



directory names if they're not ISO9660-compliant. Placing an archive's contents on



a CD-ROM may also require renaming or re-organizing the original structure. Whatever



the reason, this task can be quite intimidating to perform.</P>



<P>The algorithm itself implies six steps and three complete passes over the archive,



using <TT>File::Find</TT>, or something similar, in order to get everything right.



Let's consider the case where you need to migrate an archive from a UNIX file system,



which allows long filenames, to a DOS file system, which doesn't. I'm not providing



any code here; I'm simply outlining the algorithm, based on a former consulting job



where the I performed the service for a client. 



<TABLE BORDER="0">



	<TR ALIGN="LEFT" rowspan="1">



		<TD ALIGN="LEFT" VALIGN="TOP">Step 1: List Directories</TD>



		<TD ALIGN="LEFT">The first pass over the archive should create a listing file of all the directories,



			in the full path form, within the archive. Each entry of the list should have three



			components: the original name; then if the current directory's name is too long,



			the list entry should have the original name with any parent directories' new names;



			followed by the new name, which is shortened to eight alpha-numeric characters and



			doesn't collide with any other names in the current directory, prepended with all



			of the parent directories' new names.</TD>



	</TR>



	<TR ALIGN="LEFT" rowspan="1">



		<TD ALIGN="LEFT" VALIGN="TOP">Step 2: Rename Directories</TD>



		<TD ALIGN="LEFT">Directories should be renamed during this step, based on the list created during



			pass one. The list has to be sorted hierarchically--from the top level to the lowest



			level--for the renaming operations to work. The original name of the current directory,



			with its parent directories' new names as a full path, should be the first argument



			to <TT>rename()</TT>, followed by the new short name, with any new parents in the



			path. These should be the second and third elements of the list created during pass



			one.</TD>



	</TR>



	<TR ALIGN="LEFT" rowspan="1">



		<TD ALIGN="LEFT" VALIGN="TOP">Step 3: List Files</TD>



		<TD ALIGN="LEFT">The third step makes another pass over the archive, creating another list. This list



			will have the current (possibly renamed) directory and original filename of each



			file, as a full path, followed by the current directory and the new filename. The



			new filename will be shortened to the 8.3 format and with verification, again, that



			there are no namespace collisions in the current directory.</TD>



	</TR>



	<TR ALIGN="LEFT" rowspan="1">



		<TD ALIGN="LEFT" VALIGN="TOP">Step 4: Rename Files</TD>



		<TD ALIGN="LEFT">The fourth step should rename files, based on the list created in pass three.</TD>



	</TR>



	<TR ALIGN="LEFT" rowspan="1">



		<TD ALIGN="LEFT" VALIGN="TOP">Step 5: Create HTML Fixup List</TD>



		<TD ALIGN="LEFT">The fifth step in the algorithm takes both lists created previously and creates one



			final list, with the original filename or directory name for each file or directory,



			followed by the current name. Again, both of these should be specified as a full



			path. This list will then be used to correct any anchors or links that have been



			affected by this massive change and that live in your HTML files.</TD>



	</TR>



	<TR ALIGN="LEFT" rowspan="1">



		<TD ALIGN="LEFT" VALIGN="TOP">Step 6: Fix HTML Files</TD>



		<TD ALIGN="LEFT">The final step in the algorithm reads in the list created in Step 5, and opens each



			file for fixing the internal links that still have the original names and paths of



			the files. It should refer to the list created in Step 5 to decide whether to change



			a given URL during the parsing process and overwrite the current HTML file. Line



			termination characters should be changed to the appropriate one for the new architecture



			at this time, too.</TD>



	</TR>



</TABLE>



It's a rather complicated process, to be sure. Of course, if you design your archive



from the original planning stages to account for the possibility of this sort of



task (by using ISO9660 names), then you'll never have to suffer the pain and time



consumption of this process. <B><TT>Verification of HTML Elements</TT></B> The process



of verifying the links that point to local documents within your HTML should be performed



on a regular basis. Occasionally, and especially if you're not using a form of revision



control as discussed previously, you may make a change to the structure of your archive



that will render a link useless until it is changed to reflect the new name or location



of the resource to which it points.</P>



<P>Broken links are also a problem that you will confront when you're using links



to external sites' HTML pages or to other remote resources. The external site may



change its layout or structure or, more drastically, its hostname, due to a move



or other issues. In these cases, you might be notified of the pending change or directly



thereafter--if the remote site is &quot;aware&quot; that you're linking to its resources.



(This is one reason to notify an external site when you create links to its resources.)



Then, at the appropriate time, you'll be able to make the change to your local HTML



files that include these links.</P>



<P>Several scripts and tools are available that implement this task for you. Tom



Christiansen has written a simple one called <TT>churl</TT>. This simple script does



limited verification of URLs in an HTML file retrieved from a server. It verifies



the existence and retrievability of HTTP, ftp, and file URLs. It could be modified



to suit your needs, and optionally verify relative (local) URLs or partial URLs.



It's available at the CPAN in his <TT>authors</TT> directory:</P>



<PRE><FONT COLOR="#0066FF">~/authors/id/TOMC/scripts. 



</FONT></PRE>



<P>He has also created a number of other useful scripts and tools for use in Web



maintenance and security, which also can be retrieved from his directory at any CPAN



site.</P>



<P>The other tool we'll mention here, called <TT>weblint</TT>, is written by Neil



Bowers and is probably the most comprehensive package available for verification



of HTML files. In addition to checking for the existence of local anchor targets,



it also thoroughly checks the other elements in your HTML file.</P>



<P>The <TT>weblint</TT> tool is available at any CPAN archive, under Neil Bower's



<TT>authors</TT> directory:</P>



<PRE><FONT COLOR="#0066FF">~/authors/id/NEILB/weblint-*.tar.gz. 



</FONT></PRE>



<P>It's widely used and highly recommended. Combining this tool with something such



as Tom Christiansen's <TT>churl</TT> script will give you a complete verification



package for your HTML files. See the <TT>README</TT> file with <TT>weblint</TT> for



a complete description of all the features.



<CENTER>



<H4><A NAME="Heading13"></A><FONT COLOR="#000077">Parsing HTTP Logfiles</FONT></H4>



</CENTER>



<P>As Webmaster, you may be called upon, from time to time, to provide a report of



the usage of your Web pages. There may be several reasons for this, not the least



of which may be to justify your existence <TT>:-)</TT>. More likely, though, the



need will be to get a feel for the general usage model of your Web site or what types



of errors are occurring.</P>



<P>Most of the available httpd servers provide you with an access log by default,



along with some sort of an error log. Each of these logs has a separate format for



its records, but there are a number of common fields, which naturally lends to the



object-oriented model for parsing them and producing reports.</P>



<P>We'll be looking at the <TT>Logfile</TT> module, written by Ulrich Pfeifer, in



this section. It provides you with the ability to subclass the base record object



and has subclass modules available for a number of servers' log files, including



NCSA httpd, Apache httpd, CERN httpd, WUFTP, and others. If there isn't a subclass



for your particular server, it's pretty easy to write one. <B><TT>General Issues</TT></B>



An HTTP server implements its logging according to configuration settings, usually



within the <TT>httpd.conf</TT> file. The data you have to analyze depends on which



log files you enable in the configuration file, or at compile time for the server's



source in the case of the Apache server. Several logs can be enabled in the configuration,



including the access log, error log, referer log, and agent log. Each of these has



information that you may need to summarize or analyze.







<DL>



	<DT></DT>



</DL>







<P><B><TT>Logging Connections</TT></B>











<BLOCKQUOTE>



	<P>There are some security and privacy issues related to logging too much information.



	Be sure to keep the appropriate permissions on your logfiles to prevent arbitrary



	snooping or parsing, and truncate them when you've completed the data gathering.



	See Chapter 3 for more details.







</BLOCKQUOTE>







<P>In general, the httpd log file is a text file with records as lines terminated



with the appropriate line terminator for the architecture under which the server



is running. The individual records have fields that are strings that form dates,



file paths, and hostnames or IP numbers, and other items, usually separated by blank



space. Ordinarily, there is one line or record per connection, but some types of



transactions generate multiple lines in the log file(s). This should be considered



when designing the algorithm and code that parses the log.</P>



<P>The access log gives you general information regarding what site is connecting



to your server and what files are being retrieved. The error log receives and records



the output from the <TT>STDERR</TT> filehandle from all connections. Both of these,



and especially the error log, may need to be parsed every now and then to see what's



happening with your server's connections. <B><TT>Parsing</TT></B> Using the <TT>Logfile</TT>



module, the discrete transaction record, based on some parameter of the request,



is abstracted to a Perl object after being parsed. During the process of parsing



the log file, the instance variables that are created with the <TT>new()</TT> method



depend on which type of log is being parsed and which field (Hostname, Date, Path,



and so on) from the log file you're interested in summarizing. When parsing is complete,



the return value, a blessed reference to the <TT>Logfile</TT> class, has a hash with



key/value pairs corresponding to the parameters on which you want to gather statistics



about the log and the number of times each one was counted. In the simplest case,



you simply write these lines:</P>



<PRE><FONT COLOR="#0066FF">use Logfile::Apache;  # to parse the popular Apache server log



$l = new Logfile::Apache  File  =&gt; `/usr/local/etc/httpd/logs/access_log',



                            Group =&gt; [qw(Host Domain File)];



</FONT></PRE>



<P>This parses your access log and returns the blessed reference. <B><TT>Reporting



and Summaries</TT></B> After you've invoked the <TT>new()</TT> method for the <TT>Logfile</TT>



class and passed in your log file to be parsed, you can invoke the <TT>report()</TT>



method on the returned object.</P>



<PRE><FONT COLOR="#0066FF">$l-&gt;report(Group =&gt; File, Sort =&gt; Records, Top =&gt; 10);



</FONT></PRE>



<P>The preceding line produces a report detailing the access counts of each of the



top ten files retrieved from your archive and their percentages of the total number



of retrievals. For the sample Apache <TT>access.conf</TT> log file included with



the Log file distribution, the results from the <TT>report()</TT> method look like



this:</P>



<PRE><FONT COLOR="#0066FF">File                                       Records 



=======================================



/mall/os                                     	5            	35.71% 



/mall/web                                   	3           	21.43% 



/~watkins                                  	3           	21.43% 



/cgi-bin/mall                              	1             	 7.14% 



/graphics/bos-area-map                     	1            	 7.14% 



/~rsalz                                    	1             	 7.14% 



</FONT></PRE>



<P>You can generate many other reports with the <TT>Logfile</TT> module, including



multiple-variable reports, to suit your needs and interests. See the <TT>Logfile</TT>



documentation as embedded POD in <TT>Logfile.pm</TT>, for additional information.



You can get the <TT>Logfile</TT> module from the CPAN, from Ulrich Pfeifer's author's



directory:</P>



<PRE><FONT COLOR="#0066FF">~/authors/id/ULPFR/



</FONT></PRE>



<P>The latest release, as of the writing of this chapter, was 0.113. Have a look,



and don't forget to give feedback to the author when you can. <B><TT>Generating Graphical



Data</TT></B> After you've gotten your reports back from <TT>Logfile</TT>, you've



pretty much exhausted the functionality of the module. In order to produce an image



that illustrates the data, you'll need to resort to other means. Because the report



gives essentially two-dimensional data, it'll be easy to produce a representative



image using the <TT>GD</TT> module, which was previously introduced in Chapter 12,



&quot;Multimedia.&quot;</P>



<P>This example provides you with a module that uses the <TT>GD</TT> class and provides



one method to which you should pass a <TT>Logfile</TT> object, along with some other



parameters to specify which field from the log file you wish to graph, the resultant



image size, and the font. This method actually would be better placed into the <TT>Logfile::Base</TT>



class, because that's where each of the <TT>Logfile</TT> subclasses, including the
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -