📄 ch21.htm
字号:
<HTML>
<HEAD>
<TITLE>Chapter 21 -- Using Perl with Web Servers</TITLE>
<META>
</HEAD>
<BODY TEXT="#000000" BGCOLOR="#FFFFFF" LINK="#0000EE" VLINK="#551A8B" ALINK="#CE2910">
<H1><FONT SIZE=6 COLOR=#FF0000>Chapter 21</FONT></H1>
<H1><FONT SIZE=6 COLOR=#FF0000>Using Perl with Web Servers</FONT>
</H1>
<HR>
<P>
<CENTER><B><FONT SIZE=5>CONTENTS</FONT></B></CENTER>
<UL>
<LI><A HREF="#ServerLogFiles">
Server Log Files
<UL>
<LI><A HREF="#ExampleReadingaLogFile">
Example: Reading a Log File</A>
<LI><A HREF="#ExampleListingAccessbyDocument">
Example: Listing Access by Document</A>
<LI><A HREF="#ExampleLookingattheStatusCode">
Example: Looking at the Status Code</A>
<LI><A HREF="#ExampleConvertingtheReporttoaWebPage">
Example: Converting the Report to a Web Page</A>
<LI><A HREF="#ExistingLogFileAnalyzingPrograms">
Existing Log File Analyzing Programs</A>
<LI><A HREF="#CreatingYourOwnCGILogFile">
Creating Your Own CGI Log File</A>
</UL>
<LI><A HREF="#CommunicatingwithUsers">
Communicating with Users</A>
<UL>
<LI><A HREF="#ExampleGeneratingaWhatsNewPage">
Example: Generating a What's New Page</A>
<LI><A HREF="#ExampleGettingUserFeedback">
Example: Getting User Feedback</A>
</UL>
<LI><A HREF="#Summary">
Summary</A>
<LI><A HREF="#ReviewExercises">
Review Exercises</A>
</UL>
<HR>
<P>
Web servers frequently need some type of maintenaNCe in order
to operate at peak efficieNCy. This chapter will look at some
maintenaNCe tasks that can be performed by Perl programs. You
will see some ways that your server keeps track of who visits
and what Web pages are accessed on your site. You will also see
some ways to automatically generate a site index, a what's new
document, and user feedback about a Web page.
<H2><A NAME="ServerLogFiles"><FONT SIZE=5 COLOR=#FF0000>
Server Log Files</FONT></A></H2>
<P>
The most useful tool to assist in understanding how and when your
Web site pages and applications are being accessed is the log
file generated by your Web server. This log file contains, among
other things, which pages are being accessed, by whom, and when.
<P>
Each Web server will provide some form of log file that records
who and what accesses a specific HTML page or graphic. A terrific
site to get an overall comparison of the major Web servers can
be found at <B>http://www.webcompare.com/</B>. From this site
one can see which Web servers follow the CERN/NCSA common log
format that is detailed below. In addition, you can also find
out which sites can customize log files, or write to multiple
log files. You might also be surprised at the number of Web servers
there are on the market.
<P>
Understanding the contents of the server log files is a worthwhile
endeavor. And in this section, you'll see several ways that the
information in the log files can be manipulated. However, if you're
like most people, you'll use one of the log file analyzers that
you'll read about in the section "Existing Log File Analyzing
Programs" to do most of your work. After all, you don't want
to create a program that others are giving away for free.<BR>
<p>
<CENTER>
<TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>
<TR><TD><B>Note </B></TD></TR>
<TR><TD>
<BLOCKQUOTE>
This section about server log files is one that you can read when the need arises. If you are not actively running a Web server now, you won't be able to get full value from the examples. The CD-ROM that accompanies this book has a sample log file to you
to experiment on but it is very limited in size and scope.</BLOCKQUOTE>
</TD></TR>
</TABLE>
</CENTER>
<P>
<P>
Nearly all of the major Web servers use a common format for their
log files. These log files contain information such as the IP
address of the remote host, the document that was requested, and
a timestamp. The syntax for each line of a log file is:
<PRE>
site logName fullName [date:time GMToffset] "req file proto" status length
</PRE>
</BLOCKQUOTE>
<P>
Because that line of syntax is relatively meaningless, here is
a line from a real log file:
<BLOCKQUOTE>
<PRE>
204.31.113.138 - - [03/Jul/1996:06:56:12 -0800]
"GET /PowerBuilder/Compny3.htm HTTP/1.0" 200 5593
</PRE>
</BLOCKQUOTE>
<P>
Even though I have split the line into two, you need to remember
that inside the log file it really is only one line.
<P>
Each of the eleven items listed in the above syntax and example
are described in the following list.
<UL>
<LI><B>site</B>-either an IP address or the symbolic name of the
site making the HTTP request. In the example line the remotehost
is <TT>204.31.113.138</TT>.
<LI><B>logName</B>-login name of the user who owns the account
that is making the HTTP request. Most remote sites don't give
out this information for security reasons. If this field is disabled
by the host, you see a dash (<TT>-</TT>)
instead of the login name.
<LI><B>fullName</B>-full name of the user who owns the account
that is making the HTTP request. Most remote sites don't give
out this information for security reasons. If this field is disabled
by the host, you see a dash (<TT>-</TT>)
instead of the full name. If your server requires a user id in
order to fulfill an HTTP request, the user id will be placed in
this field.
<LI><B>date</B>-date of the HTTP request. In the example line
the date is <TT>03/Jul/1996</TT>.
<LI><B>time</B>-time of the HTTP request. The time will be presented
in 24-hour format. In the example line the time is <TT>06:56:12</TT>.
<LI><B>GMToffset</B>-signed offset from Greenwich Mean Time. GMT
is the international time refereNCe. In the example line the offset
is -0800, eight hours earlier than GMT.
<LI><B>req</B>-HTTP command. For WWW page requests, this field
will always start with the GET command. In the example line the
request is <TT>GET</TT>.
<LI><B>file</B>-path and filename of the requested file. In the
example line the file is <TT>/PowerBuilder/Compny3.htm</TT>.
There are three types of path/filename combinations:
</UL>
<BLOCKQUOTE>
<B>Implied Path and Filename</B>-accesses a file in a user's home
direc-tory. For example, <TT>/~foo/</TT>
could be expanded into <TT>/user/foo/homepage.html</TT>.
The <TT>/user/foo</TT> directory is
the home directory for the user <TT>foo</TT>.
And <TT>homepage.html</TT> is the
default file name for any user's home page. Implied paths are
hard to analyze because you need to know how the server is set
up and because the server's set up may change.
</BLOCKQUOTE>
<BLOCKQUOTE>
<B>Relative Path and Filename</B>-accesses a file in a directory
that is specified relative to a user's home directory. For example,
<TT>/~foo/cooking.html</TT> will be
expanded into <TT>/user/foo/cooking.html</TT>.
</BLOCKQUOTE>
<BLOCKQUOTE>
<B>Full Path and Filename</B>-accesses a file by explicitly stating
the full directory and filename. For example, <TT>/user/foo/biking/mountain/index.html</TT>.
</BLOCKQUOTE>
<UL>
<LI><B>proto</B>-type of protocol used for the request. In the
example line, proto <TT>HTTP 1.0</TT>
is used.
<LI><B>status</B>-status code generated by the request. In the
example line the status is <TT>200</TT>.
See section "Example: Looking at the Status Code" later
in the chapter for more information.
<LI><B>length</B>-length of requested document. In the example
line the byte is <TT>5593</TT>.
</UL>
<P>
Web servers can have many different types of log files. For example,
you might see a proxy access log, or an error log. In this chapter,
we'll focus on the access log-where the Web server tracks every
access to your Web site.
<H3><A NAME="ExampleReadingaLogFile">
Example: Reading a Log File</A></H3>
<P>
In this section you see a Perl script that can open a log file
and iterate over the lines of the log file. It is usually unwise
to read entire log files into memory because they can get quite
large. A friend of mine has a log file that is over 113 Megabytes!
<P>
Regardless of the way that you'd like to process the data, you
must open a log file and read it. You can read the entry into
one variable for processing, or you can split the entry into it's
components. To read each line into a single variable, use the
following code sample:
<BLOCKQUOTE>
<PRE>$LOGFILE = "access.log";
open(LOGFILE) or die("Could not open log file.");
foreach $line (<LOGFILE>) {
chomp($line); # remove the newline from $line.
# do line-by-line processing.
}<BR>
</PRE>
</BLOCKQUOTE>
<p>
<CENTER>
<TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>
<TR><TD><B>Note</B></TD></TR>
<TR><TD>
<BLOCKQUOTE>
If you don't have your own server logs, you can use the file <TT>server.log</TT> that is iNCluded on the CD-ROM that accompanies this book.
</BLOCKQUOTE>
</TD></TR>
</TABLE>
</CENTER>
<P>
<P>
The code snippet will open the log file for reading and will access
the file one line at a time, loading the line into the <TT>$line</TT>
variable. This type of processing is pretty limiting because you
need to deal with the entire log entry at oNCe.
<P>
A more popular way to read the log file is to split the contents
of the entry into different variables. For example, Listing 21.1
uses the <TT>split()</TT> command
and some processing to value 11 variables:
<P>
<IMG SRC="pseudo.gif" tppabs="http://cheminf.nankai.edu.cn/~eb~/Perl%205%20By%20Example/pseudo.gif" BORDER=1 ALIGN=RIGHT><p>
<BLOCKQUOTE>
<I>Turn on the </I><TT><I>warning</I></TT><I>
option.<BR>
</I>Initialize <TT>$LOGFILE</TT> with
the full path and name of the access log.<BR>
Open the log file.<BR>
Iterate over the lines of the log file. Each line gets placed,
<BR>
in turn, into <TT>$line</TT>.<BR>
Split <TT>$line</TT> using the space
character as the delimiter.<BR>
Get the time value from the <TT>$date</TT>
variable.<BR>
Remove the date value from the <TT>$date</TT>
variable avoiding the time<BR>
value and the '[' character.<BR>
Remove the '"' character from the beginning of the request
value.<BR>
Remove the end square bracket from the <TT>gmt</TT>
offset value.<BR>
Remove the end quote from the protocol value.<BR>
Close the log file.
</BLOCKQUOTE>
<HR>
<P>
<B>Listing 21.1 21LST01.PL-Read the Access Log and
Parse Each Entry<BR>
</B>
<BLOCKQUOTE>
<PRE>#!/usr/bin/perl -w
$LOGFILE = "access.log";
open(LOGFILE) or die("Could not open log file.");
foreach $line (<LOGFILE>) {
($site, $logName, $fullName, $date, $gmt,
$req, $file, $proto, $status, $length) = split(' ',$line);
$time = substr($date, 13);
$date = substr($date, 1, 11);
$req = substr($req, 1);
chop($gmt);
chop($proto);
# do line-by-line processing.
}
close(LOGFILE);
</PRE>
</BLOCKQUOTE>
<HR>
<P>
If you print out the variables, you might get a display like this:
<BR>
<BLOCKQUOTE>
<PRE>$site = ros.algonet.se
$logName = -
$fullName = -
$date = 09/Aug/1996
$time = 08:30:52
$gmt = -0500
$req = GET
$file = /~jltiNChe/songs/rib_supp.gif
$proto = HTTP/1.0
$status = 200
$length = 1543
</PRE>
</BLOCKQUOTE>
<P>
You can see that after the split is done, further manipulation
is needed in order to "clean up" the values inside the
variable. At the very least, the square brackets and the double-quotes
needed to be removed.
<P>
I prefer to use a regular expression to extract the information
from the log file entries. I feel that this approach is more straightforward-assuming
that you are comfortable with regular expressions-than the others.
Listing 21.2 shows a program that uses a regular expression to
determine the 11 items in the log entries.
<P>
<IMG SRC="pseudo.gif" tppabs="http://cheminf.nankai.edu.cn/~eb~/Perl%205%20By%20Example/pseudo.gif" BORDER=1 ALIGN=RIGHT><p>
<BLOCKQUOTE>
<I>Turn on the </I><TT><I>warning</I></TT><I>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -