📄 ch21.htm

📁 prrl 5 programs codes in the book
💻 HTM
📖 第 1 页 / 共 5 页
字号:
12 3 4 5 下一页
<HTML>

<HEAD>

<TITLE>Chapter 21  -- Using Perl with Web Servers</TITLE>



<META>

</HEAD>

<BODY TEXT="#000000" BGCOLOR="#FFFFFF" LINK="#0000EE" VLINK="#551A8B" ALINK="#CE2910">

<H1><FONT SIZE=6 COLOR=#FF0000>Chapter&nbsp;21</FONT></H1>

<H1><FONT SIZE=6 COLOR=#FF0000>Using Perl with Web Servers</FONT>

</H1>

<HR>

<P>

<CENTER><B><FONT SIZE=5>CONTENTS</FONT></B></CENTER>

<UL>

<LI><A HREF="#ServerLogFiles">

Server Log Files 

<UL>

<LI><A HREF="#ExampleReadingaLogFile">

Example: Reading a Log File</A>

<LI><A HREF="#ExampleListingAccessbyDocument">

Example: Listing Access by Document</A>

<LI><A HREF="#ExampleLookingattheStatusCode">

Example: Looking at the Status Code</A>

<LI><A HREF="#ExampleConvertingtheReporttoaWebPage">

Example: Converting the Report to a Web Page</A>

<LI><A HREF="#ExistingLogFileAnalyzingPrograms">

Existing Log File Analyzing Programs</A>

<LI><A HREF="#CreatingYourOwnCGILogFile">

Creating Your Own CGI Log File</A>

</UL>

<LI><A HREF="#CommunicatingwithUsers">

Communicating with Users</A>

<UL>

<LI><A HREF="#ExampleGeneratingaWhatsNewPage">

Example: Generating a What's New Page</A>

<LI><A HREF="#ExampleGettingUserFeedback">

Example: Getting User Feedback</A>

</UL>

<LI><A HREF="#Summary">

Summary</A>

<LI><A HREF="#ReviewExercises">

Review Exercises</A>

</UL>



<HR>

<P>

Web servers frequently need some type of maintenaNCe in order

to operate at peak efficieNCy. This chapter will look at some

maintenaNCe tasks that can be performed by Perl programs. You

will see some ways that your server keeps track of who visits

and what Web pages are accessed on your site. You will also see

some ways to automatically generate a site index, a what's new

document, and user feedback about a Web page.

<H2><A NAME="ServerLogFiles"><FONT SIZE=5 COLOR=#FF0000>

Server Log Files</FONT></A></H2>

<P>

The most useful tool to assist in understanding how and when your

Web site pages and applications are being accessed is the log

file generated by your Web server. This log file contains, among

other things, which pages are being accessed, by whom, and when.

<P>

Each Web server will provide some form of log file that records

who and what accesses a specific HTML page or graphic. A terrific

site to get an overall comparison of the major Web servers can

be found at <B>http://www.webcompare.com/</B>. From this site

one can see which Web servers follow the CERN/NCSA common log

format that is detailed below. In addition, you can also find

out which sites can customize log files, or write to multiple

log files. You might also be surprised at the number of Web servers

there are on the market.

<P>

Understanding the contents of the server log files is a worthwhile

endeavor. And in this section, you'll see several ways that the

information in the log files can be manipulated. However, if you're

like most people, you'll use one of the log file analyzers that

you'll read about in the section &quot;Existing Log File Analyzing

Programs&quot; to do most of your work. After all, you don't want

to create a program that others are giving away for free.<BR>

<p>

<CENTER>

<TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>

<TR><TD><B>Note </B></TD></TR>

<TR><TD>

<BLOCKQUOTE>

This section about server log files is one that you can read when the need arises. If you are not actively running a Web server now, you won't be able to get full value from the examples. The CD-ROM that accompanies this book has a sample log file to you 
to experiment on but it is very limited in size and scope.</BLOCKQUOTE>



</TD></TR>

</TABLE>

</CENTER>

<P>

<P>

Nearly all of the major Web servers use a common format for their

log files. These log files contain information such as the IP

address of the remote host, the document that was requested, and

a timestamp. The syntax for each line of a log file is:

<PRE>

site logName fullName [date:time GMToffset] &quot;req file proto&quot; status length

</PRE>

</BLOCKQUOTE>

<P>

Because that line of syntax is relatively meaningless, here is

a line from a real log file:

<BLOCKQUOTE>

<PRE>

204.31.113.138 - - [03/Jul/1996:06:56:12 -0800]

    &quot;GET /PowerBuilder/Compny3.htm HTTP/1.0&quot; 200 5593

</PRE>

</BLOCKQUOTE>



<P>

Even though I have split the line into two, you need to remember

that inside the log file it really is only one line.

<P>

Each of the eleven items listed in the above syntax and example

are described in the following list.

<UL>

<LI><B>site</B>-either an IP address or the symbolic name of the

site making the HTTP request. In the example line the remotehost

is <TT>204.31.113.138</TT>.

<LI><B>logName</B>-login name of the user who owns the account

that is making the HTTP request. Most remote sites don't give

out this information for security reasons. If this field is disabled

by the host, you see a dash (<TT>-</TT>)

instead of the login name.

<LI><B>fullName</B>-full name of the user who owns the account

that is making the HTTP request. Most remote sites don't give

out this information for security reasons. If this field is disabled

by the host, you see a dash (<TT>-</TT>)

instead of the full name. If your server requires a user id in

order to fulfill an HTTP request, the user id will be placed in

this field.

<LI><B>date</B>-date of the HTTP request. In the example line

the date is <TT>03/Jul/1996</TT>.

<LI><B>time</B>-time of the HTTP request. The time will be presented

in 24-hour format. In the example line the time is <TT>06:56:12</TT>.

<LI><B>GMToffset</B>-signed offset from Greenwich Mean Time. GMT

is the international time refereNCe. In the example line the offset

is -0800, eight hours earlier than GMT.

<LI><B>req</B>-HTTP command. For WWW page requests, this field

will always start with the GET command. In the example line the

request is <TT>GET</TT>.

<LI><B>file</B>-path and filename of the requested file. In the

example line the file is <TT>/PowerBuilder/Compny3.htm</TT>.

There are three types of path/filename combinations:

</UL>

<BLOCKQUOTE>

<B>Implied Path and Filename</B>-accesses a file in a user's home

direc-tory. For example, <TT>/~foo/</TT>

could be expanded into <TT>/user/foo/homepage.html</TT>.

The <TT>/user/foo</TT> directory is

the home directory for the user <TT>foo</TT>.

And <TT>homepage.html</TT> is the

default file name for any user's home page. Implied paths are

hard to analyze because you need to know how the server is set

up and because the server's set up may change.

</BLOCKQUOTE>

<BLOCKQUOTE>

<B>Relative Path and Filename</B>-accesses a file in a directory

that is specified relative to a user's home directory. For example,

<TT>/~foo/cooking.html</TT> will be

expanded into <TT>/user/foo/cooking.html</TT>.

</BLOCKQUOTE>

<BLOCKQUOTE>

<B>Full Path and Filename</B>-accesses a file by explicitly stating

the full directory and filename. For example, <TT>/user/foo/biking/mountain/index.html</TT>.

</BLOCKQUOTE>

<UL>

<LI><B>proto</B>-type of protocol used for the request. In the

example line, proto <TT>HTTP 1.0</TT>

is used.

<LI><B>status</B>-status code generated by the request. In the

example line the status is <TT>200</TT>.

See section &quot;Example: Looking at the Status Code&quot; later

in the chapter for more information.

<LI><B>length</B>-length of requested document. In the example

line the byte is <TT>5593</TT>.

</UL>

<P>

Web servers can have many different types of log files. For example,

you might see a proxy access log, or an error log. In this chapter,

we'll focus on the access log-where the Web server tracks every

access to your Web site.

<H3><A NAME="ExampleReadingaLogFile">

Example: Reading a Log File</A></H3>

<P>

In this section you see a Perl script that can open a log file

and iterate over the lines of the log file. It is usually unwise

to read entire log files into memory because they can get quite

large. A friend of mine has a log file that is over 113 Megabytes!

<P>

Regardless of the way that you'd like to process the data, you

must open a log file and read it. You can read the entry into

one variable for processing, or you can split the entry into it's

components. To read each line into a single variable, use the

following code sample:

<BLOCKQUOTE>

<PRE>$LOGFILE = &quot;access.log&quot;;

open(LOGFILE) or die(&quot;Could not open log file.&quot;);

foreach $line (&lt;LOGFILE&gt;) {

    chomp($line);              # remove the newline from $line.

    # do line-by-line processing.

}<BR>

</PRE>

</BLOCKQUOTE>

<p>

<CENTER>

<TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>

<TR><TD><B>Note</B></TD></TR>

<TR><TD>

<BLOCKQUOTE>

If you don't have your own server logs, you can use the file <TT>server.log</TT> that is iNCluded on the CD-ROM that accompanies this book.

</BLOCKQUOTE>



</TD></TR>

</TABLE>

</CENTER>

<P>

<P>

The code snippet will open the log file for reading and will access

the file one line at a time, loading the line into the <TT>$line</TT>

variable. This type of processing is pretty limiting because you

need to deal with the entire log entry at oNCe.

<P>

A more popular way to read the log file is to split the contents

of the entry into different variables. For example, Listing 21.1

uses the <TT>split()</TT> command

and some processing to value 11 variables:

<P>

<IMG SRC="pseudo.gif" tppabs="http://cheminf.nankai.edu.cn/~eb~/Perl%205%20By%20Example/pseudo.gif" BORDER=1 ALIGN=RIGHT><p>

<BLOCKQUOTE>

<I>Turn on the </I><TT><I>warning</I></TT><I>

option.<BR>

</I>Initialize <TT>$LOGFILE</TT> with

the full path and name of the access log.<BR>

Open the log file.<BR>

Iterate over the lines of the log file. Each line gets placed,

<BR>

in turn, into <TT>$line</TT>.<BR>

Split <TT>$line</TT> using the space

character as the delimiter.<BR>

Get the time value from the <TT>$date</TT>

variable.<BR>

Remove the date value from the <TT>$date</TT>

variable avoiding the time<BR>

value and the '[' character.<BR>

Remove the '&quot;' character from the beginning of the request

value.<BR>

Remove the end square bracket from the <TT>gmt</TT>

offset value.<BR>

Remove the end quote from the protocol value.<BR>

Close the log file.

</BLOCKQUOTE>

<HR>

<P>

<B>Listing 21.1&nbsp;&nbsp;21LST01.PL-Read the Access Log and

Parse Each Entry<BR>

</B>

<BLOCKQUOTE>

<PRE>#!/usr/bin/perl -w



$LOGFILE = &quot;access.log&quot;;

open(LOGFILE) or die(&quot;Could not open log file.&quot;);

foreach $line (&lt;LOGFILE&gt;) {

    

    ($site, $logName, $fullName, $date, $gmt,

         $req, $file, $proto, $status, $length) = split(' ',$line);

    $time = substr($date, 13);

    $date = substr($date, 1, 11);

    $req  = substr($req, 1);

    chop($gmt);

    chop($proto);

    # do line-by-line processing.

}

close(LOGFILE);

</PRE>

</BLOCKQUOTE>

<HR>

<P>

If you print out the variables, you might get a display like this:

<BR>

<BLOCKQUOTE>

<PRE>$site     = ros.algonet.se

$logName  = -

$fullName = -

$date     = 09/Aug/1996

$time     = 08:30:52

$gmt      = -0500

$req      = GET

$file     = /~jltiNChe/songs/rib_supp.gif

$proto    = HTTP/1.0

$status   = 200

$length   = 1543

</PRE>

</BLOCKQUOTE>

<P>

You can see that after the split is done, further manipulation

is needed in order to &quot;clean up&quot; the values inside the

variable. At the very least, the square brackets and the double-quotes

needed to be removed.

<P>

I prefer to use a regular expression to extract the information

from the log file entries. I feel that this approach is more straightforward-assuming

that you are comfortable with regular expressions-than the others.

Listing 21.2 shows a program that uses a regular expression to

determine the 11 items in the log entries.

<P>

<IMG SRC="pseudo.gif" tppabs="http://cheminf.nankai.edu.cn/~eb~/Perl%205%20By%20Example/pseudo.gif" BORDER=1 ALIGN=RIGHT><p>

<BLOCKQUOTE>

<I>Turn on the </I><TT><I>warning</I></TT><I>
12 3 4 5 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -