📄 ch21.htm

📁 CGI programming is the hottest stuff to look out for in this book
💻 HTM
📖 第 1 页 / 共 5 页
字号:
12 3 4 5 下一页
<HTML>

<HEAD>
   <TITLE>Chapter 21 -- Tracking users</TITLE>
   <META>
</HEAD>
<BODY TEXT="#000000" BGCOLOR="#FFFFFF" LINK="#0000EE" VLINK="#551A8B" ALINK="#CE2910">
<H1><FONT COLOR=#FF0000>Chapter 21</FONT></H1>
<H1><B><FONT SIZE=5 COLOR=#FF0000>Tracking Users</FONT></B>
</H1>
<P>
<HR WIDTH="100%"></P>
<P>
<H3 ALIGN=CENTER><FONT COLOR="#000000"><FONT SIZE=+2>CONTENTS<A NAME="CONTENTS"></A>
</FONT></FONT></H3>


<UL>
<LI><A HREF="#WhyDoWeNeedtoTrackUsers" >Why Do We Need to Track Users?</A>
<LI><A HREF="#TheEssenceofWebMarketing" >The Essence of Web Marketing</A>
<LI><A HREF="#ParsingAccessLogs" >Parsing Access Logs</A>
<UL>
<LI><A HREF="#WhatIsanAccessLog" >What Is an Access Log?</A>
</UL>
<LI><A HREF="#EnvironmentVariables" >Environment Variables</A>
<LI><A HREF="#CreatingaPseudoAccessLogFile" >Creating a Pseudo Access Log File</A>
<LI><A HREF="#LoggingAccesses" >Logging Accesses</A>
<LI><A HREF="#HowtoImplementTrackingCGIs" >How to Implement Tracking CGIs</A>
<UL>
<LI><A HREF="#indexcgi" >index.cgi</A>
<LI><A HREF="#indexshtml" >index.shtml</A>
<LI><A HREF="#IncludingCGIsinImages" >Including CGIs in Images</A>
</UL>
<LI><A HREF="#ASimpleWebCounter" >A Simple Web Counter</A>
<LI><A HREF="#Callingcountercgi" >Calling counter.cgi</A>
<LI><A HREF="#LocatingUsersGeographically" >Locating Users Geographically</A>
<UL>
<LI><A HREF="#DiscussionofFeasibility" >Discussion of Feasibility</A>
<LI><A HREF="#IntroductiontoNSLOOKUPandWHOIS" >Introduction to NSLOOKUP and WHOIS</A>
<LI><A HREF="#LimitationsofTrackingUsersThroughIP" >Limitations of Tracking Users Through IP Addresses</A>
</UL>
<LI><A HREF="#Cookies" >Cookies</A>
<LI><A HREF="#OtherMethodsofTrackingUsers" >Other Methods of Tracking Users</A>
<UL>
<LI><A HREF="#FingeringDialUpServers" >Fingering Dial-Up Servers</A>
</UL>
<LI><A HREF="#TheEthicsofTrackingUsers" >The Ethics of Tracking Users</A>
<LI><A HREF="#AccessingThisChapterOnline" >Accessing This Chapter Online</A>
<LI><A HREF="#Summary" >Summary</A>
</UL>
<HR>
<P>
There are several different methods you can use to track users.
They are
<UL>
<LI><FONT COLOR=#000000>Parsing Access Logs: How to get at the
information your Web server may already have.</FONT>
<LI><FONT COLOR=#000000>Environment Variables: The information
your browser is sending, without your knowledge.</FONT>
<LI><FONT COLOR=#000000>Web Counters: The odometers you may have
seen on some sites and how to make your own.</FONT>
<LI><FONT COLOR=#000000>Logging Accesses: A more sophisticated
means of counting users.</FONT>
<LI><FONT COLOR=#000000>Locating Users Geographically: Where exactly
is your audience located?</FONT>
<LI><FONT COLOR=#000000>Cookies: A client/server method of definite
user verification.</FONT>
</UL>
<H2><A NAME="WhyDoWeNeedtoTrackUsers"><FONT SIZE=5 COLOR=#FF0000>Why
Do We Need to Track Users?</FONT></A></H2>
<P>
It's easy enough to set up a World Wide Web site for yourself
or your organization and gauge its success or failure solely on
the amount of response you get via e-mail, phone, or fax. But
if you rely on so simplistic a tracking mechanism, you won't get
anywhere near the whole picture. Perhaps your site is attracting
many visitors, but your response form is hard to find, so very
few of them are getting in touch with you. Perhaps many people
find your Web site via unrelated searches on Internet search engines
and promptly leave. Or perhaps you've optimized your site for
Netscape, but the people most interested in your content are using
ncSA Mosaic and can't view any of your in-line images! In any
of these cases, you could spend a long time waiting for user responses
while being totally in the dark about why you weren't getting
any responses.
<P>
This illustrates why it's so important to track user information
on a constant basis. You can gain valuable insights not only into
who is accessing your site, but also how they're finding it and
where they might have heard of you. Plus, there's the all-important
question of the total number of users visiting your site.
<P>
<CENTER><TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>
<TR><TD><B>How Search Engines Work</B></TD></TR>
<TR><TD>
<BLOCKQUOTE>
Search engines such as Alta Vista, WebCrawler, InfoSeek, Lycos, and Excite possess vast databases of information, cataloging much of the content on the World Wide Web. Not only is the creation of such a huge database a task more difficult than any group of 
people could manually accomplish, it's also necessary to update all of the information on an increasingly frequent basis. Thus, the creators of these services designed automatic &quot;robots&quot; that roam the Web and retrieve Web site information for 
inclusion in the database. While this deals with the speed problem quite nicely, there is a serious problem introduced by this automatic approach: Machines, even ones with so-called artificial intelligence software, are still nowhere near as good as humans 
at categorizing information (well, at least not into categories that make sense to humans!). When a search engine's robot visits a site, it incorporates all of the text on that site into its database for reference in subsequent user searches. This means 
that a word inadvertently placed in the text of your Web site can cause people to find your site via searches on that word, thinking that your site might have something to do with that word! Suppose that you've set up a Web site about gardening, and in it 
you include a personal anecdote about how much your dog loves being outdoors with you. Thousands of dog-lovers might find your site because of that reference to your dog, be surprised that the site is about gardening and not dogs, and promptly leave! There 
are many other problems associated with the way automatic search engines work, which you'll no doubt discover when your site is added to them.</BLOCKQUOTE>

</TD></TR>
</TABLE></CENTER>
<P>
<H2><A NAME="TheEssenceofWebMarketing"><FONT SIZE=5 COLOR=#FF0000>The
Essence of Web Marketing</FONT></A></H2>
<P>
With the incredible corporate interest in the World Wide Web in
the past few years, tracking users helps us get closer to an answer
to the most crucial question for most organizations getting on
the Web: Does the Web really work? In other words, does their
Web site attract visitors, and if so, do those visitors turn into
customers? In other media, hard numbers are available as answers
to these questions. Newspapers have circulation figures, radio
has broadcast ranges, and television has Nielsen ratings. It's
surprising how many Web sites have unmonitored access levels since
more precise visitor information can be gained on the Internet
than through any other medium.
<P>
There is one key advantage these other media have over the Web,
however: access to demographic information. The reason that accurate
demographics (for example, the makeup of the audience by age,
sex, income, and so on) are much more readily available for these
traditional media is because the level of market penetration is
such that a representative sampling of the general population
in that area can be extrapolated meaningfully to apply to your
whole audience. With the Web, you have several problems in doing
this:
<UL>
<LI><FONT COLOR=#000000>Because people self-select their visit
to your site, you can reach a </FONT><I>very</I> specialized audience,
and a sampling of the general population would be completely inaccurate.
<LI><FONT COLOR=#000000>The international reach of the Web means
that you could be attracting visitors from all over the world,
making it much harder to do a survey.</FONT>
</UL>
<P>
Both of these problems mean that the only way you could get accurate
demographics would be while people are actually visiting your
Web site. This can come across as somewhat obtrusive, and people
accustomed to browsing through Web sites at high-speed with little
or no thought involved have to be given a very good incentive
to spend the time to fill out a survey form for your benefit.
<P>
This means that it's all the more crucial to identify whatever
hard numbers you can automatically, and this is where the idea
of tracking users comes in.
<H2><A NAME="ParsingAccessLogs"><FONT SIZE=5 COLOR=#FF0000>Parsing
Access Logs</FONT></A></H2>
<P>
This section deals with one of the fundamental methods of collecting
demographic information about visitors to your Web site-the access
log.
<H3><A NAME="WhatIsanAccessLog">What Is an Access Log?</A></H3>
<P>
So where do we begin when trying to find out information about
visitors to our site? How about on our Web server itself! It's
mentioned earlier on in the book that <I>HTTP</I>, the <I>HyperText
Transfer Protocol</I>, enables communication between your browser
and the Web server via <BR>
a series of discrete connections that fetch the text of the Web
page being retrieved, and then each one of the graphics on that
page in sequence. Did you know that every single time one of these
requests is made, a record of that request is written to a log
file? Here is a sample of the contents of an access log, from
the file access-log, produced by ncSA httpd.
<BLOCKQUOTE>
<TT><FONT FACE="Courier">&nbsp;&nbsp;&nbsp;&nbsp;ts17-15.slip.uwo.ca
- - [09/Jul/1996:01:53:53 -0500] <BR>
&quot;POST /cgiunleashed/shopping/cart.cgi HTTP/1.0&quot; 200
1519<BR>
&nbsp;&nbsp;&nbsp;&nbsp;ts17-15.slip.uwo.ca - - [09/Jul/1996:01:54:22
-0500] <BR>
&quot;POST /cgiunleashed/shopping/cart.cgi HTTP/1.0&quot; 200
1954<BR>
&nbsp;&nbsp;&nbsp;&nbsp;ts17-15.slip.uwo.ca - - [09/Jul/1996:01:54:43
-0500] <BR>
&quot;POST /cgiunleashed/shopping/cart.cgi HTTP/1.0&quot; 200
1678<BR>
&nbsp;&nbsp;&nbsp;&nbsp;pm107.spots.ab.ca - - [09/Jul/1996:01:59:28
-0500] &quot;GET /pics/asd.gif HTTP/1.0&quot; &Acirc;304 0<BR>
&nbsp;&nbsp;&nbsp;&nbsp;b61022.dial.tip.net - - [09/Jul/1996:02:03:36
-0500] &quot;GET /pics/asd.gif HTTP/&Acirc;1.0&quot; 200 4117
<BR>
slip11.docker.com - - [09/Jul/1996:02:03:49 -0500] &quot;GET /rcr/
HTTP/1.0&quot; 200 8751<BR>
&nbsp;&nbsp;&nbsp;&nbsp;slip11.docker.com - - [09/Jul/1996:02:04:17
-0500] &quot;GET /rcr/guest.html HTTP/&Acirc;1.0&quot; 200 2984
<BR>
&nbsp;&nbsp;&nbsp;&nbsp;slip11.docker.com - - [09/Jul/1996:02:05:01
-0500] &quot;GET /rcr/store.html HTTP/&Acirc;1.0&quot; 200 34717
<BR>
&nbsp;&nbsp;&nbsp;&nbsp;port52.annex1.net.ubc.ca - - [09/Jul/1996:02:05:09
-0500] &quot;GET /pics/asd.gif &Acirc;HTTP/1.0&quot; 200 4117
<BR>
&nbsp;&nbsp;&nbsp;&nbsp;slip11.docker.com - - [09/Jul/1996:02:06:01
-0500] &quot;GET /rcr/regint.html HTTP/&Acirc;1.0&quot; 200 19452</FONT></TT>
</BLOCKQUOTE>
<P>
ncSA, CERN, and Apache httpd all produce access logs in very similar
formats, and collectively they have the vast majority of Web server
market share, so this section will deal with extracting information
from those servers. Other Web servers may store information in
a different format, and you should consult the documentation that
comes with yours to learn how to read it.
<P>
<CENTER><TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>
<TR><TD><B>Note</B></TD></TR>
<TR><TD>
<BLOCKQUOTE>
You may have heard of the HTTP keep-alive protocol, which allows for a continuous connection to be maintained between the Web server and the Web browser. This doesn't contradict the nature of the discrete connections in HTTP; there are still multiple 
fetches made from the Web server. The difference is that the connection isn't terminated and restarted between each one while retrieving information on the same Web page.</BLOCKQUOTE>

</TD></TR>
</TABLE></CENTER>
<P>
<P>
Now, let's take a look at some of the information that is provided
in the access log. The lines all take on a standard format, and,
in fact, the entire access log consists of nothing but lines like
these. The format of the lines is as follows:
<BLOCKQUOTE>
<TT><FONT FACE="Courier">host rfc931 authuser [DD/Mon/YYYY:hh:mm:ss]
&quot;request&quot; ddd bbbb &quot;opt_referer&quot; &Acirc;&quot;opt_agent&quot;</FONT></TT>
</BLOCKQUOTE>
<P>
Here's a breakdown of the elements included in the lines:
<UL>
<LI><TT><FONT FACE="Courier">host</FONT></TT>: Either the DNS
name or the IP number of the remote client.
<LI><TT><FONT FACE="Courier">rfc931</FONT></TT>: Any information
returned by <TT><FONT FACE="Courier">identd</FONT></TT> for this
person, or a dash (<TT><FONT FACE="Courier">-</FONT></TT>) otherwise.
<LI><TT><FONT FACE="Courier">authuser</FONT></TT>: If user sent
a userid for authentication, the username, or a dash otherwise.
<LI><TT><FONT FACE="Courier">DD</FONT></TT>: Day.
<LI><TT><FONT FACE="Courier">Mon</FONT></TT>: Month (calendar
name).
<LI><TT><FONT FACE="Courier">YYYY</FONT></TT>: Year.
<LI><TT><FONT FACE="Courier">hh</FONT></TT>: Hour (24-hour format,
the machine's timezone).
<LI><TT><FONT FACE="Courier">mm</FONT></TT>: Minutes.
<LI><TT><FONT FACE="Courier">ss</FONT></TT>: Seconds.
<LI><TT><FONT FACE="Courier">request</FONT></TT>: The first line
of the HTTP request as sent by the client.
<LI><TT><FONT FACE="Courier">ddd</FONT></TT>: The status code
returned by the server, or a dash if not available.
<LI><TT><FONT FACE="Courier">bbbb</FONT></TT>: The total number
of bytes sent, not including the HTTP/1.0 header, or a dash if
not available.
<LI><TT><FONT FACE="Courier">opt_referer</FONT></TT>: The referer
field if given and if <TT><FONT FACE="Courier">LogOptions</FONT></TT>
is <TT><FONT FACE="Courier">Combined</FONT></TT>.
<LI><TT><FONT FACE="Courier">opt_agent</FONT></TT>: The user agent
field if given and if <TT><FONT FACE="Courier">LogOptions</FONT></TT>
is <TT><FONT FACE="Courier">Combined</FONT></TT>
</UL>
<P>
Note that the last two fields are not usually enabled on most
systems, and thus our sample program won't process them. It's
easy enough to modify it so that it does, however.
<P>
With a line not only for each Web page access, but in fact for
each graphic on each Web page as well, you might be able to imagine
why access log files can grow to become several megabytes in size
very quickly. If your Web server has a limited amount of hard
drive space, the access log's growth might even risk crashing
it!
<P>
One solution to this problem is to delete the access log on a
regular basis, after creating a summary of the information in
it. So how exactly do you create a summary? Good question! this
is where we get into our first program for this chapter, an httpd
access log parser. The individual lines in the access log file,
while providing a fairly detailed amount of information, aren't
terribly useful when viewed in their raw form. However, they can
be used as the basis for all kinds of reports you can create with
software that summarizes the information into various categories.
An example of such a program is included in Listing 21.1. Its
output is shown in Figure 21.1., the Access Log Summary program.
This program reads in the server access log file and generates
an HTML document as output. The document summarizes all of the
raw information presented in the access log into useful categories.
<P>
<A HREF="f21-1.gif" ><B>Figure 21.1: </B><I>The output from the access log summary program.</I></A>
<BR>
<HR>
<BLOCKQUOTE>
<B>Listing 21.1. Source code for the Access Log Summary program.
<BR>
</B>
</BLOCKQUOTE>
<BLOCKQUOTE>
<TT><FONT FACE="Courier">// accsum.cpp -- AccESS LOG SUMMARY PROGRAM
<BR>
// Available on-line at http://www.anadas.com/cgiunleashed/trackuser/
<BR>
//<BR>
// This program reads in the server access log file and generates
an HTML<BR>
// document as output.&nbsp;&nbsp;The document summarizes all
of the raw information<BR>
// presented in the access log into useful categories<BR>
//<BR>
// By Shuman Ghosemajumder, Anadas Software Development<BR>
//<BR>
// The categories it summarizes information for:<BR>
//<BR>
// * # of hits by domain<BR>
12 3 4 5 下一页
💿 文件大小 1276 K
👤 上传用户 as7512158
📂 所属分类软件工程
📄 代码行数 1,558 行
💻 语言类型 HTM
🏷️ 相关标签

#programming #hottest #stuff #book
更多programming资源 →
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -