📄 ch10.htm
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<!-- This document was created from RTF source by rtftohtml version 3.0.1 -->
<META NAME="GENERATOR" Content="Symantec Visual Page 1.0">
<META HTTP-EQUIV="Content-Type" CONTENT="text/html;CHARSET=iso-8859-1">
<TITLE>Without a title - Title</TITLE>
</HEAD>
<BODY BACKGROUND="r2harch.gif" tppabs="http://210.32.137.15/ebook/Web%20Programming%20with%20Perl%205/r2harch.gif" TEXT="#000000" BGCOLOR="#FFFFFF">
<H2 ALIGN="CENTER"><A HREF="ch09.htm" tppabs="http://210.32.137.15/ebook/Web%20Programming%20with%20Perl%205/ch09.htm"><IMG SRC="blanprev.gif" tppabs="http://210.32.137.15/ebook/Web%20Programming%20with%20Perl%205/blanprev.gif" WIDTH="37" HEIGHT="37"
ALIGN="BOTTOM" BORDER="2"></A><A HREF="index-1.htm" tppabs="http://210.32.137.15/ebook/Web%20Programming%20with%20Perl%205/index-1.htm"><IMG SRC="blantoc.gif" tppabs="http://210.32.137.15/ebook/Web%20Programming%20with%20Perl%205/blantoc.gif" WIDTH="42"
HEIGHT="37" ALIGN="BOTTOM" BORDER="2"></A><A HREF="ch11.htm" tppabs="http://210.32.137.15/ebook/Web%20Programming%20with%20Perl%205/ch11.htm"><IMG SRC="blannext.gif" tppabs="http://210.32.137.15/ebook/Web%20Programming%20with%20Perl%205/blannext.gif"
WIDTH="45" HEIGHT="37" ALIGN="BOTTOM" BORDER="2"></A><BR>
<BR>
<FONT COLOR="#0000AA">10</FONT><BR>
<A NAME="Heading1"></A><FONT COLOR="#000077">Search Engines<BR>
</FONT>
<HR>
</H2>
<UL>
<LI><A HREF="#Heading1">Search Engines</A>
<UL>
<LI><A HREF="#Heading2">On-Site Searching with Glimpse</A>
<UL>
<LI><A HREF="#Heading4">Glimpse Indexes</A>
<LI><A HREF="#Heading5">GlimpseHTTP</A>
</UL>
<LI><A HREF="#Heading6">Search the Web with WWW::Search</A>
<LI><A HREF="#Heading7">Listing 10.1. search.PL</A>
<LI><A HREF="#Heading8">; real world example of WWW::Search.</A>
<LI><A HREF="#Heading9">Summary</A>
</UL>
</UL>
<P>
<HR>
</P>
<UL>
<LI>On-Site Searching with Glimpse
<P>
<LI>Search the Web with WWW::Search
</UL>
<P>There are two basic ways we access information on the Web: browsing and searching.
The Web's popularity and power is based on its vast amounts of hyperlinked documents.
You can browse from one page to another, clicking on the links which interest you
or focus on what you are looking for. Starting from a single home page, or a page
such as Yahoo!, you can click to anywhere else on the Web.</P>
<P>However, as more and more information becomes available on the Web, even the best
indexes can't provide links to all of the information. With tens of millions of Web
pages currently on servers all over the world, it is simply impossible and impractical
to "browse" through an index of these documents to find the information
you are looking for.</P>
<P>So, as the Web has expanded, we have seen the birth of search engines. At first,
these search engines could be found on the more prominent index sites such as Yahoo!.
The search engine could locate a list of Web sites that matched a given search criteria.
Today Web sites such as Digital Equipment Corporation's AltaVista allow searching
of the entire Web with giant supercomputers with gigabytes of memory.</P>
<P>When implementing a search engine on your site, consider how it is implemented
from a user's standpoint. Many of the search functions I find on the Web today are
totally useless because of the way their interface was designed. The typical user
does not want to take the time to learn the syntax of a complicated "valid"
search query and is easily annoyed with the "black box" nature of some
search mechanisms. This is especially true if the search mechanism fails to return
the appropriate (or any) response to the user.</P>
<P>In this chapter, I will introduce you to how Perl5 can be used to access information
locally on your site and globally on any site on the Web. If implemented properly,
these tools will allow even the most terribly constructed, even misspelled, search
query to return appropriate information to those searching your site.
<H3 ALIGN="CENTER"><A NAME="Heading2"></A><FONT COLOR="#000077">On-Site Searching
with Glimpse</FONT></H3>
<P>Glimpse is a powerful set of UNIX tools that provide an excellent foundation for
a search engine on any UNIX based Web server. Glimpse (GLobal IMPlicit SEarch) is
a powerful "indexing and query system" that allows you to search through
large numbers of files on your server very quickly. Glimpse is used in the same way
as the popular UNIX command grep, except that it can search entire filesystems. For
example, if you are looking for the word "help" in some file located anywhere
on your server, all you have to do is type "glimpse help," and all lines
containing "help" will appear preceded by the file name.</P>
<P>Glimpse was developed by Udi Manber and Burra Gopal, at the University of Arizona,
and Sun Wu, at the National Chung-Cheng University in Taiwan. At the time of this
writing, Glimpse is at version 4.0. Source and precompiled binaries of Glimpse can
be found at</P>
<PRE><A HREF="javascript:if(confirm('http://glimpse.cs.arizona.edu/ \n\nThis file was not retrieved by Teleport Pro, because it is addressed on a domain or path outside the boundaries set for its Starting Address. \n\nDo you want to open it from the server?'))window.location='http://glimpse.cs.arizona.edu/'" tppabs="http://glimpse.cs.arizona.edu/"><FONT COLOR="#0066FF">http://glimpse.cs.arizona.edu/
</FONT></A><FONT COLOR="#0066FF"></FONT></PRE>
<P>The Glimpse package contains the programs agrep, glimpse, glimpseindex, and glimpseserver.
To use Glimpse from the command line you must first "index" your files
with glimpseindex. The glimpseindex program creates an optimized index file which
contains a "hash" of all the data in your files. Glimpse will search through
the index file instead of the actual data. Since Glimpse searches the index file,
and not the actual data, it is important that the index file be kept up to date.
Running glimpseindex on a nightly basis from cron, a utility which executes tasks
on a regular basis, is typically a good idea. Using glimpseindex to create an index
is very simple. To use glimpseindex to index all files in the a directory tree rooted
at /public_html type (or place in your crontab) the following:</P>
<PRE><FONT COLOR="#0066FF">glimpseindex /public_html
</FONT></PRE>
<P>Afterwards, Glimpse can quickly and efficiently search through all of the documents
indexed in the /public_html directory.
<DL>
<DT></DT>
</DL>
<H3 ALIGN="CENTER">
<HR WIDTH="83%">
<BR>
<FONT COLOR="#000077">TIP:</FONT></H3>
<BLOCKQUOTE>
<P>Pay close attention to what you are indexing. If you want to index all of the
Web pages on your server, your glimpseindex need only contain the files under the
public HTML directory. Images are located in the public HTML document area and need
not be indexed, so they should be placed in a directory not indexed by Glimpse.<BR>
<HR>
</BLOCKQUOTE>
<H4 ALIGN="CENTER"><A NAME="Heading4"></A><FONT COLOR="#000077">Glimpse Indexes</FONT></H4>
<P>Glimpse indexes are highly optimized files containing representations of the actual
data on your system. By searching for patterns in these index files, Glimpse can
quickly query large amounts of data. Glimpse supports three types of indexes: a tiny
one (2 to 3 percent of the size of all files indexed), a small one (7 to 9 percent),
and a medium one (20 to 30 percent). The relative size of the index file generated
can be specified when you build the index file with glimpseindex. The larger the
index the faster the search. The size of the index you plan to use should be based
on the resources you have on your server. If you had a fast server (say a Silicon
Graphics WebForce server) with limited disk resources you would probably want to
use a smaller index file. If you have lots of disk space, and a slower Intel-based
server, you might consider using a bigger index. Glimpse supports "approximate
matching" (finding misspelled words), Boolean queries, and limited forms of
regular expressions. Details can be found in the Glimpse man pages or on its Web
site.
<H4 ALIGN="CENTER"><A NAME="Heading5"></A><FONT COLOR="#000077">GlimpseHTTP</FONT></H4>
<P>Now you are probably asking how all of this talk about Glimpse relates to Perl.
GlimpseHTTP is a collection of Perl scripts which takes advantage of the power of
Glimpse from within Perl. GlimpseHTTP outputs search results in nicely formatted
HTML based on a template page (ghtemplate.html) which is easily modified to customize
your output.</P>
<P>GlimpseHTTP was written by Michael Smith, Udi Manber, and Paul Klark. As of this
writing, the most current version is 2.0 and is available from:</P>
<PRE><A HREF="javascript:if(confirm('ftp://ftp.cs.arizona.edu/glimpse/glimpseHTTP.2.0.src.tar.Z \n\nThis file was not retrieved by Teleport Pro, because it is addressed on a domain or path outside the boundaries set for its Starting Address. \n\nDo you want to open it from the server?'))window.location='ftp://ftp.cs.arizona.edu/glimpse/glimpseHTTP.2.0.src.tar.Z'" tppabs="ftp://ftp.cs.arizona.edu/glimpse/glimpseHTTP.2.0.src.tar.Z"><FONT COLOR="#0066FF">ftp://ftp.cs.arizona.edu/glimpse/glimpseHTTP.2.0.src.tar.Z</FONT></A><FONT
COLOR="#0066FF">
</FONT></PRE>
<P>Installation of GlimpseHTTP is very straightforward. A step-by-step installation
guide can be <BR>
found at:</P>
<PRE><A HREF="javascript:if(confirm('http://glimpse.cs.arizona.edu/ghttp/install.html \n\nThis file was not retrieved by Teleport Pro, because it is addressed on a domain or path outside the boundaries set for its Starting Address. \n\nDo you want to open it from the server?'))window.location='http://glimpse.cs.arizona.edu/ghttp/install.html'" tppabs="http://glimpse.cs.arizona.edu/ghttp/install.html"><FONT COLOR="#0066FF">http://glimpse.cs.arizona.edu/ghttp/install.html
</FONT></A><FONT COLOR="#0066FF"></FONT></PRE>
<P>After GlimpseHTTP is installed, the first thing you need to do is make an "archive"
using the included makegharc command. Like Glimpse, GlimpseHTTP requires a few additional
files to be created to function properly. The makegharc program creates some configuration
files, along with the ghindex.html files which contain the search forms. When makegharc
is run, it will prompt you for the location of the archive. As we discussed earlier,
the location needs to be at the root of the public_html tree on your server, and
should not contain images or any other files you do not intend to have publicly available.
<BR>
<BR>
<A HREF="11wpp01.jpg" tppabs="http://210.32.137.15/ebook/Web%20Programming%20with%20Perl%205/11wpp01.jpg"><TT><B>Figure 10.1.</B></TT></A> GlimpseHTTP in action.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -