📄 ch10.htm
字号:
one or more keywords into an HTML form; the search engine gathers
the URLs of pages that match the keywords; the results are returned
to the user weighted by some sort of scoring mechanism.
<H3><A NAME="Indexers">Indexers</A></H3>
<P>
The first step in putting a searchable index of information on
the Web is generating that index. A number of freely available
packages exist on the Internet to do just that, including Wais,
Swish, Ice, and Glimpse.
<H4>Wais</H4>
<P>
Probably the most common Web indexer in use today (which predates
the existence of the Web) is Wais (or freeWais or freeWais-sf).
Wais was originally developed by Wais, Inc. (now owned by America
OnLine). The most recent development on Wais has branched off
into a freely redistributable version called freeWais and an enhanced
version called freeWais-sf. Information about Wais in general
(and freeWais-sf in particular) is available at
<BLOCKQUOTE>
<TT><FONT FACE="Courier"><A HREF="http://l6-www.informatik.uni-dortmund.de/freeWAIS-sf/">http://l6-www.informatik.uni-dortmund.de/freeWAIS-sf/</A></FONT></TT>
</BLOCKQUOTE>
<P>
Source code is available from
<BLOCKQUOTE>
<TT><FONT FACE="Courier"><A HREF="ftp://ftp.germany.eu.net/pub/infosystems/wais/Unido-LS6/">ftp://ftp.germany.eu.net/pub/infosystems/wais/Unido-LS6/</A></FONT></TT>
</BLOCKQUOTE>
<P>
Wais was designed as an all-purpose text indexer but is very useful
at indexing HTML and other Web-related documents.
<P>
Installing freeWAIS-sf creates several programs, including waisserver,
waissearch, waisq, and waisindex. Waisserver is a daemon that
accepts requests from any machine on the Internet, processes queries,
and returns information on the requested documents with weighted
scoring information. Waissearch is a client used to connect to
waisservers across the Internet. Waisq is a client for use on
a local server. Waisindex is the actual index program. It takes
a list of files and generates a database containing all the words
on the files, sorted and weighted by a number of criteria. At
the current time, indexes generated by waisindex are about twice
the size of the original documents.
<H4>Swish</H4>
<P>
Swish was developed by Kevin Hughes of EIT. It is available at
<BLOCKQUOTE>
<TT><FONT FACE="Courier"><A HREF="http://www.eit.com/goodies/software/swish/">http://www.eit.com/goodies/software/swish/</A></FONT></TT>
</BLOCKQUOTE>
<P>
Swish was designed from the ground up as an HTML indexer. It is
not (nor does it claim to be) as complex or full-featured as Wais,
but it is much smaller, simpler to install, and easier to maintain.
Both the indexer and the search engine are in the same program.
Also, because it was designed for the Web, Swish is able to take
into account HTML tags, ignoring the tags themselves and giving
higher precedence to text within certain tags (like headers).
One of the most noticeable drawbacks of Swish is that it does
all of the indexing in RAM. So the total of all the files you
wish to index cannot exceed your RAM (Wais offers a maximum RAM
switch with its indexer). However, unless you have a very large
site (say, over 30 MB of files on a 32 MB machine), this should
not be a problem.
<H4>Ice</H4>
<P>
Ice is a Web indexing program written entirely in Perl. It uses
a very simple indexing format that becomes slow with large numbers
of documents but is very fast and efficient for sites with up
to a couple of thousand files. Ice was created and is maintained
by Christian Neuss and is available at
<BLOCKQUOTE>
<TT><FONT FACE="Courier"><A HREF="http://www.informatik.th-darmstadt.de/~neuss/ice/ice.html">http://www.informatik.th-darmstadt.de/~neuss/ice/ice.html</A></FONT></TT>
</BLOCKQUOTE>
<P>
Ice also supports a thesaurus file, which allows for synonyms
and abbreviations while searching.
<H4>Glimpse</H4>
<P>
Glimpse is a fairly new entry in the indexer wars, having just
now gained widespread attention as the default search engine of
the Harvest system. Glimpse is similar to Wais in that it builds
as several executables and offers many options when searching.
Glimpse also appears to be highly intuitive, with most of its
advanced searching options accessible with a simple command-line
switch. It is being developed at the University of Arizona and
is available at
<BLOCKQUOTE>
<TT><FONT FACE="Courier"><A HREF="http://glimpse.cs.arizona.edu:1994/">http://glimpse.cs.arizona.edu:1994/</A></FONT></TT>
</BLOCKQUOTE>
<H3><A NAME="SearchEngines">Search Engines</A></H3>
<P>
Once the index of files exists on your server, the next step is
providing a way for users to access this from the Web. This is
where CGI comes in. A CGI program must take a set of keywords
(or some other sort of query) from a form, pass it to the search
engine, and then interpret the results. Because all the work is
done by the indexer/search engine, this front end can be fairly
simple. Not coincidentally, there are dozens of them available
on the Net, and it is not a major task to customize one for your
own use.
<H4>Wais Front Ends</H4>
<P>
Due to the popularity of Wais, interfaces between it and the Web
are very common. A Perl interface (<TT><FONT FACE="Courier">WAIS.pm</FONT></TT>)
is standard with certain releases of freeWAIS-sf. Another Perl
front end (<TT><FONT FACE="Courier">wais.pl</FONT></TT>) comes
with ncSA httpd. A list of other interfaces between Wais and the
Web can be found on Yahoo at
<BLOCKQUOTE>
<TT><FONT FACE="Courier"><A HREF="http://www.yahoo.com/Computers_and_Internet/Internet/Searching_the_Net/WAIS/">http://www.yahoo.com/Computers_and_Internet/Internet/Searching_the_Net/WAIS/</A></FONT></TT>
</BLOCKQUOTE>
<H4>Other Front Ends</H4>
<P>
Several front ends exist for the other search engines, as well.
Ice comes with its one CGI program (<TT><FONT FACE="Courier">ice_form.pl</FONT></TT>).
WWWWAIS is a program by the maker of Swish that serves as a front
end to both Wais and Swish indexes. It is available at
<BLOCKQUOTE>
<TT><FONT FACE="Courier"><A HREF="http://www.eit.com/goodies/software/wwwwais/">http://www.eit.com/goodies/software/wwwwais/</A></FONT></TT>
</BLOCKQUOTE>
<P>
Harvest is an ambitious set of tools developed by Colorado University,
Boulder, which hopes to provide a central package to "gather,
extract, search, cache, and replicate" information across
the Internet. Harvest uses Glimpse as its default search engine.
It is available at
<BLOCKQUOTE>
<TT><FONT FACE="Courier"><A HREF="http://harvest.cs.colorado.edu/harvest/">http://harvest.cs.colorado.edu/harvest/</A></FONT></TT>
</BLOCKQUOTE>
<H4>Rolling Your Own</H4>
<P>
With a little thought and effort, it is not hard to create your
own custom front end for an existing search engine. A few things
must be considered:
<UL>
<LI><FONT COLOR=#000000>Getting the information from the form-This
is the easy part. As mentioned earlier in the chapter, there are
packages for all CGI languages that serve to retrieve information
from a form and store it in variables of some kind. In this case,
the information will be a list of keywords and perhaps some constraints
such as a Boolean </FONT><TT><FONT FACE="Courier">AND</FONT></TT>
or <TT><FONT FACE="Courier">OR </FONT></TT>search, a maximum number
of results to return, or a specific index to search.
<LI><FONT COLOR=#000000>Parsing the information-Before passing
it to the search engine, the data must be put in the right form
(usually as command line arguments). Also at this stage, simple
error detecting can be performed. Checks should be made that the
user entered all necessary data in the form.</FONT>
</UL>
<P>
<CENTER><TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>
<TR><TD><B>Caution</B></TD></TR>
<TR><TD>
<BLOCKQUOTE>
At this point, the program should also check to make sure that the user is not trying to pull a fast one. In the next step, an external program is called, so care must be taken to prevent the infamous <TT><FONT FACE="Courier">keyword; rm -rf /</FONT></TT>
trick. Almost universally, a semicolon is a command separator, and so a wannabe attacker could insert one into his or her query, followed by his or her malicious command(s).
</BLOCKQUOTE>
<BLOCKQUOTE>
Don't fall into this trap.</BLOCKQUOTE>
</TD></TR>
</TABLE></CENTER>
<P>
<UL>
<LI><FONT COLOR=#000000>Calling the search engine-Now that all
of the information has been verified as safe and is in the correct
format, it must be passed to the search engine. Using UNIX-derived
languages (such as C/C++ and Perl), this is most effectively accomplished
by using process pipes. Consider the following snippet of Perl
code that takes the prepared information, passes it to the Wais
search engine, and then reads the output</FONT>:
</UL>
<P>
<P>
<BLOCKQUOTE>
<TT><FONT FACE="Courier">pipe(P0R,P0W); # Creates one read/write
pipe<BR>
pipe(P1R,P1W); # Creates another read/write pipe<BR>
<BR>
if ($pid = fork) { # This created a new process,<BR>
# This is the parent process<BR>
close(P0R); # Close the read end of the
first pipe<BR>
close(P1W); # and the write end of the
other one<BR>
&read_from_wais(P1R); # This calls
a subroutine which is fed input<BR>
# into P1R. It then interprets it into
search results.<BR>
} elsif (defines $pid) {<BR>
# This is the child<BR>
close(P0W); # Close the write end of the
first pipe<BR>
close(P1R); # Close the read end of the
second pipe<BR>
open(STDIN, "<&P0R");
# Duplicate P0R as the standard input<BR>
open(STDOUT, ">&P1W");
# Duplicate P1W as the standard out<BR>
# Now the standard output will travel
through P1W into P1R which<BR>
# is being held by the parent who sends
it off to the subroutine.<BR>
exec(@argline) || die; # @argline holds
the command to execute the<BR>
# Wais search engine<BR>
# At this point the child dies<BR>
} else { die("Can't fork!");
} # This is only reached if fork()<BR>
# fails<BR>
# Parent now continues with any information
retrieved from the<BR>
# search engine.</FONT></TT>
<P>
Manipulating pipes and forks can be tricky at first, but it greatly
increases the power of interprocess communication, which is necessary
to interact with an external search engine.
</BLOCKQUOTE>
<H2><A NAME="LargeScaleDatabases"><FONT SIZE=5 COLOR=#FF0000>Large
Scale Databases</FONT></A></H2>
<P>
At some point, you may encounter a project that is simply too
big for a text-based database and is not suited for a text-indexing
system. Fear not; others have been down this road and fortunately
have left a lot of software behind to help integrate large database
servers with the Web. A "large scale" database need
not be large, per se. It is simply any database that is not a
flat ASCII file. Popular commercial databases apply, such as dBASE,
Paradox, and Access (although they are all able to read ASCII
files, it is just not their preferred method of storing information).
Also fitting this category are database servers such as Sybase,
Oracle, and mSQL.
<P>
When dealing with a large-scale database, the trick is not in
storing or manipulating the data as it is with the text database.
The database server does all that work for you. The trick is communicating
with the database server. There are almost as many database communication
protocols as there are databases, despite the existence of some
very complete standards (such as SQL). Programs exist for practically
every database that has communications capabilities to interface
with the Web. A list of some programs follows:
<P>
<CENTER><TABLE BORDERCOLOR=#000000 BORDER=1 WIDTH=80%>
<TR><TD><B>Tip</B></TD></TR>
<TR><TD>
<BLOCKQUOTE>
Much of the information that follows can be found online (in, no doubt, an updated form) at Jeff Rowe's excellent page</BLOCKQUOTE>
<BLOCKQUOTE>
<TT><FONT FACE="Courier"><A HREF="http://cscsun1.larc.nasa.gov/~beowulf/db/all_products.html">http://cscsun1.larc.nasa.gov/~beowulf/db/all_products.html</A></FONT></TT>
</BLOCKQUOTE>
</TD></TR>
</TABLE></CENTER>
<P>
<UL>
<LI><FONT COLOR=#000000>4D<BR>
</FONT>NetLink/4D (<TT><FONT FACE="Courier"><A HREF="http://www.fsti.com/productinfo/netlink.html">http://www.fsti.com/productinfo/netlink.html</A></FONT></TT>)-This
is a commercial product for Macintosh computers running the WebStar
server. It allows users to directly manipulate 4D databases.
<LI><FONT COLOR=#000000>Microsoft Access<BR>
</FONT>4W Publisher (<TT><FONT FACE="Courier"><A HREF="http://www.4w.com/4wpublisher/">http://www.4w.com/4wpublisher/</A></FONT></TT>)-This
is a commercial product that generates static HTML pages from
an Access database. A CGI version is due out soon that will allow
dynamic access to the database.<BR>
A-XOrion (<TT><FONT FACE="Courier"><A HREF="http://www.clark.net/infouser/endidc.htm">http://www.clark.net/infouser/endidc.htm</A></FONT></TT>)-This
is a custom commercial database server for the Windows platform.
It allows real-time access to major brand pc databases (Paradox,
dBASE, FoxPro Access), but it requires Access to run.<BR>
dbWeb (<TT><FONT FACE="Courier"><A HREF="http://www.axone.ch/dbWeb/">http://www.axone.ch/dbWeb/</A></FONT></TT>)-This
is a freeware tool to maintain large hypertexts using an SQL interface
to Access. It has the capability to export large documents in
a variety of formats including HTML pages, Microsoft Viewer, and
tagged text.
<LI><FONT COLOR=#000000>DB2<BR>
</FONT>DB2WWW (<TT><FONT FACE="Courier"><A HREF="http://www.software.ibm.com/data/db2/db2wfac2.html">http://www.software.ibm.com/data/db2/db2wfac2.html</A></FONT></TT>)-This
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -