📄 ch12_03.htm

📁 用perl编写CGI的好书。本书从解释CGI和底层HTTP协议如何工作开始
💻 HTM
📖 第 1 页 / 共 2 页
字号:
12 下一页
<?label 12.3. Inverted Index Search?><html><head><title>Inverted Index Search (CGI Programming with Perl)</title><link href="../style/style1.css" type="text/css" rel="stylesheet" /><meta name="DC.Creator" content="Scott Guelich, Gunther Birznieks and Shishir Gundavaram" /><meta scheme="MIME" content="text/xml" name="DC.Format" /><meta content="en-US" name="DC.Language" /><meta content="O'Reilly & Associates, Inc." name="DC.Publisher" /><meta scheme="ISBN" name="DC.Source" content="1565924193L" /><meta name="DC.Subject.Keyword" content="stuff" /><meta name="DC.Title" content="CGI Programming with Perl" /><meta content="Text.Monograph" name="DC.Type" /></head><body bgcolor="#ffffff"><img src="gifs/smbanner.gif" alt="Book Home" usemap="#banner-map" border="0" /><map name="banner-map"><area alt="CGI Programming with Perl" href="index.htm" coords="0,0,466,65" shape="rect" /><area alt="Search this book" href="jobjects/fsearch.htm" coords="467,0,514,18" shape="rect" /></map><div class="navbar"><table border="0" width="515"><tr><td width="172" valign="top" align="left"><a href="ch12_02.htm"><img src="../gifs/txtpreva.gif" alt="Previous" border="0" /></a></td><td width="171" valign="top" align="center"><a href="index.htm">CGI Programming with Perl</a></td><td width="172" valign="top" align="right"><a href="ch13_01.htm"><img src="../gifs/txtnexta.gif" alt="Next" border="0" /></a></td></tr></table></div><hr align="left" width="515" /><h2 class="sect1">12.3. Inverted Index Search</h2><p>The <a name="INDEX-2395" /> <a name="INDEX-2,396" /> <a name="INDEX-2,397" /> <a name="INDEX-2,398" />applications that we'velooked at so far search through each and every file in the specifieddirectory, looking for particular words or phrases. This is not onlytime consuming, but will also place a great burden on the server. Weclearly need a different approach to searching.</p><p>A more efficient approach is to create an index (like the one you canfind at the back of this and other books) containing all the wordsfrom specific documents and the name of the document in which theyappear.</p><p>In this section, we will discuss an application that creates aninverted index. The index is<em class="firstterm">inverted</em><a name="INDEX-2399" /> in the sense that a particular word isused to find the file(s) in which it appears, rather than the otherway around. In the following section, we will look at the CGI scriptthat searches this index and presents the results in a nice format.</p><p><a href="ch12_03.htm#ch12-65973">Example 12-3</a> creates the indexer.</p><a name="ch12-65973" /><div class="example"><h4 class="objtitle">Example 12-3. indexer.pl </h4><blockquote><pre class="code">#!/usr/bin/perl -wT# This is not a CGI, so taint mode not requireduse strict;use File::Find;use DB_File;use Getopt::Long;require "stem.pl";use constant DB_CACHE      =&gt; 0;use constant DEFAULT_INDEX =&gt; "/usr/local/apache/data/index.db";my( %opts, %index, @files, $stop_words );GetOptions( \%opts, "dir=s",                    "cache=s",                    "index=s",                    "ignore",                    "stop=s",                    "numbers",                    "stem" );die usage(  ) unless $opts{dir} &amp;&amp; -d $opts{dir};$opts{'index'}        ||= DEFAULT_INDEX;$DB_BTREE-&gt;{cachesize}  = $cache || DB_CACHE;$index{"!OPTION:stem"} = 1 if $opts{'stem'};$index{"!OPTION:ignore"} = 1 if $opts{'ignore'};tie %index, "DB_File", $opts{'index'}, O_RDWR|O_CREAT, 0640, $DB_TREE    or die "Cannot tie database: $!\n";find( sub { push @files, $File::Find::name }, $opts{dir} );$stop_words = load_stopwords( $opts{stop} ) if $opts{stop};process_files( \%index, \@files, \%opts, $stop_words );untie %index;sub load_stopwords {    my $file = shift;    my $words = {};    local *INFO, $_;        die "Cannot file stop file: $file\n" unless -e $file;        open INFO, $file or die "$!\n";    while ( &lt;INFO&gt; ) {        next if /^#/;        $words-&gt;{lc $1} = 1 if /(\S+)/;    }        close INFO;        return $words;}sub process_files {    my( $index, $files, $opts, $stop_words ) = @_;    local *FILE, $_;    local $/ = "\n\n";        for ( my $file_id = 0; $file_id &lt; @$files; $file_id++ ) {        my $file = $files[$file_id];        my %seen_in_file;                next unless -T $file;                print STDERR "Indexing $file\n";        $index-&gt;{"!FILE_NAME:$file_id"} = $file;                open FILE, $file or die "Cannot open file: $file!\n";                while ( &lt;FILE&gt; ) {                        tr/A-Z/a-z/ if $opts{ignore};            s/&lt;.+?&gt;//gs; # Woa! what about &lt; or &gt; in comments or js??                        while ( /([a-z\d]{2,})\b/gi ) {                my $word = $1;                next if $stop_words-&gt;{lc $word};                next if $word =~ /^\d+$/ &amp;&amp; not $opts{number};                                ( $word ) = stem( $word ) if $opts{stem};                                $index-&gt;{$word} = ( exists $index-&gt;{$word} ?                     "$index-&gt;{$word}:" : "" ) . "$file_id" unless                     $seen_in_file{$word}++;            }        }    }}sub usage {    my $usage = &lt;&lt;End_of_Usage;Usage: $0 -dir directory [options]The options are:  -cache         DB_File cache size (in bytes)  -index         Path to index, default:/usr/local/apache/data/index.db  -ignore        Case-insensitive index  -stop          Path to stopwords file  -numbers       Include numbers in index  -stem          Stem wordsEnd_of_Usage    return $usage;}</pre></blockquote></div><p>We will use <a name="INDEX-2400" /><a name="INDEX-2401" />File::Findto get a list of all the files in the specified directory, as well asfiles in any subdirectories. The <a name="INDEX-2402" /><a name="INDEX-2403" />File::Basenamemodule provides us with functions to extract the filename, given thefull path. You might be wondering why we need this feature,considering the fact that we can use a simple regular expression toget at the filename. If we use a regex, we will have to determinewhat platform we're using the application on, and accordinglyextract the filename. This module takes care of that for us.</p><p>We use <a name="INDEX-2404" />DB_File to createand store the index. Note that we could also store the index in anRDBMS, although a DBM file is certainly adequate for many sites. Themethod for creating indexes is the same no matter what type of formatwe use for storage. <a name="INDEX-2405" /> <a name="INDEX-2,406" /><a name="INDEX-2407" />Getopt::Longhelps us handle command-line options, and<em class="filename">stem.pl</em><a name="INDEX-2408" /><a name="INDEX-2409" /><a name="INDEX-2410" /><a name="INDEX-2411" />, a Perl 4 library, has algorithmsto automatically "stem" (or remove) word suffixes.</p><p>We use the<tt class="literal">DB_CACHE</tt><a name="INDEX-2412" /> <a name="INDEX-2,413" /> constant to hold the size of theDB_File <a name="INDEX-2414" />memory cache. Increasing the size ofthis cache (up to a certain point) improves insertion rate at theexpense of memory. In other words, it increases the rate at which westore the words in the index. A cache size ofis used as the default.</p><p><tt class="literal">DEFAULT_INDEX</tt><a name="INDEX-2415" />contains the default path to the file that will hold our data. Theuser can specify a different file by using the<em class="emphasis">-index</em><a name="INDEX-2416" /> option, as you will see shortly.</p><p>The<em class="emphasis">GetOptions</em><a name="INDEX-2417" />function (part of the Getopt::Long module) allows us to extract anycommand-line options and store them in a<a name="INDEX-2418" />hash. We pass a reference to a hashand a list of options to <em class="emphasis">GetOptions</em>. The optionsthat take arguments contain an "s" to indicate that theyeach take a<a name="INDEX-2419" />string.</p><p>This application allows you to pass several options that will affectthe indexing process. The<em class="emphasis">-dir</em><a name="INDEX-2420" /><a name="INDEX-2421" /><a name="INDEX-2422" /> option is the onlyone that is required, as it is used to specify the directory thatcontains the files to be indexed.</p><p>You can use the<em class="emphasis">-cache</em><a name="INDEX-2423" /><a name="INDEX-2424" /> option to specify the cache size and<em class="emphasis">-index</em><a name="INDEX-2425" /><a name="INDEX-2426" /> to specify the path to the index. The<em class="emphasis">-ignore</em><a name="INDEX-2427" /><a name="INDEX-2428" /> <a name="INDEX-2,429" /> <a name="INDEX-2,430" /> option creates an index where all thewords are turned into lowercase (case-insensitive). This willincrease the rate at which the index is created, as well as decreasethe size of the index. If you want numbers in documents to beincluded in the index, you can specify the<em class="emphasis">-numbers</em><a name="INDEX-2431" /> option.</p><p>You can use the<em class="emphasis">-stop</em><a name="INDEX-2432" /> option tospecify a file that contains "stop" words -- wordsthat are generally found in most of your documents. Typical stopwords include "a", "an", "to","it", and "the", but you can also includewords that are more specific to your documents.</p><p>Finally, the <em class="emphasis">-stem</em> option stems<a name="INDEX-2433" /><a name="INDEX-2434" /><a name="INDEX-2435" />word suffixes before storing them inthe index. This will help us find words in documents much easily. Forexample, if a user searches for "tomatoes", our searchapplication will return documents that contain "tomato"as well as "tomatoes". An important note here is thatstemming will also create a case-insensitive index.</p><p>Here's an example of how you would use these various options:</p><blockquote><pre class="code">$ ./Indexer -dir    /usr/local/apache/htdocs/sports \            -cache  16_000_000 \            -index  /usr/local/apache/data/sports.db \            -stop   my_stop_words.txt \            -stem</pre></blockquote><p><tt class="literal">%index</tt> is the hash that will hold the index. Weuse the <tt class="function">tie</tt><a name="INDEX-2436" /><a name="INDEX-2437" /> function to bind the hash to the filespecified by <tt class="literal">$conf{index}</tt>. This allows us totransparently store a hash in a file, which we can later retrieve andmodify. In this example, we are using DB_File, as it is faster andmore efficient that other DBM implementations.</p><p>If the <em class="emphasis">-stem</em> option was used, we record this inour index so that our CGI script knows whether to apply stemming tothe query as well. We could have stored this information in anotherdatabase file, but that would require opening two files for eachsearch. Instead, we name this key with an exclamation point such thatit can't collide with any of the words we're indexing.</p><p>We use the <em class="emphasis">find</em><a name="INDEX-2438" /> function (part of File::Find module)to get a list of all the files in the specified directory.<em class="emphasis">find</em> expects the first argument to be a codereference, which can either be a reference to a subroutine or aninlined anonymous subroutine, as is the case above. As<em class="emphasis">find</em> traverses through the directory (as well asall subdirectories), it executes the code, specified by the firstargument, setting the <tt class="literal">$File::Find::name</tt> variableto the path of the file. This builds an array of the path to all thefiles under the original directory.</p><p>If a stop file was specified and it exists, we call the<tt class="function">load_stopwords</tt> function to read through the fileand return a reference to a hash.</p><p>The most important function in this application is<em class="emphasis">process_files</em>, which iterates through all thefiles and stores the words in <tt class="literal">$index</tt>. Finally, weclose the binding between the hash and the file and exit. At thispoint, we will have a file containing the index.</p><p>Let's look at the functions now. The<tt class="function">load_stopwords</tt> function opens the stop wordsfile, ignores all comments (lines starting with "#"), andextracts the first word found on each line <tt class="literal">(\S+)</tt>.</p><p>The word is converted to lowercase by the <tt class="function">lc</tt>function and stored as a key in the hash referenced by<tt class="literal">$words</tt>. Since we are going to find words withmixed case in our files, it is much easier and quicker to comparethem to this list if all our stop words are either completelyuppercase or completely lowercase.</p><p>Before we discuss the <tt class="function">process_files</tt> method,let's look at the arguments it expects. The first argument,<tt class="literal">$index</tt>, is a reference to an empty hash that willeventually contain the words from all the files as well as pointersto the documents where they are found. <tt class="literal">$files</tt> is areference to a list of all the files to parse.<tt class="literal">$stop</tt> is a reference to a hashes containing ourstop words. The final argument, <tt class="literal">$args</tt>, is simply areference to the hash of our <a name="INDEX-2439" />command-line arguments.</p><p>If the user chose to ignore case, we convert all words intolowercase, thus creating a <a name="INDEX-2440" />case-insensitive index.</p>
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -