📄 ch12_03.htm

📁 用perl编写CGI的好书。本书从解释CGI和底层HTTP协议如何工作开始
💻 HTM
📖 第 1 页 / 共 2 页
字号:
上一页 12
<p>We set <a name="INDEX-2441" /><a name="INDEX-2442" />Perl's default input recordseparator, <tt class="literal">$/</tt>, to paragraph mode. In other words,one read on a file handle will return a paragraph, as opposed to asingle line. This allows us to index the files at a faster rate.</p><p>We iterate through the <tt class="literal">@$files</tt> array with the<tt class="function">for</tt> function, storing the key in<tt class="literal">$file_id</tt> and the value of the current file in<tt class="literal">$file</tt>. Since this application creates ahuman-searchable index, we will deal only with text files. We use the<em class="emphasis">-T</em><a name="INDEX-2443" /> <a name="INDEX-2,444" /><a name="INDEX-2445" /> operator to ignore any non-textfiles.</p><p>The first entry into the <tt class="literal">%$index</tt> hash is a"unique" key that associates a number with the full pathto the file. Since this hash will also hold all the words that wefind, we use the "!FILE_NAME" string to keep our numberto file mappings separate from the words.</p><p>We start our indexing process by iterating through the file aparagraph at a time; the <tt class="literal">$_</tt> variable holds thecontents. If the <em class="emphasis">-case</em> option was specified bythe user, we convert the paragraph that we have just read tolowercase.</p><p>We also strip all <a name="INDEX-2446" /><a name="INDEX-2447" />HTML tags from the paragraph, since wedon't want them to be indexed. The <a name="INDEX-2448" />regexp will look for astring starting with "&lt;", followed by one or morecharacters (including newlines) until it finds the first"&gt;".</p><p>We iterate through the paragraph using a<a name="INDEX-2449" />regex that extracts words greater thanor equal to two characters in length and matches characters as wellas digits (\d matches "0-9"). The matched word is storedin <tt class="literal">$1</tt>.</p><p>Before we check to see if the word we extracted is a stop word, weneed to convert it to lowercase, since we converted all the stopwords to lowercase earlier in this script. If the word is, indeed, astop word, we skip it and continue. We also skip numbers if the<em class="emphasis">-numbers</em><a name="INDEX-2450" /><a name="INDEX-2451" /> option is not specified.</p><p>If the <em class="emphasis">-stem</em> option is specified, we call the<em class="emphasis">stem</em> function (part of the<em class="emphasis">stem.pl</em> library) to remove all<a name="INDEX-2452" /><a name="INDEX-2453" />prefixes from the word and convert itto lowercase.</p><p>Finally, we are ready to store the word in the index, where the valuerepresents the file that we are currently parsing. Unfortunately,this isn't that simple. The last command is a little long andcomplicated. It helps to read it backwards. First, we check whetherwe have seen the <a name="INDEX-2454" /><a name="INDEX-2455" />word in this filepreviously by using the <tt class="literal">%seen_in_file</tt> hash; thefirst time through, there will not be an entry in the hash and willevaluate to false (and thus pass the <tt class="function">unless</tt>check), thereafter, it will contain the number of times we have seenthe number in the file and evaluate to true (and thus fail the<tt class="function">unless</tt> check). So the first time we see the wordin the file, we add it to our index. If the word was previouslyindexed for another file, then we join the<tt class="literal">$file_id</tt> of this file to the previous entry with acolon. Otherwise, we just add <tt class="literal">$file_id</tt> as thisword's only value thus far.</p><p>When this function finishes, the <tt class="literal">%$index</tt> hash willlook something like this:</p><blockquote><pre class="code">$index = {              "!FILE_NAME:1"     =&gt;                   "/usr/local/apache/htdocs/sports/sprint.html",              "!FILE_NAME:2"     =&gt;                  "/usr/local/apache/htdocs/sports/olympics.html",              "!FILE_NAME:3"     =&gt;                   "/usr/local/apache/htdocs/sports/celtics.html",              browser              =&gt; "1:2",              code                 =&gt; "3",              color                =&gt; "2:3",              comment              =&gt; "2",              content              =&gt; "1",              cool                 =&gt; "2:3",              copyright            =&gt; "1:2:3"          };</pre></blockquote><p>Now, we are ready to implement the CGI application that will searchthis index.</p><a name="ch12-2-fm2xml" /><div class="sect2"><h3 class="sect2">12.3.1. Search Application</h3><p>The<a name="INDEX-2456" /> <a name="INDEX-2,457" /> indexer applicationmakes our life easier when it comes time to write the CGI applicationto perform the actual search. The CGI application should parse theform input, open the DBM file created by the indexer, search forpossible matches and then return HTML output.</p><p><a href="ch12_03.htm#ch12-73591">Example 12-4</a> contains the program.</p><a name="ch12-73591" /><div class="example"><h4 class="objtitle">Example 12-4. indexed_search.cgi </h4><a name="INDEX-2458" /><blockquote><pre class="code">#!/usr/bin/perl -wTuse DB_File;use CGI;use CGIBook::Error;use File::Basename;require stem.pl;use strict;use constant INDEX_DB =&gt; "/usr/local/apache/data/index.db";my( %index, $paths, $path );my $q     = new CGI;my $query = $q-&gt;param("query");my @words = split /\s*(,|\s+)/, $query;tie %index, "DB_File", INDEX_DB, O_RDONLY, 0640    or error( $q, "Cannot open database" );$paths = search( \%index, \@words );print $q-&gt;header,      $q-&gt;start_html( "Inverted Index Search" ),      $q-&gt;h1( "Search for: $query" );unless ( @$paths ) {    print $q-&gt;h2( $q-&gt;font( { -color =&gt; "#FF000" },                             "No Matches Found" ) );}foreach $path ( @$paths ) {    my $file = basename( $path );    next unless $path =~ s/^\Q$ENV{DOCUMENT_ROOT}\E//o;    $path = to_uri_path( $path );    print $q-&gt;a( { -href =&gt; "$path" }, "$path" ), $q-&gt;br;} print $q-&gt;end_html;untie %index;sub search {    my( $index, $words ) = @_;    my $do_stemming = exists $index-&gt;{"!OPTION:stem"} ? 1 : 0;    my $ignore_case = exists $index-&gt;{"!OPTION:ignore"} ? 1 : 0;    my( %matches, $word, $file_index );        foreach $word ( @$words ) {        my $match;                if ( $do_stemming ) {            my( $stem )  = stem( $word );            $match = $index-&gt;{$stem};        }        elsif ( $ignore_case ) {            $match = $index-&gt;{lc $word};        }        else {            $match = $index-&gt;{$word};        }                next unless $match;                foreach $file_index ( split /:/, $match ) {            my $filename = $index-&gt;{"!FILE_NAME:$file_index"};            $matches{$filename}++;        }    }    my @files = map  { $_-&gt;[0] }                sort { $matches{$a-&gt;[0]} &lt;=&gt; $matches{$b-&gt;[0]} ||                        $a-&gt;[1] &lt;=&gt; $b-&gt;[1] }                map  { [ $_, -M $_ ] }                keys %matches;        return \@files;}sub to_uri_path {    my $path = shift;    my( $name, @elements );        do {        ( $name, $path ) = fileparse( $path );        unshift @elements, $name;        chop $path;    } while $path;        return join '/', @elements;}</pre></blockquote></div><p>The modules should be familiar to you by now. The<a name="INDEX-2459" /><tt class="literal">INDEX_DB</tt> constantcontains the<a name="INDEX-2460" />pathof the index created by the indexer application.</p><p>Since a query can include multiple words, we split it on anywhitespace or a comma and store the resulting words in the<tt class="literal">@words</tt> array. We use<em class="emphasis">tie</em><a name="INDEX-2461" /> to open the index DBM file inread-only mode. In other words, we bind the<a name="INDEX-2462" /> <a name="INDEX-2,463" /><a name="INDEX-2464" />index file with the<tt class="literal">%index</tt> hash. If we cannot open the file, we callour <em class="emphasis">error</em> function to return an error to thebrowser.</p><p>The real searching is done appropriately enough in the<tt class="function">search</tt> function, which takes a reference to theindex hash and a reference to the list of words we are searching for.The first thing we do is to peek into the index and see if the stemoption was set when the index was built. We then proceed to iteratethrough the <tt class="literal">@$words</tt> array, searching for possiblematches. If stemming was enabled, we stem the word and compare that.Otherwise, we check to see whether the particular word exists in theindex as-is, or as a lowercase word if the index is notcase-sensitive. If any of these comparisons succeeds, we have got amatch. Otherwise, we ignore the word and continue.</p><p>If there is a match, we split the <a name="INDEX-2465" /><a name="INDEX-2466" />colon separated listof file id's where that particular word is found. Since wedon't want duplicate entries in our final list, we store thefull path of the matching files in the <tt class="literal">%matches</tt>hash.</p><p>After the loop has finished executing, we are left with the matchingfiles in <tt class="literal">%matches</tt>. We would like to add some orderto our results and display them according to the number of wordsmatching and then by the<a name="INDEX-2467" /> <a name="INDEX-2,468" />file's modification time. So, wesort the keys according to the number of matches and then by the datareturned by the <em class="emphasis">-M</em><a name="INDEX-2469" /><a name="INDEX-2470" /> operator, and store the recentlymodified files in the <tt class="literal">@files</tt> array.</p><p>We could calculate the modification time of the files during eachcomparison like this:</p><blockquote><pre class="code">my @files = sort { $matches{$_} &lt;=&gt; $matches{$_} ||                   -M $_ &lt;=&gt; -M $_ }            keys %matches;</pre></blockquote><p>However, this is inefficient because we might calculate themodification time for each file multiple times. A more efficientalgorithm involves precalculating the modification times as we havedone in the program.</p><p>This strategy has become known as the <a name="INDEX-2471" /> <a name="INDEX-2,472" />Schwartzian Transform, made famousby Randal Schwartz. It's beyond the scope of this book toexplain this, but if you're interested, see Joseph Hall'sexplanation of the Transform, located at:<a href="http://www.5sigma.com/perl/schwtr.html">http://www.5sigma.com/perl/schwtr.html</a>. Ours isa slight variation because we perform a two-part sort.</p><p>We output the HTTP and HTML document headers, and proceed to check tosee if we have any matches. If not, we return a simple message.Otherwise, we iterate through the <tt class="literal">@files</tt> array,setting <tt class="literal">$path</tt> to the current element each timethrough the loop. We strip off the part of the path that matches theserver's root directory. That should give us the<a name="INDEX-2473" /><a name="INDEX-2474" />path thatcorresponds to a URL. However, on non-Unix filesystems, wewon't have <a name="INDEX-2475" /> <a name="INDEX-2,476" /> <a name="INDEX-2,477" />forward slashes ("/")separating directories. So we call the<tt class="function">to_uri_path</tt><a name="INDEX-2478" /> <a name="INDEX-2,479" /> function, which uses the File::Basenamemodule to strip off successive elements of the path and then rebuildit with forward slashes. Note that this will work on many operatingsystems like Win32 and MacOS, but it will not work on systems that donot use a single character to delimit parts of the path (like VMS;although, the chances that you're actually doing CGIdevelopment on a VMS machine are pretty slim).</p><p>We build proper links with this newly formatted path, print theremainder of our results, <a name="INDEX-2480" />close the binding between the<a name="INDEX-2481" /><a name="INDEX-2482" />database<a name="INDEX-2483" /><a name="INDEX-2484" /><a name="INDEX-2485" />and thehash, <a name="INDEX-2486" /><a name="INDEX-2487" /><a name="INDEX-2488" /><a name="INDEX-2489" />and exit.</p></div><hr align="left" width="515" /><div class="navbar"><table border="0" width="515"><tr><td width="172" valign="top" align="left"><a href="ch12_02.htm"><img src="../gifs/txtpreva.gif" alt="Previous" border="0" /></a></td><td width="171" valign="top" align="center"><a href="index.htm"><img src="../gifs/txthome.gif" alt="Home" border="0" /></a></td><td width="172" valign="top" align="right"><a href="ch13_01.htm"><img src="../gifs/txtnexta.gif" alt="Next" border="0" /></a></td></tr><tr><td width="172" valign="top" align="left">12.2. Searching One by One, Take Two</td><td width="171" valign="top" align="center"><a href="index/index.htm"><img src="../gifs/index.gif" alt="Book Index" border="0" /></a></td><td width="172" valign="top" align="right">13. Creating Graphics on the Fly</td></tr></table></div><hr align="left" width="515" /><img src="../gifs/navbar.gif" alt="Library Navigation Links" usemap="#library-map" border="0" /><p><font size="-1"><a href="copyrght.htm">Copyright &copy; 2001</a> O'Reilly &amp; Associates. All rights reserved.</font></p><map name="library-map"><area href="../index.htm" coords="1,1,83,102" shape="rect" /><area href="../lnut/index.htm" coords="81,0,152,95" shape="rect" /><area href="../run/index.htm" coords="172,2,252,105" shape="rect" /><area href="../apache/index.htm" coords="238,2,334,95" shape="rect" /><area href="../sql/index.htm" coords="336,0,412,104" shape="rect" /><area href="../dbi/index.htm" coords="415,0,507,101" shape="rect" /><area href="../cgi/index.htm" coords="511,0,601,99" shape="rect" /></map></body></html>
上一页 12
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -