📄 ch07.htm
字号:
seen in previous chapters, Perl is a perfect language for text manipulation and searching.
It is very efficient in processing files, which, combined with its powerful regular
expression capability, make it a perfect language for this type of work.
<H4 ALIGN="CENTER"><A NAME="Heading31"></A><FONT COLOR="#000077">Introduction</FONT></H4>
<P>This example shows you how to provide a search engine into your own Web site.
The front end is a simple form with a text field, a Submit button, and a Reset button.
The back end recurses through your Web site's directories, scanning the HTML files
for the existence of the specified string. The resulting page will contain either
a message that no items have been found, or it will display a list of navigable links
to those pages that match the search criteria.
<H4 ALIGN="CENTER"><A NAME="Heading32"></A><FONT COLOR="#000077">Defining the Search
Scope</FONT></H4>
<P>The form for this example is a simple one. Using <TT>CGI::Form</TT>, Listing 7.8
contains the code.
<H3 ALIGN="CENTER"><A NAME="Heading33"></A><FONT COLOR="#000077">Listing 7.8. Subroutine
to return a search form.</FONT></H3>
<PRE><FONT COLOR="#0066FF">sub searchForm {
my($q)=@_;
print $q->header;
print $q->start_html("Search My Site");
print "<H1>Search My Site</H1>\n<HR>\n";
print "<P>Please enter one or more words to search for";
print " and click `Search'<BR>\n";
print $q->start_multipart_form();
print $q->textfield(-name=>`SearchString',-maxlength=>100,-size=>40);
print "<BR><BR><BR>";
print $q->submit(-name=>`Action',-value=>`Search');
print " ";
print $q->reset();
print $q->endform();
print $q->end_html();
}
</FONT></PRE>
<P>This form appears in your browser as shown in Figure 7.5. <BR>
<BR>
<A HREF="08wpp05.jpg" tppabs="http://210.32.137.15/ebook/Web%20Programming%20with%20Perl%205/08wpp05.jpg"><TT><B>Figure 7.5.</B></TT></A><TT> </TT>The search
form as it appears in your browser. <BR>
<BR>
When the user clicks Search, the real work begins. In this example, you will search
the entire site, but depending on the size of your site, you might want to limit
the search scope by adding another field to your form. This can be accomplished by
using a pull-down menu or a group of radio buttons.
<H4 ALIGN="CENTER"><A NAME="Heading34"></A><FONT COLOR="#000077">The Power of Perl
in Text File Processing</FONT></H4>
<P>Now that you have the front end, it's time to write the search engine itself.
Use the <TT>File::Find</TT> library, available in the Perl distribution. This library
does all of the directory scanning for you, leaving you to simply implement the scanning
algorithm. This scanning algorithm searches for each word, keeping a count of occurrences
of each word. When it comes time to display the search results, you can display them
in the order of occurrences, which will give the user the most likely page they are
looking for right at the top. This concept should not be entirely new to you if you
have visited one of the popular search sites on the Web.</P>
<P>Assuming you have extracted the list of words to search for, you'll simply write
a function that accepts a word list as an argument, along with the file to scan.
Let's leave it up to the <TT>File::Find</TT> module to pass you the files, as shown
in Listing 7.9.
<H3 ALIGN="CENTER"><A NAME="Heading35"></A><FONT COLOR="#000077">Listing 7.9. Subroutine
to search for a list of words.</FONT></H3>
<PRE><FONT COLOR="#0066FF">sub wanted {
# This line gets rid of all Unix-type hidden files/directories.
return if $File::Find::name=~/\/\./;
# Only look at HTML files.
if ($File::Find::name=~/^.*\.html$/) {
if (!open(IN, "< $File::Find::name")) {
# This error message will appear in your error_log file.
warn "Cannot open file: $File::Find::name...$!\n";
return;
}
my(@lines)=<IN>;
close(IN);
my($count)=0;
foreach (@words) {
# Make the search case-insensitive.
$word="(?i)$_";
$count+=grep(/$word/,@lines);
}
if ($count>0) {
# Add this page to the list of found items.
push(@foundList,"$File::Find::name");
# Store the hit count in an associate array
# with the page as the key.
$hitCounts{"$File::Find::name"}=$count;
}
}
}
</FONT></PRE>
<DL>
<DT><FONT COLOR="#0066FF"></FONT></DT>
</DL>
<H3 ALIGN="CENTER">
<HR WIDTH="82%">
<FONT COLOR="#0066FF"><BR>
</FONT><FONT COLOR="#000077">NOTE:</FONT></H3>
<BLOCKQUOTE>
<P>If you are running on a UNIX system where the <TT>egrep</TT> command is available,
you should consider replacing the majority of this Perl code with a call to <TT>egrep</TT>,
as follows:</P>
<PRE><FONT COLOR="#0066FF">@hitList=`egrep -ci `(word1|word2|word3)' $File::Find::name`;</FONT></PRE>
</BLOCKQUOTE>
<PRE><FONT COLOR="#0066FF"></FONT></PRE>
<BLOCKQUOTE>
<P>This would be more efficient in terms of memory requirements and processor use.<BR>
<HR>
</BLOCKQUOTE>
<P><TT>File::Find</TT> contains a function called <TT>finddepth()</TT>, which takes
at least two arguments: a filter function and one or more directory names to recurse.
The filter function you are using is the one above called <TT>wanted()</TT>. <TT>finddepth()</TT>calls
<TT>wanted()</TT> for each file that it comes across. The filename is contained in
the variable <TT>$_</TT>. The file path is contained in the variable <TT>$File::Find::dir</TT>.
You have used the variable <TT>$File::Find::name</TT>, which is the combination of
the other two variables, with a path separator stuck in between. By using the functionality
provided by <TT>File::Find</TT>, all you need to do is add in your search filter
and not worry about recursion and figuring out what's a file and what's a directory.</P>
<P>The code used to initiate the search looks like this:</P>
<PRE><FONT COLOR="#0066FF">@words=split(/ /,$q->param(`SearchString'));
if (@words>0) {
finddepth(\&wanted,"/user/bdeng/Web/docs");
}
</FONT></PRE>
<P>It's probably a good idea to check the <TT>@words</TT> array so that it contains
at least one value. No need to make <TT>finddepth()</TT> do all that work if you
have nothing to search for. In this particular case, you might emit some HTML that
politely reminds the user to specify something to search for.
<H4 ALIGN="CENTER"><A NAME="Heading37"></A><FONT COLOR="#000077">Displaying the Results</FONT></H4>
<P>All you need to do now is display the results in a meaningful format. What you're
aiming for is an ordered list of likely candidates for what the user is trying to
find. You have an array of pages and an associative array of hit counts. What you
need first is a sort routine to rearrange the array in the correct order. The following
sort routine should work just fine:</P>
<PRE><FONT COLOR="#0066FF">@foundList = sort sortByHitCount @foundList;
sub sortByHitCount {
return $hitCounts{$b}- $hitCounts{$a};
}
</FONT></PRE>
<P>The first line in this code is the call to <TT>sort()</TT>, using the subroutine
<TT>sortByHitCount()</TT>. The <TT>$a</TT> and <TT>$b</TT> variables are package
global variables that <TT>sort()</TT> uses to tell the sorting routine which items
to compare. The items that you're comparing in this case are filenames that are keys
into the <TT>hitCounts</TT> associative array. Returning a negative value indicates
that <TT>$a</TT> is less than <TT>$b</TT>, and returning a positive value indicates
<TT>$a</TT> is greater than <TT>$b</TT>. Returning <TT>0</TT> indicates that the
two values are equal. What you are actually comparing in <TT>sortByHitCount()</TT>
is the hit count of each page.
<DL>
<DT></DT>
</DL>
<H3 ALIGN="CENTER">
<HR WIDTH="83%">
<BR>
<FONT COLOR="#000077">NOTE:</FONT></H3>
<BLOCKQUOTE>
<P>Remember that in the previous example, the <TT>%hitCounts</TT> associate array
must be within the scope of the <TT>sortByHitCount</TT> function. It would be a very
difficult problem to debug if you decided to move the <TT>sortByHitCount</TT> into
a different package scope one day.<BR>
<HR>
</BLOCKQUOTE>
<P>Now you have a sorted list of files that need converting to URLs. To do this,
you simply chop off the first n characters, where n is the length of the <TT>$serverRoot</TT>
variable. This can be done with the following line:</P>
<PRE><FONT COLOR="#0066FF">$url=substr($file,length($serverRoot));
</FONT></PRE>
<P>You can now format the string as a link by adding the <TT><A></TT> tag around
the <TT>$url</TT>. The final main code appears in Listing 7.10.
<H3 ALIGN="CENTER"><A NAME="Heading39"></A><FONT COLOR="#000077">Listing 7.10. A
simple CGI searching program.</FONT></H3>
<PRE><FONT COLOR="#0066FF">#!/public/bin/perl5
use CGI::Form;
use File::Find;
# Variables for storing the search criteria/results.
@words;
@foundList;
%hitCounts;
$q = new CGI::Form;
$serverRoot="/user/bdeng/Web/docs";
if ($q->cgi->var(`REQUEST_METHOD') eq `GET') {
&searchForm($q);
} else {
@words=split(/ /,$q->param(`SearchString'));
print $q->header;
print $q->start_html("Search Results");
print "<H1>Search Results</H1>\n<HR>\n";
if (@words>0) {
finddepth(\&wanted,$serverRoot);
@foundList = sort sortByHitCount @foundList;
if (@foundList>0) {
foreach $file (@foundList) {
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -