http:^^www.cs.washington.edu^research^projects^softbots^papers^metacrawler^www4^html^overview.html
来自「This data set contains WWW-pages collect」· HTML 代码 · 共 1,261 行 · 第 1/4 页
HTML
1,261 行
Date: Tue, 10 Dec 1996 14:53:39 GMTServer: NCSA/1.4.2Content-type: text/html<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"><!Converted with LaTeX2HTML 95.1 (Fri Jan 20 1995) by Nikos Drakos (nikos@cbl.leeds.ac.uk), CBLU, University of Leeds ><HEAD><TITLE>Multi-Service Search and Comparison Using the MetaCrawler</TITLE></HEAD><BODY><H1>Multi-Service Search and Comparison Using the MetaCrawler</H1>Erik Selberg<br>Oren Etzioni<P><H3>Abstract:</H3><EM>Standard Web search services, though useful, are far from ideal.There are over a dozen different search services currently inexistence, each with a unique interface and a database covering adifferent portion of the Web. As a result, users are forced torepeatedly try and retry their queries across different services.Furthermore, the services return many responses that are irrelevant,outdated, or unavailable, forcing the user to manually sift through theresponses searching for useful information.<P>This paper presents the<!WA0><!WA0><!WA0><!WA0><A NAME=tex2html2 HREF="http://www.cs.washington.edu/research/metacrawler">MetaCrawler</A>,a fielded Web service that represents the next level up in the information ``food chain.'' The MetaCrawler provides a single, central interface for Web documentsearching. Upon receiving a query, the MetaCrawler posts the query tomultiple search services in parallel, collates the returned references,and loads those references to verify their existence and to ensure thatthey contain relevant information. The MetaCrawler is sufficiently lightweight to reside on auser's machine, which facilitates customization, privacy,sophisticated filtering of references, and more.<P>The MetaCrawler also serves as a tool for comparison of diverse searchservices. Using the MetaCrawler's data, we present a ``ConsumerReports'' evaluation of six Web search services:Galaxy[<!WA1><!WA1><!WA1><!WA1><A HREF="#wwwgalaxy">5</A>], InfoSeek[<!WA2><!WA2><!WA2><!WA2><A HREF="#wwwinfoseek">1</A>], Lycos[<!WA3><!WA3><!WA3><!WA3><A HREF="#wwwlycos">15</A>], OpenText[<!WA4><!WA4><!WA4><!WA4><A HREF="#wwwopentext">20</A>], WebCrawler[<!WA5><!WA5><!WA5><!WA5><A HREF="#wwwwebcrawler">22</A>], andYahoo[<!WA6><!WA6><!WA6><!WA6><A HREF="#wwwyahoo">9</A>]. In addition, we also report on the mostcommonly submitted queries to the MetaCrawler.</EM><P><H3>Keywords:</H3>MetaCrawler, WWW, World Wide Web, search, multi-service, multi-threaded, parallel, comparison<P><H2><A NAME=SECTION00010000000000000000> Introduction</A></H2><P>Web search services such as Lycos and WebCrawler have proven bothuseful and popular. As the Web grows, thenumber and variety of search services is increasing as well. Examplesinclude: the Yahoo ``net directory''; the Harvest home page searchservice[<!WA7><!WA7><!WA7><!WA7><A HREF="#wwwharvesthomepage">7</A>]; the Query By Image Content service[<!WA8><!WA8><!WA8><!WA8><A HREF="#wwwqbic">12</A>]; the Virtual Tourist[<!WA9><!WA9><!WA9><!WA9><A HREF="#wwwvirtourist">24</A>], a directoryorganized by geographic regions; and more. Since each service provides an incomplete snapshot of the Web, users areforced to try and retry their queries across different indices untilthey find appropriate responses. The process of queryingmultiple services is quite tedious. Each service has its ownidiosyncratic interface which the user is forced to learn.Further, the services return manyresponses that are irrelevant, outdated, or unavailable, forcing theuser to manually sift through the responses searching for usefulinformation.<P>This paper presents the MetaCrawler, a search service that attempts toaddress the problems outlined above. The premisesunderlying the MetaCrawler are the following:<UL><LI> No single search service is sufficient. Table <!WA10><!WA10><!WA10><!WA10><A HREF="#followedtable">2</A> shows that no single service is able to return more then 45% of the references followed by users.<P><LI> Many references returned by services are irrelevant and can be removed if the user is better able to express the query. Table <!WA11><!WA11><!WA11><!WA11><A HREF="#deathbytable">3</A> shows that up to 75% of the references returned can be removed if the user supplies a more expressive query.<P><LI> Low-quality references can be detected and removed fairly quickly. Table <!WA12><!WA12><!WA12><!WA12><A HREF="#benchtime">4</A> shows that an average of about 100 references can be verified in well under 2.5 minutes, while simple collation and ranking takes under 30 seconds.<P><LI> These features will be used by the Web's population. The MetaCrawler is receiving over 7000 queries per week, and that number is growing, as shown in Figure <!WA13><!WA13><!WA13><!WA13><A HREF="#usegraph">1</A>.<P><LI> The MetaCrawler log facilitates an objective evaluation and comparison of the underlying search services. Tables <!WA14><!WA14><!WA14><!WA14><A HREF="#cctotalhits">5</A>-<!WA15><!WA15><!WA15><!WA15><A HREF="#ccperformance">8</A> detail trade-offs between the services. For example, Lycos returns over 5% more followed references than any other service, yet WebCrawler is the fastest, taking an average of 9.64 seconds to return answers to queries.<P></UL><P>The MetaCrawler logs also reveal that people search for awide variety of information, from ``A. H. Robins'' to ``zyxmusic.'' While the most common queries are related to sex andpornography, these onlyaccount for under 4% of the total queries submitted to theMetacrawler as shown in Table <!WA16><!WA16><!WA16><!WA16><A HREF="#toptenqueries">1</A>. Nearlyhalf of all queries submitted are unique.<P>The remainder of this paper is organized as follows: the design andimplementation of the MetaCrawler are described inSection <!WA17><!WA17><!WA17><!WA17><A HREF="#metacrawler">2</A>. Experiments to validate the above premisesare described in Section <!WA18><!WA18><!WA18><!WA18><A HREF="#evaluation">3</A>. We discuss relatedwork in Section <!WA19><!WA19><!WA19><!WA19><A HREF="#relatedwork">4</A>, and our ideas for future work andpotential impact appear in Section <!WA20><!WA20><!WA20><!WA20><A HREF="#futurework">5</A>. We conclude withSection <!WA21><!WA21><!WA21><!WA21><A HREF="#conclusions">6</A>.<P><P><A NAME=metacrawler></A><H2><A NAME=SECTION00020000000000000000> The MetaCrawler</A></H2>The<!WA22><!WA22><!WA22><!WA22><A NAME=tex2html3 HREF="http://metacrawler.cs.washington.edu:8080">MetaCrawler</A>is a free search service used for locating informationavailable on the World Wide Web. The MetaCrawler has an interfacesimilar to WebCrawler and Open Text in that itallows users to enter a search string, or <em> query</em>, and returns apage with click-able references or <em> hits</em> to pages available onthe Web. However, the internal architecture of the MetaCrawler isradically different from the other search services.<P>Standard Web searching consists of three activities:<UL><LI> <em> Indexing</em> the web for new and updated pages, a process that demands substantial CPU and network resources.<P><LI> <em> Storage</em> of the Web pages retrieved into an index, which typically requires a large amount of disk space.<P><LI> <em> Retrieval</em> of pages matching user queries. For most services, this amounts to returning a ranked list of page references from the stored index.</UL><P>Standard search services create and store an index ofthe Web as well as retrieve information from that index. Unlike theseservices, the MetaCrawler is a <em> meta-service</em> which uses nointernal database of its own; instead, it relies on other external search services to provide theinformation necessary to fulfill user queries. The insighthere is that by separating the retrieval of pages from indexing andstoring them, a lightweight application such as the MetaCrawler canaccess multiple databases and thus provide a larger number ofpotentially higher quality references than any search service tied toa single database.<P>Another advantage of the MetaCrawler is that it does notdepend upon the implementation or existence of any one search service. Some indexing mechanism is necessary for the Web. Typically, thisis done using automated robots or spiders, which may not necessarilybe the best choice[<!WA23><!WA23><!WA23><!WA23><A HREF="#kosterrobots">13</A>]. However, the underlyingarchitecture of the search services used by the MetaCrawler isunimportant. As long as there is no central complete search serviceand several partial search services exist, theMetaCrawler can provide the benefit of accessing them simultaneouslyand collating the results.<P>The MetaCrawler prototype has been publicly accessible since July 7, 1995.It has been steadily growing in popularity, logging upwards of 7000queries per week and increasing. TheMetaCrawler currently accesses six services: Galaxy,InfoSeek, Lycos, Open Text, WebCrawler, and Yahoo. It works as follows: given a query, the MetaCrawler will submit thatquery to every search service it knows in parallel. These servicesthen return a list of references to WWW pages, or hits. Uponreceiving the hits from every service, the MetaCrawler <em> collates</em>the results by merging all hits returned. Duplicate hits are listedonly once, but each service that returned a hit is acknowledged. Expertuser-supplied sorting options are applied at this time. Optionally,the MetaCrawler will <em> verify</em> the information's existence byloading the reference. When the MetaCrawler has loaded a reference, itis then able to re-score the page using supplementary query syntaxsupplied by the user.<P>When the MetaCrawler has finished processing all of the hits, the useris presented with a page consisting of a sorted list ofreferences. Each reference contains a click-able hypertext link to thereference, followed by local page context (if available), a confidence score, verified keywords, and the actual URL of the reference. Eachword in the search query is automatically boldfaced. So that we maydetermine which references are followed, each click-able link returnedto the user points not to the reference, but to a script which logsthat the reference was followed and then refers the user's browser tothe correct URL.<P>Querying many services and simply collating results will return moreresults than any one service, but at the cost of presenting the userwith more irrelevant references. The MetaCrawler is designed toincrease <em> both</em> the number of hits and relevance of hitsreturned. The MetaCrawler yields a higherproportion of <em> relevant</em> hits by using both a powerful querysyntax as well as expert options so that users can more easily instructthe MetaCrawler how to determine the quality of the returnedreferences. The query syntax used specifies required and non-desiredwords, as well as words that should appear as a phrase. The expert options allow usersto rank hits by physical location, such as the user's country, as wellas logical locality, such as their Internet domain. <P><A NAME=details></A><H3><A NAME=SECTION00021000000000000000> User Interface</A></H3><P>While giving the user a Web form with addedexpressive power was easy, presenting the user with a form thatwould facilitate actually using the novel features of theMetaCrawler proved to be a challenge. We strove for a balance betweena simple search form and an expressive one, keeping in mind interface issues mentioned by serviceproviders[<!WA24><!WA24><!WA24><!WA24><A HREF="#webcrawlerwww2">23</A>].<P>In our early designs, we focusedon syntax for queries with several additional options for improvingthe result. This syntax was similar to InfoSeek's query syntax:parentheses were used to define phrases, a plus sign designated arequired word, and a minus designated a non-desiredword. For example, to search for ``John Cleese,'' naturally requiring that both ``John'' and ``Cleese''appear together, the syntax required was the unwieldy <tt> (+John+Cleese)</tt>. Not surprisingly, we discovered that while most usersattempted to use the syntax, they often introduced subtle syntactical errorscausing the resulting search to produce an entirely irrelevant set ofhits.<P>In our current design, we have reduced the need for extra syntax, andinstead ask the user to select the type of search. The threeoptions are:<UL><LI> <em> Search for words as a phrase:</em><BR> Treat the query text as a single phrase, and attempt to match the phrase in pages retrieved, e.g. ``Four score and seven years ago.''<LI> <em> Search for all words:</em> <BR> Attempt to find each word of the query text somewhere in the retrieved pages. This is the equivalent of logical ``and.''<LI> <em> Search for any words:</em> <BR> Attempt to find any word of the query text in the retrieved pages. This is the equivalent of logical ``or.''</UL>The older syntax is still supported, although it is not advertisedprominently on the main search page, save for the minus sign, whichwas the most widely used element of the query syntax. Since wechanged the search page to this new design, the number of malformed requests has dropped significantly.<P>In addition to the query entry box, we maintain various expert optionswhich can be activated via menus. The MetaCrawler currently uses twomenus to provide extra expressiveness. The first describes a coarsegrain Locality, with options for the user's continent, country, and Internet domain, aswell as options to select a specific continent. The second menu describesthe sundry Internet domain types, e.g. <tt> .edu</tt>, <tt> .com</tt>,etc. These options allow users to better describe what they arelooking for in terms of where they believe the relevant informationwill be.<P><A NAME=clientserverdesign></A><H3><A NAME=SECTION00022000000000000000> Client-Server Design</A></H3><P>Current search services amortize the cost of indexing and storingpages over hundreds of thousands of retrievals per day. In order to field the maximal number ofretrievals, services devote minimal effort to responding to eachindividual query. Increases in server capacity are quickly gobbledup by increases in pages indexed and queries per day. As a result,there is little time for more sophisticated analysis, filtering, andpost-processing of responses to queries.<P>By decoupling the retrieval of pages from indexing and storage, theMetaCrawler is able to spend time performing sophisticated analysison pages returned. The MetaCrawler just retrieves data,spending no time indexing or storing it. Thus, the MetaCrawler isrelatively lightweight. The prototype, written in C++, comes to only 3985 lines of code including comments.It does not need the massive disk storage to maintain an index nordoes it need the massive CPU and network resources that other servicesrequire. Consequently, a MetaCrawler client could reside comfortablyon an individual user's machine.<P>An individualized MetaCrawler <i> client</i> that accesses multiple Websearch services has a number of advantages. First, the user's machinebears the load of the post-processing and analysis of the returnedreferences. Given extra time, post-processing can be quitesophisticated. For example, the MetaCrawler could divide referencesinto clusters based on their similarity to each other, or it couldengage in <i> secondary search</i> by following references to related pagesto determine potential interest. Second, the processing can becustomized to the user's taste and needs. For example, the user maychoose to filter advertisements or parents may try to block X-rated pages.Third, the MetaCrawler could support scheduled queries (e.g., what'snew today about the Seattle Mariners?). By storing the results of previous queries on the user's machine, theMetaCrawler can focus its output on new or updated pages. Finally,for pay-per-query services, the MetaCrawler can be programmed withselective query policies (e.g., ``go to the cheapest service first''or even ``compute the optimal service querying sequence'').<P>
⌨️ 快捷键说明
复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?