http:^^www.cs.washington.edu^research^projects^softbots^papers^metacrawler^www4^html^overview.html

来自「This data set contains WWW-pages collect」· HTML 代码 · 共 1,261 行 · 第 1/4 页

HTML
1,261
字号
Organizations may choose to have an institutional MetaCrawler withenhanced caching capabilities, on the presumption that people withinan organization will want to examine many of the same pages.  Thecache could also facilitate local annotations, creating acollaborative filtering and information exchange environment of thesort described elsewhere [<!WA25><!WA25><!WA25><!WA25><A HREF="#wwwhomr">17</A>].<P>Finally, while our prototype MetaCrawler depends on the good will ofthe underlying services, a MetaCrawler client would not.  In thefuture, an underlying service may choose to block MetaCrawlerrequests, which are easily identified.  However, it would benearly impossible to distinguish queries issued by a MetaCrawlerclient versus queries made directly by a person.<P><H3><A NAME=SECTION00023000000000000000> Common Usage</A></H3><P>One of the frequently asked questions regarding search on theInternet is ``What are people searching for?'' Table<!WA26><!WA26><!WA26><!WA26><A HREF="#toptenqueries">1</A> summarizes the top ten repeated queries out of atotal of 50,878 queries made from July 7 through September 30. Eachquery in the top ten is related to sex. However, the combined top tenqueries amount only to 1716 queries out of 50,878 total queries, or3.37%. Further, 24,253 queries, or 46.67%, were not repeated.<P><hr><BR><STRONG>Table 1:</STRONG> <A NAME=toptenqueries></A>Top Ten Queries Issued to the MetaCrawler<BR><A NAME=137></A><table><tr><th align=center>No.</th><th align=left>Query</th><th align=center>Times Issued</th></tr><tr><td align=center>1</td><td align=left>sex</td><td align=center>533</td></tr><tr><td align=center>2</td><td align=left>erotica</td><td align=center>219</td></tr><tr><td align=center>3</td><td align=left>nude</td><td align=center>217</td></tr><tr><td align=center>4</td><td align=left>porn</td><td align=center>158</td></tr><tr><td align=center>5</td><td align=left>penthouse</td><td align=center>127</td></tr><tr><td align=center>6</td><td align=left>pornography</td><td align=center>112</td></tr><tr><td align=center>7</td><td align=left>erotic</td><td align=center>105</td></tr><tr><td align=center>8</td><td align=left>porno</td><td align=center>89</td></tr><tr><td align=center>9</td><td align=left>adult</td><td align=center>89</td></tr><tr><td align=center>10</td><td align=left>playboy</td><td align=center>67</td></tr></table><blockquote>``Times Issued'' lists the number of times the corresponding query wasissued from July 7 through Sept. 30, 1995. Note that while each queryis sexually related, the combined total amounts to less than 4\% ofthe total queries processed by the MetaCrawler. Also, 46\% of thequeries issued were unique.</blockquote><hr><P><A NAME=evaluation></A><H2><A NAME=SECTION00030000000000000000> Evaluation</A></H2><P>The MetaCrawler was released publicly on July 7, 1995. Averages and percentagespresented in this paper are based on the 50,878 completed queries,starting on July 7 and ending September 30, except those in reference toOpen Text, which are based on 19,951 completed queries startingSeptember 8, when Open Text was added to the MetaCrawler'srepertoire. The log results from seven days were omitted due to a service changing its output format, causing thatservice to return no references to the MetaCrawler even though theservice was available. The MetaCrawler is currently running on a DECAlpha 3000/400 under OSF 3.2.<P>The first hypothesis we confirmed after we deployed the MetaCrawlerwas that sending queries in parallel and collating the results wasuseful. To confirm this, we used the metric that references followed fromthe page of hits returned by the MetaCrawler contained relevant information. We calculated the  percentageof references followed by users for each of the search services.Table <!WA27><!WA27><!WA27><!WA27><A HREF="#followedtable">2</A> demonstrates the need for using multiple services;while Lycos did return the plurality of the hits that were followed, witha 35.43% share (42.17% in the last month recorded), slightly under65% of the followed references came from the other fiveservices. Skeptical readers may argue that service providerscould invest in more resources and provide more comprehensive indices to the web. However, recent studies indicate the rate of Webexpansion and change makes a complete index virtuallyimpossible[<!WA28><!WA28><!WA28><!WA28><A HREF="#lycossignidrv">16</A>].<p><hr><BR><STRONG>Table 2:</STRONG> <A NAME=followedtable></A>Market Shares of Followed References<BR><A NAME=138></A><table><tr><th></th><th align=center>% followed Jul. 7 - Sept. 30</th><th align=center>% followed Sept. 8 - 30</th></tr><tr><th align=left>Lycos           </td><td align=center>  35.43 </td><td align=center> 42.17    </td></tr><tr><th align=left>WebCrawler      </td><td align=center>  30.76 </td><td align=center> 25.74    </td></tr><tr><th align=left>InfoSeek        </td><td align=center>  18.55 </td><td align=center> 15.70    </td></tr><tr><th align=left>Galaxy          </td><td align=center>  17.10 </td><td align=center> 15.60    </td></tr><tr><th align=left>Open Text       </td><td align=center>    n/a </td><td align=center> 14.70    </td></tr><tr><th align=left>Yahoo           </td><td align=center>  10.67 </td><td align=center>  6.59    </td></table><blockquote>This table shows the percentage each service has of the total followedreferences. References returned by two or more services are includedunder each service, which is why the columns sum to over 100\%.  The table demonstrates that a user who restricts his or her queries to asingle service will miss most of the relevant references.</blockquote><hr><p>We then analyzed the data to determine which, if any, of the addedfeatures of the MetaCrawler were helping users. The metric we used wasthe number of references pruned. Table <!WA29><!WA29><!WA29><!WA29><A HREF="#deathbytable">3</A>shows the average number of references removed for each advancedoption.<p><hr><BR><STRONG>Table 3:</STRONG> <A NAME=deathbytable></A>Effect of Features in RemovingIrrelevant Hits<BR><A NAME=139></A><table><tr><th align=left>Feature   </th><th align=center>% of Hits Removed</th></tr><tr><td align=left>Syntax  </td><td align=center> 39.79         </td></tr><tr><td align=left>Dead    </td><td align=center> 14.88         </td></tr><tr><td align=left>Expert  </td><td align=center> 21.49         </td></tr></table><blockquote>This table shows the percentage of hits removed when a particularfeature was used. ``Syntax'' refers to queries that were removeddue to sophisticated query syntax (e.g., minus for undesired words);``Dead'' refers to unavailable or inaccessible pages, and ``Expert''refers to hits removed due to restriction on the references' origins.</blockquote><hr><p>Using syntax for required or non-desiredwords typically reduces the number of returned results by 40%. Detecting deadpages allowed the removal of another 15%. Finally, the expert options were very successful in removing unwantedreferences. When all of these features are used in conjunction,up to 75% of the returned references can be removed.<P><H3><A NAME=SECTION00031000000000000000> MetaCrawler Benchmarks</A></H3><P>We have shown that the MetaCrawler improves the quality of resultsreturned to the user. But what is the performance cost? Table<!WA30><!WA30><!WA30><!WA30><A HREF="#benchtime">4</A> shows the average times per query, differentiatingbetween having the MetaCrawler simply collate the results or verify them aswell.<p><hr><BR><STRONG>Table 4:</STRONG> <A NAME=benchtime></A>Average Time for MetaCrawler Return  Results<BR><A NAME=140></A><table><tr><th align=left></th><th align=center>      Wall Clock Time </th><th align=center> User Time </th><th align=center> System Time </th><th align=center> Lag Time </th></tr><tr><th align=left>Collated </td><td align=center>         25.70        </td><td align=center> 0.32      </td><td align=center> 1.87        </td><td align=center>  23.51   </td></tr><tr><th align=left>Verified </td><td align=center>        139.30        </td><td align=center> 22.72     </td><td align=center> 4.50        </td><td align=center>  112.08  </td></tr></table><blockquote>All times are measured in seconds. ``Wall Clock Time'' is the total time taken for an averagequery, and is broken down into User, System, and Lag Time. ``UserTime'' is the time taken by the MetaCrawler program, ``System Time'' the time taken by the operating system, and ``LagTime'' the time taken for pages to be downloaded off the network.</blockquote><hr><p>Table <!WA31><!WA31><!WA31><!WA31><A HREF="#benchtime">4</A> shows that the MetaCrawler finished relativelyquickly. The average time to return collated results is a little over 5 secondslonger than the slowest service as shown by Table<!WA32><!WA32><!WA32><!WA32><A HREF="#ccperformance">8</A>. This is to be expected given the percentage ofthe time a service times out, which causes the MetaCrawler to wait fora full minute before returning all the results.We are pleased with the times reported for verification. Ourinitial prototype typically took five minutes to performverification. We recently began caching retrieved pages for threehours, and have found that caching reduces the averageverification time by nearly one-half. We are confident that this time canbe further reduced by more aggressive caching as well as improvements in the thread management used by the MetaCrawler.<P>Since the MetaCrawler was publicly announced, the daily access counthas been growing at a linear rate. We are also pleased with anincreased use of the user options. Figure <!WA33><!WA33><!WA33><!WA33><A HREF="#usegraph">1</A> plotsthe data points for the weeks beginning July 7 until September30. ``Feature Use by Week'' shows the number of queries where any ofthe MetaCrawler's advanced features, such as verification, were used.<P><A NAME=545></A><!WA34><!WA34><!WA34><!WA34><IMG  ALIGN=BOTTOM ALT="" SRC="http://www.cs.washington.edu/research/projects/softbots/papers/metacrawler/www4/html/fig1trans.gif"><BR><STRONG>Figure 1:</STRONG> <A NAME=usegraph></A>Queries per week from July 7 - Sept. 30<BR><P><A NAME=servicecomparison></A><H3><A NAME=SECTION00032000000000000000> Search Service Comparison</A></H3><P>In addition to validating our claims, the MetaCrawler's logs alsoallow us to present a ``Consumer Reports'' style comparison of the searchservices. We evaluate each service using three metrics:<UL><LI> <em> Coverage:</em> How many hits will be returned on average?<LI> <em> Relevance:</em> Are hits returned actually followed by users?<LI> <em> Performance:</em> How long does each service take, and how often does  it time out?</UL><H4><A NAME=SECTION00032100000000000000> Coverage</A></H4><P>Given a pre-set maximum on the number of hits returned by eachservice, we measured both the percentage of references returned aswell as references unique to the service that returned them. Thus,75% returned with 70% unique shows that on average a servicereturns 75% of its maximum allowed, with 70% of those hits beingunique.<P>Tables <!WA35><!WA35><!WA35><!WA35><A HREF="#cctotalhits">5</A> and <!WA36><!WA36><!WA36><!WA36><A HREF="#ccuniquehits">6</A> details our findingsin terms of raw content. It shows that with default parameters, Open Text returns80% of the maximum hits allowed, with nearly 89% of those hits beingunique. Lycos and WebCrawler follow, also returning over 70%,with slightly over 90% of those hits being unique. Yahoo has particularly poor performance on the total hits metric, but this was notsurprising to us. We included Yahoo on the hypothesis that people searchfor subjects, such as ``Mariners Baseball,'' which Yahooexcels at. However, it turned out this hypothesis was incorrect, aspeople tended to use the MetaCrawler to search for nuggets of information, such as ``KenGriffey hand injury.'' Yahoo does not index this type information, and thusshows poor content. Presumably, most topic searches go to Yahoo directly.<P>Although each service returns mostly unique references, it is notclear whether those references are useful. Further, unique references are not necessarily unique to aservice's database, as another service could return that referencegiven a different query string. <p><hr><BR><STRONG>Table 5:</STRONG> <A NAME=cctotalhits></A>Returned References by Service<BR><A NAME=546></A><table><tr><th align=left>            </th><th align=center>% of Max Hits Returned </th><th align=center> Ave. Hits Returned / Maximum Allowed </th></tr><tr><th align=left>Open Text   </td><td align=center>80.0          </td><td align=center>  8.0 / 10          </td></tr><tr><th align=left>Lycos       </td><td align=center>76.3          </td><td align=center> 19.1 / 25          </td></tr>

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?