http:^^www.cs.washington.edu^research^projects^softbots^papers^metacrawler^www4^html^overview.html

来自「This data set contains WWW-pages collect」· HTML 代码 · 共 1,261 行 · 第 1/4 页

HTML
1,261
字号
<tr><th align=left>WebCrawler  </td><td align=center>70.2          </td><td align=center> 17.6 / 25          </td></tr><tr><th align=left>Galaxy      </td><td align=center>56.9          </td><td align=center> 28.4 / 50          </td></tr><tr><th align=left>InfoSeek    </td><td align=center>43.5          </td><td align=center>  4.4 / 10          </td></tr><tr><th align=left>Yahoo       </td><td align=center>11.1          </td><td align=center> 11.1 / 100         </td></tr></table><blockquote>The first column shows the percentage of the maximum hits allowed thateach service returned. Each percentage was calculated by dividing theaverage hits returned by the maximum allowed for that service, as shown in thesecond column. This percentage is a measure of how many hits aservice will provide given a pre-set maximum. The MetaCrawler useddifferent maximum values for services, as some had internal maximumvalues, and others would either accept only certain maximums or noneat all. </blockquote><hr><hr><BR><STRONG>Table 6:</STRONG> <A NAME=ccuniquehits></A>Unique References by Service<BR><A NAME=547></A><table><tr><th align=left></th><th align=center>% of Unique Hits Returned          </th></tr><tr><th align=left>Galaxy      </th><td align=center> 99.6            </td></tr><tr><th align=left>Yahoo       </th><td align=center> 92.8            </td></tr><tr><th align=left>Lycos       </th><td align=center> 90.6            </td></tr><tr><th align=left>WebCrawler  </th><td align=center> 90.3            </td></tr><tr><th align=left>Open Text   </th><td align=center> 88.8            </td></tr><tr><th align=left>InfoSeek    </th><td align=center> 79.5            </td></tr></table><blockquote>This table shows the percentage of references each servicereturned that were unique to that service. </blockquote><hr><p><H4><A NAME=SECTION00032200000000000000> Relevance</A></H4>To measure relevance, two metrics are used. The first is whichservice is returning the most references that people follow. This isshown by Table <!WA37><!WA37><!WA37><!WA37><A HREF="#followedtable">2</A>. The second metric is what percentof references returned by each service are people following. Table <!WA38><!WA38><!WA38><!WA38><A HREF="#ccrelevance">7</A> summarizes thesecalculations. It shows that nearly 6% of all references returned byInfoSeek were followed. Lycos, Open Text, andWebCrawler follow, each having about 2.5% of their hitsfollowed.<P>This data has two caveats. The first is that the relevantinformation for people may be the list of references itself. For example,people who wish to see how many links there are to their home page maysearch on their own name just to calculate this number. The secondcaveat is that these numbers may be skewed by the number of hitsreturned by each service. Thus, while InfoSeek has nearly 6% of itsresults followed, only a total of 13,045 references returned byInfoSeek were followed, compared with the 24913 references followedthat were contributed by Lycos, which on average had only 2.5% of its19 hits examined.<p><hr><BR><STRONG>Table 7:</STRONG> <A NAME=ccrelevance></A>Relevance of Returned Hits by Service<BR><A NAME=548></A><table><tr><th align=left></th><th align=center> % of Hits Returned that are Followed </th><th align=center> Total Hits Followed </th></tr><tr><th align=left>InfoSeek    </th><td align=center> 5.89                </td><td align=center> 13,045      </td></tr><tr><th align=left>Lycos       </th><td align=center> 2.56                </td><td align=center> 24,913      </td></tr><tr><th align=left>Open Text    </th><td align=center> 2.51                </td><td align=center>  4,025      </td></tr><tr><th align=left>WebCrawler  </th><td align=center> 2.42                </td><td align=center> 21,631      </td></tr><tr><th align=left>Yahoo       </th><td align=center> 1.33                </td><td align=center>  7,503      </td></tr><tr><th align=left>Galaxy      </th><td align=center> 0.83                </td><td align=center> 12,022      </td></tr></table><blockquote>This table shows the percentage of followed hits for eachservice. References returned by multiple servicesare counted multiple times. Column 2 shows the actual number ofreferences followed for each service. These numbers are out of50,878 queries, except Open Text which is out of 19,951 queries. </blockquote><hr><p><H4><A NAME=SECTION00032300000000000000> Performance</A></H4>Finally, we measure each service's response time. Table<!WA39><!WA39><!WA39><!WA39><A HREF="#ccperformance">8</A> summarizes our findings. It is not surprising,although disappointing, to find that average times vary from justunder 10 seconds to just under 20. The percent of time the services timed outis under 5% for all services except Open Text, which is the newestand presumably still going through some growing pains. One explanationfor the length of times taken by these services is that the majorityof requests are during peak hours. Thus, results are naturallyskewed towards the times when the services are most loaded. Timesduring non-peak hours are much lower.<p><hr><BR><STRONG>Table 8:</STRONG> <A NAME=ccperformance></A>Performance of Services<BR><P><A NAME=549></A><table><tr><th align=left>            </th><th align=center>  Ave. Time (sec)  </th><th align=center> % Timed Out </th></tr><tr><th align=left>WebCrawler  </th><td align=center>   9.64            </td><td align=center>   2.30       </td></tr><tr><th align=left>InfoSeek    </th><td align=center>  12.30            </td><td align=center>   3.01       </td></tr><tr><th align=left>Open Text    </th><td align=center>  16.26            </td><td align=center>  14.13       </td></tr><tr><th align=left>Yahoo       </th><td align=center>  18.32            </td><td align=center>   2.28       </td></tr><tr><th align=left>Lycos       </th><td align=center>  18.99            </td><td align=center>   4.87       </td></tr><tr><th align=left>Galaxy      </th><td align=center>  19.52            </td><td align=center>   3.10       </td></tr></table><blockquote>This table shows the average time in seconds taken by each service to fulfill aquery. The second column gives the percent of time that the servicewould time out, or fail to return any hits under one minute. </blockquote><hr><p><A NAME=relatedwork></A><H2><A NAME=SECTION00040000000000000000> Related Work</A></H2><P>Unifying several databases under one interface is far from novel. Manycompanies, such as PLS[<!WA40><!WA40><!WA40><!WA40><A HREF="#wwwpls">21</A>],Lexis-Nexis[<!WA41><!WA41><!WA41><!WA41><A HREF="#wwwlexisnexis">14</A>], and Verity[<!WA42><!WA42><!WA42><!WA42><A HREF="#wwwverity">30</A>]have invested several years and substantial capital creating systems that can handle andintegrate heterogeneous databases. Likewise, with the emergence ofmany Internet search services, there have been many different effortsto create single interfaces to the sundry databases. Perhaps the mostwidely distributed is CUSI[<!WA43><!WA43><!WA43><!WA43><A HREF="#wwwcusi">18</A>], the Configurable UnifiedSearch Index, which is a large form which allows users to select oneservice at a time and query that service. There are also several otherservices much like CUSI, such as the All-in-One SearchPage[<!WA44><!WA44><!WA44><!WA44><A HREF="#wwwallinone">2</A>], or W3 Search Engine list[<!WA45><!WA45><!WA45><!WA45><A HREF="#wwww3enginelist">19</A>]. Unfortunately, while theuser has many services on these lists to choose from, there is noparallelism or collation. The user must submit queries to each service individually,although this task is made easier by having form interfaces to thevarious services on one page.<P>The Harvest system[<!WA46><!WA46><!WA46><!WA46><A HREF="#harvesttr">6</A>] has many similarities to theMetaCrawler; however, rather than using existing databases as they areand post-processing the information returned, Harvest uses ``Gatherers''to index information and ``Brokers'' to provide different interfacesto extract this information. However, while Harvest may have manydifferent interfaces to many different internal services, it is still asearch service like Lycos and WebCrawler, instead of a meta-service likeMetaCrawler.<P>There are also other parallel Web search services. Sun Microsystemssupports a very primitive service[<!WA47><!WA47><!WA47><!WA47><A HREF="#wwwsunmultithreaded">27</A>], and IBMhas recently announced infoMarket[<!WA48><!WA48><!WA48><!WA48><A HREF="#wwwinfoMarket">11</A>] which, ratherthan integrating similar services with different coverage,integrates quite different services, such asDejaNews[<!WA49><!WA49><!WA49><!WA49><A HREF="#wwwdejanews">26</A>], a USENET news search service, McKinley[<!WA50><!WA50><!WA50><!WA50><A HREF="#wwwmckinley">29</A>], a clone of Yahoo with some editorialratings on various pages, in addition to Open Text and Yahoo.<P>The closest work to the MetaCrawler is SavvySearch[<!WA51><!WA51><!WA51><!WA51><A HREF="#wwwsavvy">3</A>], anindependently created multi-threaded search service released in May1995. SavvySearch has a larger repertoire of search services, althoughsome are not WWW resource services, such as Roget'sThesaurus. SavvySearch's main focus is categorizing users' queries,and sending them to the most appropriate subset of its knownservices[<!WA52><!WA52><!WA52><!WA52><A HREF="#dreilingersavvy">4</A>].  <P>Like the MetaCrawler, the Internet Softbot[<!WA53><!WA53><!WA53><!WA53><A HREF="#etzioniuicacm">8</A>] is ameta-service that leverages existing services and collates their results.  The Softbot enables a human user to statewhat he or she wants accomplished.  The Softbot attempts todisambiguate the request and to dynamically determine how and whereto satisfy it, utilizing a wide range of Internet services.  Unlikethe MetaCrawler, which focuses on indices and keyword queries, the Softbot accessesstructured services such as Netfind and databases such as Inspec.  TheSoftbot explicitly represents the semantics of the services, enabling it to chain together multiple services in responseto a user request.  The Softbot utilizes automatic planning technologyto dynamically generate the appropriate sequence of service accesses.While the MetaCrawler and the Softbot rely on radically differenttechnologies, the vision driving both systems is the same.  Bothseek to provide an expressive and integrated interface to Internetservices.<P><A NAME=futurework></A><H2><A NAME=SECTION00050000000000000000> Future Work</A></H2><P>We are investigating how the MetaCrawler will scale to use newservices. Of particular importance is how to collate results returnedfrom different types of Internet indices, such as USENET news and Webpages. Also important is determining useful methods forinteracting with specialized databases, such as the Internet MovieDatabase[<!WA54><!WA54><!WA54><!WA54><A HREF="#wwwmoviedatabase">28</A>]. If the information requested isobviously located on some special purpose databases, than it does not make sense to queryeach and every service. We areinvestigating methods that will enable the MetaCrawler to determinewhich services will return relevant data based solely on the query textand other data provided by the user.<P><A NAME=futuredesign></A><H3><A NAME=SECTION00051000000000000000> Future Design</A></H3><P>The existing MetaCrawler prototype can cause a substantial network load whenit attempts to verify a large number of pages. While one queryby itself is no problem, multiple queries occurring at the same timecan cause the system and network to bog down. However, with theemergence of universally portable Internet-friendly languages, such as Java[<!WA55><!WA55><!WA55><!WA55><A HREF="#javawhitepaper">10</A>] or MagicCap[<!WA56><!WA56><!WA56><!WA56><A HREF="#genmagicguidetocodewarrior">25</A>], load problems can be lessenedby having users' machines take on the workload needed to perform theirindividual query, as discussed in Section<!WA57><!WA57><!WA57><!WA57><A HREF="#clientserverdesign">2.2</A>. The JavaCrawler, a prototype nextgeneration MetaCrawler written in Java, supports most of the  features already present in the MetaCrawler. However, instead of usersrunning queries on one central service, each user has a local copy ofthe JavaCrawler and uses that copy to directly send queries toservices. The load caused by verification will be taken by the user'smachine, rather than the central server. This hasthe added benefit of inserting downloaded pages into the local cache, makingretrieval of those pages nearly instantaneous. The JavaCrawler is loadedautomatically from the MetaCrawler home page when visited with aJava-compatible browser.<P><H3><A NAME=SECTION00052000000000000000> Impact on Search Service Providers</A></H3><P>We anticipate that a wide range ofmeta-services like the MetaCrawler will emerge over the next fewyears. However, it is uncertain what the relationship between these meta-services and searchservice providers will be. We envision that this relationship will hinge onwhat form the ``information economy'' used by service providers takes.We discuss two different models.<P><H4><A NAME=SECTION00052100000000000000> Charge-per-Access</A></H4><P>In the charge-per-access model, service providers benefit fromany access to their database. InfoSeek has already taken this model with their commercialservice. InfoSeek is financially rewarded regardless of who or whatsends a query to their commercial database. Many other databases, bothon and off the Web, also use this model.<P>The MetaCrawler fits in well with this model.Since service providers benefit from any access, the addedexposure generated by the MetaCrawler is to their advantage. Further,this model creates an implicit sanity check on the claims this papermakes on the use of its features. In order for the MetaCrawler, or anymeta-service, to survive in such an economy, it must charge more pertransaction than the underlying services, as the MetaCrawler will inturn have to pay each service for its information. Thus, users mustbe willing to pay the premium for the service. By voting with theirpocketbook, they can determine if those features are truly desirable.<P><H4><A NAME=SECTION00052200000000000000> Advertising</A></H4><P>In the advertising model, service providers benefit from sponsors whoin turn gain benefit from exposure provided by the service.Nearly all major search services that do not charge users directly haveadopted this model, as have many other unrelated services which areheavily accessed.<P>Under this model, the providers' relationship with theMetaCrawler can become problematic as the MetaCrawler filters awaysuperfluous information such as advertisements. One promising methodto ensure profitable co-existence

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?