⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 manual.txt

📁 harvest是一个下载html网页得机器人
💻 TXT
📖 第 1 页 / 共 5 页
字号:
  Harvest User's Manual  Darren R. Hardy, Michael F. Schwartz, Duane Wessels, Kang-  Jin Lee  2002-10-29  Harvest User's Manual was edited by Kang-Jin Lee and covers Harvest  version 1.8.  It was originally written by Darren R. Hardy, Michael F.  Schwartz and Duane Wessels for Harvest 1.4.pl2 in 1996-01-31.  ______________________________________________________________________  Table of Contents  1. Introduction to Harvest     1.1 Copyright     1.2 Online Harvest Resources  2. Subsystem Overview     2.1 Distributing the Gathering and Brokering Processes  3. Installing the Harvest Software     3.1 Requirements for Harvest Servers        3.1.1 Hardware        3.1.2 Platforms        3.1.3 Software     3.2 Requirements for Harvest Users     3.3 Retrieving and Installing the Harvest Software        3.3.1 Distribution types        3.3.2 Harvest components        3.3.3 User-contributed software     3.4 Building the Source Distribution     3.5 Additional installation for the Harvest Broker        3.5.1 Checking the installation for HTTP access        3.5.2 Required modifications to your HTTP server        3.5.3 Apache httpd        3.5.4 Other HTTP servers     3.6 Upgrading versions of the Harvest software        3.6.1 Upgrading from version 1.6 to version 1.8        3.6.2 Upgrading from version 1.5 to version 1.6        3.6.3 Upgrading from version 1.4 to version 1.5        3.6.4 Upgrading from version 1.3 to version 1.4        3.6.5 Upgrading from version 1.2 to version 1.3        3.6.6 Upgrading from version 1.1 to version 1.2        3.6.7 Upgrading to version 1.1 from version 1.0 or older     3.7 Starting up the system: RunHarvest and related commands     3.8 Harvest team contact information  4. The Gatherer     4.1 Overview     4.2 Basic setup        4.2.1 Gathering News URLs with NNTP        4.2.2 Cleaning out a Gatherer     4.3 RootNode specifications        4.3.1 RootNode filters        4.3.2 Generic Enumeration program description        4.3.3 Example RootNode configuration        4.3.4 Gatherer enumeration vs. candidate selection     4.4 Generating LeafNode/RootNode URLs from a program     4.5 Extracting data for indexing: The Essence summarizing subsystem        4.5.1 Default actions of ``stock'' summarizers        4.5.2 Summarizing SGML data           4.5.2.1 Location of support files           4.5.2.2 The SGML to SOIF table           4.5.2.3 Errors and warnings from the SGML Parser           4.5.2.4 Creating a summarizer for a new SGML-tagged data type           4.5.2.5 The SGML-based HTML summarizer           4.5.2.6 Adding META data to your HTML           4.5.2.7 Other examples        4.5.3 Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps           4.5.3.1 Customizing the type recognition step           4.5.3.2 Customizing the candidate selection step           4.5.3.3 Customizing the presentation unnesting step           4.5.3.4 Customizing the summarizing step     4.6 Post-Summarizing: Rule-based tuning of object summaries        4.6.1 The Rules file        4.6.2 Rewriting URLs     4.7 Gatherer administration        4.7.1 Setting variables in the Gatherer configuration file        4.7.2 Local file system gathering for reduced CPU load        4.7.3 Gathering from password-protected servers        4.7.4 Controlling access to the Gatherer's database        4.7.5 Periodic gathering and realtime updates        4.7.6 The local disk cache        4.7.7 Incorporating manually generated information into a Gatherer     4.8 Troubleshooting  5. The Broker     5.1 Overview     5.2 Basic setup     5.3 Querying a Broker        5.3.1 Example queries        5.3.2 Regular expressions        5.3.3 Query options selected by menus or buttons        5.3.4 Filtering query results        5.3.5 Result set presentation     5.4 Customizing the Broker's Query Result Set        5.4.1 The search.cf configuration file           5.4.1.1 Defined Variables           5.4.1.2 List of Definitions        5.4.2 Example search.cf customization file        5.4.3 Integrating your customized configuration file        5.4.4 Displaying SOIF attributes in results     5.5 World Wide Web interface description        5.5.1 HTML files for graphical user interface        5.5.2 CGI programs        5.5.3 Help files for the user     5.6 Administrating a Broker        5.6.1 Deleting unwanted Broker objects        5.6.2 Command-line Administration     5.7 Tuning Glimpse indexing in the Broker        5.7.1 The glimpseserver program     5.8 Using different index/search engines with the Broker        5.8.1 Using Swish as an indexer        5.8.2 Using WAIS as an indexer     5.9 Collector interface description: Collection.conf     5.10 Troubleshooting  6. Programs and layout of the installed Harvest software     6.1 $HARVEST_HOME     6.2 $HARVEST_HOME/bin     6.3 $HARVEST_HOME/brokers     6.4 $HARVEST_HOME/cgi-bin     6.5 $HARVEST_HOME/gatherers     6.6 $HARVEST_HOME/lib     6.7 $HARVEST_HOME/lib/broker     6.8 $HARVEST_HOME/lib/gatherer     6.9 $HARVEST_HOME/tmp  7. The Summary Object Interchange Format (SOIF)     7.1 Formal description of SOIF     7.2 List of common SOIF attribute names  8. Gatherer Examples     8.1 Example 1 - A simple Gatherer     8.2 Example 2 - Incorporating manually generated information     8.3 Example 3 - Customizing type recognition and candidate selection     8.4 Example 4 - Customizing type recognition and summarizing        8.4.1 Using regular expressions to summarize a format        8.4.2 Using programs to summarize a format        8.4.3 Running the example     8.5 Example 5 - Using RootNode filters  9. History of Harvest     9.1 History of Harvest     9.2 History of Harvest User's Manual  ______________________________________________________________________  11..  IInnttrroodduuccttiioonn ttoo HHaarrvveesstt  HARVEST is an integrated set of tools to gather, extract, organize,  and search information across the Internet.  With modest effort users  can tailor Harvest to digest information in many different formats,  and offer custom search services on the Internet.  A key goal of Harvest is to provide a flexible system that can be  configured in various ways to create many types of indexes.  Harvest also allows users to extract structured (attribute-value pair)  information from many different information formats and build indexes  that allow these attributes to be referenced during queries (e.g.,  searching for all documents with a certain regular expression in the  title field).  An important advantage of Harvest is that it allows users to build  indexes using either manually constructed templates (for maximum  control over index content) or automatically extracted data  constructed templates (for easy coverage of large collections), or  using a hybrid of the two methods.  Harvest is designed to make it easy to distribute the search system on  a pool of networked machines to handle higher load.  11..11..  CCooppyyrriigghhtt  The core of Harvest is licensed under GPL <../../COPYING>.  Additional  components distributed with Harvest are also under GPL or similar  license.  Glimpse, the current default fulltext indexer has a  different license.  Here is a clarification of Glimpse' copyright  status <../glimpse-license-status> kindly posted by Golda Velez  <mailto:gvelez@tucson.com> to comp.infosystems.harvest  <news:comp.infosystems.harvest>.  11..22..  OOnnlliinnee HHaarrvveesstt RReessoouurrcceess  This manual is available at  harvest.sourceforge.net/harvest/doc/html/manual.html.  More information about Harvest is available at  harvest.sourceforge.net.  22..  SSuubbssyysstteemm OOvveerrvviieeww  Harvest consists of several subsystems.  The _G_a_t_h_e_r_e_r subsystem  collects indexing information (such as keywords, author names, and  titles) from the resources available at _P_r_o_v_i_d_e_r sites (such as FTP  and HTTP servers).  The _B_r_o_k_e_r subsystem retrieves indexing  information from one or more Gatherers, suppresses duplicate  information, incrementally indexes the collected information, and  provides a WWW query interface to it.                       Harvest Software Components  You should start using Harvest simply, by installing a single  ``stock'' (i.e., not customized) Gatherer and Broker on one machine to  index some of the FTP, World Wide Web, and NetNews data at your site.  After you get the system working in this basic configuration, you can  invest additional effort as warranted.  First, as you scale up to  index larger volumes of information, you can reduce the CPU and  network load to index your data by distributing the gathering process.  Second, you can customize how Harvest extracts, indexes, and searches  your information, to better match the types of data you have and the  ways your users would like to interact with the data.  We discuss how to distribute the gathering process in the next  subsection.  We cover various forms of customization in Section  ``Customizing the type recognition, candidate selection, presentation  unnesting, and summarizing steps'' and in several parts of Section  ``The Broker''.  22..11..  DDiissttrriibbuuttiinngg tthhee GGaatthheerriinngg aanndd BBrrookkeerriinngg PPrroocceesssseess  Harvest Gatherers and Brokers can be configured in various ways.  Running a Gatherer remotely from a Provider site allows Harvest to  interoperate with sites that are not running Harvest Gatherers, by  using standard object retrieval protocols like FTP, Gopher, HTTP, and  NNTP.  However, as suggested by the bold lines in the left side of  Figure ``2'', this arrangement results in excess server and network  load.  Running a Gatherer locally is much more efficient, as shown in  the right side of Figure ``2''.  Nonetheless, running a Gatherer  remotely is still better than having many sites independently collect  indexing information, since many Brokers or other search services can  share the indexing information that the Gatherer collects.  If you have a number of FTP/HTTP/Gopher/NNTP servers at your site, it  is most efficient to run a Gatherer on each machine where these  servers run.  On the other hand, you can reduce installation effort by  running a Gatherer at just one machine at your site and letting it  retrieve data from across the network.                      Harvest Configuration Options  Figure ``2'' also illustrates that a Broker can collect information  from many Gatherers (to build an index of widely distributed  information).  Brokers can also retrieve information from other  Brokers, in effect cascading indexed views from one another.  Brokers  retrieve this information using the query interface, allowing them to  filter or refine the information from one Broker to the next.  33..  IInnssttaalllliinngg tthhee HHaarrvveesstt SSooffttwwaarree  33..11..  RReeqquuiirreemmeennttss ffoorr HHaarrvveesstt SSeerrvveerrss

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -