📄 manual.txt

📁 harvest是一个下载html网页得机器人
💻 TXT
📖 第 1 页 / 共 5 页
字号:
  +o  The Broker has a new log format for the _a_d_m_i_n_/_L_O_G file which is     incompatible with version 1.1.  33..66..77..  UUppggrraaddiinngg ttoo vveerrssiioonn 11..11 ffrroomm vveerrssiioonn 11..00 oorr oollddeerr  If you already have an older version of Harvest installed, and want to  upgrade, you _c_a_n _n_o_t unpack the new distribution on top of the old  one.  For example, the change from version 1.0 to version 1.1 included  some reorganization of the executables, and hence simply installing  version 1.1 on top of version 1.0 would cause you to use old  executables in some cases.  On the other hand, you may not want to start over from scratch with a  new software version, as that would not take advantage of the data you  have already gathered and indexed.  Instead, to upgrade from Harvest  version 1.0 to 1.1, do the following:  1. Move your old installation to a temporary location.  2. Install the new version as directed by the release notes.  3. Then, for each Gatherer and Broker that you were running under the     old installation, migrate the server into the new installation.     GGaatthheerreerrss::        you need to move the Gatherer's directory into        _$_H_A_R_V_E_S_T___H_O_M_E_/_g_a_t_h_e_r_e_r_s.  Section ``RootNode specifications''        describes the new Gatherer workload specifications which were        introduced in version 1.1; you may modify your Gatherer's        configuration file to employ this new functionality.     BBrrookkeerrss::        you need to move the Broker's directory into        _$_H_A_R_V_E_S_T___H_O_M_E_/_b_r_o_k_e_r_s.  You may want, however, to rebuild your        broker by using CreateBroker so that you can use the updated        _q_u_e_r_y_._h_t_m_l and related files.  33..77..  SSttaarrttiinngg uupp tthhee ssyysstteemm:: RRuunnHHaarrvveesstt aanndd rreellaatteedd ccoommmmaannddss  The simplest way to start the Harvest system is to use the RunHarvest  command.  RunHarvest prompts the user with a short list of questions  about what data to index, etc., and then creates and runs a Gatherer  and Broker with a ``stock'' (non-customized) set of content extraction  and indexing mechanisms.  Some more primitive commands are also  available, for starting individual Gatherers and Brokers (e.g., if you  want to distribute the gathering process).  The Harvest startup  commands are:     RRuunnHHaarrvveesstt        Checks that the Harvest software is installed correctly, prompts        the user for basic configuration information, and then creates        and runs a Gatherer and a Broker.  If you have _$_H_A_R_V_E_S_T___H_O_M_E        set, then it will use it; otherwise, it tries to determine        _$_H_A_R_V_E_S_T___H_O_M_E automatically.  Found in the _$_H_A_R_V_E_S_T___H_O_M_E        directory.     RRuunnBBrrookkeerr        Runs a Broker.  Found in the Broker's directory.     RRuunnGGaatthheerreerr        Runs a Gatherer.  Found in the Gatherer's directory.     CCrreeaatteeBBrrookkeerr        Creates a single Broker which will collect its information from        other existing Brokers or Gatherers.  Used by RunHarvest, or can        be run by a user to create a new Broker.  Uses _$_H_A_R_V_E_S_T___H_O_M_E,        and defaults to _/_u_s_r_/_l_o_c_a_l_/_h_a_r_v_e_s_t.  Found in the        _$_H_A_R_V_E_S_T___H_O_M_E_/_b_i_n directory.  There is no CreateGatherer command, but the RunHarvest command can  create a Gatherer, or you can create a Gatherer manually (see Section  ``Customizing the type recognition, candidate selection, presentation  unnesting, and summarizing steps'' or Section ``Gatherer Examples'').  The layout of the installed Harvest directories and programs is  discussed in Section ``Programs and layout of the installed Harvest  software''.  Among other things, the RunHarvest command asks the user what port  numbers to use when running the Gatherer and the Broker.  By default,  the Gatherer will use port 8500 and the Broker will use the Gatherer  port plus 1.  The choice of port numbers depends on your particular  machine -- you need to choose ports that are not in use by other  servers on your machine.  You might look at your _/_e_t_c_/_s_e_r_v_i_c_e_s file to  see what ports are in use (although this file only lists some servers;  other servers use ports without registering that information  anywhere).  Usually the above port numbers will not be in use by other  processes.  Probably the easiest thing is simply to try using the  default port numbers, and see if it works.  The remainder of this manual provides information for users who wish  to customize or otherwise make more sophisticated use of Harvest than  what happens when you install the system and run RunHarvest.  33..88..  HHaarrvveesstt tteeaamm ccoonnttaacctt iinnffoorrmmaattiioonn  If you have questions the about Harvest system or problems with the  software, post a note to the USENET newsgroup comp.infosystems.harvest  <news:comp.infosystems.harvest>.  Please note your machine type,  operating system type, and Harvest version number in your  correspondence.  If you have bug fixes, ports to new platforms or other software  improvements, please email them to the Harvest maintainer lee@arco.de  <mailto:lee@arco.de>.  44..  TThhee GGaatthheerreerr  44..11..  OOvveerrvviieeww  The Gatherer retrieves information resources using a variety of  standard access methods (FTP, Gopher, HTTP, NNTP, and local files),  and then summarizes those resources in various type-specific ways to  generate structured indexing information.  For example, a Gatherer can  retrieve a technical report from an FTP archive, and then extract the  author, title, and abstract from the paper to summarize the technical  report.  Harvest Brokers or other search services can then retrieve  the indexing information from the Gatherer to use in a searchable  index available via a WWW interface.  The Gatherer consists of a number of separate components.  The  Gatherer program reads a Gatherer configuration file and controls the  overall process of enumerating and summarizing data objects.  The structured indexing information that the Gatherer collects is  represented as a list of attribute-value pairs using the _S_u_m_m_a_r_y  _O_b_j_e_c_t _I_n_t_e_r_c_h_a_n_g_e _F_o_r_m_a_t (SOIF, see Section ``The Summary Object  Interchange Format (SOIF)'').  The gatherd daemon serves the Gatherer  database to Brokers.  It hangs around, in the background, after a  gathering session is complete.  A stand-alone gather program is a  client for the gatherd server.  It can be used from the command line  for testing, and is used by the Broker.  The Gatherer uses a local  disk cache to store objects it has retrieved.  The disk cache is  described in Section ``The local disk cache''.  Even though the gatherd daemon remains in the background, a Gatherer  does not automatically update or refresh its summary objects.  Each  object in a Gatherer has a Time-to-Live value.  Objects remain in the  database until they expire.  See Section ``Periodic gathering and  realtime updates'' for more information on keeping Gatherer objects up  to date.  Several example Gatherers are provided with the Harvest software  distribution (see Section ``Gatherer Examples'').  44..22..  BBaassiicc sseettuupp  To run a basic Gatherer, you need only list the Uniform Resource  Locators (URLs, see RFC1630 and RFC1738) from which it will gather  indexing information.  This list is specified in the Gatherer  configuration file, along with other optional information such as the  Gatherer's name and the directory in which it resides (see Section  ``Setting variables in the Gatherer configuration file'' for details  on the optional information).  Below is an example Gatherer  configuration file:          #          #  sample.cf - Sample Gatherer Configuration File          #          Gatherer-Name:    My Sample Harvest Gatherer          Gatherer-Port:    8500          Top-Directory:    /usr/local/harvest/gatherers/sample          <RootNodes>          # Enter URLs for RootNodes here          http://www.mozilla.org/          http://www.xfree86.org/          </RootNodes>          <LeafNodes>          # Enter URLs for LeafNodes here          http://www.arco.de/~kj/index.html          </LeafNodes>  As shown in the example configuration file, you may classify an URL as  a RRoooottNNooddee or a LLeeaaffNNooddee.  For a LeafNode URL, the Gatherer simply  retrieves the URL and processes it.  LeafNode URLs are typically files  like PostScript papers or compressed ``tar'' distributions.  For a  RootNode URL, the Gatherer will expand it into zero or more LeafNode  URLs by recursively enumerating it in an access method-specific way.  For FTP or Gopher, the Gatherer will perform a recursive directory  listing on the FTP or Gopher server to expand the RootNode (typically  a directory name).  For HTTP, a RootNode URL is expanded by following  the embedded HTML links to other URLs.  For News, the enumeration  returns all the messages in the specified USENET newsgroup.  PLEASE BE CAREFUL when specifying RootNodes as it is possible to  specify an enormous amount of work with a single RootNode URL.  To  help prevent a misconfigured Gatherer from abusing servers or running  wildly, by default the Gatherer will only expand a RootNode into 250  LeafNodes, and will only include HTML links that point to documents  that reside on the same server as the original RootNode URL.  There  are several options that allow you to change these limits and  otherwise enhance the Gatherer specification.  See Section ``RootNode  specifications'' for details.  The Gatherer is a ``robot'' and collects URLs starting from the URLs  specified in RootNodes.  It obeys the _r_o_b_o_t_s_._t_x_t convention and the  _r_o_b_o_t_s _M_E_T_A _t_a_g.  It also is HTTP Version 1.1 compliant and sends the  _U_s_e_r_-_A_g_e_n_t and _F_r_o_m request fields to HTTP servers for accountability.  After you have written the Gatherer configuration file, create a  directory for the Gatherer and copy the configuration file there.  Then, run the Gatherer program with the configuration file as the only  command-line argument, as shown below:               % Gatherer GathName.cf  The Gatherer will generate a database of the content summaries, a log  file (_l_o_g_._g_a_t_h_e_r_e_r), and an error log file (_l_o_g_._e_r_r_o_r_s).  It will also  start the gatherd daemon which exports the indexing information  automatically to Brokers and other clients.  To view the exported  indexing information, you can use the gather client program, as shown  below:               % gather localhost 8500 | more  The --iinnffoo option causes the Gatherer to respond only with the Gatherer  summary information, which consists of the attributes available in the  specified Gatherer's database, the Gatherer's host and name, the range  of object update times, and the number of objects.  Compression is the  default, but can be disabled with the --nnooccoommpprreessss option.  The  optional timestamp tells the Gatherer to send only the objects that  have changed since the specified timestamp (in seconds since the UNIX  ``epoch'' of January 1, 1970).  44..22..11..  GGaatthheerriinngg NNeewwss UURRLLss wwiitthh NNNNTTPP  News URLs are somewhat different than the other access protocols  because the URL generally does not contain a hostname.  The Gatherer  retrieves News URLs from an NNTP server.  The name of this server must  be placed in the environment variable _$_N_N_T_P_S_E_R_V_E_R.  It is probably a  good idea to add this to your RunGatherer script.  If the environment  variable is not set, the Gatherer attempts to connect to a host named  _n_e_w_s at your site.  44..22..22..  CClleeaanniinngg oouutt aa GGaatthheerreerr  Remember the Gatherer databases persists between runs.  Objects remain  in the databases until they expire.  When experimenting with the  gatherer, it is always a good idea to ``clean out'' the databases  between runs.  This is most easily accomplished by executing this  command from the Gatherer directory:               % rm -rf data tmp log.*  44..33..  RRoooottNNooddee ssppeecciiffiiccaattiioonnss  The RootNode specification facility described in Section ``Basic  setup'' provides a basic set of default enumeration actions for  RootNodes.  Often it is useful to enumerate beyond the default limits,  for example, to increase the enumeration limit beyond 250 URLs, or to  allow site boundaries to be crossed when enumerating HTML links.  It  is possible to specify these and other aspects of enumeration, using  the following syntax:               <RootNodes>               URL EnumSpec               URL EnumSpec               ...               </RootNodes>  where _E_n_u_m_S_p_e_c is on a single line (using ``\\'' to escape linefeeds),  with the following syntax:               URL=URL-Max[,URL-Filter-filename]  \               Host=Host-Max[,Host-Filter-filename] \               Access=TypeList \               Delay=Seconds \               Depth=Number \               Enumeration=Enumeration-Program  The _E_n_u_m_S_p_e_c modifiers are all optional, and have the following  meanings:     UURRLL--MMaaxx        The number specified on the right hand side of the ``URL=''        expression lists the maximum number of LeafNode URLs to generate        at all levels of depth, from the current URL.  Note that _U_R_L_-_M_a_x
💿 文件大小 7910 K
👤 上传用户 pc1667pc1667
📂 所属分类网络
🏷️ 相关标签

#harvest #html #页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -