📄 manual.txt
字号:
+o The Broker has a new log format for the _a_d_m_i_n_/_L_O_G file which is incompatible with version 1.1. 33..66..77.. UUppggrraaddiinngg ttoo vveerrssiioonn 11..11 ffrroomm vveerrssiioonn 11..00 oorr oollddeerr If you already have an older version of Harvest installed, and want to upgrade, you _c_a_n _n_o_t unpack the new distribution on top of the old one. For example, the change from version 1.0 to version 1.1 included some reorganization of the executables, and hence simply installing version 1.1 on top of version 1.0 would cause you to use old executables in some cases. On the other hand, you may not want to start over from scratch with a new software version, as that would not take advantage of the data you have already gathered and indexed. Instead, to upgrade from Harvest version 1.0 to 1.1, do the following: 1. Move your old installation to a temporary location. 2. Install the new version as directed by the release notes. 3. Then, for each Gatherer and Broker that you were running under the old installation, migrate the server into the new installation. GGaatthheerreerrss:: you need to move the Gatherer's directory into _$_H_A_R_V_E_S_T___H_O_M_E_/_g_a_t_h_e_r_e_r_s. Section ``RootNode specifications'' describes the new Gatherer workload specifications which were introduced in version 1.1; you may modify your Gatherer's configuration file to employ this new functionality. BBrrookkeerrss:: you need to move the Broker's directory into _$_H_A_R_V_E_S_T___H_O_M_E_/_b_r_o_k_e_r_s. You may want, however, to rebuild your broker by using CreateBroker so that you can use the updated _q_u_e_r_y_._h_t_m_l and related files. 33..77.. SSttaarrttiinngg uupp tthhee ssyysstteemm:: RRuunnHHaarrvveesstt aanndd rreellaatteedd ccoommmmaannddss The simplest way to start the Harvest system is to use the RunHarvest command. RunHarvest prompts the user with a short list of questions about what data to index, etc., and then creates and runs a Gatherer and Broker with a ``stock'' (non-customized) set of content extraction and indexing mechanisms. Some more primitive commands are also available, for starting individual Gatherers and Brokers (e.g., if you want to distribute the gathering process). The Harvest startup commands are: RRuunnHHaarrvveesstt Checks that the Harvest software is installed correctly, prompts the user for basic configuration information, and then creates and runs a Gatherer and a Broker. If you have _$_H_A_R_V_E_S_T___H_O_M_E set, then it will use it; otherwise, it tries to determine _$_H_A_R_V_E_S_T___H_O_M_E automatically. Found in the _$_H_A_R_V_E_S_T___H_O_M_E directory. RRuunnBBrrookkeerr Runs a Broker. Found in the Broker's directory. RRuunnGGaatthheerreerr Runs a Gatherer. Found in the Gatherer's directory. CCrreeaatteeBBrrookkeerr Creates a single Broker which will collect its information from other existing Brokers or Gatherers. Used by RunHarvest, or can be run by a user to create a new Broker. Uses _$_H_A_R_V_E_S_T___H_O_M_E, and defaults to _/_u_s_r_/_l_o_c_a_l_/_h_a_r_v_e_s_t. Found in the _$_H_A_R_V_E_S_T___H_O_M_E_/_b_i_n directory. There is no CreateGatherer command, but the RunHarvest command can create a Gatherer, or you can create a Gatherer manually (see Section ``Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps'' or Section ``Gatherer Examples''). The layout of the installed Harvest directories and programs is discussed in Section ``Programs and layout of the installed Harvest software''. Among other things, the RunHarvest command asks the user what port numbers to use when running the Gatherer and the Broker. By default, the Gatherer will use port 8500 and the Broker will use the Gatherer port plus 1. The choice of port numbers depends on your particular machine -- you need to choose ports that are not in use by other servers on your machine. You might look at your _/_e_t_c_/_s_e_r_v_i_c_e_s file to see what ports are in use (although this file only lists some servers; other servers use ports without registering that information anywhere). Usually the above port numbers will not be in use by other processes. Probably the easiest thing is simply to try using the default port numbers, and see if it works. The remainder of this manual provides information for users who wish to customize or otherwise make more sophisticated use of Harvest than what happens when you install the system and run RunHarvest. 33..88.. HHaarrvveesstt tteeaamm ccoonnttaacctt iinnffoorrmmaattiioonn If you have questions the about Harvest system or problems with the software, post a note to the USENET newsgroup comp.infosystems.harvest <news:comp.infosystems.harvest>. Please note your machine type, operating system type, and Harvest version number in your correspondence. If you have bug fixes, ports to new platforms or other software improvements, please email them to the Harvest maintainer lee@arco.de <mailto:lee@arco.de>. 44.. TThhee GGaatthheerreerr 44..11.. OOvveerrvviieeww The Gatherer retrieves information resources using a variety of standard access methods (FTP, Gopher, HTTP, NNTP, and local files), and then summarizes those resources in various type-specific ways to generate structured indexing information. For example, a Gatherer can retrieve a technical report from an FTP archive, and then extract the author, title, and abstract from the paper to summarize the technical report. Harvest Brokers or other search services can then retrieve the indexing information from the Gatherer to use in a searchable index available via a WWW interface. The Gatherer consists of a number of separate components. The Gatherer program reads a Gatherer configuration file and controls the overall process of enumerating and summarizing data objects. The structured indexing information that the Gatherer collects is represented as a list of attribute-value pairs using the _S_u_m_m_a_r_y _O_b_j_e_c_t _I_n_t_e_r_c_h_a_n_g_e _F_o_r_m_a_t (SOIF, see Section ``The Summary Object Interchange Format (SOIF)''). The gatherd daemon serves the Gatherer database to Brokers. It hangs around, in the background, after a gathering session is complete. A stand-alone gather program is a client for the gatherd server. It can be used from the command line for testing, and is used by the Broker. The Gatherer uses a local disk cache to store objects it has retrieved. The disk cache is described in Section ``The local disk cache''. Even though the gatherd daemon remains in the background, a Gatherer does not automatically update or refresh its summary objects. Each object in a Gatherer has a Time-to-Live value. Objects remain in the database until they expire. See Section ``Periodic gathering and realtime updates'' for more information on keeping Gatherer objects up to date. Several example Gatherers are provided with the Harvest software distribution (see Section ``Gatherer Examples''). 44..22.. BBaassiicc sseettuupp To run a basic Gatherer, you need only list the Uniform Resource Locators (URLs, see RFC1630 and RFC1738) from which it will gather indexing information. This list is specified in the Gatherer configuration file, along with other optional information such as the Gatherer's name and the directory in which it resides (see Section ``Setting variables in the Gatherer configuration file'' for details on the optional information). Below is an example Gatherer configuration file: # # sample.cf - Sample Gatherer Configuration File # Gatherer-Name: My Sample Harvest Gatherer Gatherer-Port: 8500 Top-Directory: /usr/local/harvest/gatherers/sample <RootNodes> # Enter URLs for RootNodes here http://www.mozilla.org/ http://www.xfree86.org/ </RootNodes> <LeafNodes> # Enter URLs for LeafNodes here http://www.arco.de/~kj/index.html </LeafNodes> As shown in the example configuration file, you may classify an URL as a RRoooottNNooddee or a LLeeaaffNNooddee. For a LeafNode URL, the Gatherer simply retrieves the URL and processes it. LeafNode URLs are typically files like PostScript papers or compressed ``tar'' distributions. For a RootNode URL, the Gatherer will expand it into zero or more LeafNode URLs by recursively enumerating it in an access method-specific way. For FTP or Gopher, the Gatherer will perform a recursive directory listing on the FTP or Gopher server to expand the RootNode (typically a directory name). For HTTP, a RootNode URL is expanded by following the embedded HTML links to other URLs. For News, the enumeration returns all the messages in the specified USENET newsgroup. PLEASE BE CAREFUL when specifying RootNodes as it is possible to specify an enormous amount of work with a single RootNode URL. To help prevent a misconfigured Gatherer from abusing servers or running wildly, by default the Gatherer will only expand a RootNode into 250 LeafNodes, and will only include HTML links that point to documents that reside on the same server as the original RootNode URL. There are several options that allow you to change these limits and otherwise enhance the Gatherer specification. See Section ``RootNode specifications'' for details. The Gatherer is a ``robot'' and collects URLs starting from the URLs specified in RootNodes. It obeys the _r_o_b_o_t_s_._t_x_t convention and the _r_o_b_o_t_s _M_E_T_A _t_a_g. It also is HTTP Version 1.1 compliant and sends the _U_s_e_r_-_A_g_e_n_t and _F_r_o_m request fields to HTTP servers for accountability. After you have written the Gatherer configuration file, create a directory for the Gatherer and copy the configuration file there. Then, run the Gatherer program with the configuration file as the only command-line argument, as shown below: % Gatherer GathName.cf The Gatherer will generate a database of the content summaries, a log file (_l_o_g_._g_a_t_h_e_r_e_r), and an error log file (_l_o_g_._e_r_r_o_r_s). It will also start the gatherd daemon which exports the indexing information automatically to Brokers and other clients. To view the exported indexing information, you can use the gather client program, as shown below: % gather localhost 8500 | more The --iinnffoo option causes the Gatherer to respond only with the Gatherer summary information, which consists of the attributes available in the specified Gatherer's database, the Gatherer's host and name, the range of object update times, and the number of objects. Compression is the default, but can be disabled with the --nnooccoommpprreessss option. The optional timestamp tells the Gatherer to send only the objects that have changed since the specified timestamp (in seconds since the UNIX ``epoch'' of January 1, 1970). 44..22..11.. GGaatthheerriinngg NNeewwss UURRLLss wwiitthh NNNNTTPP News URLs are somewhat different than the other access protocols because the URL generally does not contain a hostname. The Gatherer retrieves News URLs from an NNTP server. The name of this server must be placed in the environment variable _$_N_N_T_P_S_E_R_V_E_R. It is probably a good idea to add this to your RunGatherer script. If the environment variable is not set, the Gatherer attempts to connect to a host named _n_e_w_s at your site. 44..22..22.. CClleeaanniinngg oouutt aa GGaatthheerreerr Remember the Gatherer databases persists between runs. Objects remain in the databases until they expire. When experimenting with the gatherer, it is always a good idea to ``clean out'' the databases between runs. This is most easily accomplished by executing this command from the Gatherer directory: % rm -rf data tmp log.* 44..33.. RRoooottNNooddee ssppeecciiffiiccaattiioonnss The RootNode specification facility described in Section ``Basic setup'' provides a basic set of default enumeration actions for RootNodes. Often it is useful to enumerate beyond the default limits, for example, to increase the enumeration limit beyond 250 URLs, or to allow site boundaries to be crossed when enumerating HTML links. It is possible to specify these and other aspects of enumeration, using the following syntax: <RootNodes> URL EnumSpec URL EnumSpec ... </RootNodes> where _E_n_u_m_S_p_e_c is on a single line (using ``\\'' to escape linefeeds), with the following syntax: URL=URL-Max[,URL-Filter-filename] \ Host=Host-Max[,Host-Filter-filename] \ Access=TypeList \ Delay=Seconds \ Depth=Number \ Enumeration=Enumeration-Program The _E_n_u_m_S_p_e_c modifiers are all optional, and have the following meanings: UURRLL--MMaaxx The number specified on the right hand side of the ``URL='' expression lists the maximum number of LeafNode URLs to generate at all levels of depth, from the current URL. Note that _U_R_L_-_M_a_x
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -