📄 manual.txt
字号:
Harvest User's Manual Darren R. Hardy, Michael F. Schwartz, Duane Wessels, Kang- Jin Lee 2002-10-29 Harvest User's Manual was edited by Kang-Jin Lee and covers Harvest version 1.8. It was originally written by Darren R. Hardy, Michael F. Schwartz and Duane Wessels for Harvest 1.4.pl2 in 1996-01-31. ______________________________________________________________________ Table of Contents 1. Introduction to Harvest 1.1 Copyright 1.2 Online Harvest Resources 2. Subsystem Overview 2.1 Distributing the Gathering and Brokering Processes 3. Installing the Harvest Software 3.1 Requirements for Harvest Servers 3.1.1 Hardware 3.1.2 Platforms 3.1.3 Software 3.2 Requirements for Harvest Users 3.3 Retrieving and Installing the Harvest Software 3.3.1 Distribution types 3.3.2 Harvest components 3.3.3 User-contributed software 3.4 Building the Source Distribution 3.5 Additional installation for the Harvest Broker 3.5.1 Checking the installation for HTTP access 3.5.2 Required modifications to your HTTP server 3.5.3 Apache httpd 3.5.4 Other HTTP servers 3.6 Upgrading versions of the Harvest software 3.6.1 Upgrading from version 1.6 to version 1.8 3.6.2 Upgrading from version 1.5 to version 1.6 3.6.3 Upgrading from version 1.4 to version 1.5 3.6.4 Upgrading from version 1.3 to version 1.4 3.6.5 Upgrading from version 1.2 to version 1.3 3.6.6 Upgrading from version 1.1 to version 1.2 3.6.7 Upgrading to version 1.1 from version 1.0 or older 3.7 Starting up the system: RunHarvest and related commands 3.8 Harvest team contact information 4. The Gatherer 4.1 Overview 4.2 Basic setup 4.2.1 Gathering News URLs with NNTP 4.2.2 Cleaning out a Gatherer 4.3 RootNode specifications 4.3.1 RootNode filters 4.3.2 Generic Enumeration program description 4.3.3 Example RootNode configuration 4.3.4 Gatherer enumeration vs. candidate selection 4.4 Generating LeafNode/RootNode URLs from a program 4.5 Extracting data for indexing: The Essence summarizing subsystem 4.5.1 Default actions of ``stock'' summarizers 4.5.2 Summarizing SGML data 4.5.2.1 Location of support files 4.5.2.2 The SGML to SOIF table 4.5.2.3 Errors and warnings from the SGML Parser 4.5.2.4 Creating a summarizer for a new SGML-tagged data type 4.5.2.5 The SGML-based HTML summarizer 4.5.2.6 Adding META data to your HTML 4.5.2.7 Other examples 4.5.3 Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps 4.5.3.1 Customizing the type recognition step 4.5.3.2 Customizing the candidate selection step 4.5.3.3 Customizing the presentation unnesting step 4.5.3.4 Customizing the summarizing step 4.6 Post-Summarizing: Rule-based tuning of object summaries 4.6.1 The Rules file 4.6.2 Rewriting URLs 4.7 Gatherer administration 4.7.1 Setting variables in the Gatherer configuration file 4.7.2 Local file system gathering for reduced CPU load 4.7.3 Gathering from password-protected servers 4.7.4 Controlling access to the Gatherer's database 4.7.5 Periodic gathering and realtime updates 4.7.6 The local disk cache 4.7.7 Incorporating manually generated information into a Gatherer 4.8 Troubleshooting 5. The Broker 5.1 Overview 5.2 Basic setup 5.3 Querying a Broker 5.3.1 Example queries 5.3.2 Regular expressions 5.3.3 Query options selected by menus or buttons 5.3.4 Filtering query results 5.3.5 Result set presentation 5.4 Customizing the Broker's Query Result Set 5.4.1 The search.cf configuration file 5.4.1.1 Defined Variables 5.4.1.2 List of Definitions 5.4.2 Example search.cf customization file 5.4.3 Integrating your customized configuration file 5.4.4 Displaying SOIF attributes in results 5.5 World Wide Web interface description 5.5.1 HTML files for graphical user interface 5.5.2 CGI programs 5.5.3 Help files for the user 5.6 Administrating a Broker 5.6.1 Deleting unwanted Broker objects 5.6.2 Command-line Administration 5.7 Tuning Glimpse indexing in the Broker 5.7.1 The glimpseserver program 5.8 Using different index/search engines with the Broker 5.8.1 Using Swish as an indexer 5.8.2 Using WAIS as an indexer 5.9 Collector interface description: Collection.conf 5.10 Troubleshooting 6. Programs and layout of the installed Harvest software 6.1 $HARVEST_HOME 6.2 $HARVEST_HOME/bin 6.3 $HARVEST_HOME/brokers 6.4 $HARVEST_HOME/cgi-bin 6.5 $HARVEST_HOME/gatherers 6.6 $HARVEST_HOME/lib 6.7 $HARVEST_HOME/lib/broker 6.8 $HARVEST_HOME/lib/gatherer 6.9 $HARVEST_HOME/tmp 7. The Summary Object Interchange Format (SOIF) 7.1 Formal description of SOIF 7.2 List of common SOIF attribute names 8. Gatherer Examples 8.1 Example 1 - A simple Gatherer 8.2 Example 2 - Incorporating manually generated information 8.3 Example 3 - Customizing type recognition and candidate selection 8.4 Example 4 - Customizing type recognition and summarizing 8.4.1 Using regular expressions to summarize a format 8.4.2 Using programs to summarize a format 8.4.3 Running the example 8.5 Example 5 - Using RootNode filters 9. History of Harvest 9.1 History of Harvest 9.2 History of Harvest User's Manual ______________________________________________________________________ 11.. IInnttrroodduuccttiioonn ttoo HHaarrvveesstt HARVEST is an integrated set of tools to gather, extract, organize, and search information across the Internet. With modest effort users can tailor Harvest to digest information in many different formats, and offer custom search services on the Internet. A key goal of Harvest is to provide a flexible system that can be configured in various ways to create many types of indexes. Harvest also allows users to extract structured (attribute-value pair) information from many different information formats and build indexes that allow these attributes to be referenced during queries (e.g., searching for all documents with a certain regular expression in the title field). An important advantage of Harvest is that it allows users to build indexes using either manually constructed templates (for maximum control over index content) or automatically extracted data constructed templates (for easy coverage of large collections), or using a hybrid of the two methods. Harvest is designed to make it easy to distribute the search system on a pool of networked machines to handle higher load. 11..11.. CCooppyyrriigghhtt The core of Harvest is licensed under GPL <../../COPYING>. Additional components distributed with Harvest are also under GPL or similar license. Glimpse, the current default fulltext indexer has a different license. Here is a clarification of Glimpse' copyright status <../glimpse-license-status> kindly posted by Golda Velez <mailto:gvelez@tucson.com> to comp.infosystems.harvest <news:comp.infosystems.harvest>. 11..22.. OOnnlliinnee HHaarrvveesstt RReessoouurrcceess This manual is available at harvest.sourceforge.net/harvest/doc/html/manual.html. More information about Harvest is available at harvest.sourceforge.net. 22.. SSuubbssyysstteemm OOvveerrvviieeww Harvest consists of several subsystems. The _G_a_t_h_e_r_e_r subsystem collects indexing information (such as keywords, author names, and titles) from the resources available at _P_r_o_v_i_d_e_r sites (such as FTP and HTTP servers). The _B_r_o_k_e_r subsystem retrieves indexing information from one or more Gatherers, suppresses duplicate information, incrementally indexes the collected information, and provides a WWW query interface to it. Harvest Software Components You should start using Harvest simply, by installing a single ``stock'' (i.e., not customized) Gatherer and Broker on one machine to index some of the FTP, World Wide Web, and NetNews data at your site. After you get the system working in this basic configuration, you can invest additional effort as warranted. First, as you scale up to index larger volumes of information, you can reduce the CPU and network load to index your data by distributing the gathering process. Second, you can customize how Harvest extracts, indexes, and searches your information, to better match the types of data you have and the ways your users would like to interact with the data. We discuss how to distribute the gathering process in the next subsection. We cover various forms of customization in Section ``Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps'' and in several parts of Section ``The Broker''. 22..11.. DDiissttrriibbuuttiinngg tthhee GGaatthheerriinngg aanndd BBrrookkeerriinngg PPrroocceesssseess Harvest Gatherers and Brokers can be configured in various ways. Running a Gatherer remotely from a Provider site allows Harvest to interoperate with sites that are not running Harvest Gatherers, by using standard object retrieval protocols like FTP, Gopher, HTTP, and NNTP. However, as suggested by the bold lines in the left side of Figure ``2'', this arrangement results in excess server and network load. Running a Gatherer locally is much more efficient, as shown in the right side of Figure ``2''. Nonetheless, running a Gatherer remotely is still better than having many sites independently collect indexing information, since many Brokers or other search services can share the indexing information that the Gatherer collects. If you have a number of FTP/HTTP/Gopher/NNTP servers at your site, it is most efficient to run a Gatherer on each machine where these servers run. On the other hand, you can reduce installation effort by running a Gatherer at just one machine at your site and letting it retrieve data from across the network. Harvest Configuration Options Figure ``2'' also illustrates that a Broker can collect information from many Gatherers (to build an index of widely distributed information). Brokers can also retrieve information from other Brokers, in effect cascading indexed views from one another. Brokers retrieve this information using the query interface, allowing them to filter or refine the information from one Broker to the next. 33.. IInnssttaalllliinngg tthhee HHaarrvveesstt SSooffttwwaarree 33..11.. RReeqquuiirreemmeennttss ffoorr HHaarrvveesstt SSeerrvveerrss
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -