📄 manual.txt
字号:
is the maximum number of URLs that are generated during the enumeration, and _n_o_t a limit on how many URLs can pass through the candidate selection phase (see Section ``Customizing the candidate selection step''). UURRLL--FFiilltteerr--ffiilleennaammee This is the name of a file containing a set of regular expression filters (see Section ``RootNode filters'') to allow or deny particular LeafNodes in the enumeration. The default filter is _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_U_R_L_-_f_i_l_t_e_r_-_d_e_f_a_u_l_t which excludes many image and sound files. HHoosstt--MMaaxx The number specified on the right hand side of the ``Host='' expression lists the maximum number of hosts that will be touched during the RootNode enumeration. This enumeration actually counts hosts by IP address so that aliased hosts are properly enumerated. Note that this does not work correctly for multi-homed hosts, or for hosts with rotating DNS entries (used by some sites for load balancing heavily accessed servers). _N_o_t_e_: Prior to Harvest Version 1.2 the ``Host=...'' line was called ``Site=...''. We changed the name to ``Host='' because it is more intuitively meaningful (being a host count limit, not a site count limit). For backwards compatibility with older Gatherer configuration files, we will continue to treat ``Site='' as an alias for ``Host=''. HHoosstt--FFiilltteerr--ffiilleennaammee This is the name of a file containing a set of regular expression filters to allow or deny particular hosts in the enumeration. Each expression can specify both a host name (or IP address) and a port number (in case you have multiple servers running on different ports of the same server and you want to index only one). The syntax is ``hostname:port''. AAcccceessss If the RootNode is an HTTP URL, then you can specify which access methods across which to enumerate. Valid access method types are: FFIILLEE,, FFTTPP,, GGoopphheerr,, HHTTTTPP,, NNeewwss,, TTeellnneett,, or WWAAIISS. Use a ``||'' character between type names to allow multiple access methods. For example, ``AAcccceessss==HHTTTTPP||FFTTPP||GGoopphheerr'' will follow HTTP, FTP, and Gopher URLs while enumerating an HTTP RootNode URL. _N_o_t_e_: We do not support cross-method enumeration from Gopher, because of the difficulty of ensuring that Gopher pointers do not cross site boundaries. For example, the Gopher URL _g_o_p_h_e_r_:_/_/_p_o_w_e_l_l_._c_s_._c_o_l_o_r_a_d_o_._e_d_u_:_7_0_0_5_/_1_f_t_p_3_a_f_t_p_._c_s_._w_a_s_h_i_n_g_t_o_n_._e_d_u_4_0_p_u_b_/ would get an FTP directory listing of ftp.cs.washington.edu:/pub, even though the host part of the URL is powell.cs.colorado.edu. DDeellaayy This is the number of seconds to wait between server contacts. It defaults to one second, when not specified otherwise. DDeellaayy==33 will let the gatherer sleep 3 seconds between server contacts. DDeepptthh This is the maximum number of levels of enumeration that will be followed during gathering. DDeepptthh==00 means that there is _n_o limit to the depth of the enumeration. DDeepptthh==11 means the specified URL will be retrieved, and all the URLs referenced by the specified URL will be retrieved; and so on for higher Depth values. In other words, the enumeration will follow links up to _D_e_p_t_h steps away from the specified URL. EEnnuummeerraattiioonn--PPrrooggrraamm This modifier adds a very flexible way to control a Gatherer. The Enumeration-Program is a filter which reads URLs as input and writes new enumeration parameters on output. See section ``Generic Enumeration program description'' for specific details. By default, _U_R_L_-_M_a_x defaults to 250, _U_R_L_-_F_i_l_t_e_r defaults to no limit, _H_o_s_t_-_M_a_x defaults to 1, _H_o_s_t_-_F_i_l_t_e_r defaults to no limit, _A_c_c_e_s_s defaults to HTTP only, _D_e_l_a_y defaults to 1 second, and _D_e_p_t_h defaults to zero. There is no way to specify an unlimited value for _U_R_L_-_M_a_x or _H_o_s_t_-_M_a_x. 44..33..11.. RRoooottNNooddee ffiilltteerrss Filter files use the standard UNIX regular expression syntax (as defined by the POSIX standard), not the csh ``globbing'' syntax. For example, you would use ``.*abc'' to indicate any string ending with ``abc'', not ``*abc''. A filter file has the following syntax: Deny regex Allow regex The _U_R_L_-_F_i_l_t_e_r regular expressions are matched only on the URL-path portion of each URL (the scheme, hostname and port are excluded). For example, the following URL-Filter file would allow all URLs except those containing the regular expression ``_/_g_a_t_h_e_r_e_r_s_/'': Deny /gatherers/ Allow . Another common use of URL-filters is to prevent the Gatherer from travelling ``up'' a directory. Automatically generated HTML pages for HTTP and FTP directories often contain a link for the parent directory ``_._.''. To keep the gatherer below a specific directory, use a URL- filter file such as: Allow ^/my/cool/sutff/ Deny . The _H_o_s_t_-_F_i_l_t_e_r regular expressions are matched on the ``hostname:port'' portion of each URL. Because the port is included, you cannot use ``$$'' to anchor the end of a hostname. Beginning with version 1.3, IP addresses may be specified in place of hostnames. A class B address such as 128.138.0.0 would be written as ``^^112288\\..113388\\....**'' in regular expression syntax. For example: Deny bcn.boulder.co.us:8080 Deny bvsd.k12.co.us Allow ^128\.138\..* Deny . The order of the AAllllooww and DDeennyy entries is important, since the filters are applied sequentially from first to last. So, for example, if you list ``AAllllooww ..**'' first, no subsequent DDeennyy expressions will be used, since this AAllllooww filter will allow all entries. 44..33..22.. GGeenneerriicc EEnnuummeerraattiioonn pprrooggrraamm ddeessccrriippttiioonn Flexible enumeration can be achieved by giving an EEnnuummeerraattiioonn==EEnnuummeerraattiioonn--PPrrooggrraamm modifier to a RootNode URL. The _E_n_u_m_e_r_a_t_i_o_n_-_P_r_o_g_r_a_m is a filter which takes URLs on standard input and writes new RootNode URLs on standard output. The output format is different than specifying a RootNode URL in a Gatherer configuration file. Each output line must have nine fields separated by spaces. These fields are: URL URL-Max URL-Filter-filename Host-Max Host-Filter-filename Access Delay Depth Enumeration-Program These are the same fields as described in section ``RootNode specifications''. Values must be given for each field. Use _/_d_e_v_/_n_u_l_l to disable the URL-Filter-filename and Host-Filter-filename. Use /bin/false to disable the Enumeration-Program. 44..33..33.. EExxaammppllee RRoooottNNooddee ccoonnffiigguurraattiioonn Below is an example RootNode configuration: <RootNodes> (1) http://harvest.cs.colorado.edu/ URL=100,MyFilter (2) http://www.cs.colorado.edu/ Host=50 Delay=60 (3) gopher://gopher.colorado.edu/ Depth=1 (4) file://powell.cs.colorado.edu/home/hardy/ Depth=2 (5) ftp://ftp.cs.colorado.edu/pub/cs/techreports/ Depth=1 (6) http://harvest.cs.colorado.edu/~hardy/hotlist.html \ Depth=1 Delay=60 (7) http://harvest.cs.colorado.edu/~hardy/ \ Depth=2 Access=HTTP|FTP </RootNodes> Each of the above RootNodes follows a different enumeration configuration as follows: 1. This RootNode will gather up to 100 documents that pass through the URL name filters contained within the file _M_y_F_i_l_t_e_r. 2. This RootNode will gather the documents from up to the first 50 hosts it encounters while enumerating the specified URL, with no limit on the Depth of link enumeration. It will also wait for 60 seconds between each retrieval. 3. This RootNode will gather only the documents from the top-level menu of the Gopher server at _g_o_p_h_e_r_._c_o_l_o_r_a_d_o_._e_d_u. 4. This RootNode will gather all documents that are in the _/_h_o_m_e_/_h_a_r_d_y directory, or that are in any subdirectory of _/_h_o_m_e_/_h_a_r_d_y. 5. This RootNode will gather only the documents that are in the _/_p_u_b_/_t_e_c_h_r_e_p_o_r_t_s directory which, in this case, is some bibliographic files rather than the technical reports themselves. 6. This RootNode will gather all documents that are within 1 step away from the specified RootNode URL, waiting 60 seconds between each retrieval. This is a good method by which to index your hotlist. By putting an HTML file containing ``hotlist'' pointers as this RootNode, this enumeration will gather the top-level pages to all of your hotlist pointers. 7. This RootNode will gather all documents that are at most 2 steps away from the specified RootNode URL. Furthermore, it will follow and enumerate any HTTP or FTP URLs that it encounters during enumeration. 44..33..44.. GGaatthheerreerr eennuummeerraattiioonn vvss.. ccaannddiiddaattee sseelleeccttiioonn In addition to using the _U_R_L_-_F_i_l_t_e_r and _H_o_s_t_-_F_i_l_t_e_r files for the RootNode specification mechanism described in Section ``RootNode specifications'', you can prevent documents from being indexed through customizing the _s_t_o_p_l_i_s_t_._c_f file, described in Section ``Customizing the type recognition, candidate selection, presentation unnesting, and summarizing steps''. Since these mechanisms are invoked at different times, they have different effects. The _U_R_L_-_F_i_l_t_e_r and _H_o_s_t_-_F_i_l_t_e_r mechanisms are invoked by the Gatherer's ``RootNode'' enumeration programs. Using these filters as stop lists can prevent unwanted objects from being retrieved across the network. This can dramatically reduce gathering time and network traffic. The _s_t_o_p_l_i_s_t_._c_f file is used by the _E_s_s_e_n_c_e content extraction system (described in Section ``Extracting data for indexing: The Essence summarizing subsystem'') _a_f_t_e_r the objects are retrieved, to select which objects should be content extracted and indexed. This can be useful because Essence provides a more powerful means of rejecting indexing candidates, in which you can customize based not only file naming conventions but also on file contents (e.g., looking at strings at the beginning of a file or at UNIX ``magic'' numbers), and also by more sophisticated file-grouping schemes (e.g., deciding not to extract contents from object code files for which source code is available). As an example of combining these mechanisms, suppose you want to index the ``.ps'' files linked into your WWW site. You could do this by having a _s_t_o_p_l_i_s_t_._c_f file that contains ``HTML'', and a RootNode _U_R_L_- _F_i_l_t_e_r that contains: Allow \.html Allow \.ps Deny .* As a final note, independent of these customizations the Gatherer attempts to avoid retrieving objects where possible, by using a local disk cache of objects, and by using the HTTP ``If-Modified-Since'' request header. The local disk cache is described in Section ``The local disk cache''. 44..44.. GGeenneerraattiinngg LLeeaaffNNooddee//RRoooottNNooddee UURRLLss ffrroomm aa pprrooggrraamm It is possible to generate RootNode or LeafNode URLs automatically from program output. This might be useful when gathering a large number of Usenet newsgroups, for example. The program is specified inside the RootNode or LeafNode section, preceded by a pipe symbol. <LeafNodes> |generate-news-urls.sh </LeafNodes> The script must output valid URLs, such as news:comp.unix.voodoo news:rec.pets.birds http://www.nlanr.net/ ... In the case of RootNode URLs, enumeration parameters can be given after the program. <RootNodes> |my-fave-sites.pl Depth=1 URL=5000,url-filter </RootNodes> 44..55.. EExxttrraaccttiinngg ddaattaa ffoorr iinnddeexxiinngg:: TThhee EEsssseennccee ssuummmmaarriizziinngg ssuubbssyysstteemm After the Gatherer retrieves a document, it passes the document through a subsystem called _E_s_s_e_n_c_e to extract indexing information. Essence allows the Gatherer to collect indexing information easily from a wide variety of information, using different techniques depending on the type of data and the needs of the particular corpus being indexed. In a nutshell, Essence can determine the type of data pointed to by a URL (e.g., PostScript vs. HTML), ``unravel'' presentation nesting formats (such as compressed ``tar'' files), select which types of data to index (e.g., don't index Audio files),
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -