📄 manual.txt

📁 harvest是一个下载html网页得机器人
💻 TXT
📖 第 1 页 / 共 5 页
字号:
        is the maximum number of URLs that are generated during the        enumeration, and _n_o_t a limit on how many URLs can pass through        the candidate selection phase (see Section ``Customizing the        candidate selection step'').     UURRLL--FFiilltteerr--ffiilleennaammee        This is the name of a file containing a set of regular        expression filters (see Section ``RootNode filters'') to allow        or deny particular LeafNodes in the enumeration.  The default        filter is _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_U_R_L_-_f_i_l_t_e_r_-_d_e_f_a_u_l_t which        excludes many image and sound files.     HHoosstt--MMaaxx        The number specified on the right hand side of the ``Host=''        expression lists the maximum number of hosts that will be        touched during the RootNode enumeration.  This enumeration        actually counts hosts by IP address so that aliased hosts are        properly enumerated.  Note that this does not work correctly for        multi-homed hosts, or for hosts with rotating DNS entries (used        by some sites for load balancing heavily accessed servers).        _N_o_t_e_: Prior to Harvest Version 1.2 the ``Host=...'' line was        called ``Site=...''.  We changed the name to ``Host='' because        it is more intuitively meaningful (being a host count limit, not        a site count limit).  For backwards compatibility with older        Gatherer configuration files, we will continue to treat        ``Site='' as an alias for ``Host=''.     HHoosstt--FFiilltteerr--ffiilleennaammee        This is the name of a file containing a set of regular        expression filters to allow or deny particular hosts in the        enumeration.  Each expression can specify both a host name (or        IP address) and a port number (in case you have multiple servers        running on different ports of the same server and you want to        index only one).  The syntax is ``hostname:port''.     AAcccceessss        If the RootNode is an HTTP URL, then you can specify which        access methods across which to enumerate.  Valid access method        types are: FFIILLEE,, FFTTPP,, GGoopphheerr,, HHTTTTPP,, NNeewwss,, TTeellnneett,, or WWAAIISS.  Use        a ``||'' character between type names to allow multiple access        methods.  For example, ``AAcccceessss==HHTTTTPP||FFTTPP||GGoopphheerr'' will follow        HTTP, FTP, and Gopher URLs while enumerating an HTTP RootNode        URL.        _N_o_t_e_: We do not support cross-method enumeration from Gopher,        because of the difficulty of ensuring that Gopher pointers do        not cross site boundaries.  For example, the Gopher URL        _g_o_p_h_e_r_:_/_/_p_o_w_e_l_l_._c_s_._c_o_l_o_r_a_d_o_._e_d_u_:_7_0_0_5_/_1_f_t_p_3_a_f_t_p_._c_s_._w_a_s_h_i_n_g_t_o_n_._e_d_u_4_0_p_u_b_/        would get an FTP directory listing of        ftp.cs.washington.edu:/pub, even though the host part of the URL        is powell.cs.colorado.edu.     DDeellaayy        This is the number of seconds to wait between server contacts.        It defaults to one second, when not specified otherwise.        DDeellaayy==33 will let the gatherer sleep 3 seconds between server        contacts.     DDeepptthh        This is the maximum number of levels of enumeration that will be        followed during gathering.  DDeepptthh==00 means that there is _n_o limit        to the depth of the enumeration.  DDeepptthh==11 means the specified        URL will be retrieved, and all the URLs referenced by the        specified URL will be retrieved; and so on for higher Depth        values.  In other words, the enumeration will follow links up to        _D_e_p_t_h steps away from the specified URL.     EEnnuummeerraattiioonn--PPrrooggrraamm        This modifier adds a very flexible way to control a Gatherer.        The Enumeration-Program is a filter which reads URLs as input        and writes new enumeration parameters on output.  See section        ``Generic Enumeration program description'' for specific        details.  By default, _U_R_L_-_M_a_x defaults to 250, _U_R_L_-_F_i_l_t_e_r defaults to no limit,  _H_o_s_t_-_M_a_x defaults to 1, _H_o_s_t_-_F_i_l_t_e_r defaults to no limit, _A_c_c_e_s_s  defaults to HTTP only, _D_e_l_a_y defaults to 1 second, and _D_e_p_t_h defaults  to zero.  There is no way to specify an unlimited value for _U_R_L_-_M_a_x or  _H_o_s_t_-_M_a_x.  44..33..11..  RRoooottNNooddee ffiilltteerrss  Filter files use the standard UNIX regular expression syntax (as  defined by the POSIX standard), not the csh ``globbing'' syntax.  For  example, you would use ``.*abc'' to indicate any string ending with  ``abc'', not ``*abc''.  A filter file has the following syntax:               Deny  regex               Allow regex  The _U_R_L_-_F_i_l_t_e_r regular expressions are matched only on the URL-path  portion of each URL (the scheme, hostname and port are excluded).  For  example, the following URL-Filter file would allow all URLs except  those containing the regular expression ``_/_g_a_t_h_e_r_e_r_s_/'':               Deny  /gatherers/               Allow .  Another common use of URL-filters is to prevent the Gatherer from  travelling ``up'' a directory.  Automatically generated HTML pages for  HTTP and FTP directories often contain a link for the parent directory  ``_._.''.  To keep the gatherer below a specific directory, use a URL-  filter file such as:               Allow ^/my/cool/sutff/               Deny  .  The _H_o_s_t_-_F_i_l_t_e_r regular expressions are matched on the  ``hostname:port'' portion of each URL.  Because the port is included,  you cannot use ``$$'' to anchor the end of a hostname.  Beginning with  version 1.3, IP addresses may be specified in place of hostnames.  A  class B address such as 128.138.0.0 would be written as  ``^^112288\\..113388\\....**''  in regular expression syntax.  For example:               Deny   bcn.boulder.co.us:8080               Deny   bvsd.k12.co.us               Allow  ^128\.138\..*               Deny   .  The order of the AAllllooww and DDeennyy entries is important, since the  filters are applied sequentially from first to last.  So, for example,  if you list ``AAllllooww ..**'' first, no subsequent DDeennyy expressions will be  used, since this AAllllooww filter will allow all entries.  44..33..22..  GGeenneerriicc EEnnuummeerraattiioonn pprrooggrraamm ddeessccrriippttiioonn  Flexible enumeration can be achieved by giving an  EEnnuummeerraattiioonn==EEnnuummeerraattiioonn--PPrrooggrraamm modifier to a RootNode URL.  The  _E_n_u_m_e_r_a_t_i_o_n_-_P_r_o_g_r_a_m is a filter which takes URLs on standard input and  writes new RootNode URLs on standard output.  The output format is different than specifying a RootNode URL in a  Gatherer configuration file.  Each output line must have nine fields  separated by spaces.  These fields are:               URL               URL-Max               URL-Filter-filename               Host-Max               Host-Filter-filename               Access               Delay               Depth               Enumeration-Program  These are the same fields as described in section ``RootNode  specifications''.  Values must be given for each field.  Use _/_d_e_v_/_n_u_l_l  to disable the URL-Filter-filename and Host-Filter-filename.  Use  /bin/false to disable the Enumeration-Program.  44..33..33..  EExxaammppllee RRoooottNNooddee ccoonnffiigguurraattiioonn  Below is an example RootNode configuration:               <RootNodes>         (1)   http://harvest.cs.colorado.edu/               URL=100,MyFilter         (2)   http://www.cs.colorado.edu/                   Host=50 Delay=60         (3)   gopher://gopher.colorado.edu/                 Depth=1         (4)   file://powell.cs.colorado.edu/home/hardy/     Depth=2         (5)   ftp://ftp.cs.colorado.edu/pub/cs/techreports/ Depth=1         (6)   http://harvest.cs.colorado.edu/~hardy/hotlist.html \                       Depth=1 Delay=60         (7)   http://harvest.cs.colorado.edu/~hardy/ \                       Depth=2 Access=HTTP|FTP               </RootNodes>  Each of the above RootNodes follows a different enumeration  configuration as follows:  1. This RootNode will gather up to 100 documents that pass through the     URL name filters contained within the file _M_y_F_i_l_t_e_r.  2. This RootNode will gather the documents from up to the first 50     hosts it encounters while enumerating the specified URL, with no     limit on the Depth of link enumeration.  It will also wait for 60     seconds between each retrieval.  3. This RootNode will gather only the documents from the top-level     menu of the Gopher server at _g_o_p_h_e_r_._c_o_l_o_r_a_d_o_._e_d_u.  4. This RootNode will gather all documents that are in the _/_h_o_m_e_/_h_a_r_d_y     directory, or that are in any subdirectory of _/_h_o_m_e_/_h_a_r_d_y.  5. This RootNode will gather only the documents that are in the     _/_p_u_b_/_t_e_c_h_r_e_p_o_r_t_s directory which, in this case, is some     bibliographic files rather than the technical reports themselves.  6. This RootNode will gather all documents that are within 1 step away     from the specified RootNode URL, waiting 60 seconds between each     retrieval.  This is a good method by which to index your hotlist.     By putting an HTML file containing ``hotlist'' pointers as this     RootNode, this enumeration will gather the top-level pages to all     of your hotlist pointers.  7. This RootNode will gather all documents that are at most 2 steps     away from the specified RootNode URL.  Furthermore, it will follow     and enumerate any HTTP or FTP URLs that it encounters during     enumeration.  44..33..44..  GGaatthheerreerr eennuummeerraattiioonn vvss.. ccaannddiiddaattee sseelleeccttiioonn  In addition to using the _U_R_L_-_F_i_l_t_e_r and _H_o_s_t_-_F_i_l_t_e_r files for the  RootNode specification mechanism described in Section ``RootNode  specifications'', you can prevent documents from being indexed through  customizing the _s_t_o_p_l_i_s_t_._c_f file, described in Section ``Customizing  the type recognition, candidate selection, presentation unnesting, and  summarizing steps''.  Since these mechanisms are invoked at different  times, they have different effects.  The _U_R_L_-_F_i_l_t_e_r and _H_o_s_t_-_F_i_l_t_e_r  mechanisms are invoked by the Gatherer's ``RootNode'' enumeration  programs.   Using these filters as stop lists can prevent unwanted  objects from being retrieved across the network.  This can  dramatically reduce gathering time and network traffic.  The _s_t_o_p_l_i_s_t_._c_f file is used by the _E_s_s_e_n_c_e content extraction system  (described in Section ``Extracting data for indexing: The Essence  summarizing subsystem'') _a_f_t_e_r the objects are retrieved, to select  which objects should be content extracted and indexed.  This can be  useful because Essence provides a more powerful means of rejecting  indexing candidates, in which you can customize based not only file  naming conventions but also on file contents (e.g., looking at strings  at the beginning of a file or at UNIX ``magic'' numbers), and also by  more sophisticated file-grouping schemes (e.g., deciding not to  extract contents from object code files for which source code is  available).  As an example of combining these mechanisms, suppose you want to index  the ``.ps'' files linked into your WWW site.  You could do this by  having a _s_t_o_p_l_i_s_t_._c_f file that contains ``HTML'', and a RootNode _U_R_L_-  _F_i_l_t_e_r that contains:               Allow \.html               Allow \.ps               Deny  .*  As a final note, independent of these customizations the Gatherer  attempts to avoid retrieving objects where possible, by using a local  disk cache of objects, and by using the HTTP ``If-Modified-Since''  request header.  The local disk cache is described in Section ``The  local disk cache''.  44..44..  GGeenneerraattiinngg LLeeaaffNNooddee//RRoooottNNooddee UURRLLss ffrroomm aa pprrooggrraamm  It is possible to generate RootNode or LeafNode URLs automatically  from program output.  This might be useful when gathering a large  number of Usenet newsgroups, for example.  The program is specified  inside the RootNode or LeafNode section, preceded by a pipe symbol.               <LeafNodes>               |generate-news-urls.sh               </LeafNodes>  The script must output valid URLs, such as               news:comp.unix.voodoo               news:rec.pets.birds               http://www.nlanr.net/               ...  In the case of RootNode URLs, enumeration parameters can be given  after the program.               <RootNodes>               |my-fave-sites.pl Depth=1 URL=5000,url-filter               </RootNodes>  44..55..  EExxttrraaccttiinngg ddaattaa ffoorr iinnddeexxiinngg:: TThhee EEsssseennccee ssuummmmaarriizziinngg ssuubbssyysstteemm  After the Gatherer retrieves a document, it passes the document  through a subsystem called _E_s_s_e_n_c_e to extract indexing information.  Essence allows the Gatherer to collect indexing information easily  from a wide variety of information, using different techniques  depending on the type of data and the needs of the particular corpus  being indexed.  In a nutshell, Essence can determine the type of data  pointed to by a URL (e.g., PostScript vs. HTML), ``unravel''  presentation nesting formats (such as compressed ``tar'' files),  select which types of data to index (e.g., don't index Audio files),
💿 文件大小 7910 K
👤 上传用户 pc1667pc1667
📂 所属分类网络
🏷️ 相关标签

#harvest #html #页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -