📄 faq.txt

📁 harvest是一个下载html网页得机器人
💻 TXT
📖 第 1 页 / 共 3 页
字号:
  You might also want to check the start scripts which start Harvest  daemons during system boot and remove cron jobs necessary for running  Harvest.  22..22..  WWhheerree ccaann II ggeett bbiissoonn aanndd fflleexx??  Bison and flex are available at GNU FTP Site <ftp://ftp.gnu.org/> and  its mirrors.  22..33..  HHooww ccaann II iinnssttaallll HHaarrvveesstt iinn ""//mmyy//ddiirreeccttoorryy//hhaarrvveesstt"" iinnsstteeaadd ooff  ""//uussrr//llooccaall//hhaarrvveesstt""??  Do               # ./configure --prefix=/my/directory/harvest               # make               # make install  22..44..  HHooww ccaann II aavvooiidd ""ssyynnttaaxx eerrrroorr bbeeffoorree ``rreeggooffff__tt''"" eerrrroorr mmeessssaaggee  wwhheenn ccoommppiilliinngg HHaarrvveesstt??  On some systems, building Harvest may fail with following message:          Making all in util          gcc  -I../include -I./../include -c buffer.c          In file included from ../include/config.h:350,                           from ../include/util.h:112,                           from buffer.c:86:          /usr/include/regex.h:46: syntax error before `regoff_t'          /usr/include/regex.h:46: warning: data definition has no type or storage class          /usr/include/regex.h:56: syntax error before `regoff_t'          *** Error code 1  If you get this error, edit _s_r_c_/_c_o_m_m_o_n_/_i_n_c_l_u_d_e_/_a_u_t_o_c_o_n_f_._h and add  "#define USE_GNU_REGEX 1" before typing make to build Harvest.  22..55..  WWhheerree ccaann II ggeett mmoorree iinnffoorrmmaattiioonn ffoorr bbuuiillddiinngg HHaarrvveesstt oonn  FFrreeeeBBSSDD??  See FreshPorts Harvest page http://www.freshports.org/www/harvest/ for  more informations about building Harvest on FreeBSD.  33..  GGaatthheerreerr  33..11..  DDooeess tthhee GGaatthheerreerr ssuuppppoorrtt ccooookkiieess??  No, Harvest's Gatherer doesn't support cookies.  33..22..  WWhhyy ddooeessnn''tt LLooccaall--MMaappppiinngg wwoorrkk??  In Harvest 1.7.7, the default HTML enumerator was switched from  httpenum-depth to httpenum-breadth. The breadth first enumerator had a  bug in LLooccaall--MMaappppiinngg, which was fixed in Harvest 1.7.19. To make  LLooccaall--MMaappppiinngg work, use depth first enumerator or update to Harvest  1.7.19 or later.  Local mapping will fail if the file is not readable by the gatherer  process, or the file is not a regular file, or the file has execute  bits set, or the filename contains characters that have to be escaped  (like tilde, space, curly brace, quote, etc). So, for directories,  symbolic links and cgi scripts, the gatherer will always contact the  server instead of using local file.  33..33..  DDooeess tthhee GGaatthheerreerr ggaatthheerr tthhee RRoooott-- aanndd LLeeaaffNNooddee--UURRLLss ppeerriiooddii--  ccaallllyy??  No, the Gatherer gathers Root- and LeafNode URLs only once. To check  the URLs periodically, you have to use cron (see "man 8 cron") to run  $HARVEST_HOME/gatherers/YOUR_GATHERER/RunGatherer.  33..44..  CCaann HHaarrvveesstt ggaatthheerr hhttttppss UURRLLss??  No, https is not supported by Harvest. To gather https URLs, use  Harvest-ng from Simon Wilkinson. It is available at Harvest-ng  homepage http://webharvest.sourceforge.net/ng/.  33..55..  WWhheenn wwiillll HHaarrvveesstt bbee aabbllee ttoo ggaatthheerr hhttttppss UURRLLss??  This is not on top of my to-do list and may take some time.  33..66..  DDooeess HHaarrvveesstt ssuuppppoorrtt cclliieenntt bbaasseedd ssccrriippttiinngg//pplluuggiinn lliikkee  JJaavvaassccrriipptt,, FFllaasshh??  No, Harvest's gatherer does not support Javascript, Flash, etc., and  there are no plans to add support for them.  33..77..  WWhhyy ddooeess tthhee ggaatthheerreerr ssttoopp aafftteerr ggaatthheerriinngg ffeeww ppaaggeess??  Harvest's gatherer doesn't support Javascript, Flash, etc.  Check the  site you want to gather and make sure that the site is browsable  without any plugins, Javascript, etc.  33..88..  HHooww ccaann II iinnddeexx llooccaall nneewwssggrroouuppss?? HHooww ccaann II ppuutt hhoossttnnaammee iinnttoo  NNeewwss UURRLL??  You will find a News URL hostname patch by Collin Smith in the _c_o_n_t_r_i_b  directory.  NOTE: Even though most web browsers support this, this violates  RFC-1738.  33..99..  WWhhaatt ddoo tthhee ggaatthheerreerr ooppttiioonnss ""SSeeaarrcchh==BBrreeaaddtthh"" aanndd ""SSeeaarrcchh==DDeepptthh""  ddoo aanndd wwhhiicchh kkeeyywwoorrddss aarree aavvaaiillaabbllee ffoorr ""SSeeaarrcchh=="" ooppttiioonn??  Search option selects an enumerator for http and gopher URLs. Harvest  comes with breadth first (Search=Breadth) and depth first  (Search=Depth) enumerator for http and gopher. They have different  strategy when following the URLs to get a list of candidates for  processing. The breadth first enumerator processes all links in a  level before descending to next level. In case of limiting the number  of URLs to gather from a site, it will give you a more representative  overview of the site. The depth first enumerator will descend to next  level as soon as possible. When there are no links left for the  current branch, it will  process the next branch. The depth first  enumerator doesn't use as much memory as the breadth first enumerator.  If you don't have compelling reasons to switch from an enumerator to  the other, the default value should be a reasonable choice.  33..1100..  HHooww ccaann II iinnddeexx hhttmmll ppaaggeess ggeenneerraatteedd bbyy ccggii ssccrriippttss?? HHooww ccaann II  iinnddeexx UURRLLss wwhhiicchh hhaass aa ""??"" ((qquueessttiioonn mmaarrkk)) iinn iitt??  Remove _H_T_T_P_-_Q_u_e_r_y from _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/_s_t_o_p_l_i_s_t_._c_f and  _$_H_A_R_V_E_S_T___H_O_M_E_/_g_a_t_h_e_r_e_r_s_/_Y_O_U_R___G_A_T_H_E_R_E_R_/_l_i_b_/_s_t_o_p_l_i_s_t_._c_f. For versions  earlier than 1.7.5, you also have to create a (symbolic) link from  $HARVEST_HOME/lib/gatherer/HTML.sum to  $HARVEST_HOME/lib/gatherer/HTTP-Query.sum. To do this, type:               # cd $HARVEST_HOME/lib/gatherer               # ln -s HTML.sum HTTP-Query.sum  33..1111..  WWhhyy iiss tthhee ggaatthheerreerr ssoo ssllooww?? HHooww ccaann II mmaakkee iitt ffaasstteerr??  The gatherer's default setting is to sleep one second after retrieving  an URL. This is to avoid an overload of the webserver. If you gather  from webservers under your control and know that they can handle the  additional load caused by the gatherer add "Delay=0" in your root node  specification to disable the sleep.  The lines should look like:               <RootNodes>               http://www.SOMESERVER.com/ Search=Breadth Delay=0               </RootNodes>  Alternatively, you can set the delay value for all root nodes by  adding AAcccceess--DDeellaayy:: 00 in your configuration file.  It should look like:               Gatherer-Name:  YOUR Gatherer               Gatherer-Port:  8500               Top-Directory:  /HARVEST_DIR/work1/gatherers/testgather               Access-Delay:   0               <RootNodes>               http://www.MYSITE.com/ Search=Breadth               </RootNodes>  33..1122..  WWhhyy iiss tthhee ggaatthheerreerr ssttiillll ssoo ssllooww??  Harvest's gatherer is designed to handle many types of documents and  many types of protocols. To achieve this flexibility it uses external  programs to handle the different types of documents and protocols. For  example, when gathering HTML documents via HTTP, the document is  parsed twice. First to get list of candidates to gather and then to  get a summary of the document. The summarizer is started each time  when a document arrives, quits after summarizing that document and has  to be restarted for the next document. Compared to more HTTP/HTML  oriented approaches this causes a significant overhead when gathering  HTTP/HTML only.  Harvest retrieves one document at a time which causes slowdown if you  encounter a slow site. Due to implementation, the Gathering process is  quite heavyweight and uses up to 25 MB of RAM per Gatherer. For this  reason, there were no attempts to spawn more gatherers to optimize the  bandwidth usage.  33..1133..  HHooww ddoo II rreeqquueesstt ""330044 NNoott MMooddiiffiieedd"" aannsswweerrss ffrroomm HHTTTTPP sseerrvveerrss??  To send "Last Modified: xx" headers and get "304 Not Modified" answers  from HTTP servers, add following line to the gatherer's configuration  file:               HTTP-If-Modified-Since: Yes  If the document hasn't changed since last gathering, the gatherer will  use the data from its database, instead of retrieving it again. This  will save bandwidth and speed up gathering significantly.  33..1144..  WWhhyy ddooeess HHaarrvveesstt ggaatthheerr ddiiffffeerreenntt UURRLLss bbeettwweeeenn ggaatthheerriinnggss??  When HHTTTTPP--IIff--MMooddiiffiieedd--SSiinnccee is enabled, the candidate selection scheme  of the http enumerators will change for successful database lookups.  For unchanged URLs, the enumerators will behave more like depth first  gatherer. The result of the gatherings should be the same if you are  gathering all URLs of a site, but if you gather only parts of a site  by using UURRLL==nn with nn << nnuummbbeerr ooff UURRLLss ooff aa ssiittee you will get  different subset of the system you gather.  33..1155..  WWhhyy hhaass tthhee GGaatthheerreerr''ss ddaattaabbaassee vvaanniisshheedd aafftteerr ggaatthheerriinngg??  The Gatherer uses GDBM databases to store its data on disk. Database  files for Gatherer can grow very large depending on how much data you  gather. On some systems, (e.g. i386 based Linux) the maximum file size  is 2GB. If the amount of data surpasses this limit, the GDBM database  file will be wiped from the disk.  33..1166..  HHooww ccaann II aavvooiidd GGDDBBMM ffiilleess ggrroowwiinngg vveerryy bbiigg dduurriinngg GGaatthheerriinngg??  The Gatherer's temporary GDMB database file _W_O_R_K_I_N_G_._g_d_b_m will grow  very rapidly when gathering nested objects like tar, tar.gz, zip etc.  archives. GDBM databases keep growing when tuples are inserted and  deleted from them, because GDBM reuses only fractions of the empty  filespace. To get rid of unused space, the GDBM database has to be  reorganized. The reorganization however is slow and will slow down the  gathering, so the default is not to reorganize the gatherer's  temporary database. This should work well for small to medium sized  Gatherers, but for large Gatherers it may be necessary to reorganize  the temporary database during gathering to keep the size of the  database at manageable level. To reorganize the _W_O_R_K_I_N_G_._g_d_b_m every 100  deletions add following line to your gatherer configuration file:               Essence-Options: --max-deletions 100  Don't set this value too low, since it will consume significant share  of CPU time and disk I/O. Reorganizing every 10 to 100 deletions seems  to be a reasonable value.  33..1177..  CCaann II uussee HHttddiigg aass GGaatthheerreerr?? CCaann tthhee BBrrookkeerr iimmppoorrtt ddaattaa ffrroomm  HHttddiigg??  The perl module _M_e_t_a_d_a_t_a from Dave Beckett can dump data from Htdig  database into a SOIF stream. Metadata only supports GDBM databases, so  this only works with versions earlier than Htdig 3.1, because newer  versions of Htdig switched from GDBM to Sleepycat's Berkeley DB.  33..1188..  HHooww ccaann II ccoonnttrrooll aacccceessss ttoo GGaatthheerreerr''ss ddaattaabbaassee??  Edit _$_H_A_R_V_E_S_T___H_O_M_E_/_g_a_t_h_e_r_e_r_s_/_Y_O_U_R___G_A_T_H_E_R_E_R_/_d_a_t_a_/_g_a_t_h_e_r_d_._c_f to allow or  deny access. A line that begins with AAllllooww is followed by any number  of domain or host names that are allowed to connect to the Gatherer.  If the word aallll is used, then all hosts are matched. DDeennyy is the  opposite of AAllllooww. The following example will only allow hosts in the  ccss..ccoolloorraaddoo..eedduu or uusscc..eedduu domain access the Gatherer's database:               Allow  cs.colorado.edu usc.edu               Deny   all  33..1199..  DDooeess HHaarrvveesstt''ss GGaatthheerreerr ssuuppppoorrtt WWAAPP//WWMMLL,, GGnnuutteellllaa,, NNaappsstteerr??  No. Harvest's Gatherer doesn't support WAP. Peer to peer services like  Gnutella, Napster, etc. are also unsupported.  33..2200..  HHooww ddoo II ggaatthheerr ffttpp UURRLLss ffrroomm wwuu--ffttpp ddaaeemmoonnss??  Changes in wu-ftpd 2.6.x broke ftpget. There is a replacement for it  in contrib directory which wraps any ftp client to behave like ftpget.  33..2211..  WWhhyy ddooeessnn''tt ffiillee UURRLLss iinn LLeeaaffNNooddeess wwoorrkk aass eexxppeecctteedd??  File URLs pointing to directories like _f_i_l_e_:_/_/_m_i_s_c_/_d_o_c_u_m_e_n_t_s_/ in  LeafNodes are considered as nested object which will be unnested.  33..2222..  WWhhyy ddooeess ggaatthheerriinngg ffrroomm aa ssiittee ffaaiill ccoommpplleetteellyy oorr ffoorr ppaarrttss ooff  tthhee ssiittee??  This may be caused by the site's _r_o_b_o_t_s_._t_x_t. You can check this by  typing "http://www.SOME.SITE.com/robots.txt" into your favourite web  browser.  44..  SSuummmmaarriizzeerr
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -