⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 faq.txt

📁 harvest是一个下载html网页得机器人
💻 TXT
📖 第 1 页 / 共 3 页
字号:
  44..11..  WWhhyy ddooeessnn''tt PPoosstt--SSuummmmaarriizziinngg wwoorrkk??  The most common error is that the instructions are indented by spaces  instead of a tab-stop. Check the PPoosstt--SSuummmmaarriizziinngg rule file and make  sure that instructions are indented by a tab-stop. The PPoosstt--  SSuummmmaarriizziinngg rule file uses a syntax like in _M_a_k_e_f_i_l_e.  Conditions  begin in the first column and instructions are indented by a tab-stop.  44..22..  HHooww ccaann II ssuummmmaarriizzee mmeettaa ttaaggss iinn HHTTMMLL ddooccuummeennttss??  In Harvest 1.5.20.kj-0.3, the default summarizer for HTML data was  switched to HTML-lax.sum which does not handle meta tags. Edit  $HARVEST_HOME/lib/gatherer/HTML.sum and uncomment the SGML or Perl  based summarizer.  44..33..  WWhhyy aarree rraaww HHTTMMLL ttaaggss iinn ssoommee qquueerryy rreessuullttss??  If you see raw HTML tags in query results, the HTML summarizer was not  able to parse the page correctly. Harvest comes with three different  summarizers for HTML. If the default summarizer fails try the other  two summarizers. To do this, edit $HARVEST_HOME/lib/gatherer/HTML.sum  and uncomment one of the summarizers.  44..44..  HHooww ccaann II ssuummmmaarriizzee DDVVII ffiilleess??  Use Harvest older than 1.5.20-kj-0.8 or newer than 1.7.2.  The  versions between these two versions have a bug which prevents DVI  files being summarized.  44..55..  HHooww ccaann II ssuummmmaarriizzee PPddff ffiilleess??  You need _x_p_d_f to summarize Pdf files. Harvest uses pdftotext from _x_p_d_f  to summarize Pdf files.  Alternatively, you can use acroread to convert Pdf files to Postscript  and pass it to Postscript summarizer. To do this, edit  $HARVEST_HOME/lib/gatherer/Pdf.sum accordingly.  44..66..  WWhheerree ccaann II ggeett ppddffttootteexxtt??  pdftotext is part of _x_p_d_f. It is available at Xpdf homepage  http://www.foolabs.com/xpdf/.  44..77..  HHooww ccaann II iimmpprroovvee ssuummmmaarriizzeerr ffoorr MMiiccrroossoofftt WWoorrdd ffiilleess??  Harvest uses _c_a_t_d_o_c to summarize Microsoft Word files. If you get bad  summaries for Microsoft Word files, you might want to try wvHtml,  which is part of _w_v_W_a_r_e, instead of _c_a_t_d_o_c.  44..88..  WWhheerree ccaann II ggeett wwvvWWaarree??  _w_v_W_a_r_e is available at wvWare homepage http://www.wvware.com/.  44..99..  HHooww ccaann II aadddd ssuuppppoorrtt ffoorr nneeww ffiillee ttyyppee??  Give the new file type a name and make Harvest know how to recognize  the new file type by modifying _b_y_n_a_m_e_._c_f (to determine filetype by its  name), _b_y_u_r_l_._c_f (to determine filetype by the URL), or _m_a_g_i_c and  _b_y_c_o_n_t_e_n_t_._c_f (to determine filetype by looking at the content of the  file). You will find _b_y_c_o_n_t_e_n_t_._c_f, _b_y_n_a_m_e_._c_f, _b_y_u_r_l_._c_f and _m_a_g_i_c in  your _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/ directory.  Create a summarizer (a programm or script) which takes the filename as  first argument and prints a SOIF stream "Attributename{length of  data}:<tab>your data" to stdout. For file type "Xyz", you have to  create a summarizer called Xyz.sum in the _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_g_a_t_h_e_r_e_r_/  directory.  In most of the cases it might be easiest to convert filetype "Xyz" to  a supported filetype like HTML, PostScript, etc. and use an existing  summarizer on the converted file.  44..1100..  HHooww ccaann II uussee nnssggmmllss iinnsstteeaadd ooff ssggmmllss ttoo ssuummmmaarriizzee ddooccuummeennttss??  Edit $HARVEST_HOME/lib/gatherer/SGML.sum and set $$ssggmmllss__ccmmdd ==  ""//uussrr//llooccaall//bbiinn//nnssggmmllss"" or where ever you have installed nsgmls.  55..  BBrrookkeerr  55..11..  HHooww ccaann II ssttaarrtt aa BBrrookkeerr aatt bboooott ttiimmee??  Some user contributed startup scripts are located in _c_o_n_t_r_i_b_/_e_t_c_/  directory of Harvest source distribution. Modify apropriate files and  copy them to your startup script directory.  55..22..  HHooww ccaann II ssttaarrtt aa BBrrookkeerr wwiitthhoouutt ssttaarrttiinngg aa ccoolllleeccttiioonn??  When a Broker starts, it starts collecting data, which can take some  time. To avoid this, use the --nnooccooll option when invoking RunBroker.  If you have installed Harvest in _/_u_s_r_/_l_o_c_a_l_/_h_a_r_v_e_s_t_/, put following  line into your startup file, e.g. /etc/rc.local:               /usr/local/harvest/brokers/YOUR_BROKER/RunBroker -nocol  Replace _/_u_s_r_/_l_o_c_a_l_/_h_a_r_v_e_s_t_/ with the directory where you have  installed Harvest.  55..33..  WWhhyy ddoonn''tt tthhee ddooccuummeennttss wwhhiicchh II hhaavvee ggaatthheerreedd rriigghhtt nnooww sshhooww uupp  iinn tthhee BBrrookkeerr??  The Broker imports data from the Gatherer once in every 24 hours. If  you want to import the data immediately after gathering, just restart  the Broker or signal the Broker to import data.  You can signal the broker with the command line client brkclient,  located in _$_H_A_R_V_E_S_T___H_O_M_E_/_l_i_b_/_b_r_o_k_e_r_/ by typing:               # brkclient localhost 8501 '#ADMIN #Password secret #collection'  Replace hostname, port and password if necessary.  Other easier method is to use the WWW based admin interface at:  "http://www.YOUR_SERVER.com/Harvest/brokers/YOUR_BROKER/admin/admin.html".  55..44..  WWhhyy ddoo II ggeett eerrrroorr mmeessssaaggeess wwhheenn II ttrryy ttoo aacccceessss  ""hhttttpp::////ssoommee..hhoosstt//HHaarrvveesstt//bbrrookkeerrss//yyoouurr--bbrrookkeerr--ppaatthh//"" aafftteerr rruunnnniinngg  $$HHAARRVVEESSTT__HHOOMMEE//RRuunnHHaarrvveesstt??  Check the error log of your http daemon. The http daemon must be able  to follow symbolic links. For apache httpd you can do this by adding:               <Location /Harvest/brokers/your-broker-path/>                       Options FollowSymLinks               </Location>  to your _h_t_t_p_d_._c_o_n_f.  If you don't want symbolic links, delete the symbolic link and copy  the file to the new name.  55..55..  WWhhyy aarree NNEEWWSS UURRLLss bbrrookkeenn?? WWhheerree aarree tthhee hhoossttnnaammeess iinn NNEEWWSS UURRLLss??  HHooww ccaann II ffoollllooww NNEEWWSS UURRLLss??  Harvest's Gatherer doesn't put hostnames into NEWS URLs. If your web  browser complains about missing news server, configure your web  browser to use the news server of your provider, company or  organization as your default news server.  For more information why Harvest doesn't put hostnames into NEWS URLs,  see RFC-1738 chapter 3.6 and 3.7.  55..66..  WWhhyy ddoonn''tt II ggeett aannyy rreessuullttss iiff II uussee aa lloonngg oorr ccoommpplleexx qquueerryy  ssttrriinngg??  The length of a query string is limited to 30 characters when using  regluar expressions (wildcards), excluding the escape characters.  55..77..  CCaann II uussee wwiillddccaarrddss iinn aattttrriibbuuttee vvaalluuee ffoorr ssttrruuccttuurreedd qquueerriieess??  No, regular expressions for attribute names and attribute values in  structured queries aren't supported. So, queries like "Author: Smi.*"  or "Auth.*: Smith" won't do what you might expect.  55..88..  AArree tthhee aattttrriibbuuttee nnaammeess ccaassee sseennssiittiivvee??  No, the attribute names are not case sensitiv. So, "Time-To-Live" is  the same like "Time-to-Live", "Time-to-live", "time-to-live", etc.  55..99..  WWhhyy ddooeessnn''tt ccoolllleeccttiinngg ffrroomm bbrrookkeerr wwoorrkk??  This is due to a bug introduced in Harvest 1.5.18. The bug was fixed  in 1.7.8. To make it work again, update to 1.7.8 or higher.  55..1100..  HHooww ccaann II ccuussttoommiizzee tthhee HHaarrvveesstt uusseerr iinntteerrffaaccee??  The query pages are located in  _$_H_A_R_V_E_S_T___H_O_M_E_/_b_r_o_k_e_r_s_/_Y_O_U_R___B_R_O_K_E_R_/_q_u_e_r_y_-_*.  Most likely, you don't  want to make all the variables visible to users who want to query your  broker. Edit _q_u_e_r_y_-_* and use the hhiiddddeenn type to set suitable defaults  for variables you want to hide.  The result set presentation can be customized by choosing or modifying  the configuration files located in _$_H_A_R_V_E_S_T___H_O_M_E_/_c_g_i_-_b_i_n_/_l_i_b_/  directory. The configuration files _S_a_m_p_l_e_._c_f_, _c_l_a_s_s_i_c_._c_f_, _m_o_d_e_r_n_._c_f  and some _L_A_N_G_U_A_G_E_._c_f are already installed in _$_H_A_R_V_E_S_T___H_O_M_E_/_c_g_i_-  _b_i_n_/_l_i_b_/ directory. You can either create a new configuration file or  modify one of th configuration files to get the result set  presentation you want. See the Harvest User's Manual for information  about available options for the configuration file.  If you want to customize the result presentation even further, then  edit $HARVEST_HOME/cgi-bin/search.cgi.  55..1111..  HHooww ddoo II llooccaalliizzee//ttrraannssllaattee uusseerr iinntteerrffaaccee??  To localize the user interface, do:  1. Create _s_r_c_/_b_r_o_k_e_r_/_e_x_a_m_p_l_e_/_b_r_o_k_e_r_s_/_s_k_e_l_e_t_o_n_/_q_u_e_r_y_-_g_l_i_m_p_s_e_-     _m_o_d_e_r_n_._h_t_m_l_._x_x_._i_n, where _x_x is a two letter abbreviation for your     language/country, by translating either _q_u_e_r_y_-_g_l_i_m_p_s_e_-     _m_o_d_e_r_n_._h_t_m_l_._i_n or other _q_u_e_r_y_-_g_l_i_m_p_s_e_-_m_o_d_e_r_n_._h_t_m_l_._y_y_._i_n. This is     the localized query page.  2. Create _c_o_m_p_o_n_e_n_t_s_/_b_r_o_k_e_r_/_s_t_a_n_d_a_r_d_/_W_W_W_/_l_a_n_g_u_a_g_e_._c_f by translating     _m_o_d_e_r_n_._c_f or other translated configuration file like _s_p_a_n_i_s_h_._c_f,     _g_e_r_m_a_n_._c_f, etc. This will localize the result pages and error     messages.  3. Create _s_r_c_/_b_r_o_k_e_r_/_e_x_a_m_p_l_e_/_b_r_o_k_e_r_s_/_s_k_e_l_e_t_o_n_/_q_u_e_r_y_-_g_l_i_m_p_s_e_._h_t_m_l_._x_x_._i_n     by translating _q_u_e_r_y_-_g_l_i_m_p_s_e_._h_t_m_l_._i_n or _q_u_e_r_y_-_g_l_i_m_p_s_e_._h_t_m_l_._y_y_._i_n.     This is the advanced query page.  4. Translate _s_r_c_/_b_r_o_k_e_r_/_e_x_a_m_p_l_e_/_b_r_o_k_e_r_s_/_*_._h_t_m_l to get localized     additional help pages.  55..1122..  HHooww ccaann II rreeppllaaccee tthhee bbuunnddlleedd GGlliimmppssee wwiitthh aann ootthheerr vveerrssiioonn ooff  GGlliimmppssee??  Edit _$_H_A_R_V_E_S_T___H_O_M_E_/_b_r_o_k_e_r_s_/_Y_O_U_R___B_R_O_K_E_R_/_a_d_m_i_n_/_b_r_o_k_e_r_._c_o_n_f to let  Harvest know the location of your glimpse, glimpseindex, and  glimpseserver.  66..  TTeerrmmss  66..11..  WWhhaatt iiss aa GGaatthheerreerr??  A Gatherer is a system that retrieves documents from various sources  (Web-, News-, FTP-server, local files) for processing. In HTML/HTTP  context, it is also often called _c_r_a_w_l_e_r, _r_o_b_o_t, or _s_p_i_d_e_r.  66..22..  WWhhaatt iiss LLooccaall--MMaappppiinngg??  To reduce the CPU load and speed up Gathering, Harvest can map local  files to URLs. The gatherer can bypass the server and use local file,  while pretending that the objects were gatherered as usual to the rest  of the Harvest system.  66..33..  WWhhaatt iiss aa SSuummmmaarriizzeerr??  A Summarizer transforms a document into a form which is more suitable  for fulltext searching.  The HTML summarizer for example, extracts the title of a document,  removes all HTML tags, generates a wordlist, etc.  66..44..  WWhhaatt iiss aa BBrrookkeerr??  A Broker processes search requests received from a user by a cgi-  script and presents the search results.  77..  MMiisscceellllaanneeoouuss  77..11..  WWhhoo aarree tthhee mmaaiinnttaaiinneerrss ooff HHaarrvveesstt??  Kang-Jin Lee lee@arco.de and Harald Weinreich harald@weinreichs.de are  maintaining Harvest.  77..22..  II hhaavvee ffoouunndd aa bbuugg.. WWhhaatt sshhoouulldd II ddoo??  Post a bug report to the newsgroup comp.infosystems.harvest or mail it  to Kang-Jin Lee lee@arco.de and Harald Weinreich harald@weinreichs.de.  77..33..  IIss tthheerree aa mmaaiilliinngglliisstt ffoorr HHaarrvveesstt?? WWhhaatt aabboouutt aa nneewwssggrroouupp??  There is a Harvest developer's mailinglist  http://lists.sourceforge.net/lists/listinfo/harvest-devel/ for Harvest  users and developers. There also is a Harvest newsgroup  news:comp.infosystems.harvest <news:comp.infosystems.harvest>.

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -