⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 faq.txt

📁 harvest是一个下载html网页得机器人
💻 TXT
📖 第 1 页 / 共 3 页
字号:
  Harvest FAQ  Kang-Jin Lee lee@arco.de  2003-11-08  Harvest frequently asked questions (FAQ) with answers  ______________________________________________________________________  Table of Contents  1. Harvest     1.1 What is Harvest?     1.2 Where can I get more information about Harvest?     1.3 Where can I download Harvest?     1.4 Are there any information about Harvest in Russian?     1.5 What is Harvest-ng?     1.6 What is the copyright status of Harvest?     1.7 Which Operating System do I need to run Harvest?     1.8 Does Harvest run under Windows NT/2000/XP?     1.9 What Hardware do I need to use Harvest?     1.10 Which version of Harvest should I use?     1.11 What are "harvest-modified-by-RL-Stajsic", "harvest-MathNet", and "harvest-1.5.20-kj"?     1.12 What are the limits of Harvest?     1.13 Do I need root access to install and run Harvest?     1.14 How do I block Harvest from my site? How do I identify Harvest?     1.15 What can I do to help?  2. Building Harvest     2.1 How do I uninstall Harvest?     2.2 Where can I get bison and flex?     2.3 How can I install Harvest in "/my/directory/harvest" instead of "/usr/local/harvest"?     2.4 How can I avoid "syntax error before `regoff_t'" error message when compiling Harvest?     2.5 Where can I get more information for building Harvest on FreeBSD?  3. Gatherer     3.1 Does the Gatherer support cookies?     3.2 Why doesn't Local-Mapping work?     3.3 Does the Gatherer gather the Root- and LeafNode-URLs periodically?     3.4 Can Harvest gather https URLs?     3.5 When will Harvest be able to gather https URLs?     3.6 Does Harvest support client based scripting/plugin like Javascript, Flash?     3.7 Why does the gatherer stop after gathering few pages?     3.8 How can I index local newsgroups? How can I put hostname into News URL?     3.9 What do the gatherer options "Search=Breadth" and "Search=Depth" do and which keywords are available for "Search=" option?     3.10 How can I index html pages generated by cgi scripts? How can I index URLs which has a "?" (question mark) in it?     3.11 Why is the gatherer so slow? How can I make it faster?     3.12 Why is the gatherer still so slow?     3.13 How do I request "304 Not Modified" answers from HTTP servers?     3.14 Why does Harvest gather different URLs between gatherings?     3.15 Why has the Gatherer's database vanished after gathering?     3.16 How can I avoid GDBM files growing very big during Gathering?     3.17 Can I use Htdig as Gatherer? Can the Broker import data from Htdig?     3.18 How can I control access to Gatherer's database?     3.19 Does Harvest's Gatherer support WAP/WML, Gnutella, Napster?     3.20 How do I gather ftp URLs from wu-ftp daemons?     3.21 Why doesn't file URLs in LeafNodes work as expected?     3.22 Why does gathering from a site fail completely or for parts of the site?  4. Summarizer     4.1 Why doesn't Post-Summarizing work?     4.2 How can I summarize meta tags in HTML documents?     4.3 Why are raw HTML tags in some query results?     4.4 How can I summarize DVI files?     4.5 How can I summarize Pdf files?     4.6 Where can I get pdftotext?     4.7 How can I improve summarizer for Microsoft Word files?     4.8 Where can I get wvWare?     4.9 How can I add support for new file type?     4.10 How can I use nsgmls instead of sgmls to summarize documents?  5. Broker     5.1 How can I start a Broker at boot time?     5.2 How can I start a Broker without starting a collection?     5.3 Why don't the documents which I have gathered right now show up in the Broker?     5.4 Why do I get error messages when I try to access "http://some.host/Harvest/brokers/your-broker-path/" after running $HARVEST_HOME/RunHarvest?     5.5 Why are NEWS URLs broken? Where are the hostnames in NEWS URLs? How can I follow NEWS URLs?     5.6 Why don't I get any results if I use a long or complex query string?     5.7 Can I use wildcards in attribute value for structured queries?     5.8 Are the attribute names case sensitive?     5.9 Why doesn't collecting from broker work?     5.10 How can I customize the Harvest user interface?     5.11 How do I localize/translate user interface?     5.12 How can I replace the bundled Glimpse with an other version of Glimpse?  6. Terms     6.1 What is a Gatherer?     6.2 What is Local-Mapping?     6.3 What is a Summarizer?     6.4 What is a Broker?  7. Miscellaneous     7.1 Who are the maintainers of Harvest?     7.2 I have found a bug. What should I do?     7.3 Is there a mailinglist for Harvest? What about a newsgroup?  ______________________________________________________________________  11..  HHaarrvveesstt  11..11..  WWhhaatt iiss HHaarrvveesstt??  Harvest is a system to collect information and make them searchable  using a web interface. Harvest can collect information on inter- and  intranet using http, ftp, nntp as well as local files like data on  harddisk, CDROM and file servers. Current list of supported formats in  addition to HTML include TeX, DVI, PS, full text, mail, man pages,  news, troff, WordPerfect, RTF, Microsoft Word/Excel, SGML, C sources  and many more. Stubs for PDF support is included in Harvest and will  use Xpdf or Acroread to process PDF files. Adding support for new  format is easy due to Harvest's modular design.  11..22..  WWhheerree ccaann II ggeett mmoorree iinnffoorrmmaattiioonn aabboouutt HHaarrvveesstt??  See Harvest homepage http://harvest.sourceforge.net/ for informations  about Harvest.  11..33..  WWhheerree ccaann II ddoowwnnllooaadd HHaarrvveesstt??  Harvest is available for download at Harvest download page  http://prdownloads.sourceforge.net/harvest/.  11..44..  AArree tthheerree aannyy iinnffoorrmmaattiioonn aabboouutt HHaarrvveesstt iinn RRuussssiiaann??  Andrei Malashevich has translated the Harvest User's Manual to  Russian. It is available at his Harvest User's Manual page at  http://baby.chg.ru/manual_harvest/.  11..55..  WWhhaatt iiss HHaarrvveesstt--nngg??  Harvest-ng is a reimplementation of Harvest's gatherer by Simon  Wilkinson. You can get more info about Harvest-ng at Harvest-ng  homepage http://webharvest.sourceforge.net/ng/.  11..66..  WWhhaatt iiss tthhee ccooppyyrriigghhtt ssttaattuuss ooff HHaarrvveesstt??  The core of Harvest located in _s_r_c directory is under GPL. Additional  components, located in _c_o_m_p_o_n_e_n_t_s directory are under GPL or similar  copyright.  11..77..  WWhhiicchh OOppeerraattiinngg SSyysstteemm ddoo II nneeeedd ttoo rruunn HHaarrvveesstt??  Harvest should run on any *nix like platforms including FreeBSD, Linux  and Solaris.  11..88..  DDooeess HHaarrvveesstt rruunn uunnddeerr WWiinnddoowwss NNTT//22000000//XXPP??  Michael Schlenker has ported Harvest to Windows platforms using Cygwin  http://sources.redhat.com/cygwin/.  11..99..  WWhhaatt HHaarrddwwaarree ddoo II nneeeedd ttoo uussee HHaarrvveesstt??  A Pentium 120MHz with 64MB RAM should achieve reasonable performance  for around 350 MB of fulltext data in ca. 20.000 objects. A Pentium  650MHz with 256MB RAM should be able to handle around 1.5 GB of  fulltext data in ca. 100.000 objects.  11..1100..  WWhhiicchh vveerrssiioonn ooff HHaarrvveesstt sshhoouulldd II uussee??  +o  If you want to help developing Harvest, use the most recent version     of Harvest.  +o  If you are cautious, a version older than a week should reasonably     be safe to use.  +o  If you don't want to use development versions of Harvest, use the     last version marked as stable.  11..1111..  WWhhaatt aarree ""hhaarrvveesstt--mmooddiiffiieedd--bbyy--RRLL--SSttaajjssiicc"",, ""hhaarrvveesstt--MMaatthhNNeett"",,  aanndd ""hhaarrvveesstt--11..55..2200--kkjj""??  After the original authors ceased working on Harvest, there were some  periods where Harvest was unmaintained. During this time there were  following forked versions of Harvest:  +o  "harvest-modified-by-RL-Stajsic" was released by R.L. Stajsic and     Tim Samshuijzen with some bugfixes.  +o  "harvest-MathNet" is a modified version of Harvest-1.5.20 to     improve the handling of German specical characters ("Umlaute",     "scharfes S").  +o  "harvest-1.5.20-kj" series were released by me with bugfixes to     Harvest 1.5.20.  All these forked trees were merged into Harvest 1.6.  11..1122..  WWhhaatt aarree tthhee lliimmiittss ooff HHaarrvveesstt??  +o  Harvest's Gatherer uses GDBM database to store the summarized data.     On some architecture/OS, the maximum file size is 2 GB, so you     can't have a database larger than 2 GB per Gatherer on those     systems. To collect more data, you have to set up multiple     Gatherers.  +o  The Broker stores the data as single files. On most OS, performance     degrades noticeably with increasing number of files in a directory.     Since the Broker uses finite number of directories defined in     _s_r_c_/_b_r_o_k_e_r_/_s_t_o_r___m_a_n_._c to store the files, the broker will slow down     with increasing number files.  11..1133..  DDoo II nneeeedd rroooott aacccceessss ttoo iinnssttaallll aanndd rruunn HHaarrvveesstt??  For initial setup, you must be able to modify the webserver  configuration and to schedule cron jobs. After the initial setup, it  is recommended to run Harvest as a different user for security  reasons.  11..1144..  HHooww ddoo II bblloocckk HHaarrvveesstt ffrroomm mmyy ssiittee?? HHooww ddoo II iiddeennttiiffyy HHaarrvveesstt??  Put a line like this to your robots.txt:               User-agent: Harvest               Disallow: /  11..1155..  WWhhaatt ccaann II ddoo ttoo hheellpp??  There are many ways to help depending your skills and time you want to  contribute to improve Harvest:  +o  Use Harvest and let others know that you are using Harvest.  +o  Use Harvest and let me know why you are using Harvest.  +o  Submit ideas, feature requests and bug reports.  +o  Contribute localization.  +o  Contribute documentation.  +o  Contribute code.  22..  BBuuiillddiinngg HHaarrvveesstt  22..11..  HHooww ddoo II uunniinnssttaallll HHaarrvveesstt??  Harvest keeps all of its files in _/_u_s_r_/_l_o_c_a_l_/_h_a_r_v_e_s_t or whichever  pprreeffiixx you have assigned during configure. To uninstall Harvest,  simply delete the Harvest directory.  If you did following when installing Harvest:               # ./configure --prefix=/home/data/harvest  then, this should uninstall Harvest:               # rm -fr /home/data/harvest

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -