📄 readme.txt
字号:
HARVESTMan Webcrawler - README==============================Introduction============HARVESTMan is an internet offline crawler (robot) program written in python. It helps you to grab pages from the internet and store it in a local directory for offline browsing.This is version 1.3.9-1. Author: Anand B Pillai. Copyright: All software in this distribution is (C) Anand B Pillai.License: See the file LICENSE.TXT.Getting started ===============Unzip the file to a directory of your choice. To install, run the 'setup.py' script which installs thefiles to your PYTHON installation directory.How it works============The program has two modes of working. In one, it can read its settings from a configuration file, which is a plain text file containing key value pairsseparated by tabs/spaces. In the other mode, the program reads its options fromthe command line. The configuration filename by default is 'config.txt'.There are two other modes of working, one based on HarvestMan project files andthe other where HarvestMan can accept a new configuration file as a command lineoption. These are discussed below.HarvestMan writes a project file before it starts crawling websites.You can read this file back to restart the project later on. For this, use the'--projectfile' option. This file has the extension '.hbp' and is written inthe base directory of the project.HarvestMan can also accept another configuration filename from the command line.For this run the program with the '--configfile' option.For more information, refer the main documentation.For information on the command line options, run the program with the --help or-h option.The Config file=============== The config file provides the program with its settings. It contains name value pairs separated by spaces/tabs. For example, URL http://www.yahoo.com BASEDIR d:/websites PROJECT myproject You can specify comments by prepending a line with the hash ('#') character. Also a comment can be added at the end of any configuration line. # Url to download URL http://www.yahoo.com BASEDIR d:/websites # This is the base directory From version 1.2 a new form of config file is the default. It uses a dotted string notation, classifying config variables into sections. The above config file in the new format would look like, ;; url to download project.url http://www.yahoo.com project.basedir d:/websites The new version of the config file separates config variables into about 10 different sections as described below. Section Description 1. project All project related variables 2. network All network related variables lik proxy, proxy username/password etc. 3. url Any username/password for the url 4. download All download related variables (html/image/ stylesheets/cookies etc) 5. control All download control variables (filters/ maximum limits/timeouts/depths/robots.txt) 6. system Any system related variable( fastmode/thread status/ thread timeouts/thread pool size etc) 7. indexer All indexer related variables (localize etc) 8. files All harvestman file settings (config/message log/ error log/url list file etc) 9. parser Any parser related setting (fast/slow parser) 10.display Display (GUI/browser) related setting From version 1.2 (all minor versions), this kind of config file is the default generated by the config file generation script. HarvestMan accepts about 40 different configuration options. For a detailed discussion on the options, refer the HarvestMan documentation files in the 'doc' sub-directory. A sample config file A part of a sample HarvestMan config file (version 2.0) is shown below. ;;HarvestMan Configuration File version 2.0 ;;project related variables project.url www.python.org/doc/current/tut/tut.html project.name pytut project.basedir d:/websites project.verbosity 2 The actual config file is much larger than this and contains various config sections. A script genconfig.py is provided to generate a config file based on inputs from the user. Running the program ------------------- Create the config file by editing or by using the config file generation script provided (genconfig.py). Set the PATHs to pick your HarvestMan main program file (This will be in the directory named HarvestMan in your site-packages directory under your PYTHON installation directory, if you ran the setup script). Then in the command prompt, % harvestman.py (On Windows) On linux/unix, you might need to actually invoke python on the command line, $ python harvestman.pt (Linux/Unix) This will start the program using the settings in the config file. Upon completion, the program creates an html file for browsing projects and opens it in the user's default web browser. This file is named 'index.html' and is created in the base directory of the program. You can click on the project link to browse directly to the saved files. New project information is automatically appended to this file. Platforms --------- HarvestMan needs Python 2.2 version. It works well with Python versions 2.2 upwards. It has been tested with Python 2.2.1, 2.2.3, 2.3.* versions on Windows NT/2000, Mandrake Linux 9.1, Redhat Linux 9.1 and Fedora Core 1. HarvestMan should work on all platforms where Python is supported. Fast HTML Parser - Configuration--------------------------------HarvestMan no longer packages sgmlop from this versiononwards.Support for it is in the code, but users are discouragedto use it, since HarvestMan features support for html tidyfrom this version, which makes the need for the fast sgmlopparser almost unnecessary. The settings for tidy is describedbelow.But if you still want to use it, you can get the latest version of sgmlop from the website, http://www.effbot.org/downloads. HTML Tidy - Configuration-------------------------From version 1.3.9 onwards (this version), HarvestMan supports cleaningof webpages before they are parsed to avoid parse errors which helps todownload more web pages. HarvestMan uses the python wrapper of html tidycalled uTidyLib. The website for the project is http://utidylib.berlios.de/.This version of HarvestMan comes with the latest version of uTidy in thepackage. Tidy source code is located inside the sub-directory 'tidy'inside the 'HarvestMan' directory. Hence HarvestMan will work transparentlywith tidy. The tidy option can be enabled or disabled by using the config variable'control.tidyhtml'. This is enabled by default in this version.MORE DOCUMENTATION==================Please read the HarvestMan documentation in the 'doc' sub-directory formore information.CHANGES & FIXES=============== See the file Changes.txt.VERSION CHANGE LOG==================See the file ChangeLog.txtBUG REPORT/FEATURE SUGGESTIONS==============================Bug report/feature suggestions can be done from the websitehttp://harvestman.freezope.org .=======================================================================Bangalore,June 14 2004.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -