📄 readme.txt

📁 网络蜘蛛
💻 TXT
字号:
HARVESTMan Webcrawler - README==============================Introduction============HARVESTMan is an internet offline crawler (robot) program written in python. It helps you to grab pages from the internet and store it in a local directory for offline browsing.This is version 1.3.9-1. Author: Anand B Pillai.        Copyright:        All software in this distribution is         (C) Anand B Pillai.License: See the file LICENSE.TXT.Getting started ===============Unzip the file to a directory of your choice. To install, run the 'setup.py' script which installs thefiles to your PYTHON installation directory.How it works============The program has two modes of working. In one, it can read its settings from a configuration file, which is a plain text file containing key value pairsseparated by tabs/spaces. In the other mode, the program reads its options fromthe command line. The configuration filename by default is 'config.txt'.There are two other modes of working, one based on HarvestMan project files andthe other where HarvestMan can accept a new configuration file as a command lineoption. These are discussed below.HarvestMan writes a project file before it starts crawling websites.You can read this file back to restart the project later on. For this, use the'--projectfile' option. This file has the extension '.hbp' and is written inthe base directory of the project.HarvestMan can also accept another configuration filename from the command line.For this run the program with the '--configfile' option.For more information, refer the main documentation.For information on the command line options, run the program with the --help or-h option.The Config file===============    The config file provides the program with its settings. It    contains name value pairs separated by spaces/tabs. For example,        URL http://www.yahoo.com    BASEDIR d:/websites    PROJECT myproject        You can specify comments by prepending a line with the hash ('#')    character. Also a comment can be added at the end of any configuration     line.    # Url to download    URL http://www.yahoo.com    BASEDIR d:/websites     # This is the base directory    From version 1.2 a new form of config file is the default. It uses    a dotted string notation, classifying config variables into sections.    The above config file in the new format would look like,    ;; url to download    project.url     http://www.yahoo.com    project.basedir d:/websites    The new version of the config file separates config variables into    about 10 different sections as described below.    Section                       Description    1. project                    All project related variables    2. network                    All network related variables lik proxy,                                  proxy username/password etc.    3. url                        Any username/password for the url    4. download                   All download related variables (html/image/                                  stylesheets/cookies etc)    5. control                    All download control variables (filters/                                  maximum limits/timeouts/depths/robots.txt)    6. system                     Any system related variable( fastmode/thread status/                                  thread timeouts/thread pool size etc)    7. indexer                    All indexer related variables (localize etc)    8. files                      All harvestman file settings (config/message log/                                   error log/url list file etc)     9. parser                     Any parser related setting (fast/slow parser)    10.display                    Display (GUI/browser) related setting      From version 1.2 (all minor versions), this kind of config file is    the default generated by the config file generation script.    HarvestMan accepts about 40 different configuration options.    For a detailed discussion on the options, refer the HarvestMan     documentation files in the 'doc' sub-directory.                             A sample config file     A part of a sample HarvestMan config file (version 2.0) is shown    below.     ;;HarvestMan Configuration File version 2.0    ;;project related variables    project.url                        www.python.org/doc/current/tut/tut.html    project.name                       pytut    project.basedir                    d:/websites    project.verbosity                  2    The actual config file is much larger than this and contains various    config sections.       A script genconfig.py is provided to generate a config file    based on inputs from the user.    Running the program    -------------------            Create the config file by editing or by using the config file generation    script provided (genconfig.py).    Set the PATHs to pick your HarvestMan main program file (This will be    in the directory named HarvestMan in your site-packages directory under    your PYTHON installation directory, if you ran the setup script).    Then in the command prompt,    % harvestman.py  (On Windows)    On linux/unix, you might need to actually invoke python     on the command line,    $ python harvestman.pt (Linux/Unix)    This will start the program using the settings in the    config file.     Upon completion, the program creates an html file for browsing    projects and opens it in the user's default web browser. This     file is named 'index.html' and is created in the base directory     of the program. You can click on the project link to browse directly    to the saved files. New project information is automatically appended    to this file.        Platforms    ---------    HarvestMan needs Python 2.2 version. It works well with Python versions    2.2 upwards. It has been tested with Python 2.2.1, 2.2.3, 2.3.* versions    on Windows NT/2000, Mandrake Linux 9.1, Redhat Linux 9.1 and Fedora Core 1.    HarvestMan should work on all platforms where Python is supported.    Fast HTML Parser - Configuration--------------------------------HarvestMan no longer packages sgmlop from this versiononwards.Support for it is in the code, but users are discouragedto use it, since HarvestMan features support for html tidyfrom this version, which makes the need for the fast sgmlopparser almost unnecessary. The settings for tidy is describedbelow.But if you still want to use it, you can get the latest version of sgmlop from the website, http://www.effbot.org/downloads. HTML Tidy - Configuration-------------------------From version 1.3.9 onwards (this version), HarvestMan supports cleaningof webpages before they are parsed to avoid parse errors which helps todownload more web pages. HarvestMan uses the python wrapper of html tidycalled uTidyLib. The website for the project is http://utidylib.berlios.de/.This version of HarvestMan comes with the latest version of uTidy in thepackage. Tidy source code is located inside the sub-directory 'tidy'inside the 'HarvestMan' directory. Hence HarvestMan will work transparentlywith tidy. The tidy option can be enabled or disabled by using the config variable'control.tidyhtml'. This is enabled by default in this version.MORE DOCUMENTATION==================Please read the HarvestMan documentation in the 'doc' sub-directory formore information.CHANGES & FIXES===============	See the file Changes.txt.VERSION CHANGE LOG==================See the file ChangeLog.txtBUG REPORT/FEATURE SUGGESTIONS==============================Bug report/feature suggestions can be done from the websitehttp://harvestman.freezope.org .=======================================================================Bangalore,June 14 2004.
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -