⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 config.py

📁 网络蜘蛛
💻 PY
📖 第 1 页 / 共 3 页
字号:
""" HarvestManConfig.py - Module to keep configuration options
    for HarvestMan program and its related modules. This software is
    part of the HarvestMan(R) program.

    Author: Anand B Pillai (anandpillai at letterboxes dot org).

    For licensing information see the file LICENSE.txt that
    is included in this distribution.

    Jan 2 2004        Anand   1.3.1 bug fix version.
    Feb 12 2004       Anand   1.3.2 version
                              development started.
    May 28 2004       Anand   1.4 version development. Derived
                              HarvestManStateObject from dict type.
                              (Note that this limits the program to
                              Python 2.2 and later versions.)
    Jun 14 2004         Anand          1.3.9 release.                              

"""

PROG_HELP = """\

%(appname)s [options]

%(appname)s is a command-line offline browser. It can be used to download
websites/files from the internet and save them to the disk for offline browsing/
viewing.

This is Version %(version)s.

Authors: Anand B Pillai.
Copyright (C) 2003-2004 Anand B Pillai.

WWW:       http://harvestman.freezope.org

By default %(appname)s reads its options from a config file named 'config.txt'
in the current directory. If this file is not present, then %(appname)s reads
the options from the command line.

You can use the option '--configfile' to specify a different config filename
on the command line, or the option '--projectfile' to read an existing project
file.

Options:


1. Help Options:

    -h/--help:\t\t\tShow this message
    -v/--version\t\tPrint version information and exit

2. Necessary Options:

    -U/--url\t\t\tStart downloading from this url.
    -P/--project\t\tName of this %(appname)s project.
    -B/--basedir\t\tThe base directory to save downloaded files to.

3. Basic Options:

    -C/--configfile\t\tUse this config filename ('config.txt').
    --PF/--projectfile\t\tLoad this %(appname)s project file.
    -V/--verbosity\t\tThe verbosity level (0-5, 2).
    -H/--html\t\t\tWhether to download html files (yes).*
    -I/--images\t\t\tWhether to download image links of a page (yes). *
    -S/--getcss\t\t\tDownload stylesheets of a page always (yes). *
    -i/--getimages\t\tDownload images of a page always (yes). *
    -F/--fastmode\t\tRun in fastmode (yes).

    --rn/--renamefiles\t\tWhether to try renaming dynamically generated files (no). *
    --fl/--fetchlevel\t\tThe fetchlevel setting (1).

    -r/--retryfailed\t\tNumber of retry attempts on failed links (1). *
    -l/--localise\t\tWhether to localise hyper text links after download (2). *
    -b/--browsepage\t\tTells the program to create a project browser index page (yes). *

4. Advanced Options:

    -j/--jitlocalise\t\tShould try and localise each page immmediately after downnload. (no)
    -t/--usethreads\t\tWhether to download non-html files in multiple threads (yes). *
    -s/--threadpoolsize\t\tSize of the thread pool (20).
    -o/--timeout\t\tTimeout value in seconds for a thread in the thread pool (600 sec) .
    -po/--prjtimeout\t\tTimeout value in seconds for the project (600 sec) .

    -E/--errorfile\t\tThe error file name ('errors.log').
    -L/--logfile\t\tThe name of the message log file ('harvestman.log').
    --UL/--urllistfile\t\tThe filename to dump the list of urls crawled.

    -p/--proxy\t\t\tSet this to the name/ip of the proxy server in your LAN/WAN, if you are behind one.
    -u/--puser\t\t\tSet this to the username for the proxy server, if the proxy needs authentication.
    -w/--ppasswd\t\tSet this to the password for the proxy server, if the proxy needs authentication.
    --pp/--pport\t\tSet this to the port number where the proxy accepts connections (80).
    --in/--intranet\t\tWhether the url is in intranet (no). *

    -n/--username\t\tLogin to the site with this username. *
    -d/--userpasswd\t\tLogin to the site with this password. *

    -c/--checkfiles\t\tAttempt to verify the integrity of files (yes).
    --hp/--htmlparser\t\tThe html parser to use for parsing html files (0). *

5. Download options:

    -k/--cookies\t\tSupport for cookies  (yes). *
    --nc/--connections\t\tEnable so many network connections at a given time (5). *
    --pc/--cachepages\t\tSupport for caching/update of webpages (yes). *
    --ep/--epagelinks\t\tDownload links on pages outside start urls directory (yes). *
    --es/---eserverlinks\tDownload links from external servers (no). *

    --d1/--depth\t\tDepth of fetching on the starting server (10).
    --d2/--extdepth\t\tDepth of fetching on external servers (no limit).

    --M1/--maxdirs\t\tLimit on external directories from which links are fetched (no limit).
    --M2/--maxservers\t\tLimit on external servers from which links are fetched (no limit).
    --M/--maxfiles\t\tMaxmimum number of files to download (3000).
    --MT/--maxthreads\t\tLimit on number of url trackers(threads) to run (10).

    --R/--rep\t\t\tWhether to support Robot Exclusion Protocol (yes). *
    --F1/--urlfilter\t\tThe regular expression for filtering urls.
    --F2/--serverfilter\t\tThe regular expression for filtering outside servers.

    --js/--javascript\t\tDownload server side javascript (.js) files (yes). *
    --ja/--java\t\t\tDownload java class (.class) files (yes). *

    NOTE: Options that need a [yes/no] argument are marked with an asterik '*'.
        You can also use the arguments [on/off] or [1/0] in place of [yes/no].
        These are case-insensitive. The value inside the parantheses at the end
        shows the default value for each option, if any.


"""

import os, sys
import getopt

class HarvestManStateObject(dict):
    """ Internal config class for the program """

    def __init__(self):
        """ Initialize dictionary with the most common
        settings and their values """

        self.version='1.4'
        self.appname='HarvestMan'
        self.progname='HarvestMan 1.4'
        self.url=''
        self.project=''
        self.basedir=''
        self.configfile = 'config.txt' # new entry  (Jul 15 03)
        self.projectfile = ''          # new entry  (Jul 15 03)
        self.proxy=''
        self.puser=''
        self.ppasswd=''
        self.siteusername=''     # new entry (Aug 24 2003)
        self.sitepasswd=''       # new entry (Aug 24 2003)
        self.proxyport=0
        self.errorfile='errors.log'
        self.logfile='harvestman.log'
        self.urlslistfile= ''
        self.localise=2
        self.jitlocalise=0
        self.images=1
        self.depth=10
        self.html=1
        self.robots=1
        self.eserverlinks=0
        self.epagelinks=1
        self.fastmode=1
        self.usethreads=1
        self.maxfiles=3000
        self.maxextservers=0
        self.maxextdirs=0
        self.retryfailed=1
        self.urlslistfile=''
        self.extdepth=0
        self.maxtrackers=5
        self.intranet=0
        self.urlfilter=''
        self.wordfilter=''
        self.inclfilter=[]
        self.exclfilter=[]
        self.allfilters=[]
        self.serverfilter=''
        self.serverinclfilter=[]
        self.serverexclfilter=[]
        self.allserverfilters=[]
        self.urlpriority = ''
        self.serverpriority = ''
        self.urlprioritydict = {}
        self.serverprioritydict = {}
        self.maxtrackers=10
        self.verbosity=2
        self.timeout=200
        self.getimagelinks=1
        self.getstylesheets=1
        self.threadpoolsize=10
        self.renamefiles=0
        self.fetchlevel=0
        self.browsepage=1 # New variable added
        self.htmlparser=0
        self.checkfiles=1
        # Two variables added for 1.2 alpha release
        self.cookies=1
        self.pagecache=1
        self._error=''
        self.starttime=0
        self.endtime=0
        # For 1.2 final release, fetch javascript?
        self.javascript = True
        # Fetch java class files?
        self.javaapplet = True
        # For 1.2 final release, to control connections
        self.connections=5
        # Internal config variable
        self.reusethreads=True
        # Internal config variable
        self.cachefileformat='pickled' # Values => 'pickled' or 'xml'
        # Internal config variables for controlling project
        # 1. Testing the code (no browse page)
        self.testing = False
        # 2. Testing the browse page (no crawl)
        self.testnocrawl = False
        # 3. Ignore keyboard interrupts
        self.ignorekbinterrupt = False
        # For considering subdomains as different from
        # main domains
        self.subdomain = False
        # For skipping query forms
        self.skipqueryforms = True
        # For controlling requests/sec per server (Suggested by Sheila King)
        self.requests = 20
        # For controlling bytes/sec per server (Suggested by Sheila King)
        # This is in kbytes/sec
        self.bytes = 20.00
        # Config variable (Project timeout in seconds)
        self.projtimeout = 300 # five minutes
        # Download time, not a user variable
        self.downloadtime = 0.0
        # Tidy html?
        self.tidyhtml = True
        # self.tidyhtml = False

        # create the dictionary of mappings containing
        # config options to dictionary keys and their
        # types

        self.create_dictionary()

    def create_dictionary(self):
        """ Create the dictionary containing the mapping
        of config options to dictionary keys and their types """

        self.__options = {
                            'project.url' : ('url', 'str'),
                            'project.name' : ('project', 'str'),
                            'project.basedir' : ('basedir', 'str'),
                            'display.browsepage' : ('browsepage', 'int'),   # new variable addition (Jul 31 03)
                            'files.configfile' : ('configfile', 'str'),   # new variable addition (Jul 15 03)
                            'files.projectfile' : ('projectfile', 'str'), # new variable addition (Jul 15 03)
                            'network.proxyserver' : ('proxy', 'str'),
                            'network.proxyuser' : ('puser', 'str'),
                            'network.proxypasswd' : ('ppasswd', 'str'),
                            'network.proxyport' : ('proxyport', 'int'),     # new variable addition (Jul 13 03)
                            'url.username' : ('siteusername', 'str'), # new variable addition (Aug 24 03)
                            'url.password' : ('sitepasswd', 'str'),   # new variable addition (Aug 24 03)
                            'files.urlslistfile' : ('urlslistfile', 'str'),
                            'files.errorfile' : ('errorfile', 'str'),
                            'files.msgfile' : ('logfile', 'str'),
                            'network.intranet' : ('intranet', 'int'),
                            'download.checkfiles' : ('checkfiles', 'int'),
                            'parser.htmlparser' : ('htmlparser', 'int'),
                            'indexer.jitlocalise' : ('jitlocalise', 'int'),
                            'system.maxtrackers' : ('maxtrackers', 'int'),
                            'system.threadtimeout' : ('timeout', 'float'),
                            'project.verbosity' : ('verbosity', 'int'),
                            'system.usethreads' : ('usethreads', 'int'),
                            'download.rename' : ('renamefiles', 'int'),
                            'system.threadpoolsize' : ('threadpoolsize', 'int'),

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -