📄 changelog.txt

📁 网络蜘蛛
💻 TXT
字号:
*==========================================================*
|            -ChangeLog file for HarvestMan-               |
|                                                          |
| URL: http://harvestman.freezope.org/files/Changelog.txt  |
*==========================================================*

Version: 1.3.9-1 (minor bug fixes)
Release Date: June 24 2004

Changes in version 1.3.9-1 from 1.3.9
=====================================

1. Fixed a bug in cache algorithm. Key 'checksum'
should not be checked if it is old cache.
2. Fixed a bug in connector.py. Check for valid
url object in line 622.
3. Fixed a bug in urlparser. Anchor type urls
should have the url file name as base url, not
original url filename.
4. Fixed a bug in url tracker. Anchor type urls
should not be skipped.

Version: 1.3.9 (features/bug fixes)
Release Date: June 14 2004

Changes in version 1.3.9 from 1.3.4
==================================

New Features
------------

1. Url priorities: Every url is assigned a priority according
to which it is downloaded. Urls with higher priority are downloaded
first. Priorities are determined by 3 factors.

    a. The generation of the url
    b. Whether the url is a webpage
    c. User defined priorities

Urls in a lower generation are given higher priority when compared
to urls in a higher generation. This makes sure that urls which
were created in the beginning of a project gets downloaded first.

Webpage urls are given a higher priority when compared to other urls.

Apart from this user can defined priorities in the config file in the
range of (-5,5) based on file extensions.

2. Website priorites: These are like url priorities but which
can be specified by the user in the config file.

Sample usage:

control.serverpriority     www.foo.com+3,www.bar.com-3

3. Thread groups for downloads: The download threads are now
pre-launched in a group similar to tracker threads. The download
jobs are submitted to the thread pool, which in turn delegates
them to the threads. The thread pool has been made into a 
queue for this. 

 This reduces thread latency, since we no longer spawn
new threads during the life cycle of the program.

4. Allow urls with spaces: HarvestMan can now download urls which 
contain spaces like 'http://www.foo.com/bar/this url.html'.

5. Changed the way to distinguish between directory and file like
urls. Earlier when we parsed the url, a connection was made to
the url, assuming it was directory like. If the reply was HTTP 404
error, then it was assumed correctly to be a file like url.

  This has been changed in the new version. We assume all urls are
file like, For example, if there is a url like http://www.foo.com/bar/file
, which can be a directory http://www.foo.com/bar/file/index.html or
file http://www.foo.com/bar/file, we assume it is a file initialy and
try to download it. The geturl() method of the file-like object returned
by opening the url, will tell whether it is file like or directory like.
This information is used to modify the local (disk) file name of the url
at that point. This decouples the modules urlparser and connector to
a large extent and makes performance better with such urls.

6. Added functionality to tidy html pages before parsing them by
   using 'uTidy', the python port of html tidy. This helps to crawl
   sites that exit due to parsing errors in previous versions of
   HarvestMan.

7. Intranet downloads need not set a specific flag (download.intranet).
   Instead HarvestMan can figure out whether the server is in intranet
   by resolving its name and take appropriate action. This allows
   intranet/internet downloads to be mixed in the same project.

8. Modified the way url information is cached. The field 'last-modified'
in url's headers is used, if it is available. If it is not there, a
checksum based on the content of the url is used (previous algorithm)
as fallback.

Other Changes
=============

1. Regular expressions for filters are pre-compiled.
2. Derived HarvestManStateObject (config class) from 'dict'  type.
3. Main thread 'joins' each tracker thread with zero timeout instead
   of killing them at the end of project.
4. Optimization fix: Links are stored for localising, only if their
   download is successful.
5. Assigned 2:1 ratio for fetchers and crawlers instead of current
   1:1 ratio.
6. Renamed all modules.
7. Used 'weakref' wherever possible to reduce extra references to
   objects and avoid reference loops. This is mostly used in
   'GetObject' method and in urlparser module.
8. 

Bug fixes
========

1. Fixed a bug with http://localhost downloads. Bug ID # B1083256752.28 .
2. Fixed bug in url filter for images. 
3. Fixed a bug with timezone printing. Bug ID # B1083253695.02.
4. Close file like object returned by opening urls
   after reading data.
5. Fixed a bug in localising links. Directory like urls
   need to be skipped.
6. Fixed bug in finding common domain for servers that 
   have lesser than three 'dots' in their name string. (This is
   the same bug as # B1083256752.28).
7. Fixed a bug in setting up network for clients behind a proxy/
   firewall.
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -