📄 changes.txt

📁 网络蜘蛛
💻 TXT
📖 第 1 页 / 共 2 页
字号:
12 下一页

*==========================================================*
|            -Changes.txt file for HarvestMan-             |
|                                                          |
|           URL: http://harvestman.freezope.org            |
*==========================================================*

Version: 1.3.9-1 (minor bug fixes)
Release Date: June 24 2004
                                                                                                      
Changes in version 1.3.9-1 from 1.3.9
=====================================
                                                                                                      
1. Fixed a bug in cache algorithm. Key 'checksum'
should not be checked if it is old cache.
2. Fixed a bug in connector.py. Check for valid
url object in line 622.
3. Fixed a bug in urlparser. Anchor type urls
should have the url file name as base url, not
original url filename.
4. Fixed a bug in url tracker. Anchor type urls
should not be skipped.


Version: 1.3.9 (features/bug fixes)
Release Date: June 14 2004

Changes in version 1.3.9 from 1.3.4
==================================

New Features
------------

1. Url priorities: Every url is assigned a priority according
to which it is downloaded. Urls with higher priority are downloaded
first. Priorities are determined by 3 factors.

    a. The generation of the url
    b. Whether the url is a webpage
    c. User defined priorities

Urls in a lower generation are given higher priority when compared
to urls in a higher generation. This makes sure that urls which
were created in the beginning of a project gets downloaded first.

Webpage urls are given a higher priority when compared to other urls.

Apart from this user can defined priorities in the config file in the
range of (-5,5) based on file extensions.

2. Website priorites: These are like url priorities but which
can be specified by the user in the config file.

Sample usage:

control.serverpriority     www.foo.com+3,www.bar.com-3

3. Thread groups for downloads: The download threads are now
pre-launched in a group similar to tracker threads. The download
jobs are submitted to the thread pool, which in turn delegates
them to the threads. The thread pool has been made into a 
queue for this. 

 This reduces thread latency, since we no longer spawn
new threads during the life cycle of the program.

4. Allow urls with spaces: HarvestMan can now download urls which 
contain spaces like 'http://www.foo.com/bar/this url.html'.

5. Changed the way to distinguish between directory and file like
urls. Earlier when we parsed the url, a connection was made to
the url, assuming it was directory like. If the reply was HTTP 404
error, then it was assumed correctly to be a file like url.

  This has been changed in the new version. We assume all urls are
file like, For example, if there is a url like http://www.foo.com/bar/file
, which can be a directory http://www.foo.com/bar/file/index.html or
file http://www.foo.com/bar/file, we assume it is a file initialy and
try to download it. The geturl() method of the file-like object returned
by opening the url, will tell whether it is file like or directory like.
This information is used to modify the local (disk) file name of the url
at that point. This decouples the modules urlparser and connector to
a large extent and makes performance better with such urls.

6. Added functionality to tidy html pages before parsing them by
   using 'uTidy', the python port of html tidy. This helps to crawl
   sites that exit due to parsing errors in previous versions of
   HarvestMan.

7. Intranet downloads need not set a specific flag (download.intranet).
   Instead HarvestMan can figure out whether the server is in intranet
   by resolving its name and take appropriate action. This allows
   intranet/internet downloads to be mixed in the same project.

8. Modified the way url information is cached. The field 'last-modified'
in url's headers is used, if it is available. If it is not there, a
checksum based on the content of the url is used (previous algorithm)
as fallback.

Other Changes
=============

1. Regular expressions for filters are pre-compiled.
2. Derived HarvestManStateObject (config class) from 'dict'  type.
3. Main thread 'joins' each tracker thread with zero timeout instead
   of killing them at the end of project.
4. Optimization fix: Links are stored for localising, only if their
   download is successful.
5. Assigned 2:1 ratio for fetchers and crawlers instead of current
   1:1 ratio.
6. Renamed all modules.
7. Used 'weakref' wherever possible to reduce extra references to
   objects and avoid reference loops. This is mostly used in
   'GetObject' method and in urlparser module.
8. 

Bug fixes
========

1. Fixed a bug with http://localhost downloads. Bug ID # B1083256752.28 .
2. Fixed bug in url filter for images. 
3. Fixed a bug with timezone printing. Bug ID # B1083253695.02.
4. Close file like object returned by opening urls
   after reading data.
5. Fixed a bug in localising links. Directory like urls
   need to be skipped.
6. Fixed bug in finding common domain for servers that 
   have lesser than three 'dots' in their name string. (This is
   the same bug as # B1083256752.28 .)
7. Fixed a bug in setting up network for clients behind a proxy/
   firewall.


Version: 1.3.3 (bug fixes)
Release Date: Feb 24 2004

Changes in Version 1.3.3 from 1.3.2
===================================

1. Fixed bug with parsing of FTP links.  Bug # B1077613467.85.
2. Fixed another bug with external server links.
3. Fixed bug with request control. Request dictionary 
   key is server name, not ip.

Version: 1.3.2 (minor feature enhancements)
Release Date: Feb 13 2004

Changes in Version 1.3.2 from 1.3.1
===================================

There is one minor feature in this release. 

1. This release adds ability to limit downloads by
controlling the number of simultaneous requests from the
same server. This option can be controlled by the config
variable named 'control.requests'.

2. Apart from that I have re-structured the package,
and added a distutils setup.py script which copies the 
package to your PYTHON installation folder.


Version: 1.3.1 (bug fix)
Release Date: Feb 10 2004

Changes in Version 1.3.1 from 1.3
=================================

This version is a bug fix version fixing most
of the critical and annoying HarvestMan bugs.
These bugs can be located in the bugs database
at http://harvestman.freezope.org/Discussons .

1. Fixed bug with query forms. The program no longer
   tries to download server side query form links.
   Bug #B1073291938.97.
2. Fixed bug with handling frame redirects. Bug #B1076402199.0.
3. Fixed bug with robots.txt url. Bug #B1072436188.35.
4. Fixed bug in finding out external server links.
   Bug #B1076402348.52.
5. Fixed bug in external links with respect to subdomains.
   Bug #B1076409910.45.
6. Fixed bug with following non-existent links in a 
   directory listing Bug #B1073028403.71.
7. Fixed problem in printing harvestman url in welcome
   message.
8. Fixed some problems in config file parsing.
9. Fixed problem with printing version string (-v and
   --version options).
10. Other miscellaneous fixes and corrections thanks to
    Vivian, Sascha and some others.


Version: 1.3 (final)
Release Date: Dec 15 2003

Changes in Version 1.3 (from 1.3 a1)
=========================================

1. This version adds one feature, that of searching 
   a webpage for keywords. You can create complex
   boolean regular expressions and supply them to
   HarvestMan. HarvestMan will parse the regular
   expressions and download only those web pages that
   match the regular expression.

   In simpler words, this means a keyword(s) search. :-)

   For example, you need to download only those webpages
   that contain the term 'Saddam' and 'WMD'. You create
   the following regular expression and pass it on to
   HarvestMan as the option 'control.wordfilter'.

   ;; config file for harvestman
   control.wordfilter    (Saddam & WMD)

   You use the boolean '&' and '|' to create the regular
   expressions.

   I have added this as a recipe in the ASPN Python Cookbook.
   For more information on how it works, point to the URL,
   http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/252526.

Changes in Version 1.3 a1 (from 1.2 final)
=========================================

1. This version features the new threading model which was
   started in the last release. This model is now completely
   written to prevent thread deadlocking incidents. A description
   of the model can be found in the HarvestMan webpage at
   http://harvestman.freezope.org. 

   This model will be developed further and will be the default
   for all future releases of HarvestMan.

2. The other major changes are complete re-writes of many modules.
   Classes have been renamed wherever suitable and some function
   names changed. The HarvestMan module has been trimmed up
   considerably.

3. This version has added an extra module HarvestManUtils which has
   some utility classes for reading/writing project & cache files and for
   creating the browse page. The code for these were earlier in the
   HarvestMan, HarvestManDataManager and HarvestManConfig modules.

4. The cache and project file information is compressed before writing
   to files.

Changes in Version 1.2 final (from 1.2 rc2)
===========================================

1. Added support for javascript and java applet tag parsing.
   HarvestMan can now fetch javascript (.js) files and
   java applets (.class) files from webpages. 
  
   The code for parsing this sits in the new HTMLParser 
   customized for HarvestMan.

2. Designated url trackers to two flavors - Fetchers and Getters.
   Fetchers are responsible for crawling webpages and fetching links,
   and Getters get the non-html files fetched by Fetchers. Images
   are still fetched by the Fetchers in thier threads.

   This should help in the growth of this program and make future
   development easier. Also this might help in preventing the thread
   locking incidents.

3. Fixed bugs in localizing anchor type links. Rewrote HarvestManPageParser,
   HarvestManUrlPathParser and HarvestManDataManager classes to take care
   of this. Anchor links in webpages are localized correctly now.

4. Due to javascript/javaapplet parsing code in the new html parser,
   many webpages which failed to work before (due to mostly javascript
   tags which the parser could not understand) will work correctly now.
   
5. Other routine bug fixes.

   a) Fixed a problem in creating the project browse page.
      We need to provide the absolute path of the project start url file.
   b) Fixed a problem in getRelativeFilename() in HarvestManUrlPathParser
      class.
   c) A few more...


Changes in Version 1.2 rc2 (from 1.2 rc1)
=========================================
Release Date: Sep 27 2003

1. Rewrote the algorithm for fetching urls with no filename
   extensions. We assume that it is a directory-like url
   (of the form dir/index.html) and try to fetch it during 
   url path resolving time (in urlPathParser clas). 

   If this fails, a 404 error is returned. The url is cached
   for later lookup in the datamanager in a invalid urls cache.
   We re-resolve the url assuming it now as a file-like url
   (of the form /file ) and fetch it.
  
   If it does not fail, the url is again cached for later lookup
   in the datamanager in a valid urls cache. The connector object
   is also cached in a connector dictionary of the datamanager so
   that we dont need to re-create the connection later.

   This fixes the long-standing bug with urls with no filename
   extensions.

2. Rewrote algorithm for localizing links. Instead of re-parsing
   html files and localizing the links, a dictionary of html files
   and their links are kept in the datamanager object. This dictionary
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -