📄 changes.txt
字号:
*==========================================================*
| -Changes.txt file for HarvestMan- |
| |
| URL: http://harvestman.freezope.org |
*==========================================================*
Version: 1.3.9-1 (minor bug fixes)
Release Date: June 24 2004
Changes in version 1.3.9-1 from 1.3.9
=====================================
1. Fixed a bug in cache algorithm. Key 'checksum'
should not be checked if it is old cache.
2. Fixed a bug in connector.py. Check for valid
url object in line 622.
3. Fixed a bug in urlparser. Anchor type urls
should have the url file name as base url, not
original url filename.
4. Fixed a bug in url tracker. Anchor type urls
should not be skipped.
Version: 1.3.9 (features/bug fixes)
Release Date: June 14 2004
Changes in version 1.3.9 from 1.3.4
==================================
New Features
------------
1. Url priorities: Every url is assigned a priority according
to which it is downloaded. Urls with higher priority are downloaded
first. Priorities are determined by 3 factors.
a. The generation of the url
b. Whether the url is a webpage
c. User defined priorities
Urls in a lower generation are given higher priority when compared
to urls in a higher generation. This makes sure that urls which
were created in the beginning of a project gets downloaded first.
Webpage urls are given a higher priority when compared to other urls.
Apart from this user can defined priorities in the config file in the
range of (-5,5) based on file extensions.
2. Website priorites: These are like url priorities but which
can be specified by the user in the config file.
Sample usage:
control.serverpriority www.foo.com+3,www.bar.com-3
3. Thread groups for downloads: The download threads are now
pre-launched in a group similar to tracker threads. The download
jobs are submitted to the thread pool, which in turn delegates
them to the threads. The thread pool has been made into a
queue for this.
This reduces thread latency, since we no longer spawn
new threads during the life cycle of the program.
4. Allow urls with spaces: HarvestMan can now download urls which
contain spaces like 'http://www.foo.com/bar/this url.html'.
5. Changed the way to distinguish between directory and file like
urls. Earlier when we parsed the url, a connection was made to
the url, assuming it was directory like. If the reply was HTTP 404
error, then it was assumed correctly to be a file like url.
This has been changed in the new version. We assume all urls are
file like, For example, if there is a url like http://www.foo.com/bar/file
, which can be a directory http://www.foo.com/bar/file/index.html or
file http://www.foo.com/bar/file, we assume it is a file initialy and
try to download it. The geturl() method of the file-like object returned
by opening the url, will tell whether it is file like or directory like.
This information is used to modify the local (disk) file name of the url
at that point. This decouples the modules urlparser and connector to
a large extent and makes performance better with such urls.
6. Added functionality to tidy html pages before parsing them by
using 'uTidy', the python port of html tidy. This helps to crawl
sites that exit due to parsing errors in previous versions of
HarvestMan.
7. Intranet downloads need not set a specific flag (download.intranet).
Instead HarvestMan can figure out whether the server is in intranet
by resolving its name and take appropriate action. This allows
intranet/internet downloads to be mixed in the same project.
8. Modified the way url information is cached. The field 'last-modified'
in url's headers is used, if it is available. If it is not there, a
checksum based on the content of the url is used (previous algorithm)
as fallback.
Other Changes
=============
1. Regular expressions for filters are pre-compiled.
2. Derived HarvestManStateObject (config class) from 'dict' type.
3. Main thread 'joins' each tracker thread with zero timeout instead
of killing them at the end of project.
4. Optimization fix: Links are stored for localising, only if their
download is successful.
5. Assigned 2:1 ratio for fetchers and crawlers instead of current
1:1 ratio.
6. Renamed all modules.
7. Used 'weakref' wherever possible to reduce extra references to
objects and avoid reference loops. This is mostly used in
'GetObject' method and in urlparser module.
8.
Bug fixes
========
1. Fixed a bug with http://localhost downloads. Bug ID # B1083256752.28 .
2. Fixed bug in url filter for images.
3. Fixed a bug with timezone printing. Bug ID # B1083253695.02.
4. Close file like object returned by opening urls
after reading data.
5. Fixed a bug in localising links. Directory like urls
need to be skipped.
6. Fixed bug in finding common domain for servers that
have lesser than three 'dots' in their name string. (This is
the same bug as # B1083256752.28 .)
7. Fixed a bug in setting up network for clients behind a proxy/
firewall.
Version: 1.3.3 (bug fixes)
Release Date: Feb 24 2004
Changes in Version 1.3.3 from 1.3.2
===================================
1. Fixed bug with parsing of FTP links. Bug # B1077613467.85.
2. Fixed another bug with external server links.
3. Fixed bug with request control. Request dictionary
key is server name, not ip.
Version: 1.3.2 (minor feature enhancements)
Release Date: Feb 13 2004
Changes in Version 1.3.2 from 1.3.1
===================================
There is one minor feature in this release.
1. This release adds ability to limit downloads by
controlling the number of simultaneous requests from the
same server. This option can be controlled by the config
variable named 'control.requests'.
2. Apart from that I have re-structured the package,
and added a distutils setup.py script which copies the
package to your PYTHON installation folder.
Version: 1.3.1 (bug fix)
Release Date: Feb 10 2004
Changes in Version 1.3.1 from 1.3
=================================
This version is a bug fix version fixing most
of the critical and annoying HarvestMan bugs.
These bugs can be located in the bugs database
at http://harvestman.freezope.org/Discussons .
1. Fixed bug with query forms. The program no longer
tries to download server side query form links.
Bug #B1073291938.97.
2. Fixed bug with handling frame redirects. Bug #B1076402199.0.
3. Fixed bug with robots.txt url. Bug #B1072436188.35.
4. Fixed bug in finding out external server links.
Bug #B1076402348.52.
5. Fixed bug in external links with respect to subdomains.
Bug #B1076409910.45.
6. Fixed bug with following non-existent links in a
directory listing Bug #B1073028403.71.
7. Fixed problem in printing harvestman url in welcome
message.
8. Fixed some problems in config file parsing.
9. Fixed problem with printing version string (-v and
--version options).
10. Other miscellaneous fixes and corrections thanks to
Vivian, Sascha and some others.
Version: 1.3 (final)
Release Date: Dec 15 2003
Changes in Version 1.3 (from 1.3 a1)
=========================================
1. This version adds one feature, that of searching
a webpage for keywords. You can create complex
boolean regular expressions and supply them to
HarvestMan. HarvestMan will parse the regular
expressions and download only those web pages that
match the regular expression.
In simpler words, this means a keyword(s) search. :-)
For example, you need to download only those webpages
that contain the term 'Saddam' and 'WMD'. You create
the following regular expression and pass it on to
HarvestMan as the option 'control.wordfilter'.
;; config file for harvestman
control.wordfilter (Saddam & WMD)
You use the boolean '&' and '|' to create the regular
expressions.
I have added this as a recipe in the ASPN Python Cookbook.
For more information on how it works, point to the URL,
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/252526.
Changes in Version 1.3 a1 (from 1.2 final)
=========================================
1. This version features the new threading model which was
started in the last release. This model is now completely
written to prevent thread deadlocking incidents. A description
of the model can be found in the HarvestMan webpage at
http://harvestman.freezope.org.
This model will be developed further and will be the default
for all future releases of HarvestMan.
2. The other major changes are complete re-writes of many modules.
Classes have been renamed wherever suitable and some function
names changed. The HarvestMan module has been trimmed up
considerably.
3. This version has added an extra module HarvestManUtils which has
some utility classes for reading/writing project & cache files and for
creating the browse page. The code for these were earlier in the
HarvestMan, HarvestManDataManager and HarvestManConfig modules.
4. The cache and project file information is compressed before writing
to files.
Changes in Version 1.2 final (from 1.2 rc2)
===========================================
1. Added support for javascript and java applet tag parsing.
HarvestMan can now fetch javascript (.js) files and
java applets (.class) files from webpages.
The code for parsing this sits in the new HTMLParser
customized for HarvestMan.
2. Designated url trackers to two flavors - Fetchers and Getters.
Fetchers are responsible for crawling webpages and fetching links,
and Getters get the non-html files fetched by Fetchers. Images
are still fetched by the Fetchers in thier threads.
This should help in the growth of this program and make future
development easier. Also this might help in preventing the thread
locking incidents.
3. Fixed bugs in localizing anchor type links. Rewrote HarvestManPageParser,
HarvestManUrlPathParser and HarvestManDataManager classes to take care
of this. Anchor links in webpages are localized correctly now.
4. Due to javascript/javaapplet parsing code in the new html parser,
many webpages which failed to work before (due to mostly javascript
tags which the parser could not understand) will work correctly now.
5. Other routine bug fixes.
a) Fixed a problem in creating the project browse page.
We need to provide the absolute path of the project start url file.
b) Fixed a problem in getRelativeFilename() in HarvestManUrlPathParser
class.
c) A few more...
Changes in Version 1.2 rc2 (from 1.2 rc1)
=========================================
Release Date: Sep 27 2003
1. Rewrote the algorithm for fetching urls with no filename
extensions. We assume that it is a directory-like url
(of the form dir/index.html) and try to fetch it during
url path resolving time (in urlPathParser clas).
If this fails, a 404 error is returned. The url is cached
for later lookup in the datamanager in a invalid urls cache.
We re-resolve the url assuming it now as a file-like url
(of the form /file ) and fetch it.
If it does not fail, the url is again cached for later lookup
in the datamanager in a valid urls cache. The connector object
is also cached in a connector dictionary of the datamanager so
that we dont need to re-create the connection later.
This fixes the long-standing bug with urls with no filename
extensions.
2. Rewrote algorithm for localizing links. Instead of re-parsing
html files and localizing the links, a dictionary of html files
and their links are kept in the datamanager object. This dictionary
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -