📄 changes.txt
字号:
is updated during crawling time with the url objects for each html
file. This dictionary is used at the end for localizing.
This improves localization time to as much as 500%.
3. Fixed a bug in calculating project time. (Time for localization
should not be included).
4. Modification in priting error messages. Error messages are printed
only for verbosity levels of 3 and up. OS and IO exceptions are
printed only at verbosity level 4 (debug).
For seeing url error messages (connection errors), you need to set
the verbosity to 3 now.
At the default verbosity level (2), no error messages can be seen.
5. Modified the checking of hanging threads. This check was not done
properly. Now it is done in the loop that checks for exit condition.
Also, reduced default timeout for hanging threads from 600 seconds
(10 minutes) to 120 seconds ( 2 minutes ).
Added socket timeout for sockets. This is same as thread timeout above.
(This works for users using Python 2.3.)
This will fix the problem of hanging threads in a big way.
Changes in Version 1.2 rc1 (from 1.2 alpha)
===========================================
Release Date: Sep 24 2003
1. Removed the earlier global download lock. Earlier the url
connector instances shared a common lock which they had to acquire
before downloading. This led to only a single download possible a
given moment.
This has been changed to multiple downloads which can be specified
in the configuration file.
2. We can specify any number of connections in the config file now.
The program makes sure that there are only so many connections
running at a given instant. This takes the place of the previous
global download lock. Since now many simultaneous downloads are possible
(apart from many threads), the program is much faster than before.
3. Added an option for writing pickled cache files. This has been
made the default in this release. XML cache files take a long
time to read, if they are big.
4. Integrated genconfig.py script with harvestManConfig class.
This makes future developments of this script easier. Added an abort
condition to the script which can be invoked by pressing the <space>
key.
5. Fixes for handling error conditions in the url connector class.
Arbitrary error numbers are no longer used, instead we try to
get the error number by parsing the error strings.
6. Redownload of failed links works only for links that failed with
non-fatal errors. This speeds up projects.
7. Modified the regular expression behaviour. Compile the reg expressions
to optimize regular expression search.
8. Moved code around from HarvestMan.py module to reduce its size.
Parsing of config file is now done in the HarvestManConfig module.
9. Removed usage of 'string' module everywhere and replaced with
methods on string objects.
10. Added a timeout option for the project. Sometimes the last thread
in the program does not complete hanging a well downloaded project.
This option looks at the last data operation into the url queue
and times it. If the time of the last operation (get/put) is more
than a prescribed time, the project times out.
We also wait now for the download sub-threads to complete their work
before exiting. This fixes any premature project exit conditions.
11. Change in writing project files. We now write pickled project files
instead of XML project files. This will be the default from this
release.
12. Bug fixes in urlpathparser module for fixing relative filename computation
errors.
13. Bug fixes in rules module. Rewrote some methods in this module.
14. Fixes in creating the project browse page. The project browse
page entry is now created correctly for every new project.
15. Many other routine bug fixes to speed up downloads and reduce
bugs in threading.
Changes in Version 1.2 alpha (From version 1.1.2)
================================================
1. This version has introduced limited support for Cookies.
This is experimental code, written from scratch
following RFC 2109. The cookie support is pretty
basic with only domain cookies supported. Netscape
style cookies may not work.
2. Support for webpage caching is available. A cache
file (xml) is created in the project directory for
a project, the first time. The cache file associates
urls to file on the disk. We compare files by using
an md5 checksum on the file content. For any
further runs of the project, only the out-of-date
files are re-fetched.
3. Many bug fixes and better error checking.
4. Bugs in genconfig script fixed.
5. Documentation changes: We provide an RTF version of the
documentation file now. (Request by John J Lee of
Clientcookie fame)
Changes in Version 1.1.2(From version 1.1.1)
============================================
1. Added a fast html parser based on sgmlop module by F.Lundh.
This can be selected by setting the variable HTMLPARSER in the
config file to 1. The default parser is still the standard
python parser.
2. Added an option to localise links relatively. This is the
default now. That is we dont replace filenames with their
absolute pathname but only relative pathname, so that users
can browse the downloaded pages on another filesystem also.
3. Added an option for the user to control md5 checksumming of files.
This option is controlled by the variable CHECKFILES in the
config file.
4. Support comments at the end of an option line in the config file.
(Egs: <URL http://www.python.org # This is the url> is valid now.
It would have thrown an error before.)
5. We are not localising form links. This makes sure that a cgi
query goes directly to the webserver.
6. An option for JIT (Just In Time) localization of url links.
If this option is selected, then urls in html files are localized
immediately after they are downloaded, instead of at the end.
Changes In Architecture (Version 1.1)
====================================
1. Global Object Register/Lookup
-----------------------------
One of the major changes in this version is the architecture of harvestman program.
It uses a modified Object Oriented approach of looking up objects whenever the services
of an object is needed by other objects. The classes no longer maintain pointers to
other class instances inside them.
All Harvestman program objects register themselves with a global registry/look-up object
when they are created. (It is upto the programmer to do this.). The registry object is
a Borg singleton ensuring that the state of the objects is maintained. The objects are
stored in the dictionary of the registry object using strings as the key.
When an object needs the services of another, it performs a simple 'query' or 'lookup'
of the registry using the key of that particular object (This should be known. Right now
we dont support a publish/subscribe mechanism, it will be added later.). The register
object sits in the Harvestman globals module, so it is available to objects in all modules
which do an import of this module. An example is given below.
# Create and register the object.
obj1 = HarvestManObject1()
HarvestManGlobals.SetObject('object1', obj1)
# Object2 wants services of obj1
obj1instance = HarvestManGlobals.GetObject('object1')
# Use its services
obj1instance.func1(...)
This makes adding new modules to HarvestMan easy, if you make sure that you register them
in the globals module.
2. Threading Model
---------------
HarvestMan versions till 1.0 was using a model where url tracker threads were store in a
queue. A url tracker object consisting of data of a url was pushed into a queue and was
later popped by a monitor object so that downloads could be controlled. This gave rise
to problems of controlling threads and overhead in the form of new thread contexts since
we were not reusing threads.
HarvestMan version 1.1 uses a preemptive threading model and reuses threads. Also, thread
data is only managed in the queue, and not threads themselves. The number of threads
(as per the config file or command line user input) are pre-launched in the beginning of
the program. They run in a loop looking for url data which is managed by a url data queue.
Threads post their url data to this queue. This ensures that we always have a given number
of threads running. It also reduces overheads and latency.
HarvestMan sub-threads in the HarvestManUrlThread module still uses a post-emptive
(new thread launched per request) mechanism. This might be changed in future releases.
3. Code Reorganization
-------------------
The new version features some extra modules which have been created by moving code
from existing modules and re-writing them. The aim was to split crawler code from
data management code, in which we succeeded quite well. There is a new Data Manager
module which takes care of scheduling downloading requests, indexing files, keeping
file statistics and localizing links. A Rules module checks the HarvestMan download
rules (this was earlier done by the previous "WebUrlTrackerMonitor" class).
A Synchronization lock has been added in the Connector module. This might
slow down downloads a bit, but should ensure that threads dont corrupt the data.
Interested users can experiment with the lock, removing it or modifying it, and
see how it works. Please report any improvements in performance you see to the
authors.
4. Other Changes
-------------
For other changes continue reading.
HISTORY
=======
+-----------------------------------------+
|Changes in Version 1.1 (from Version 1.0)|
+-----------------------------------------+
1. A project file is created for every project in the harvestman directory
in the subdirectory 'projects'.
2. Always download css files related to a web-page, even if
it is outside of domain or directory. Same for images. Config options
for both added in the config file.
3. Added a config file option to rename dynamically generated images.
Works right now for jpeg/gif images.
4. Modified the urlfilter algorithm to check the order of filter
strings in case of a collision in filter results.
5. Added a new option FETCHLEVEL to the program to allow very
basic control of download. For details see Readme.txt/HarvestMan.doc
file.
6. Get background images of webpages.
7. Better error/message logging. Error files are created in each project's
download directory. All messages are logged to a file in the harvestman
installation directory. This by default is named 'harvestman.log'. User
can change this option by editing the config file. This file is created fresh
for every project.
8. Added support for getting files from ftp servers.
9. Write a project file based on HarvestMan.dtd before starting to crawl.
This file is written to the base directory.
10. Stats file is no longer written in the current directory under "projects". Instead
it is written to the project directory of the particular project.
11. Added command line support.
12. Modified proxy setting. Removed port number from proxy string. Port number
needs to be specified as a separate config entry.
13. Modified writing of stats. Stats are written to a file named 'projectname.hst' (where
projectname is the name of the current project) to the project directory. The file
extension 'hst' stands for 'HarvestMan Stats File'.
14. Write a binary project file also.
15. Modified localise links function to take care of localising anchor type links also.
This was an undetected bug in version 1.0.
16. HarvestMan can now load projects from saved project files. This can be done for
both the xml and binary project files. Added encryption for proxy related data.
17. Fixed some bugs in genconfig script. The script now encrypts any proxy related data
(except port number) before writing it to the config file.
18. Added code in WebUrlConnector to request user for authentication information
for a proxy-authenticated firewall. If the project file does not contain this information,
it will be requested from the user, interactively.
19. WebRobotParser module uses the services of WebUrlConnector now, instead of having
its own internet connection code.
20. Added a mechanism to log errors made in the config file, and inform user about it
at the end. The mechanism uses a list of strings in the Global module (HarvestManGlobals).
21. Updated HarvestMan.dtd to add the new config entries. (CONFIGFILE/PROJECTFILE).
22. Modified FETCHLEVEL handling. Levels 0 - 1 does not fetch external server links now.
0 - 1 will fetch only local links. 2 fetches local + first level external links and
3 fetches any link.
23. Tried different approaches to running thread queue. Ideally the runTrackers() method
should be called when you start the project and it should run separately from the
push() method. But this lead to blocking of the last download thread in many tests since
the CPU seems to run the runTrackers() method in priority to the last download thread.
So I reverted back to the existing method of running trackers where the push method
makes a call to runTrackers() ( I know that it is not good thread programming, but it works . )
24. Modification to webUrlConnector class, this class now accepts a urlPathParser object
instead of a url directly. This makes handling of urls easy and we can pass more information
around. Made correspoding changes to Monitor/Tracker/Thread classes.
25. Fixes for slowmode. Rewrote some code.
+-----------------------------------------+
|Changes in Version 1.0 (from Version 0.8)|
+-----------------------------------------+
1. Fully multithreaded. Multithreaded mode is the default.
2. Depth fetching for starting server and external servers in config file.
3. Browser page for projects similar to HTTrack.
4. Added re-fetching of failed urls.
5. Support for intranet servers.
6. Verbosity option added in config file.
7. Lots of configurable options added in the config file.
The list of options (apart from the basic ones) is now about 30.
8. Signal handler for keyboard interrupts autmatically does clean up jobs.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -