📄 changes.txt

📁 网络蜘蛛
💻 TXT
📖 第 1 页 / 共 2 页
字号:
上一页 12
   is updated during crawling time with the url objects for each html
   file. This dictionary is used at the end for localizing.

   This improves localization time to as much as 500%.

3. Fixed a bug in calculating project time. (Time for localization
   should not be included).

4. Modification in priting error messages. Error messages are printed
   only for verbosity levels of 3 and up. OS and IO exceptions are 
   printed only at verbosity level 4 (debug).

   For seeing url error messages (connection errors), you need to set
   the verbosity to 3 now.

   At the default verbosity level (2), no error messages can be seen.

5. Modified the checking of hanging threads. This check was not done
   properly. Now it is done in the loop that checks for exit condition.
   Also, reduced default timeout for hanging threads from 600 seconds
   (10 minutes) to 120 seconds ( 2 minutes ). 
   
   Added socket timeout for sockets. This is same as thread timeout above.
   (This works for users using Python 2.3.)

   This will fix the problem of hanging threads in a big way.

Changes in Version 1.2 rc1 (from 1.2 alpha)
===========================================

Release Date: Sep 24 2003

1. Removed the earlier global download lock. Earlier the url
   connector instances shared a common lock which they had to acquire
   before downloading. This led to only a single download possible a
   given moment. 

   This has been changed to multiple downloads which can be specified
   in the configuration file.

2. We can specify any number of connections in the config file now.
   The program makes sure that there are only so many connections
   running at a given instant. This takes the place of the previous
   global download lock. Since now many simultaneous downloads are possible
   (apart from many threads), the program is much faster than before.

3. Added an option for writing pickled cache files. This has been
   made the default in this release. XML cache files take a long
   time to read, if they are big.  

4. Integrated genconfig.py script with harvestManConfig class. 
   This makes future developments of this script easier. Added an abort
   condition to the script which can be invoked by pressing the <space>
   key.

5. Fixes for handling error conditions in the url connector class.
   Arbitrary error numbers are no longer used, instead we try to
   get the error number by parsing the error strings.

6. Redownload of failed links works only for links that failed with
   non-fatal errors. This speeds up projects.

7. Modified the regular expression behaviour. Compile the reg expressions
   to optimize regular expression search.

8. Moved code around from HarvestMan.py module to reduce its size. 
   Parsing of config file is now done in the HarvestManConfig module.

9. Removed usage of 'string' module everywhere and replaced with
   methods on string objects.

10. Added a timeout option for the project. Sometimes the last thread
    in the program does not complete hanging a well downloaded project.
    This option looks at the last data operation into the url queue 
    and times it. If the time of the last operation (get/put) is more
    than a prescribed time, the project times out. 

    We also wait now for the download sub-threads to complete their work 
    before exiting. This fixes any premature project exit conditions.

11. Change in writing project files. We now write pickled project files
    instead of XML project files. This will be the default from this
    release.

12. Bug fixes in urlpathparser module for fixing relative filename computation
    errors.

13. Bug fixes in rules module. Rewrote some methods in this module.

14. Fixes in creating the project browse page. The project browse
    page entry is now created correctly for every new project.

15. Many other routine bug fixes to speed up downloads and reduce
    bugs in threading.

Changes in Version 1.2 alpha (From version 1.1.2)
================================================

1. This version has introduced limited support for Cookies.
   This is experimental code, written from scratch
   following RFC 2109. The cookie support is pretty
   basic with only domain cookies supported. Netscape
   style cookies may not work.

2. Support for webpage caching is available. A cache
   file (xml) is created in the project directory for
   a project, the first time. The cache file associates
   urls to file on the disk. We compare files by using
   an md5 checksum on the file content. For any 
   further runs of the project, only the out-of-date
   files are re-fetched.

3. Many bug fixes and better error checking.

4. Bugs in genconfig script fixed.

5. Documentation changes: We provide an RTF version of the
   documentation file now. (Request by John J Lee of
   Clientcookie fame)

Changes in Version 1.1.2(From version 1.1.1)
============================================
1. Added a fast html parser based on sgmlop module by F.Lundh.
   This can be selected by setting the variable HTMLPARSER in the
   config file to 1. The default parser is still the standard
   python parser.

2. Added an option to localise links relatively. This is the
   default now. That is we dont replace filenames with their
   absolute pathname but only relative pathname, so that users
   can browse the downloaded pages on another filesystem also.

3. Added an option for the user to control md5 checksumming of files.
   This option is controlled by the variable CHECKFILES in the 
   config file.

4. Support comments at the end of an option line in the config file.
   (Egs: <URL http://www.python.org # This is the url> is valid now.
   It would have thrown an error before.)

5. We are not localising form links. This makes sure that a cgi
   query goes directly to the webserver.

6. An option for JIT (Just In Time) localization of url links.
   If this option is selected, then urls in html files are localized
   immediately after they are downloaded, instead of at the end.


Changes In Architecture (Version 1.1)
====================================

1. Global Object Register/Lookup
   -----------------------------

 One of the major changes in this version is the architecture of harvestman program.
It uses a modified Object Oriented approach of looking up objects whenever the services
of an object is needed by other objects. The classes no longer maintain pointers to
other class instances inside them. 

All Harvestman program objects register themselves with a global registry/look-up object
when they are created. (It is upto the programmer to do this.). The registry object is
a Borg singleton ensuring that the state of the objects is maintained. The objects are
stored in the dictionary of the registry object using strings as the key. 

When an object needs the services of another, it performs a simple 'query' or 'lookup'
of the registry using the key of that particular object (This should be known. Right now
we dont support a publish/subscribe mechanism, it will be added later.). The register
object sits in the Harvestman globals module, so it is available to objects in all modules
which do an import of this module. An example is given below.


  # Create and register the object.
  obj1 = HarvestManObject1()
  HarvestManGlobals.SetObject('object1', obj1)

  # Object2 wants services of obj1 
  obj1instance = HarvestManGlobals.GetObject('object1')
  # Use its services
  obj1instance.func1(...)
  

This makes adding new modules to HarvestMan easy, if you make sure that you register them
in the globals module.


2. Threading Model
   ---------------

HarvestMan versions till 1.0 was using a model where url tracker threads were store in a
queue. A url tracker object consisting of data of a url was pushed into a queue and was
later popped by a monitor object so that downloads could be controlled. This gave rise
to problems of controlling threads and overhead in the form of new thread contexts since
we were not reusing threads.

HarvestMan version 1.1 uses a preemptive threading model and reuses threads. Also, thread
data is only managed in the queue, and not threads themselves. The number of threads 
(as per the config file or command line user input) are pre-launched in the beginning of
the program. They run in a loop looking for url data which is managed by a url data queue.
Threads post their url data to this queue. This ensures that we always have a given number
of threads running. It also reduces overheads and latency.

HarvestMan sub-threads in the HarvestManUrlThread module still uses a post-emptive
(new thread launched per request) mechanism. This might be changed in future releases.

3. Code Reorganization
   -------------------

   The new version features some extra modules which have been created by moving code
   from existing modules and re-writing them. The aim was to split crawler code from
   data management code, in which we succeeded quite well. There is a new Data Manager
   module which takes care of scheduling downloading requests, indexing files, keeping
   file statistics and localizing links. A Rules module checks the HarvestMan download
   rules (this was earlier done by the previous "WebUrlTrackerMonitor" class).

   A Synchronization lock has been added in the Connector module. This might
   slow down downloads a bit, but should ensure that threads dont corrupt the data.
   Interested users can experiment with the lock, removing it or modifying it, and
   see how it works. Please report any improvements in performance you see to the
   authors.

4. Other Changes
   -------------

   For other changes continue reading.


HISTORY
=======

+-----------------------------------------+
|Changes in Version 1.1 (from Version 1.0)|
+-----------------------------------------+

1. A project file is created for every project in the harvestman directory
   in the subdirectory 'projects'.
2. Always download css files related to a web-page, even if 
    it is outside of domain or directory. Same for images. Config options
    for both added in the config file.
3. Added a config file option to rename dynamically generated images.
    Works right now for jpeg/gif images.
4. Modified the urlfilter algorithm to check the order of filter
    strings in case of a collision in filter results.
5. Added a new option FETCHLEVEL to the program to allow very
    basic control of download. For details see Readme.txt/HarvestMan.doc
    file. 
6. Get background images of webpages.
7. Better error/message logging. Error files are created in each project's
    download directory. All messages are logged to a file in the harvestman
    installation directory. This by default is named 'harvestman.log'. User
    can change this option by editing the config file. This file is created fresh
      for every project.
8. Added support for getting files from ftp servers.
9. Write a project file based on HarvestMan.dtd before starting to crawl.
    This file is written to the base directory.
10. Stats file is no longer written in the current directory under "projects". Instead
    it is written to the project directory of the particular project.
11. Added command line support.
12. Modified proxy setting. Removed port number from proxy string. Port number
    needs to be specified as a separate config entry.
13. Modified writing of stats. Stats are written to a file named 'projectname.hst' (where
    projectname is the name of the current project) to the project directory. The file
    extension 'hst' stands for 'HarvestMan Stats File'.
14. Write a binary project file also. 
15. Modified localise links function to take care of localising anchor type links also.
    This was an undetected bug in version 1.0.
16. HarvestMan can now load projects from saved project files. This can be done for
    both the xml and binary project files. Added encryption for proxy related data.
17. Fixed some bugs in genconfig script. The script now encrypts any proxy related data
    (except port number) before writing it to the config file.
18. Added code in WebUrlConnector to request user for authentication information 
    for a proxy-authenticated firewall. If the project file does not contain this information,
    it will be requested from the user, interactively.
19. WebRobotParser module uses the services of WebUrlConnector now, instead of having
    its own internet connection code.
20. Added a mechanism to log errors made in the config file, and inform user about it
    at the end. The mechanism uses a list of strings in the Global module (HarvestManGlobals).
21. Updated HarvestMan.dtd to add the new config entries. (CONFIGFILE/PROJECTFILE).
22. Modified FETCHLEVEL handling. Levels 0 - 1 does not fetch external server links now.
    0 - 1 will fetch only local links. 2 fetches local + first level external links and
    3 fetches any link.
23. Tried different approaches to running thread queue. Ideally the runTrackers() method
    should be called when you start the project and it should run separately from the
    push() method. But this lead to blocking of the last download thread in many tests since
    the CPU seems to run the runTrackers() method in priority to the last download thread.
    So I reverted back to the existing method of running trackers where the push method
    makes a call to runTrackers() ( I know that it is not good thread programming, but it works . )
24. Modification to webUrlConnector class, this class now accepts a urlPathParser object
    instead of a url directly. This makes handling of urls easy and we can pass more information
    around. Made correspoding changes to Monitor/Tracker/Thread classes.
25. Fixes for slowmode. Rewrote some code.


+-----------------------------------------+
|Changes in Version 1.0 (from Version 0.8)|
+-----------------------------------------+

1. Fully multithreaded. Multithreaded mode is the default.
2. Depth fetching for starting server and external servers in config file.
3. Browser page for projects similar to HTTrack.
4. Added re-fetching of failed urls.
5. Support for intranet servers.
6. Verbosity option added in config file.
7. Lots of configurable options added in the config file.
   The list of options (apart from the basic ones) is now about 30.
8. Signal handler for keyboard interrupts autmatically does clean up jobs.
上一页 12
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -