datafilter.py

来自「Harvestman-最新版本」· Python 代码 · 共 86 行

86 行

# -- coding: utf-8""" Data filter plugin example based on thesimulator plugin for HarvestMan. Thisplugin changes the behaviour of HarvestManto only simulate crawling without actuallydownloading anything. In addition, it shows how to get access to the data downloaded by the crawler,to implement various kinds of data filters.Author: Anand B Pillai <abpillai at gmail dot com>Created Feb 7 2007  Anand B Pillai <abpillai at gmail dot com>Modified Nov 2 2007 by: Nils Ulltveit-Moe <nils at u-moe dot no>Copyright (C) 2007 Anand B Pillai"""__version__ = '2.0 b1'__author__ = 'Anand B Pillai'from harvestman.lib import hooksfrom harvestman.lib.common.common import *from HTMLParser import HTMLParserclass MyHTMLParser(HTMLParser):    # Example on a HTML parser, to filter img tags    def handle_starttag(self, tag, attrs):        # This just prints the image tag and its attributes        if tag=="img":            print tag,attrsdef process_url(self, data):    """ Post process url callback test """    # This shows how to get access to the    # downloaded HTML document that is being processed.    # Data is either HTML document or None    if data:        p = MyHTMLParser()        p.feed(data)    return datadef save_url(self, urlobj):    # For simulation, we need to modify the behaviour    # of save_url function in HarvestManUrlConnector class.    # This is achieved by injecting this function as a plugin    # Note that the signatures of both functions have to    # be the same.    url = urlobj.get_full_url()    self.connect(urlobj, True, self._cfg.retryfailed)    return 6def apply_plugin():    """ All plugin modules need to define this method """    # This method is expected to perform the following steps.    # 1. Register the required hook function    # 2. Get the config object and set/override any required settings    # 3. Print any informational messages.    # The first step is required, the last two are of course optional    # depending upon the required application of the plugin.        cfg = objects.config    cfg.simulate = True    cfg.localise = 0    # Dummy function that does not really write the mirrored files.    hooks.register_plugin_function('connector:save_url_plugin', save_url)    # Hook to get access to the downloaded data after process_url has been called.    hooks.register_post_callback_method('crawler:fetcher_process_url_callback',                                            process_url)    # Turn off caching, since no files are saved    cfg.pagecache = 0    # Turn off header dumping, since no files are saved    cfg.urlheaders = 0    logconsole('Simulation mode turned on. Crawl will be simulated and no files will be saved.')

datafilter.py - 源码说明

本页面展示了「Harvestman-最新版本」中的 datafilter.py 源码文件，采用 Python 编程语言编写，共 86 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。

虫虫下载站收录了大量与Harvestman相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。

⌨️ 快捷键说明

复制代码Ctrl + C

搜索代码Ctrl + F

全屏模式F11

增大字号Ctrl + =

减小字号Ctrl + -

显示快捷键?