taganalyzer.py

来自「Harvestman-最新版本」· Python 代码 · 共 67 行

67 行

#!/usr/bin/env python"""taganalyzer.py - Demonstrating custom crawler writing bysubscribing to events. This is a crawler which allowsyou to subscribe to HTML tag parsing events and totake actions. Created by Anand B Pillai <abpillai at gmail dot com> Copyright (C) 2008 Anand B Pillai"""import sysimport __init__from harvestman.apps.spider import HarvestManfrom harvestman.lib.common.common import CaselessDictclass TagAnalyzingCrawler(HarvestMan):    """ A crawler which can perform custom tag analysis """    def __init__(self):        # Dictionary for storing information        self.d = {'images_no_alt': [], 'csslinks': []}        super(TagAnalyzingCrawler, self).__init__()    def write_this_url(self, event, *args, **kwargs):        # Since we are doing only tag analysis, don't write anything..        return False        def analyze_this_tag(self, event, *args, **kwargs):        tag = kwargs.get('tag','')        attrs = kwargs.get('attrs',None)        # This performs a check on images not having the 'alt' attribute...        if tag.lower() == 'img':            d = CaselessDict(attrs)            if not 'alt' in d:                imgurl = d['src'] or d['href']                self.d['images_no_alt'].append(imgurl)            def finish_event_cb(self, event, *args, **kwargs):        print self.d        info = open('tagsinfo.txt','w')                if len(self.d['images_no_alt']):            info.write('Image URLs without "alt" attribute\n')            for url in self.d['images_no_alt']:                info.write(url + '\n')        info.close()if __name__ == "__main__":    spider=TagAnalyzingCrawler()    spider.initialize()    config = spider.get_config()    # Disable caching    config.pagecache = 0    spider.bind_event('writeurl', spider.write_this_url)    spider.bind_event('beforetag', spider.analyze_this_tag)    spider.bind_event('beforefinish', spider.finish_event_cb)    spider.main()

taganalyzer.py - 源码说明

本页面展示了「Harvestman-最新版本」中的 taganalyzer.py 源码文件，采用 Python 编程语言编写，共 67 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。

虫虫下载站收录了大量与Harvestman相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。

⌨️ 快捷键说明

复制代码Ctrl + C

搜索代码Ctrl + F

全屏模式F11

增大字号Ctrl + =

减小字号Ctrl + -

显示快捷键?