htmlcrawler.py

Harvestman-最新版本

字号:

#!/usr/bin/env python"""htmlcrawler.py - Demonstrating custom crawler writing bysubscribing to events. This is a crawler which fetchesonly web pages from the web.Created by Anand B Pillai <abpillai at gmail dot com> Copyright (C) 2008 Anand B Pillai"""import sysimport __init__from harvestman.apps.spider import HarvestManclass HtmlCrawler(HarvestMan):    """ A crawler which fetches only HTML (webpage) pages """        def include_this_link(self, event, *args, **kwargs):                url = event.url        if url.is_webpage():            # Allow for further processing by rules...            # otherwise we will end up crawling the entire            # web, since no other rules will apply if we            # return True here.            return None        else:            return Falseif __name__ == "__main__":    spider=HtmlCrawler()    spider.initialize()    spider.bind_event('includelinks', spider.include_this_link)    spider.main()

⌨️ 快捷键说明

复制代码 Ctrl + C

搜索代码 Ctrl + F

全屏模式 F11

切换主题 Ctrl + Shift + D

显示快捷键 ?

增大字号 Ctrl + =

减小字号 Ctrl + -