⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 htmlcrawler.py

📁 Harvestman-最新版本
💻 PY
字号:
#!/usr/bin/env python"""htmlcrawler.py - Demonstrating custom crawler writing bysubscribing to events. This is a crawler which fetchesonly web pages from the web.Created by Anand B Pillai <abpillai at gmail dot com> Copyright (C) 2008 Anand B Pillai"""import sysimport __init__from harvestman.apps.spider import HarvestManclass HtmlCrawler(HarvestMan):    """ A crawler which fetches only HTML (webpage) pages """        def include_this_link(self, event, *args, **kwargs):                url = event.url        if url.is_webpage():            # Allow for further processing by rules...            # otherwise we will end up crawling the entire            # web, since no other rules will apply if we            # return True here.            return None        else:            return Falseif __name__ == "__main__":    spider=HtmlCrawler()    spider.initialize()    spider.bind_event('includelinks', spider.include_this_link)    spider.main()

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -