📄 htmlcrawler.py
字号:
#!/usr/bin/env python"""htmlcrawler.py - Demonstrating custom crawler writing bysubscribing to events. This is a crawler which fetchesonly web pages from the web.Created by Anand B Pillai <abpillai at gmail dot com> Copyright (C) 2008 Anand B Pillai"""import sysimport __init__from harvestman.apps.spider import HarvestManclass HtmlCrawler(HarvestMan): """ A crawler which fetches only HTML (webpage) pages """ def include_this_link(self, event, *args, **kwargs): url = event.url if url.is_webpage(): # Allow for further processing by rules... # otherwise we will end up crawling the entire # web, since no other rules will apply if we # return True here. return None else: return Falseif __name__ == "__main__": spider=HtmlCrawler() spider.initialize() spider.bind_event('includelinks', spider.include_this_link) spider.main()
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -