📄 crawler.html
字号:
lang=EN-US style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'>Web</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-font-kerning:0pt'>站点上指定区域的协议。标准的最大的用处之一就是禁止通过查询站点运行的机器人访问一些临时的</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'>Html</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-font-kerning:0pt'>文档,因为这些文档很可能今天还在而明天就不存在了。此外,由于许多网站,特别是那查询网站实时地生成</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'>Html</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-font-kerning:0pt'>文档,已完成来自用户的请求,然后返回,并删除那些不再有用的文档。显然,将这些页面包含于</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'>Web</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-font-kerning:0pt'>索引中对任何人都没有任何益处。标准的另一个用处就是允许机器人控制还在建立过程中的页的清除,或给这些页下达禁止访问的指令。</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'><o:p></o:p></span></p>
<p class=MsoNormal align=left style='margin-left:2.0gd;text-align:left;
text-indent:21.0pt;mso-char-indent-count:2.0;mso-char-indent-size:10.5pt;
mso-layout-grid-align:none;text-autospace:none'><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";mso-font-kerning:
0pt'>使用该标准限制机器人访问的区域这件事本身很简单,只需在</span><span lang=EN-US style='mso-bidi-font-size:
10.5pt;mso-font-kerning:0pt'>Web</span><span style='mso-bidi-font-size:10.5pt;
font-family:宋体;mso-ascii-font-family:"Times New Roman";mso-font-kerning:0pt'>服务器上生成一个</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'>Http</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-font-kerning:0pt'>协议可访问的文件,并将它分配给本地的</span><span lang=EN-US
style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'>URL/robots.txt</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-font-kerning:0pt'>。该文件本身是由一组命令行组成,明确指明了机器人限制访问的区域。</span><span lang=EN-US
style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'><o:p></o:p></span></p>
<p class=MsoNormal><span lang=EN-US style='mso-bidi-font-size:10.5pt'>2.3 BOT</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>开发工具包简介</span><span
style='font-size:9.0pt;mso-bidi-font-size:10.5pt'> </span><span lang=EN-US
style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal style='text-indent:17.95pt;mso-char-indent-count:1.71;
mso-char-indent-size:10.45pt'><span lang=EN-US style='mso-bidi-font-size:10.5pt'>BOT</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>包是美国的一个叫</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>Jeff Heaton</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>的</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>IEEE</span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>成员编写的一个基于</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>Socket</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>利用</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>Http</span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>协议访问网络的开发工具包,包中提供了大量关于</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>Spiders</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>、</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>Bots</span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>和</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>Aggregators</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>程序设计所需的类库。该包可以从</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><a
href="http://www.jeffheaton.com/java/bot/updates.shtml">http://www.jeffheaton.com/java/bot/updates.shtml</a></span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>上获取,目前作者提供的最新版本为</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>1.4</span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>,也是本程序开发所采用的版本。</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal style='text-indent:10.5pt;mso-char-indent-count:1.0;
mso-char-indent-size:10.5pt'><span lang=EN-US style='mso-bidi-font-size:10.5pt'><span
style="mso-spacerun: yes"> </span><span style="mso-spacerun:
yes"> </span></span><span style='mso-bidi-font-size:10.5pt;font-family:
宋体;mso-ascii-font-family:"Times New Roman"'>本文主要用到了</span><span lang=EN-US
style='mso-bidi-font-size:10.5pt'>Log</span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>类、</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>BotExclusion</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>类、</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>Spider</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>类、</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>ISpiderReportable</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>接口、</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>SpierInternalWorkLoad</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>类、</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>SpiderWorker</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>类、</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>SpiderDone</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>类。</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal style='text-indent:10.5pt;mso-char-indent-count:1.0;
mso-char-indent-size:10.5pt'><span lang=EN-US style='mso-bidi-font-size:10.5pt'><span
style="mso-spacerun: yes"> </span><span style="mso-spacerun:
yes"> </span>Log</span><span style='mso-bidi-font-size:10.5pt;font-family:
宋体;mso-ascii-font-family:"Times New Roman"'>类负责</span><span lang=EN-US
style='mso-bidi-font-size:10.5pt'>BOT</span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>包的所有日志操作。因为,爬虫程序通常处于无人看管的状态下运行的,诊断问题往往需要通过操作日志解决。该类在</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>BOT</span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>包中是个静态类,日志消息分做五类。</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal style='text-indent:10.5pt;mso-char-indent-count:1.0;
mso-char-indent-size:10.5pt'><span lang=EN-US style='mso-bidi-font-size:10.5pt'><span
style="mso-spacerun: yes"> </span><span style="mso-spacerun:
yes"> </span>BotExclusion</span><span style='mso-bidi-font-size:10.5pt;
font-family:宋体;mso-ascii-font-family:"Times New Roman"'>类主要用于装载并解释</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>robots.txt</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>。其中装载使用</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>load</span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>方法,判断一个</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>URL</span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>是否是禁止访问采用</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>isExcluded</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>方法。</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal style='text-indent:10.5pt;mso-char-indent-count:1.0;
mso-char-indent-size:10.5pt;tab-stops:45.0pt'><span lang=EN-US
style='mso-bidi-font-size:10.5pt'><span style="mso-spacerun: yes"> </span><span
style="mso-spacerun: yes"> </span>Spider</span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>类是创建爬虫程序所用的核心类。这个类中包含很多起接口作用的方法,可以通过这些接口命令</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>Spider</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>。</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal style='text-indent:23.1pt;mso-char-indent-count:2.2;
mso-char-indent-size:10.5pt'><span lang=EN-US style='mso-bidi-font-size:10.5pt'>ISpiderReortable</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>接口,一个</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>Spider</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>本身只穿过网页,任何对网页实际处理工作必须通过该接口实现。</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal style='text-indent:23.1pt;mso-char-indent-count:2.2;
mso-char-indent-size:10.5pt'><span lang=EN-US style='mso-bidi-font-size:10.5pt'>SpiderInternalWorkload</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>类是一个内嵌的作业管理器。这个作业管理器是将作业存储在计算机的内存中。</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal style='text-indent:23.1pt;mso-char-indent-count:2.2;
mso-char-indent-size:10.5pt'><span lang=EN-US style='mso-bidi-font-size:10.5pt'>SpiderWorker</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>类负责真正的网页下载任务,被</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>Spider</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>对象调度。</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal style='text-indent:21.0pt;mso-char-indent-count:2.0;
mso-char-indent-size:10.5pt'><span lang=EN-US style='mso-bidi-font-size:10.5pt'>SpiderDone</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>类用于精确的跟踪并发线程的运行个数,它提供了有效的方法来等待线程个数变为零。</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal align=center style='text-align:center'><span
style='font-size:14.0pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>二、程序的设计与实现</span><span
lang=EN-US style='font-size:14.0pt'><o:p></o:p></span></p>
<p class=MsoNormal><span lang=EN-US style='mso-bidi-font-size:10.5pt'>2.1 Intranet</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>的特点</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal style='margin-left:17.95pt;text-indent:-17.95pt;mso-char-indent-count:
-1.71;mso-char-indent-size:10.45pt;tab-stops:45.0pt 54.0pt 63.0pt'><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><span style="mso-spacerun:
yes"> </span><span style="mso-spacerun:
yes"> </span>Intranet</span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>采用</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>Internet</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>技术构建,但由于其本身所具有的特殊性,我们需要采取与</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>Internet</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>环境下不同的爬虫程序设计模型,使得搜索效率更高、搜索的范围更加完整,满足</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>Intranet</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>的一些特定需求。其特点主要表现在以下几个方面:</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal align=left style='margin-left:1.71gd;text-align:left;
text-indent:21.0pt;mso-char-indent-count:2.0;mso-char-indent-size:10.5pt;
mso-layout-grid-align:none;text-autospace:none'><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";mso-font-kerning:
0pt'>首先,</span><span style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'> </span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-font-kerning:0pt'>几乎所有的计算机都有一个相对固定的</span><span lang=EN-US
style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'>IP</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-font-kerning:0pt'>地址</span><span lang=EN-US style='mso-bidi-font-size:10.5pt;
mso-font-kerning:0pt'>,</span><span style='mso-bidi-font-size:10.5pt;
font-family:宋体;mso-ascii-font-family:"Times New Roman";mso-font-kerning:0pt'>尽管有的采用</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'>DHCP</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-font-kerning:0pt'>服务器来分配</span><span lang=EN-US style='mso-bidi-font-size:
10.5pt;mso-font-kerning:0pt'>IP</span><span style='mso-bidi-font-size:10.5pt;
font-family:宋体;mso-ascii-font-family:"Times New Roman";mso-font-kerning:0pt'>地址</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'>,</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -