📄 crawler.html
字号:
color:black;mso-font-kerning:0pt;mso-ansi-language:ZH-CN'><o:p></o:p></span></p>
<p class=MsoNormal align=left style='text-align:left;mso-layout-grid-align:
none;text-autospace:none'><span style='mso-bidi-font-size:10.5pt;color:black;
mso-font-kerning:0pt;mso-ansi-language:ZH-CN'>2.3.3</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
color:black;mso-font-kerning:0pt;mso-ansi-language:ZH-CN'>保证不违法</span><span
style='mso-bidi-font-size:10.5pt;color:black;mso-font-kerning:0pt;mso-ansi-language:
ZH-CN'><o:p></o:p></span></p>
<p class=MsoNormal align=left style='margin-left:1.71gd;text-align:left;
text-indent:23.1pt;mso-char-indent-count:2.2;mso-char-indent-size:10.5pt;
mso-layout-grid-align:none;text-autospace:none'><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";color:black;
mso-font-kerning:0pt;mso-ansi-language:ZH-CN'>要做一个遵纪守法的、友好的爬虫程序就应该支持机器人禁止标准,这样不仅是对</span><span
style='mso-bidi-font-size:10.5pt;color:black;mso-font-kerning:0pt;mso-ansi-language:
ZH-CN'>Web</span><span style='mso-bidi-font-size:10.5pt;font-family:宋体;
mso-ascii-font-family:"Times New Roman";color:black;mso-font-kerning:0pt;
mso-ansi-language:ZH-CN'>站点的尊重,同时在某些情况下对自己也有好处。因此,爬虫程序在首次访问一个</span><span
style='mso-bidi-font-size:10.5pt;color:black;mso-font-kerning:0pt;mso-ansi-language:
ZH-CN'>Web</span><span style='mso-bidi-font-size:10.5pt;font-family:宋体;
mso-ascii-font-family:"Times New Roman";color:black;mso-font-kerning:0pt;
mso-ansi-language:ZH-CN'>站点时首先就应该载入</span><span style='mso-bidi-font-size:10.5pt;
color:black;mso-font-kerning:0pt;mso-ansi-language:ZH-CN'>robots.txt</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
color:black;mso-font-kerning:0pt;mso-ansi-language:ZH-CN'>文件读取机器人禁止访问的区域,机器人在发现这些超链接后就应该立即回避,不要试图越权访问,做一个守法的访问者。</span><span
style='mso-bidi-font-size:10.5pt;color:black;mso-font-kerning:0pt;mso-ansi-language:
ZH-CN'><o:p></o:p></span></p>
<p class=MsoNormal><span lang=EN-US style='mso-bidi-font-size:10.5pt'>2.4</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>算法描述</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal><span lang=EN-US style='mso-bidi-font-size:10.5pt'>2.4.1</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>采用的主要数据结构</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal style='margin-left:1.71gd;text-indent:23.1pt;mso-char-indent-count:
2.2;mso-char-indent-size:10.5pt'><span style='mso-bidi-font-size:10.5pt;
font-family:宋体;mso-ascii-font-family:"Times New Roman";color:black;mso-font-kerning:
0pt;mso-ansi-language:ZH-CN'>扫描队列用于存储用户指定的本次爬行的所有网内</span><span
style='mso-bidi-font-size:10.5pt;color:black;mso-font-kerning:0pt;mso-ansi-language:
ZH-CN'>IP</span><span style='mso-bidi-font-size:10.5pt;font-family:宋体;
mso-ascii-font-family:"Times New Roman";color:black;mso-font-kerning:0pt;
mso-ansi-language:ZH-CN'>地址列表;活动队列被用于存储那些处于工作状态且提供</span><span
style='mso-bidi-font-size:10.5pt;color:black;mso-font-kerning:0pt;mso-ansi-language:
ZH-CN'>Web</span><span style='mso-bidi-font-size:10.5pt;font-family:宋体;
mso-ascii-font-family:"Times New Roman";color:black;mso-font-kerning:0pt;
mso-ansi-language:ZH-CN'>服务的</span><span style='mso-bidi-font-size:10.5pt;
color:black;mso-font-kerning:0pt;mso-ansi-language:ZH-CN'>IP</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
color:black;mso-font-kerning:0pt;mso-ansi-language:ZH-CN'>地址列表;等待队列是</span><span
style='mso-bidi-font-size:10.5pt;color:black;mso-font-kerning:0pt;mso-ansi-language:
ZH-CN'>URL</span><span style='mso-bidi-font-size:10.5pt;font-family:宋体;
mso-ascii-font-family:"Times New Roman";color:black;mso-font-kerning:0pt;
mso-ansi-language:ZH-CN'>等待被爬虫程序处理的队列。新发现的</span><span style='mso-bidi-font-size:
10.5pt;color:black;mso-font-kerning:0pt;mso-ansi-language:ZH-CN'>URL</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
color:black;mso-font-kerning:0pt;mso-ansi-language:ZH-CN'>也被加入到这个队列中;运行队列是在爬虫程序开始处理时,存放</span><span
style='mso-bidi-font-size:10.5pt;color:black;mso-font-kerning:0pt;mso-ansi-language:
ZH-CN'>URL</span><span style='mso-bidi-font-size:10.5pt;font-family:宋体;
mso-ascii-font-family:"Times New Roman";color:black;mso-font-kerning:0pt;
mso-ansi-language:ZH-CN'>的队列;错误队列是在解析网页发生错误,</span><span style='mso-bidi-font-size:
10.5pt;color:black;mso-font-kerning:0pt;mso-ansi-language:ZH-CN'>URL</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
color:black;mso-font-kerning:0pt;mso-ansi-language:ZH-CN'>将被存放的队列。该队列中的</span><span
style='mso-bidi-font-size:10.5pt;color:black;mso-font-kerning:0pt;mso-ansi-language:
ZH-CN'>URL</span><span style='mso-bidi-font-size:10.5pt;font-family:宋体;
mso-ascii-font-family:"Times New Roman";color:black;mso-font-kerning:0pt;
mso-ansi-language:ZH-CN'>不能被移入其他队列中;完成队列是成功解析网页后,</span><span style='mso-bidi-font-size:
10.5pt;color:black;mso-font-kerning:0pt;mso-ansi-language:ZH-CN'>URL</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
color:black;mso-font-kerning:0pt;mso-ansi-language:ZH-CN'>将被送达的地方。该队列中的</span><span
style='mso-bidi-font-size:10.5pt;color:black;mso-font-kerning:0pt;mso-ansi-language:
ZH-CN'>URL</span><span style='mso-bidi-font-size:10.5pt;font-family:宋体;
mso-ascii-font-family:"Times New Roman";color:black;mso-font-kerning:0pt;
mso-ansi-language:ZH-CN'>也不能被移入其他队列中。</span><span lang=EN-US style='mso-bidi-font-size:
10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal><span style='mso-bidi-font-size:10.5pt;color:black;
mso-font-kerning:0pt;mso-ansi-language:ZH-CN'>2.4.2</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
color:black;mso-font-kerning:0pt;mso-ansi-language:ZH-CN'>整体思路</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal style='margin-left:1.71gd;text-indent:23.1pt;mso-char-indent-count:
2.2;mso-char-indent-size:10.5pt'><span style='mso-bidi-font-size:10.5pt;
font-family:宋体;mso-ascii-font-family:"Times New Roman"'>根据用户提供的一组</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>IP</span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>地址列表,然后进行扫描获得正处于工作状态的</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>IP</span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>列表;从活动</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>IP</span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>列表取一个</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>IP</span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>地址放入运行队列后访问该</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>Web</span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>站点首页;解析该网页获得所有的内部链接并添加到等待队列当中;从等待队列中移出一个</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>URL</span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>放入运行队列进行解析。如果在解析时发现错误,则将该</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>URL</span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>放入错误队列,否则放入完成队列。上述操作直到活动</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>IP</span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>列表为空。</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal><span lang=EN-US style='mso-bidi-font-size:10.5pt'>2.5</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>具体实现</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal style='margin-left:15.75pt;text-indent:-15.75pt;mso-char-indent-count:
-1.5;mso-char-indent-size:10.5pt'><span lang=EN-US style='mso-bidi-font-size:
10.5pt'><span style="mso-spacerun: yes"> </span><span
style="mso-spacerun: yes"> </span></span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>程序提供了一个</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>config.txt</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>文件存放用户要求扫描的</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>IP</span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>地址列表,程序在运行时载入该列表,并启动名为</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>searchThreadGroup</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>的线程组填充活动队列,同时启动</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>ScanRun</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>线程从活动队列当中获取一个</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>IP</span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>地址,设置相应的扫描日志存放路径,启动</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>scanThreadGroup</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>线程组扫描该站点的所有</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>Web</span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>资源,真正执行代码为</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>mySpider</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>类。这个类是程序核心部分,下面我们着重说明一下该类。</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal><span lang=EN-US style='mso-bidi-font-size:10.5pt'><span
style="mso-spacerun: yes"> </span><span
style="mso-spacerun: yes"> </span></span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>该类实现了</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>ISpiderReportable</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>、</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>Runnable</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>两个接口,负责与</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>Spider</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>类交互。</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal style='text-indent:36.75pt;mso-char-indent-count:3.5;
mso-char-indent-size:10.5pt'><span style='mso-bidi-font-size:10.5pt;font-family:
宋体;mso-ascii-font-family:"Times New Roman"'>构造函数,其核心代码为:</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal style='text-indent:17.95pt;mso-char-indent-count:1.71;
mso-char-indent-size:10.45pt'><span lang=EN-US style='mso-bidi-font-size:10.5pt'>wl
= new SpiderInternalWorkload();//</span><span style='mso-bidi-font-size:10.5pt;
font-family:宋体;mso-ascii-font-family:"Times New Roman"'>获得一个内存作业管理队列</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal style='text-indent:15.75pt;mso-char-indent-count:1.5;
mso-char-indent-size:10.5pt'><span lang=EN-US style='mso-bidi-font-size:10.5pt'>url
= u;//</span><span style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:
"Times New Roman"'>初始化要访问的初始</span><span lang=EN-US style='mso-bidi-font-size:
10.5pt'>URL<o:p></o:p></span></p>
<p class=MsoNormal style='text-indent:15.75pt;mso-char-indent-count:1.5;
mso-char-indent-size:10.5pt'><span lang=EN-US style='mso-bidi-font-size:10.5pt'>_exclude
= new BotExclusion();<o:p></o:p></span></p>
<p class=MsoNormal style='text-indent:15.75pt;mso-char-indent-count:1.5;
mso-char-indent-size:10.5pt'><span lang=EN-US style='mso-bidi-font-size:10.5pt'>_exclude.load(new
HTTPSocket(),url);//</span><span style='mso-bidi-font-size:10.5pt;font-family:
宋体;mso-ascii-font-family:"Times New Roman"'>载入机器人禁止访问的文件</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal style='margin-left:1.5gd;text-indent:20.15pt;mso-char-indent-count:
1.92;mso-char-indent-size:10.45pt'><span lang=EN-US style='mso-bidi-font-size:
10.5pt'>foundInternalLink</span><span style='mso-bidi-font-size:10.5pt;
font-family:宋体;mso-ascii-font-family:"Times New Roman"'>函数主要用于处理当</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>Spid
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -