📄 crawler.html
字号:
mso-font-kerning:0pt'>但一旦第一次分配也就相对固定了。</span><span lang=EN-US style='mso-bidi-font-size:
10.5pt;mso-font-kerning:0pt'><o:p></o:p></span></p>
<p class=MsoNormal align=left style='text-align:left;text-indent:42.0pt;
mso-char-indent-count:4.0;mso-char-indent-size:10.5pt;mso-layout-grid-align:
none;text-autospace:none'><span style='mso-bidi-font-size:10.5pt;font-family:
宋体;mso-ascii-font-family:"Times New Roman";mso-font-kerning:0pt'>其次,</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'>IP</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-font-kerning:0pt'>地址数量较少、范围已知,而且其中主机主要是</span><span lang=EN-US
style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'>Web </span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-font-kerning:0pt'>和</span><span lang=EN-US style='mso-bidi-font-size:10.5pt;
mso-font-kerning:0pt'>FTP </span><span style='mso-bidi-font-size:10.5pt;
font-family:宋体;mso-ascii-font-family:"Times New Roman";mso-font-kerning:0pt'>服务器。</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'><o:p></o:p></span></p>
<p class=MsoNormal align=left style='margin-left:2.0gd;text-align:left;
text-indent:21.0pt;mso-char-indent-count:2.0;mso-char-indent-size:10.5pt;
mso-layout-grid-align:none;text-autospace:none'><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";mso-font-kerning:
0pt'>第三,并不是每一台主机都有域名。</span><span lang=EN-US style='mso-bidi-font-size:10.5pt;
mso-font-kerning:0pt'>Intranet </span><span style='mso-bidi-font-size:10.5pt;
font-family:宋体;mso-ascii-font-family:"Times New Roman";mso-font-kerning:0pt'>内部可能会设置一个或多个</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'>DNS ,</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-font-kerning:0pt'>在一般情况下是不会给所有的主机设置域名的。</span><span lang=EN-US
style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'><o:p></o:p></span></p>
<p class=MsoNormal align=left style='margin-left:2.0gd;text-align:left;
text-indent:21.0pt;mso-char-indent-count:2.0;mso-char-indent-size:10.5pt;
mso-layout-grid-align:none;text-autospace:none'><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";mso-font-kerning:
0pt'>第四,主机不一定连续运行。很多计算机在不工作时会关机</span><span lang=EN-US style='mso-bidi-font-size:
10.5pt;mso-font-kerning:0pt'>,</span><span style='mso-bidi-font-size:10.5pt;
font-family:宋体;mso-ascii-font-family:"Times New Roman";mso-font-kerning:0pt'>从而</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'>Web</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-font-kerning:0pt'>服务器也就停止了工作。</span><span lang=EN-US style='mso-bidi-font-size:
10.5pt;mso-font-kerning:0pt'><o:p></o:p></span></p>
<p class=MsoNormal align=left style='margin-left:2.0gd;text-align:left;
text-indent:21.0pt;mso-char-indent-count:2.0;mso-char-indent-size:10.5pt;
mso-layout-grid-align:none;text-autospace:none'><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";mso-font-kerning:
0pt'>第五,</span><span lang=EN-US style='mso-bidi-font-size:10.5pt;mso-font-kerning:
0pt'>Web</span><span style='mso-bidi-font-size:10.5pt;font-family:宋体;
mso-ascii-font-family:"Times New Roman";mso-font-kerning:0pt'>服务器之间超链接较少。很多部门级</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'>Web</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-font-kerning:0pt'>服务器不与其他服务器之间建立超链接。</span><span lang=EN-US
style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'><o:p></o:p></span></p>
<p class=MsoNormal><span lang=EN-US style='mso-bidi-font-size:10.5pt'>2.2</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>程序的设计目标</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal><span lang=EN-US style='mso-bidi-font-size:10.5pt'><span
style="mso-spacerun: yes"> </span><span style="mso-spacerun:
yes"> </span></span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>首先,我们要保证搜索的完整性,即所有提供</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>Web</span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>服务的主机都要搜索到。</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal><span lang=EN-US style='mso-bidi-font-size:10.5pt'><span
style="mso-spacerun: yes"> </span><span style="mso-spacerun:
yes"> </span></span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>其次,在保证搜索完整性基础上尽可能提高搜索的效率。</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal style='margin-left:17.95pt;text-indent:-17.95pt;mso-char-indent-count:
-1.71;mso-char-indent-size:10.45pt'><span lang=EN-US style='mso-bidi-font-size:
10.5pt'><span style="mso-spacerun: yes"> </span><span
style="mso-spacerun: yes"> </span></span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>第三,应充分尊重各</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'>Web</span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>站点管理员的意愿,对管理员禁止爬虫程序访问的资源应略过,做一个守法的爬虫程序。</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal><span lang=EN-US style='mso-bidi-font-size:10.5pt'>2.3</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman"'>关键技术及难点分析</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt'><o:p></o:p></span></p>
<p class=MsoNormal align=left style='margin-left:1.71gd;text-align:left;
text-indent:21.0pt;mso-char-indent-count:2.0;mso-char-indent-size:10.5pt;
mso-layout-grid-align:none;text-autospace:none'><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";mso-font-kerning:
0pt'>由于爬虫程序必须自动地从一个网页移动到另一个网页</span><span lang=EN-US style='mso-bidi-font-size:
10.5pt;mso-font-kerning:0pt'>,</span><span style='mso-bidi-font-size:10.5pt;
font-family:宋体;mso-ascii-font-family:"Times New Roman";mso-font-kerning:0pt'>从而达到在网上不停地自动搜索目的。因此,程序首先必须能够解析</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'>Web</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-font-kerning:0pt'>文档获得超链接</span><span style='mso-bidi-font-size:10.5pt;
font-family:宋体;mso-ascii-font-family:"Times New Roman";mso-ansi-language:ZH-CN'>,然后通过递归和非递归两种结构来实现</span><span
style='mso-bidi-font-size:10.5pt;mso-ansi-language:ZH-CN'>Spider</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-ansi-language:ZH-CN'>程序。</span><span style='mso-bidi-font-size:10.5pt;
font-family:宋体;mso-ascii-font-family:"Times New Roman";mso-font-kerning:0pt'>同时,由于各</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'>Web</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-font-kerning:0pt'>文档中极可能存在指向同一个资源的超链接,所以保证程序不陷入在一组相同的</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'>URL</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-font-kerning:0pt'>之间循环往复也是程序需要解决的难点之一。</span><span lang=EN-US
style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'><o:p></o:p></span></p>
<p class=MsoNormal align=left style='text-align:left;mso-layout-grid-align:
none;text-autospace:none'><span lang=EN-US style='mso-bidi-font-size:10.5pt;
mso-font-kerning:0pt'>2.3.1</span><span style='mso-bidi-font-size:10.5pt;
font-family:宋体;mso-ascii-font-family:"Times New Roman";mso-font-kerning:0pt'>如何解析</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'>Html</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-font-kerning:0pt'>文档</span><span lang=EN-US style='mso-bidi-font-size:10.5pt;
mso-font-kerning:0pt'><o:p></o:p></span></p>
<p class=MsoNormal align=left style='margin-left:1.71gd;text-align:left;
text-indent:23.1pt;mso-char-indent-count:2.2;mso-char-indent-size:10.5pt;
mso-layout-grid-align:none;text-autospace:none'><span lang=EN-US
style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'>Html</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-font-kerning:0pt'>文档主要包含五类数据:除标签和脚本程序以外的文本数据;程序员所留下的,对用户不可见的注释性数据;由单个表示的简单标签,比如</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'><br></span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-font-kerning:0pt'>;用于控制内容显示所包含的起始标签和结束标签。</span><span lang=EN-US
style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'><o:p></o:p></span></p>
<p class=MsoNormal align=left style='margin-left:-.85gd;text-align:left;
text-indent:-27.0pt;mso-char-indent-count:-2.57;mso-char-indent-size:10.5pt;
tab-stops:45.0pt;mso-layout-grid-align:none;text-autospace:none'><span
lang=EN-US style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'><span
style="mso-spacerun: yes"> </span><span style="mso-spacerun:
yes"> </span></span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";mso-font-kerning:
0pt'>爬虫程序主要关心用于</span><span style='mso-bidi-font-size:10.5pt;font-family:宋体;
mso-ascii-font-family:"Times New Roman";mso-ansi-language:ZH-CN'>迁移到新页面</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-font-kerning:0pt'>链接标签。这类标签通常分为三类,一类是指向同一台服务器的内部链接,另一类是指向不同服务器的外部连接,以及一写类似邮件地址的其他链接。其中,在内部链接里又存在着绝对和相对</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'>URL</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-font-kerning:0pt'>,绝对</span><span lang=EN-US style='mso-bidi-font-size:
10.5pt;mso-font-kerning:0pt'>URL</span><span style='mso-bidi-font-size:10.5pt;
font-family:宋体;mso-ascii-font-family:"Times New Roman";mso-font-kerning:0pt'>是一个准确的、无歧义的</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'>Internet</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-font-kerning:0pt'>资源的位置,它包含主机名和文件名,相对</span><span lang=EN-US
style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'>URL</span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
mso-font-kerning:0pt'>仅是一个文件名,没有主机名,因此解析时需要添加相应的主机地址。</span><span lang=EN-US
style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'><o:p></o:p></span></p>
<p class=MsoNormal align=left style='text-align:left;mso-layout-grid-align:
none;text-autospace:none'><span lang=EN-US style='mso-bidi-font-size:10.5pt;
mso-font-kerning:0pt'>2.3.2</span><span style='mso-bidi-font-size:10.5pt;
font-family:宋体;mso-ascii-font-family:"Times New Roman";mso-font-kerning:0pt'>爬虫程序结构选取</span><span
lang=EN-US style='mso-bidi-font-size:10.5pt;mso-font-kerning:0pt'><o:p></o:p></span></p>
<p class=MsoNormal align=left style='margin-left:.71gd;text-align:left;
text-indent:-10.5pt;mso-char-indent-count:-1.0;mso-char-indent-size:10.5pt;
tab-stops:45.0pt;mso-layout-grid-align:none;text-autospace:none'><span
style='mso-bidi-font-size:10.5pt;color:black;mso-font-kerning:0pt;mso-ansi-language:
ZH-CN'><span style="mso-spacerun: yes"> </span></span><span
style='mso-bidi-font-size:10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";
color:black;mso-font-kerning:0pt;mso-ansi-language:ZH-CN'>递归是在一个方法中调用它本身的程序设计技术,爬虫程序在访问相当少的网页时可以考虑递归。因为递归程序运行时要把每次递归压入堆栈,所以递归程序在运行多次时就可能把堆栈内存耗尽并终止程序运行。此外,递归与多线程技术不能兼容,因为在这一过程中每一个线程都有一个自己的堆栈。因此,递归结构不适合本程序。</span><span
style='mso-bidi-font-size:10.5pt;color:black;mso-font-kerning:0pt;mso-ansi-language:
ZH-CN'><o:p></o:p></span></p>
<p class=MsoNormal align=left style='margin-left:17.95pt;text-align:left;
text-indent:-17.95pt;mso-char-indent-count:-1.71;mso-char-indent-size:10.45pt;
mso-layout-grid-align:none;text-autospace:none'><span style='mso-bidi-font-size:
10.5pt;color:black;mso-font-kerning:0pt;mso-ansi-language:ZH-CN'><span
style="mso-spacerun: yes"> </span><span
style="mso-spacerun: yes"> </span></span><span style='mso-bidi-font-size:
10.5pt;font-family:宋体;mso-ascii-font-family:"Times New Roman";color:black;
mso-font-kerning:0pt;mso-ansi-language:ZH-CN'>构造爬虫程序的另一种结构的途径就是非递归。当爬虫程序发现每个新网页时,它将不会调用自身的方法,而是放入一个队列当中。当解析当前的网页后就查看队列是否为空,空则返回,否则取一个元素继续解析。为了保证程序不陷入死循环,解析过的</span><span
style='mso-bidi-font-size:10.5pt;color:black;mso-font-kerning:0pt;mso-ansi-language:
ZH-CN'>URL</span><span style='mso-bidi-font-size:10.5pt;font-family:宋体;
mso-ascii-font-family:"Times New Roman";color:black;mso-font-kerning:0pt;
mso-ansi-language:ZH-CN'>应该保存。</span><span style='mso-bidi-font-size:10.5pt;
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -