📄 网页爬虫,httpclient+jericho html parser 实现网页的抓取 - oscar999的专栏 - csdnblog.htm
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0061)http://blog.csdn.net/oscar999/archive/2007/05/17/1613325.aspx -->
<HTML xmlns="http://www.w3.org/1999/xhtml"><HEAD><TITLE>网页爬虫,HttpClient+Jericho HTML Parser 实现网页的抓取 - oscar999的专栏 - CSDNBlog</TITLE>
<META http-equiv=Content-Type content="text/html; charset=utf-8">
<META content=oscar999,commons,httpclient,getmethod,jericho, name=keywords>
<META content="HttpClient+Jericho HTML Parser 实现网页的抓取" name=description>
<SCRIPT
src="网页爬虫,HttpClient+Jericho HTML Parser 实现网页的抓取 - oscar999的专栏 - CSDNBlog.files/tabber.js"
type=text/javascript></SCRIPT>
<SCRIPT
src="网页爬虫,HttpClient+Jericho HTML Parser 实现网页的抓取 - oscar999的专栏 - CSDNBlog.files/feedBackToolTips.js"
type=text/javascript></SCRIPT>
<SCRIPT language=javascript
src="网页爬虫,HttpClient+Jericho HTML Parser 实现网页的抓取 - oscar999的专栏 - CSDNBlog.files/JSUtils.js"
type=text/javascript></SCRIPT>
<LINK media=screen
href="网页爬虫,HttpClient+Jericho HTML Parser 实现网页的抓取 - oscar999的专栏 - CSDNBlog.files/tabber.css"
type=text/css rel=stylesheet><LINK
href="http://profile.csdn.net/oscar999/picture/1.ico" rel="Shortcut Icon"><LINK
media=all
href="网页爬虫,HttpClient+Jericho HTML Parser 实现网页的抓取 - oscar999的专栏 - CSDNBlog.files/style.css"
type=text/css rel=stylesheet><LINK media=print
href="网页爬虫,HttpClient+Jericho HTML Parser 实现网页的抓取 - oscar999的专栏 - CSDNBlog.files/print.htm"
type=text/css rel=stylesheet><LINK title=RSS
href="http://blog.csdn.net/oscar999/rss.aspx" type=application/rss+xml
rel=alternate>
<META content="MSHTML 6.00.2900.2963" name=GENERATOR></HEAD>
<BODY>
<FORM language=javascript id=Form1 name=Form1
onsubmit="javascript:return WebForm_OnSubmit();" action=1613325.aspx
method=post><INPUT id=__EVENTTARGET type=hidden name=__EVENTTARGET> <INPUT
id=__EVENTARGUMENT type=hidden name=__EVENTARGUMENT> <INPUT
id=" __VIEWSTATE" type=hidden name=__VIEWSTATE>
<SCRIPT type=text/javascript>
<!--
var theForm = document.forms['Form1'];
if (!theForm) {
theForm = document.Form1;
}
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
// -->
</SCRIPT>
<SCRIPT
src="网页爬虫,HttpClient+Jericho HTML Parser 实现网页的抓取 - oscar999的专栏 - CSDNBlog.files/WebResource.axd"
type=text/javascript></SCRIPT>
<SCRIPT type=text/javascript>
//<![CDATA[
var Anthem_FormID = "Form1";
//]]>
</SCRIPT>
<SCRIPT
src="C:\Documents and Settings\nya\桌面\JMF\网页爬虫,HttpClient+Jericho HTML Parser 实现网页的抓取 - oscar999的专栏 - CSDNBlog.files\WebResource(1).axd"
type=text/javascript></SCRIPT>
<SCRIPT
src="C:\Documents and Settings\nya\桌面\JMF\网页爬虫,HttpClient+Jericho HTML Parser 实现网页的抓取 - oscar999的专栏 - CSDNBlog.files\WebResource(2).axd"
type=text/javascript></SCRIPT>
<SCRIPT type=text/javascript>
<!--
function WebForm_OnSubmit() {
if (typeof(ValidatorOnSubmit) == "function" && ValidatorOnSubmit() == false) return false;
return true;
}
// -->
</SCRIPT>
<!--done-->
<DIV id=main>
<DIV id=banner>
<DIV id=bnr_pic><!--done-->
<DIV class=header>
<DIV class=headerText><A class=headermaintitle id=Header1_HeaderTitle
href="http://blog.csdn.net/oscar999/">oscar999的专栏</A> </DIV>
<DIV class=headerDis><SPAN id=TopicAuthor
style="DISPLAY: none">oscar999</SPAN></DIV></DIV></DIV>
<DIV id=mylinks>
<DIV id=mystats><!--done-->
<DIV class=blogStats>原创 - 44, 翻译 - 0, 转贴 - 16, 点击 - 15124, 评论 - 6, Trackbacks -0
</DIV></DIV><!--done--><A class=mainmenu id=MyLinks1_csdnhome
href="http://www.csdn.net/">CSDN首页</A> <A class=mainmenu
id=MyLinks1_csdndev href="http://dev.csdn.net/">CSDN技术中心</A>
<A class=mainmenu id=MyLinks1_HomeLink title=到聚合站点
href="http://blog.csdn.net/">BLOG首页</A> <A class=mainmenu
id=MyLinks1_PersonalHome title="访问 oscar999的专栏"
href="http://blog.csdn.net/oscar999/">我的首页</A> <A
class=mainmenu id=MyLinks1_MyArticles title="查看 oscar999的专栏 所有文章"
href="http://blog.csdn.net/oscar999/MyArticles.aspx"
target=_blank>我的文章</A> <A class=mainmenu id=MyLinks1_MySpace
title="查看 oscar999 的个人空间" href="http://hi.csdn.net/oscar999/profile"
target=_blank><FONT color=red>我的空间</FONT></A> <A
class=mainmenu id=MyLinks1_ContactLink
href="http://blog.csdn.net/oscar999/contact.aspx">联系作者</A> <A
class=mainmenu id=MyLinks1_HyperLink1
href="http://search.csdn.net/search_blog.asp"
target=_blank>搜索</A> <A class=mainmenu id=MyLinks1_Admin
href="http://writeblog.csdn.net/">写文章</A> </DIV></DIV>
<DIV id=wrap>
<DIV id=left><!-- left starts -->
<DIV id=left_content>
<DIV id=topics><SPAN class=PreAndNext id=viewpost.ascx_PreviousAndNextEntriesUp>
<DIV align=center><A
href="http://blog.csdn.net/oscar999/archive/2007/06/08/1643516.aspx">上一篇: JasperReport+iReport进行java报表开发</A> | <A
href="http://blog.csdn.net/oscar999/archive/2006/12/11/1438694.aspx">下一篇: Java
Media Framework 基础教程</A></DIV></SPAN><BR>
<SCRIPT>function StorePage(){d=document;t=d.selection?(d.selection.type!='None'?d.selection.createRange().text:''):(d.getSelection?d.getSelection():'');void(keyit=window.open('http://www.365key.com/storeit.aspx?t='+escape(d.title)+'&u='+escape(d.location.href)+'&c='+escape(t),'keyit','scrollbars=no,width=475,height=575,left=75,top=20,status=no,resizable=yes'));keyit.focus();}</SCRIPT>
<DIV class=post>
<DIV class=postTitle>
<SCRIPT
src="网页爬虫,HttpClient+Jericho HTML Parser 实现网页的抓取 - oscar999的专栏 - CSDNBlog.files/vote.js"></SCRIPT>
<A href="http://blog.csdn.net/oscar999/archive/2007/05/17/1613325.aspx"><IMG
height=13
src="网页爬虫,HttpClient+Jericho HTML Parser 实现网页的抓取 - oscar999的专栏 - CSDNBlog.files/authorship.gif"
width=15 border=0> 网页爬虫,HttpClient+Jericho HTML Parser
实现网页的抓取</A>
<SCRIPT
src="网页爬虫,HttpClient+Jericho HTML Parser 实现网页的抓取 - oscar999的专栏 - CSDNBlog.files/count.htm"></SCRIPT>
</DIV>
<DIV class=postText><SPAN class=style7> Jericho HTML Parser是一个简单而功能强大的Java
HTML解析器库,可以分析和处理HTML文档的一部分,包括一些通用的服务器端标签,同时也可以重新生成无法识别的或无效的HTML。它也提供了一个有用的HTML表单分析器。<BR>
下载地址:http://sourceforge.net/project/showfiles.php?group_id=101067<BR><BR></SPAN><SPAN
class=style7>
HttpClient作为HTTP客户端组件与服务器进行通讯,同时使用了jdom进行XML数据的解析。<BR></SPAN>
<UL>
<LI>HttpClient 可以在<A
href="http://jakarta.apache.org/commons/httpclient/downloads.html">http://jakarta.apache.org/commons/httpclient/downloads.html</A>下载
<LI>HttpClient 用到了 Apache Jakarta common 下的子项目 logging,你可以从这个地址<A
href="http://jakarta.apache.org/site/downloads/downloads_commons-logging.cgi">http://jakarta.apache.org/site/downloads/downloads_commons-logging.cgi</A>下载到
common logging,从下载后的压缩包中取出 commons-logging.jar 加到 CLASSPATH 中
<LI>HttpClient 用到了 Apache Jakarta common 下的子项目 codec,你可以从这个地址<A
href="http://jakarta.apache.org/site/downloads/downloads_commons-codec">http://jakarta.apache.org/site/downloads/downloads_commons-codec</A>.cgi
下载到最新的 common codec,从下载后的压缩包中取出 commons-codec-1.x.jar 加到 CLASSPATH
中</LI></UL><BR><SPAN class=style7>在对网页信息进行抓取时,</SPAN><A name=N10095><SPAN
class=smalltitle>主要会用到GET 方法</SPAN></A>
<P>使用 HttpClient 需要以下 6 个步骤:</P>
<P>1. 创建 HttpClient 的实例</P>
<P>2. 创建某种连接方法的实例,在这里是 GetMethod。在 GetMethod 的构造函数中传入待连接的地址</P>
<P>3. 调用第一步中创建好的实例的 execute 方法来执行第二步中创建好的 method 实例</P>
<P>4. 读 response</P>
<P>5. 释放连接。无论执行方法是否成功,都必须释放连接</P>
<P>6. 对得到后的内容进行处理</P>在eclipse下建立工程
-->snatch<BR>将上面下载的四个jar文件导入到项目路径中.<BR>环境搭建完成<BR><BR>现在,首先介绍一下HttpClient的使用<BR>在工程目录下创建test包,在包中创建Httpclient
Test类<BR><BR>
<DIV
style="BORDER-RIGHT: windowtext 0.5pt solid; PADDING-RIGHT: 5.4pt; BORDER-TOP: windowtext 0.5pt solid; PADDING-LEFT: 5.4pt; BACKGROUND: rgb(230,230,230) 0% 50%; PADDING-BOTTOM: 4px; BORDER-LEFT: windowtext 0.5pt solid; WIDTH: 95%; PADDING-TOP: 4px; BORDER-BOTTOM: windowtext 0.5pt solid; moz-background-clip: -moz-initial; moz-background-origin: -moz-initial; moz-background-inline-policy: -moz-initial">
<DIV><IMG alt=""
src="网页爬虫,HttpClient+Jericho HTML Parser 实现网页的抓取 - oscar999的专栏 - CSDNBlog.files/None.gif"
align=top><SPAN style="COLOR: rgb(0,0,255)">package</SPAN><SPAN
style="COLOR: rgb(0,0,0)"> test;<BR><IMG alt=""
src="网页爬虫,HttpClient+Jericho HTML Parser 实现网页的抓取 - oscar999的专栏 - CSDNBlog.files/None.gif"
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -