⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 index.html

📁 网络爬虫开源代码
💻 HTML
📖 第 1 页 / 共 2 页
字号:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><head><title>Heritrix - Home Page</title><style type="text/css" media="all">          @import url("./style/maven-base.css");          			    @import url("./style/maven-theme.css");@import url("./style/project.css");</style><link rel="stylesheet" href="./style/print.css" type="text/css" media="print"></link><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></meta><meta name="author" content="St.Ack"></meta><meta name="email" content="stack at archive dot org"></meta></head><body class="composite"><div id="banner"><a href="http://www.archive.org/" id="organizationLogo"><img alt="Internet Archive" src="http://www.archive.org/images/logo.jpg"></img></a><a href="http://crawler.archive.org" id="projectLogo"><img alt="Heritrix" src="./images/logo.gif"></img></a><div class="clear"><hr></hr></div></div><div id="breadcrumbs"><div class="xleft">                	Last published: 06 May 2007                  | Doc for 1.12.1</div><div class="xright"></div><div class="clear"><hr></hr></div></div><div id="leftColumn"><div id="navcolumn"><div id="menuOverview"><h5>Overview</h5><ul><li class="none"><a href="license.html">License</a></li><li class="none"><a href="requirements.html">System Requirements</a></li><li class="none"><a href="downloads.html">Downloads</a></li><li class="none"><a href="articles/user_manual/index.html">User Manual</a></li><li class="none"><a href="articles/developer_manual/index.html">Developer Manual</a></li><li class="none"><a href="apidocs/index.html">Javadocs</a></li><li class="none"><a href="faq.html">FAQ</a></li><li class="none"><a href="http://webteam.archive.org/confluence/display/Heritrix/Home" class="externalLink" title="External Link">Wiki</a></li><li class="none"><a href="http://sourceforge.net/tracker/?group_id=73833&amp;atid=539099" class="externalLink" title="External Link">Browse/Submit a Bug</a></li><li class="expanded"><a href="">Related Projects</a><ul><li class="none"><a href="http://archive-access.sourceforge.net/" class="externalLink" title="External Link">Archive Access</a></li><li class="none"><a href="http://crawler.sourceforge.net/hcc" class="externalLink" title="External Link">Heritrix Cluster Controller (hcc)</a></li><li class="none"><a href="http://crawler.sourceforge.net/cmdline-jmxclient" class="externalLink" title="External Link">cmdline-jmxclient</a></li><li class="none"><a href="http://deduplicator.sourceforge.net" class="externalLink" title="External Link">Deduplicator</a></li><li class="none"><a href="http://www.zvents.com/labs/heritrix_hadoop" class="externalLink" title="External Link">Hadoop DFS Writer Processor</a></li></ul></li></ul></div><div id="menuProject_Documentation"><h5>Project Documentation</h5><ul><li class="none"><strong><a href="index.html">About Heritrix</a></strong></li><li class="collapsed"><a href="project-info.html">Project Info</a></li><li class="collapsed"><a href="maven-reports.html">Project Reports</a></li><li class="none"><a href="http://maven.apache.org/development-process.html" class="externalLink" title="External Link">Development Process</a></li></ul></div><a href="http://maven.apache.org/" title="Built by Maven" id="poweredBy"><img alt="Built by Maven" src="./images/logos/maven-button-1.png"></img></a></div></div><div id="bodyColumn"><div class="contentBox"><div class="section"><a name="Introduction"></a><h2>Introduction</h2><p>Heritrix is the Internet Archive's open-source, extensible,         	web-scale, archival-quality web crawler project.</p><p><em>Heritrix</em> (sometimes spelled <em>heretrix</em>, or         misspelled or missaid as <em>heratrix</em>/<em>heritix</em>/        <em>heretix</em>/<em>heratix</em>) is an         archaic word for <em>heiress</em> (woman who inherits). Since our         crawler seeks to collect and <i>preserve</i> the digital        artifacts of our culture for the benefit of future researchers and        generations, this name seemed apt.</p></div><div class="section"><a name="Webmasters_"></a><h2>Webmasters!</h2><p>Heritrix is designed to respect the        <a href="http://www.robotstxt.org/wc/robots.html" class="externalLink" title="External Link">robots.txt</a>        exclusion directives and         <a href="http://www.robotstxt.org/wc/exclusion.html#meta" class="externalLink" title="External Link">META robots        tags</a>, and collect material at a measured, adaptive pace unlikely        to disrupt normal website activity.        </p><p>If you notice our crawler behaving poorly -- The Internet Archive        uses <b>archive.org_bot</b> as User Agent when crawling --        please send us email at:        </p><p><img src="/images/crawler-problem-report-email.gif" alt="archive -dash- crawler -dash- agent, @at@ lists .dot. sourceforge .dot. net" title="archive -dash- crawler -dash- agent, @at@ lists .dot. sourceforge .dot. net"></img>        </p></div><div class="section"><a name="Getting_Started"></a><h2>Getting Started</h2><p>See the <a href="articles/user_manual/index.html">User Manual</a>.</p></div><div class="section"><a name="News_and_Status"></a><h2>News and Status</h2><div class="subsection"><a name="Release_1_12_1_05_06_2007"></a><h3>Release 1.12.1 05/06/2007</h3><p>Release 1.12.1 is a bug fix release.      See the <a href="articles/releasenotes/1_12_1.html">Release Notes</a> and      <a href="http://webteam.archive.org/confluence/display/Heritrix/Issues+with+%27Fix+Version%27+1.12.1" class="externalLink" title="External Link">      list of fixed issues</a> for details.    </p><p>Additional notes about 1.12.1 which may be updated with information     post-release are available     <a href="http://webteam.archive.org/confluence/display/Heritrix/Release+Notes+-+1.12.1" class="externalLink" title="External Link">    on the Heritrix wiki</a>.     </p></div><div class="subsection"><a name="Release_1_12_0_03_16_2007"></a><h3>Release 1.12.0 03/16/2007</h3><p>Release 1.12.0 is the first of several planned releases enhancing      Heritrix with "smart crawler" functionality. In this release, the theme      has been offering new options to reduce the amount of duplicate content      crawled and stored when recrawling sites at regular intervals. A number      of other enhancements and bug fixes are also included.      See the <a href="articles/releasenotes/1_12_0.html">Release Notes</a> for      details.    </p></div><div class="subsection"><a name="HDFS_Writer_Processor_01_25_2007"></a><h3>HDFS Writer Processor 01/25/2007</h3><p>      The HDFS Writer Processor extension enables storing crawled content      directly into HDFS, the       <a href="http://lucene.apache.org/hadoop" class="externalLink" title="External Link">Hadoop</a> Distributed      FileSystem.  For details, see the      <a href="http://www.zvents.com/hdfs/README.txt" class="externalLink" title="External Link">README.txt</a>.      To download, see <a href="http://www.zvents.com/labs/hdfs_writer_processor" class="externalLink" title="External Link">HDFS Writer      Processor</a>      </p></div><div class="subsection"><a name="Release_1_10_2_01_15_2007"></a><h3>Release 1.10.2 01/15/2007</h3><p>This is primarily a bug-fix release, with a couple of new      features, provided before a number of significant changes to       the Heritrix project that will require developer and crawl       operator adjustments. Post-1.10.2, Heritrix source code control, issue       tracking, and build process will migrate to new systems. Also, updates       to core classes, especially with regard to the settings architecture,       will noticeably break backward compatibility with 1.10.2 and prior       crawler settings files and formats.       See <a href="articles/releasenotes/1_10_2.html">Release Notes</a> for      details.    </p></div><div class="subsection"><a name="Release_1_10_1_09_27_2006"></a><h3>Release 1.10.1 09/27/2006</h3><p>Bug fix release. See     <a href="articles/releasenotes/1_10_1.html">Release Notes</a> for    detail.    </p></div><div class="subsection"><a name="Deduplicator__add-on_for_Heritrix__0_2_0_release_________09_14_2006"></a><h3>Deduplicator (add-on for Heritrix) 0.2.0 release         09/14/2006</h3><p>      The Deduplicator is a add-on module for Heritrix that allows sequential      snapshot crawls to leverage information about previous iterations to avoid      storing (or even downloading) duplicate data.  See the       <a href="http://tech.groups.yahoo.com/group/archive-crawler/message/3282" class="externalLink" title="External Link">      mailing list announcement</a> for details.      </p></div><div class="subsection"><a name="Release_1_10_0_09_11_2006"></a><h3>Release 1.10.0 09/11/2006</h3><p>Release 1.10.0 adds new configuration options, experimental       new protocol and format support, and lots of fixes (43 tracked bugs       have been fixed and 35 feature requests added). Requires JDK 1.5.x.      See <a href="articles/releasenotes/1_10_0.html">Release Notes</a> for      detail.      </p></div><div class="subsection"><a name="Release_1_8_0_05_05_2006"></a><h3>Release 1.8.0 05/05/2006</h3><p>Release 1.8.0 offers a number of improvements, including 13 requested       enhancements and fixes for 18 reported bugs. See      <a href="articles/releasenotes/1_8_0.html">Heritrix Release Notes</a>       for detail and      <a href="articles/releasenotes/1_8_0.html#1_8_0_limitations">Known      Limitations</a>.</p></div><div class="subsection"><a name="Release_1_6_0_12_01_2005"></a><h3>Release 1.6.0 12/01/2005</h3><p>Release 1.6.0 offers improved remote control and monitoring via      JMX, a crawl-checkpointing facility, and experimental support for bloom      filter already-included testing, partitioning a crawl across multiple       independent crawlers, and per-host/domain/queue-grouping collection 

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -