📄 index.html
字号:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><head><title>Heritrix - Home Page</title><style type="text/css" media="all"> @import url("./style/maven-base.css"); @import url("./style/maven-theme.css");@import url("./style/project.css");</style><link rel="stylesheet" href="./style/print.css" type="text/css" media="print"></link><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></meta><meta name="author" content="St.Ack"></meta><meta name="email" content="stack at archive dot org"></meta></head><body class="composite"><div id="banner"><a href="http://www.archive.org/" id="organizationLogo"><img alt="Internet Archive" src="http://www.archive.org/images/logo.jpg"></img></a><a href="http://crawler.archive.org" id="projectLogo"><img alt="Heritrix" src="./images/logo.gif"></img></a><div class="clear"><hr></hr></div></div><div id="breadcrumbs"><div class="xleft"> Last published: 06 May 2007 | Doc for 1.12.1</div><div class="xright"></div><div class="clear"><hr></hr></div></div><div id="leftColumn"><div id="navcolumn"><div id="menuOverview"><h5>Overview</h5><ul><li class="none"><a href="license.html">License</a></li><li class="none"><a href="requirements.html">System Requirements</a></li><li class="none"><a href="downloads.html">Downloads</a></li><li class="none"><a href="articles/user_manual/index.html">User Manual</a></li><li class="none"><a href="articles/developer_manual/index.html">Developer Manual</a></li><li class="none"><a href="apidocs/index.html">Javadocs</a></li><li class="none"><a href="faq.html">FAQ</a></li><li class="none"><a href="http://webteam.archive.org/confluence/display/Heritrix/Home" class="externalLink" title="External Link">Wiki</a></li><li class="none"><a href="http://sourceforge.net/tracker/?group_id=73833&atid=539099" class="externalLink" title="External Link">Browse/Submit a Bug</a></li><li class="expanded"><a href="">Related Projects</a><ul><li class="none"><a href="http://archive-access.sourceforge.net/" class="externalLink" title="External Link">Archive Access</a></li><li class="none"><a href="http://crawler.sourceforge.net/hcc" class="externalLink" title="External Link">Heritrix Cluster Controller (hcc)</a></li><li class="none"><a href="http://crawler.sourceforge.net/cmdline-jmxclient" class="externalLink" title="External Link">cmdline-jmxclient</a></li><li class="none"><a href="http://deduplicator.sourceforge.net" class="externalLink" title="External Link">Deduplicator</a></li><li class="none"><a href="http://www.zvents.com/labs/heritrix_hadoop" class="externalLink" title="External Link">Hadoop DFS Writer Processor</a></li></ul></li></ul></div><div id="menuProject_Documentation"><h5>Project Documentation</h5><ul><li class="none"><strong><a href="index.html">About Heritrix</a></strong></li><li class="collapsed"><a href="project-info.html">Project Info</a></li><li class="collapsed"><a href="maven-reports.html">Project Reports</a></li><li class="none"><a href="http://maven.apache.org/development-process.html" class="externalLink" title="External Link">Development Process</a></li></ul></div><a href="http://maven.apache.org/" title="Built by Maven" id="poweredBy"><img alt="Built by Maven" src="./images/logos/maven-button-1.png"></img></a></div></div><div id="bodyColumn"><div class="contentBox"><div class="section"><a name="Introduction"></a><h2>Introduction</h2><p>Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.</p><p><em>Heritrix</em> (sometimes spelled <em>heretrix</em>, or misspelled or missaid as <em>heratrix</em>/<em>heritix</em>/ <em>heretix</em>/<em>heratix</em>) is an archaic word for <em>heiress</em> (woman who inherits). Since our crawler seeks to collect and <i>preserve</i> the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.</p></div><div class="section"><a name="Webmasters_"></a><h2>Webmasters!</h2><p>Heritrix is designed to respect the <a href="http://www.robotstxt.org/wc/robots.html" class="externalLink" title="External Link">robots.txt</a> exclusion directives and <a href="http://www.robotstxt.org/wc/exclusion.html#meta" class="externalLink" title="External Link">META robots tags</a>, and collect material at a measured, adaptive pace unlikely to disrupt normal website activity. </p><p>If you notice our crawler behaving poorly -- The Internet Archive uses <b>archive.org_bot</b> as User Agent when crawling -- please send us email at: </p><p><img src="/images/crawler-problem-report-email.gif" alt="archive -dash- crawler -dash- agent, @at@ lists .dot. sourceforge .dot. net" title="archive -dash- crawler -dash- agent, @at@ lists .dot. sourceforge .dot. net"></img> </p></div><div class="section"><a name="Getting_Started"></a><h2>Getting Started</h2><p>See the <a href="articles/user_manual/index.html">User Manual</a>.</p></div><div class="section"><a name="News_and_Status"></a><h2>News and Status</h2><div class="subsection"><a name="Release_1_12_1_05_06_2007"></a><h3>Release 1.12.1 05/06/2007</h3><p>Release 1.12.1 is a bug fix release. See the <a href="articles/releasenotes/1_12_1.html">Release Notes</a> and <a href="http://webteam.archive.org/confluence/display/Heritrix/Issues+with+%27Fix+Version%27+1.12.1" class="externalLink" title="External Link"> list of fixed issues</a> for details. </p><p>Additional notes about 1.12.1 which may be updated with information post-release are available <a href="http://webteam.archive.org/confluence/display/Heritrix/Release+Notes+-+1.12.1" class="externalLink" title="External Link"> on the Heritrix wiki</a>. </p></div><div class="subsection"><a name="Release_1_12_0_03_16_2007"></a><h3>Release 1.12.0 03/16/2007</h3><p>Release 1.12.0 is the first of several planned releases enhancing Heritrix with "smart crawler" functionality. In this release, the theme has been offering new options to reduce the amount of duplicate content crawled and stored when recrawling sites at regular intervals. A number of other enhancements and bug fixes are also included. See the <a href="articles/releasenotes/1_12_0.html">Release Notes</a> for details. </p></div><div class="subsection"><a name="HDFS_Writer_Processor_01_25_2007"></a><h3>HDFS Writer Processor 01/25/2007</h3><p> The HDFS Writer Processor extension enables storing crawled content directly into HDFS, the <a href="http://lucene.apache.org/hadoop" class="externalLink" title="External Link">Hadoop</a> Distributed FileSystem. For details, see the <a href="http://www.zvents.com/hdfs/README.txt" class="externalLink" title="External Link">README.txt</a>. To download, see <a href="http://www.zvents.com/labs/hdfs_writer_processor" class="externalLink" title="External Link">HDFS Writer Processor</a> </p></div><div class="subsection"><a name="Release_1_10_2_01_15_2007"></a><h3>Release 1.10.2 01/15/2007</h3><p>This is primarily a bug-fix release, with a couple of new features, provided before a number of significant changes to the Heritrix project that will require developer and crawl operator adjustments. Post-1.10.2, Heritrix source code control, issue tracking, and build process will migrate to new systems. Also, updates to core classes, especially with regard to the settings architecture, will noticeably break backward compatibility with 1.10.2 and prior crawler settings files and formats. See <a href="articles/releasenotes/1_10_2.html">Release Notes</a> for details. </p></div><div class="subsection"><a name="Release_1_10_1_09_27_2006"></a><h3>Release 1.10.1 09/27/2006</h3><p>Bug fix release. See <a href="articles/releasenotes/1_10_1.html">Release Notes</a> for detail. </p></div><div class="subsection"><a name="Deduplicator__add-on_for_Heritrix__0_2_0_release_________09_14_2006"></a><h3>Deduplicator (add-on for Heritrix) 0.2.0 release 09/14/2006</h3><p> The Deduplicator is a add-on module for Heritrix that allows sequential snapshot crawls to leverage information about previous iterations to avoid storing (or even downloading) duplicate data. See the <a href="http://tech.groups.yahoo.com/group/archive-crawler/message/3282" class="externalLink" title="External Link"> mailing list announcement</a> for details. </p></div><div class="subsection"><a name="Release_1_10_0_09_11_2006"></a><h3>Release 1.10.0 09/11/2006</h3><p>Release 1.10.0 adds new configuration options, experimental new protocol and format support, and lots of fixes (43 tracked bugs have been fixed and 35 feature requests added). Requires JDK 1.5.x. See <a href="articles/releasenotes/1_10_0.html">Release Notes</a> for detail. </p></div><div class="subsection"><a name="Release_1_8_0_05_05_2006"></a><h3>Release 1.8.0 05/05/2006</h3><p>Release 1.8.0 offers a number of improvements, including 13 requested enhancements and fixes for 18 reported bugs. See <a href="articles/releasenotes/1_8_0.html">Heritrix Release Notes</a> for detail and <a href="articles/releasenotes/1_8_0.html#1_8_0_limitations">Known Limitations</a>.</p></div><div class="subsection"><a name="Release_1_6_0_12_01_2005"></a><h3>Release 1.6.0 12/01/2005</h3><p>Release 1.6.0 offers improved remote control and monitoring via JMX, a crawl-checkpointing facility, and experimental support for bloom filter already-included testing, partitioning a crawl across multiple independent crawlers, and per-host/domain/queue-grouping collection
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -