📄 index.html
字号:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><head><title>Heritrix - Home Page</title><style type="text/css" media="all"> @import url("./style/maven-base.css"); @import url("./style/maven-theme.css");@import url("./style/project.css");</style><link rel="stylesheet" href="./style/print.css" type="text/css" media="print"></link><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></meta><meta name="author" content="St.Ack"></meta><meta name="email" content="stack at archive dot org"></meta></head><body class="composite"><div id="banner"><a href="http://www.archive.org/" id="organizationLogo"><img alt="Internet Archive" src="http://www.archive.org/images/logo.jpg"></img></a><a href="http://crawler.archive.org" id="projectLogo"><img alt="Heritrix" src="./images/logo.gif"></img></a><div class="clear"><hr></hr></div></div><div id="breadcrumbs"><div class="xleft"> Last published: 27 September 2006 | Doc for 1.10.1</div><div class="xright"></div><div class="clear"><hr></hr></div></div><div id="leftColumn"><div id="navcolumn"><div id="menuOverview"><h5>Overview</h5><ul><li class="none"><a href="license.html">License</a></li><li class="none"><a href="requirements.html">System Requirements</a></li><li class="none"><a href="downloads.html">Downloads</a></li><li class="none"><a href="articles/user_manual/index.html">User Manual</a></li><li class="none"><a href="articles/developer_manual/index.html">Developer Manual</a></li><li class="none"><a href="apidocs/index.html">Javadocs</a></li><li class="none"><a href="faq.html">FAQ</a></li><li class="none"><a href="http://crawler.archive.org/cgi-bin/wiki.pl?HomePage" class="externalLink" title="External Link">Wiki</a></li><li class="none"><a href="http://sourceforge.net/tracker/?group_id=73833&atid=539099" class="externalLink" title="External Link">Browse/Submit a Bug</a></li></ul></div><div id="menuProject_Documentation"><h5>Project Documentation</h5><ul><li class="none"><strong><a href="index.html">About Heritrix</a></strong></li><li class="collapsed"><a href="project-info.html">Project Info</a></li><li class="collapsed"><a href="maven-reports.html">Project Reports</a></li><li class="none"><a href="http://maven.apache.org/development-process.html" class="externalLink" title="External Link">Development Process</a></li></ul></div><a href="http://maven.apache.org/" title="Built by Maven" id="poweredBy"><img alt="Built by Maven" src="./images/logos/maven-button-1.png"></img></a></div></div><div id="bodyColumn"><div class="contentBox"><div class="section"><a name="Introduction"></a><h2>Introduction</h2><p>Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.</p><p><em>Heritrix</em> (sometimes spelled <em>heretrix</em>, or misspelled or missaid as <em>heratrix</em>/<em>heritix</em>/ <em>heretix</em>/<em>heratix</em>) is an archaic word for <em>heiress</em> (woman who inherits). Since our crawler seeks to collect and <i>preserve</i> the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed apt.</p></div><div class="section"><a name="Webmasters_"></a><h2>Webmasters!</h2><p>Heritrix is designed to respect the <a href="http://www.robotstxt.org/wc/robots.html" class="externalLink" title="External Link">robots.txt</a> exclusion directives and <a href="http://www.robotstxt.org/wc/exclusion.html#meta" class="externalLink" title="External Link">META robots tags</a>, and collect material at a measured, adaptive pace unlikely to disrupt normal website activity. </p><p>If you notice our crawler behaving poorly -- The Internet Archive uses <b>archive.org_bot</b> as User Agent when crawling -- please send us email at: </p><p><img src="/images/crawler-problem-report-email.gif" alt="archive -dash- crawler -dash- agent, @at@ lists .dot. sourceforge .dot. net" title="archive -dash- crawler -dash- agent, @at@ lists .dot. sourceforge .dot. net"></img> </p></div><div class="section"><a name="Getting_Started"></a><h2>Getting Started</h2><p>See the <a href="articles/user_manual/index.html">User Manual</a>.</p></div><div class="section"><a name="News_and_Status"></a><h2>News and Status</h2><div class="subsection"><a name="Release_1_10_0_09_11_2006"></a><h3>Release 1.10.0 09/11/2006</h3><p>Release 1.10.0 adds new configuration options, experimental new protocol and format support, and lots of fixes (43 tracked bugs have been fixed and 35 feature requests added). Requires JDK 1.5.x. See <a href="articles/releasenotes/1_10_0.html">Release Notes</a> for detail. </p></div><div class="subsection"><a name="Release_1_8_0_05_05_2006"></a><h3>Release 1.8.0 05/05/2006</h3><p>Release 1.8.0 offers a number of improvements, including 13 requested enhancements and fixes for 18 reported bugs. See <a href="articles/releasenotes/1_8_0.html">Heritrix Release Notes</a> for detail and <a href="articles/releasenotes/1_8_0.html#1_8_0_limitations">Known Limitations</a>.</p></div><div class="subsection"><a name="Release_1_6_0_12_01_2005"></a><h3>Release 1.6.0 12/01/2005</h3><p>Release 1.6.0 offers improved remote control and monitoring via JMX, a crawl-checkpointing facility, and experimental support for bloom filter already-included testing, partitioning a crawl across multiple independent crawlers, and per-host/domain/queue-grouping collection quotas. Performance and stability in large crawls is also improved. Among tracked issues, it includes 39 requested enhancements and fixes 96 reported bugs. See <a href="articles/releasenotes/1_6_0.html">Heritrix Release Notes</a> for detail and <a href="articles/releasenotes/1_6_0.html#1_6_0_limitations">Known Limitations</a>: e.g. Again you will need to <a href="articles/releasenotes/1_6_0.html#postselector">tweak your old order files</a> to make them work with the new release.</p></div><div class="subsection"><a name="Release_1_4_0_04_28_2005"></a><h3>Release 1.4.0 04/28/2005</h3><p>Much improved memory usage, new experimental scoping/filter model, and a new revisiting frontier. Over 90 bugs fixed. See <a href="articles/releasenotes/1_4_0.html">Heritrix Release Notes</a> for detail and <a href="articles/releasenotes/1_4_0.html#1_4_0_limitations">Known Limitations</a>: e.g. You cannot use your old order files with the new release.</p></div><div class="subsection"><a name="Release_1_2_0_11_16_2004"></a><h3>Release 1.2.0 11/16/2004</h3><p>Added IP-based politeness, configurable URI-canonicalization, and mid-fetch abort. Lots of Bug fixes. See <a href="articles/releasenotes/1_2_0.html">Heritrix Release Notes</a> for detail and Known Limitations (In particular, https fetching requires SUN JDK and UI throws OOME if jobs run in series). </p></div><div class="subsection"><a name="Release_1_0_4_09_22_2004"></a><h3>Release 1.0.4 09/22/2004</h3><p>Bug fix. Crawl.log and ARC metadata lines could have whitespace in URIs and mimetype fields. See <a href="articles/releasenotes/1_0_4.html">Heritrix Release Notes</a> for detail and Known Limitations. </p></div><div class="subsection"><a name="Release_1_0_2_09_14_2004"></a><h3>Release 1.0.2 09/14/2004</h3><p>Bug fixes. See <a href="articles/releasenotes/1_0_2.html">Heritrix Release Notes</a> for detail and known limitations. </p></div><div class="subsection"><a name="Release_1_0_0_08_06_2004"></a><h3>Release 1.0.0 08/06/2004</h3><p>Added new prefix ('SURT') scope and filter, compression of recovery log, mass adding of URIs to running crawler, crawling via a http proxy, adding of headers to request, improved out-of-the-box defaults, hash of content to crawl log and to arcreader output, and many bug fixes. See <a href="articles/releasenotes/1_0_0.html">Heritrix Release Notes</a> for detail and known limitations. </p></div><div class="subsection"><a name="1_0_0_first_release_candidate__0_10_0_06_04_2004"></a><h3>1.0.0 first release candidate, 0.10.0 06/04/2004</h3><p>Release for second heritrix workshop, Copenhagen 06/2004 (1.0.0 first release candidate). Added site-first prioritization, fixed link extraction of multibyte URIs, added metadata to arcs as xml, changed arc naming template, new user and developer manuals, added basic/digest auth and http post/get login facility, and added help to UI. Bug fixes. See <a href="articles/releasenotes/0_10_0.html">Heritrix Release Notes</a> for detail and known limitations. </p></div><div class="subsection"><a name="Release_0_8_1_05_28_2004"></a><h3>Release 0.8.1 05/28/2004</h3><p>Fixes to build with maven rc2+. </p></div><div class="subsection"><a name="Release_0_8_0_05_24_2004"></a><h3>Release 0.8.0 05/24/2004</h3><p>Release (and branch heritrix-0_8 made at the heritrix-0_7_1 tag) because of concurrentmodificationexceptions if tens of seeds supplied and to fix domain-scope leakage. Also, made continuous build publically available, incorporated integration selftest into build, made it a maven-build only (ant-build no longer supported), added day/night configurations (refinements), ameliorated too-many-open files, added exploit of http-header content-type charset creating character streams, and heritrix now crawls ssl sites. UI improvements include red start by bad configuration, precompilation, and delineation of advanced settings. See <a href="articles/releasenotes/0_8_0.html">Heritrix Release Notes</a> for detail. </p></div><div class="subsection"><a name="Release_0_6_0_03_25_2004"></a><h3>Release 0.6.0 03/25/2004</h3><p>Release made in advance of radical frontier changes. Added bandwidth throttle, operator 'diary', settable robots expiration, crawler cookie pre-population, and changing of certain options mid-crawl. Many UI improvements including UI display of critical exceptions, UI desccription of job-order options, and improved reporting. Optimizations. Updated httpclient lib to 2.0 release and jmx libs to 1.2.1. See <a href="articles/releasenotes/0_6_0.html">Heritrix Release Notes</a> for detail. </p></div><div class="subsection"><a name="Point_Release_0_4_1_02_12_2004"></a><h3>Point Release 0.4.1 02/12/2004</h3><p>Released <a href="http://sourceforge.net/project/showfiles.php?group_id=73833&package_id=73980" class="externalLink" title="External Link">heritrix-0.4.1</a> to fix <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=895955&group_id=73833&atid=539099" class="externalLink" title="External Link">URIRegExpFilter retains memory</a>.</p></div><div class="subsection"><a name="Release_0_4_0_02_10_2004"></a><h3>Release 0.4.0 02/10/2004</h3><p>Release made for heritrix workshop, San Francisco, 02/2004. New MBEAN-based configuration, extensive UI revamp, first unit tests and integration selftest framework added, pooling of ARCWriters, new cmd-line start scripts, httpclient lib update (2.0RC3) and bugfixes. See <a href="articles/releasenotes/0_4_0.html">Heritrix Release Notes</a> for detail. </p></div><div class="subsection"><a name="First_Release_01_05_2004"></a><h3>First Release 01/05/2004</h3><p>Today we made our first 'official' heritrix release, <a href="http://sourceforge.net/project/showfiles.php?group_id=73833&package_id=73980" class="externalLink" title="External Link">heritrix-0.2.0</a>.</p></div></div></div></div><div class="clear"><hr></hr></div><div id="footer"><div class="xleft"><a href="http://sourceforge.net/projects/archive-crawler/" class="externalLink" title="External Link"> <img src="http://sourceforge.net/sflogo.php?group_id=archive-crawler&type=1" border="0" alt="sf logo"></img></a></div><div class="xright">漏 2003-2006, Internet Archive</div><div class="clear"><hr></hr></div></div></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -