⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 index.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
字号:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><head><title>Heritrix - Home Page</title><style type="text/css" media="all">          @import url("./style/maven-base.css");          			    @import url("./style/maven-theme.css");@import url("./style/project.css");</style><link rel="stylesheet" href="./style/print.css" type="text/css" media="print"></link><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></meta><meta name="author" content="St.Ack"></meta><meta name="email" content="stack at archive dot org"></meta></head><body class="composite"><div id="banner"><a href="http://www.archive.org/" id="organizationLogo"><img alt="Internet Archive" src="http://www.archive.org/images/logo.jpg"></img></a><a href="http://crawler.archive.org" id="projectLogo"><img alt="Heritrix" src="./images/logo.gif"></img></a><div class="clear"><hr></hr></div></div><div id="breadcrumbs"><div class="xleft">                	Last published: 27 September 2006                  | Doc for 1.10.1</div><div class="xright"></div><div class="clear"><hr></hr></div></div><div id="leftColumn"><div id="navcolumn"><div id="menuOverview"><h5>Overview</h5><ul><li class="none"><a href="license.html">License</a></li><li class="none"><a href="requirements.html">System Requirements</a></li><li class="none"><a href="downloads.html">Downloads</a></li><li class="none"><a href="articles/user_manual/index.html">User Manual</a></li><li class="none"><a href="articles/developer_manual/index.html">Developer Manual</a></li><li class="none"><a href="apidocs/index.html">Javadocs</a></li><li class="none"><a href="faq.html">FAQ</a></li><li class="none"><a href="http://crawler.archive.org/cgi-bin/wiki.pl?HomePage" class="externalLink" title="External Link">Wiki</a></li><li class="none"><a href="http://sourceforge.net/tracker/?group_id=73833&amp;atid=539099" class="externalLink" title="External Link">Browse/Submit a Bug</a></li></ul></div><div id="menuProject_Documentation"><h5>Project Documentation</h5><ul><li class="none"><strong><a href="index.html">About Heritrix</a></strong></li><li class="collapsed"><a href="project-info.html">Project Info</a></li><li class="collapsed"><a href="maven-reports.html">Project Reports</a></li><li class="none"><a href="http://maven.apache.org/development-process.html" class="externalLink" title="External Link">Development Process</a></li></ul></div><a href="http://maven.apache.org/" title="Built by Maven" id="poweredBy"><img alt="Built by Maven" src="./images/logos/maven-button-1.png"></img></a></div></div><div id="bodyColumn"><div class="contentBox"><div class="section"><a name="Introduction"></a><h2>Introduction</h2><p>Heritrix is the Internet Archive's open-source, extensible,         	web-scale, archival-quality web crawler project.</p><p><em>Heritrix</em> (sometimes spelled <em>heretrix</em>, or         misspelled or missaid as <em>heratrix</em>/<em>heritix</em>/        <em>heretix</em>/<em>heratix</em>) is an         archaic word for <em>heiress</em> (woman who inherits). Since our         crawler seeks to collect and <i>preserve</i> the digital        artifacts of our culture for the benefit of future researchers and        generations, this name seemed apt.</p></div><div class="section"><a name="Webmasters_"></a><h2>Webmasters!</h2><p>Heritrix is designed to respect the        <a href="http://www.robotstxt.org/wc/robots.html" class="externalLink" title="External Link">robots.txt</a>        exclusion directives and         <a href="http://www.robotstxt.org/wc/exclusion.html#meta" class="externalLink" title="External Link">META robots        tags</a>, and collect material at a measured, adaptive pace unlikely        to disrupt normal website activity.        </p><p>If you notice our crawler behaving poorly -- The Internet Archive        uses <b>archive.org_bot</b> as User Agent when crawling --        please send us email at:        </p><p><img src="/images/crawler-problem-report-email.gif" alt="archive -dash- crawler -dash- agent, @at@ lists .dot. sourceforge .dot. net" title="archive -dash- crawler -dash- agent, @at@ lists .dot. sourceforge .dot. net"></img>        </p></div><div class="section"><a name="Getting_Started"></a><h2>Getting Started</h2><p>See the <a href="articles/user_manual/index.html">User Manual</a>.</p></div><div class="section"><a name="News_and_Status"></a><h2>News and Status</h2><div class="subsection"><a name="Release_1_10_0_09_11_2006"></a><h3>Release 1.10.0 09/11/2006</h3><p>Release 1.10.0 adds new configuration options, experimental       new protocol and format support, and lots of fixes (43 tracked bugs       have been fixed and 35 feature requests added). Requires JDK 1.5.x.      See <a href="articles/releasenotes/1_10_0.html">Release Notes</a> for      detail.      </p></div><div class="subsection"><a name="Release_1_8_0_05_05_2006"></a><h3>Release 1.8.0 05/05/2006</h3><p>Release 1.8.0 offers a number of improvements, including 13 requested       enhancements and fixes for 18 reported bugs. See      <a href="articles/releasenotes/1_8_0.html">Heritrix Release Notes</a>       for detail and      <a href="articles/releasenotes/1_8_0.html#1_8_0_limitations">Known      Limitations</a>.</p></div><div class="subsection"><a name="Release_1_6_0_12_01_2005"></a><h3>Release 1.6.0 12/01/2005</h3><p>Release 1.6.0 offers improved remote control and monitoring via      JMX, a crawl-checkpointing facility, and experimental support for bloom      filter already-included testing, partitioning a crawl across multiple       independent crawlers, and per-host/domain/queue-grouping collection       quotas. Performance and stability in large crawls is also improved.      Among tracked issues, it includes 39 requested enhancements and fixes 96       reported bugs. See      <a href="articles/releasenotes/1_6_0.html">Heritrix Release Notes</a>       for detail and      <a href="articles/releasenotes/1_6_0.html#1_6_0_limitations">Known      Limitations</a>: e.g. Again you will need to       <a href="articles/releasenotes/1_6_0.html#postselector">tweak your old order      files</a> to make them work with the new release.</p></div><div class="subsection"><a name="Release_1_4_0_04_28_2005"></a><h3>Release 1.4.0 04/28/2005</h3><p>Much improved memory usage, new experimental scoping/filter model,      and a new revisiting frontier.  Over 90 bugs fixed. See      <a href="articles/releasenotes/1_4_0.html">Heritrix Release Notes</a>       for detail and      <a href="articles/releasenotes/1_4_0.html#1_4_0_limitations">Known      Limitations</a>: e.g. You cannot use your old order files with the new      release.</p></div><div class="subsection"><a name="Release_1_2_0_11_16_2004"></a><h3>Release 1.2.0 11/16/2004</h3><p>Added IP-based politeness, configurable URI-canonicalization,        and mid-fetch abort.  Lots of Bug fixes.  See        <a href="articles/releasenotes/1_2_0.html">Heritrix Release Notes</a>         for detail and Known Limitations (In particular, https fetching        requires SUN JDK and UI throws OOME if jobs run in series).        </p></div><div class="subsection"><a name="Release_1_0_4_09_22_2004"></a><h3>Release 1.0.4 09/22/2004</h3><p>Bug fix. Crawl.log and ARC metadata lines could have whitespace        in URIs and mimetype fields.  See        <a href="articles/releasenotes/1_0_4.html">Heritrix Release Notes</a>         for detail and Known Limitations.        </p></div><div class="subsection"><a name="Release_1_0_2_09_14_2004"></a><h3>Release 1.0.2 09/14/2004</h3><p>Bug fixes.         See <a href="articles/releasenotes/1_0_2.html">Heritrix Release        Notes</a> for detail and known limitations.        </p></div><div class="subsection"><a name="Release_1_0_0_08_06_2004"></a><h3>Release 1.0.0 08/06/2004</h3><p>Added new prefix ('SURT') scope and filter,        compression of recovery log,        mass adding of URIs to running crawler,        crawling via a http proxy,         adding of headers to request,        improved out-of-the-box defaults,        hash of content to crawl log and to arcreader output,        and many bug fixes.        See <a href="articles/releasenotes/1_0_0.html">Heritrix Release        Notes</a> for detail and known limitations.        </p></div><div class="subsection"><a name="1_0_0_first_release_candidate__0_10_0_06_04_2004"></a><h3>1.0.0 first release candidate, 0.10.0 06/04/2004</h3><p>Release for second heritrix workshop, Copenhagen 06/2004        (1.0.0 first release candidate). Added site-first prioritization,        fixed link extraction of multibyte URIs, added metadata to arcs as xml,        changed arc naming template, new user and developer manuals,        added basic/digest auth and http post/get login facility, and added        help to UI. Bug fixes.         See <a href="articles/releasenotes/0_10_0.html">Heritrix Release        Notes</a> for detail and known limitations.        </p></div><div class="subsection"><a name="Release_0_8_1_05_28_2004"></a><h3>Release 0.8.1 05/28/2004</h3><p>Fixes to build with maven rc2+.        </p></div><div class="subsection"><a name="Release_0_8_0_05_24_2004"></a><h3>Release 0.8.0 05/24/2004</h3><p>Release (and branch heritrix-0_8 made at the heritrix-0_7_1 tag)        because of concurrentmodificationexceptions if tens of seeds supplied        and to fix domain-scope leakage. Also, made continuous build        publically available, incorporated integration selftest into build,        made it a maven-build only (ant-build no longer supported), added        day/night configurations (refinements), ameliorated too-many-open        files, added exploit of http-header content-type charset creating        character streams, and heritrix now crawls ssl sites. UI improvements        include red start by bad configuration, precompilation, and        delineation of advanced settings.         See <a href="articles/releasenotes/0_8_0.html">Heritrix Release        Notes</a> for detail.        </p></div><div class="subsection"><a name="Release_0_6_0_03_25_2004"></a><h3>Release 0.6.0 03/25/2004</h3><p>Release made in advance of radical frontier changes.        Added bandwidth throttle, operator 'diary', settable robots expiration,        crawler cookie pre-population, and changing of certain options        mid-crawl. Many UI improvements including UI display of critical        exceptions, UI desccription of job-order options, and improved        reporting.  Optimizations.  Updated httpclient lib to 2.0 release and        jmx libs to 1.2.1.        See <a href="articles/releasenotes/0_6_0.html">Heritrix Release        Notes</a> for detail.        </p></div><div class="subsection"><a name="Point_Release_0_4_1_02_12_2004"></a><h3>Point Release 0.4.1 02/12/2004</h3><p>Released <a href="http://sourceforge.net/project/showfiles.php?group_id=73833&amp;package_id=73980" class="externalLink" title="External Link">heritrix-0.4.1</a> to fix         <a href="http://sourceforge.net/tracker/index.php?func=detail&amp;aid=895955&amp;group_id=73833&amp;atid=539099" class="externalLink" title="External Link">URIRegExpFilter retains memory</a>.</p></div><div class="subsection"><a name="Release_0_4_0_02_10_2004"></a><h3>Release 0.4.0 02/10/2004</h3><p>Release made for heritrix workshop, San Francisco, 02/2004.        New MBEAN-based configuration, extensive UI revamp, first unit        tests and integration selftest framework added, pooling of        ARCWriters, new cmd-line start scripts, httpclient lib update (2.0RC3)        and bugfixes.        See <a href="articles/releasenotes/0_4_0.html">Heritrix Release        Notes</a> for detail.        </p></div><div class="subsection"><a name="First_Release_01_05_2004"></a><h3>First Release 01/05/2004</h3><p>Today we made our first 'official' heritrix release,         <a href="http://sourceforge.net/project/showfiles.php?group_id=73833&amp;package_id=73980" class="externalLink" title="External Link">heritrix-0.2.0</a>.</p></div></div></div></div><div class="clear"><hr></hr></div><div id="footer"><div class="xleft"><a href="http://sourceforge.net/projects/archive-crawler/" class="externalLink" title="External Link">            <img src="http://sourceforge.net/sflogo.php?group_id=archive-crawler&amp;type=1" border="0" alt="sf logo"></img></a></div><div class="xright">漏 2003-2006, Internet Archive</div><div class="clear"><hr></hr></div></div></body></html>

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -