1_12_0.html

来自「网络爬虫开源代码」· HTML 代码 · 共 36 行

HTML
36
字号
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>2.&nbsp;Release 1.12.0 - 3/16/2007</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix Release Notes"><link rel="up" href="index.html" title="Heritrix Release Notes"><link rel="prev" href="1_12_1.html" title="1.&nbsp;Release 1.12.1 - 5/6/2007"><link rel="next" href="1_10_2.html" title="3.&nbsp;Release 1.10.2 - 01/15/2007"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">2.&nbsp;Release 1.12.0 - 3/16/2007</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="1_12_1.html">Prev</a>&nbsp;</td><th align="center" width="60%">&nbsp;</th><td align="right" width="20%">&nbsp;<a accesskey="n" href="1_10_2.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="1_12_0"></a>2.&nbsp;Release 1.12.0 - 3/16/2007</h2></div></div></div><div class="abstract"><p class="title"><b>Abstract</b></p><p>Release 1.12.0 is the first of several planned releases enhancing      Heritrix with "smart crawler" functionality. In this release, the theme      has been offering new options to reduce the amount of duplicate content      crawled and stored when recrawling sites at regular intervals. A number      of other enhancements and bug fixes are also included.</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="1_12_0_contributors"></a>2.1.&nbsp;Contributors</h3></div></div></div><p>Aside from the <a href="http://crawler.archive.org/team-list.html" target="_top">usual suspects</a>,      the following contributed to this release:<div class="itemizedlist"><ul type="disc"><li><p>Oskar Grenholm</p></li><li><p>Doug Judd</p></li></ul></div></p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="1_12_0_notes"></a>2.2.&nbsp;Notes</h3></div></div></div><p>With this release, Heritrix project issue-tracking has moved from      Sourceforge to a JIRA-based system at <a href="http://webteam.archive.org/jira/browse/HER" target="_top">      http://webteam.archive.org/jira/browse/HER </a>.</p><p>Those using Heritrix in a Hadoop environment may be interested in      Doug Judd's <a href="http://www.zvents.com/labs/hdfs_writer_processor" target="_top">HDFSWriterProcessor</a>,      for storing crawled content directly into HDFS, the <a href="http://lucene.apache.org/hadoop/" target="_top">Hadoop Distributed      FileSystem</a>. </p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="1_12_0_limitations"></a>2.3.&nbsp;Known Limitations/Issues</h3></div></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="1_12_0_limitations_bdb_nfs"></a>2.3.1.&nbsp;java.io.IOException: No locks available</h4></div></div></div><p>See <a href="1_8_0.html#bdb_nfs" title="6.1.1.&nbsp;java.io.IOException: No locks available">Section&nbsp;6.1.1, &ldquo;java.io.IOException: No locks available&rdquo;</a> in 1.8.0 Release Notes.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="1_12_0_limitations_checkpoints"></a>2.3.2.&nbsp;Older Checkpoints</h4></div></div></div><p>Checkpoints from earlier versions are generally not supported        for resume in later versions.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N1006B"></a>2.3.3.&nbsp;Older configurations (order.xml, etc.)</h4></div></div></div><p>Crawler configuration files from jobs in previous versions may        work in 1.12.0, though missing new settings will be set to their        default values, and obsolete old settings will generate log warnings.        Re-creating configurations from defaults or hand-editting to match        newer files is recommended. </p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="1_12_0_changes"></a>2.4.&nbsp;Changes</h3></div></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10074"></a>2.4.1.&nbsp;Duplication reduction features</h4></div></div></div><p>A collection of Processors, including the FetchHistoryProcessor,        PersistProcessor, and its subclasses, may be used together with new        options on the FetchHTTP and writer processors to carry information        forward between crawls and collect less duplicate content on later        recrawls. The project wiki features <a href="http://webteam.archive.org/confluence/display/Heritrix/Feature+Notes+-+1.12.0" target="_top">notes        on using the new duplication-reduction functionality</a>. </p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="1_12_0_deciderules"></a>2.4.2.&nbsp;DecideRules have replaced Filters on Processors</h4></div></div></div><p>All Processors which used internal Filters for differentially        acting on URIs now use DecideRules instead. In those cases where a        DecideRule replacement for a Filter is not yet available, a legacy        Filter can be wrapped in a FilterDecideRule to preserve prior        functionality. In a future release, all Filters will be removed in        favor of equivalent DecideRules. </p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10083"></a>2.4.3.&nbsp;WARC</h4></div></div></div><p>ExperimentalWARCWriter has been updated to match proposed WARC        version <a href="http://archive-access.svn.sourceforge.net/viewvc/*checkout*/archive-access/branches/gjm_warc_0_12/warc/warc_file_format.html" target="_top">"WARC/0.12"        (revision H1.12-RC1)</a>. The implementation as of Heritrix 1.10.x        remains for reference as org.archive.io.warc.v10.ExperimentalV10WARCWriterProcessor.        The WARC format remains under discussion. </p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N1008C"></a>2.4.4.&nbsp;Kw3WriterProcessor</h4></div></div></div><p>Oskar Grenholm of the Swedish National Library has contributed a        module that writes the results of successful fetches to files on disk.        These files are MIME-files of the type used by the Swedish National        Library's <a href="http://www.kb.se/kw3/" target="_top">Kulturarw3 web        harvesting</a>.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10095"></a>2.4.5.&nbsp;All tracked changes</h4></div></div></div><p>A dynamic list of all tracked changes marked as fixed in 1.12.0        is available at: <a href="http://webteam.archive.org/confluence/display/Heritrix/Issues+with+%27Fix+Version%27+1.12.0" target="_top">Issues        with 'Fix Version' 1.12.0.</a></p></div></div></div><div class="navfooter"><hr><table summary="Navigation footer" width="100%"><tr><td align="left" width="40%"><a accesskey="p" href="1_12_1.html">Prev</a>&nbsp;</td><td align="center" width="20%">&nbsp;</td><td align="right" width="40%">&nbsp;<a accesskey="n" href="1_10_2.html">Next</a></td></tr><tr><td valign="top" align="left" width="40%">1.&nbsp;Release 1.12.1 - 5/6/2007&nbsp;</td><td align="center" width="20%"><a accesskey="h" href="index.html">Home</a></td><td valign="top" align="right" width="40%">&nbsp;3.&nbsp;Release 1.10.2 - 01/15/2007</td></tr></table></div></body></html>

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?