1_12_0.html
来自「网络爬虫开源代码」· HTML 代码 · 共 36 行
HTML
36 行
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>2. Release 1.12.0 - 3/16/2007</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix Release Notes"><link rel="up" href="index.html" title="Heritrix Release Notes"><link rel="prev" href="1_12_1.html" title="1. Release 1.12.1 - 5/6/2007"><link rel="next" href="1_10_2.html" title="3. Release 1.10.2 - 01/15/2007"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">2. Release 1.12.0 - 3/16/2007</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="1_12_1.html">Prev</a> </td><th align="center" width="60%"> </th><td align="right" width="20%"> <a accesskey="n" href="1_10_2.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="1_12_0"></a>2. Release 1.12.0 - 3/16/2007</h2></div></div></div><div class="abstract"><p class="title"><b>Abstract</b></p><p>Release 1.12.0 is the first of several planned releases enhancing Heritrix with "smart crawler" functionality. In this release, the theme has been offering new options to reduce the amount of duplicate content crawled and stored when recrawling sites at regular intervals. A number of other enhancements and bug fixes are also included.</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="1_12_0_contributors"></a>2.1. Contributors</h3></div></div></div><p>Aside from the <a href="http://crawler.archive.org/team-list.html" target="_top">usual suspects</a>, the following contributed to this release:<div class="itemizedlist"><ul type="disc"><li><p>Oskar Grenholm</p></li><li><p>Doug Judd</p></li></ul></div></p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="1_12_0_notes"></a>2.2. Notes</h3></div></div></div><p>With this release, Heritrix project issue-tracking has moved from Sourceforge to a JIRA-based system at <a href="http://webteam.archive.org/jira/browse/HER" target="_top"> http://webteam.archive.org/jira/browse/HER </a>.</p><p>Those using Heritrix in a Hadoop environment may be interested in Doug Judd's <a href="http://www.zvents.com/labs/hdfs_writer_processor" target="_top">HDFSWriterProcessor</a>, for storing crawled content directly into HDFS, the <a href="http://lucene.apache.org/hadoop/" target="_top">Hadoop Distributed FileSystem</a>. </p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="1_12_0_limitations"></a>2.3. Known Limitations/Issues</h3></div></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="1_12_0_limitations_bdb_nfs"></a>2.3.1. java.io.IOException: No locks available</h4></div></div></div><p>See <a href="1_8_0.html#bdb_nfs" title="6.1.1. java.io.IOException: No locks available">Section 6.1.1, “java.io.IOException: No locks available”</a> in 1.8.0 Release Notes.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="1_12_0_limitations_checkpoints"></a>2.3.2. Older Checkpoints</h4></div></div></div><p>Checkpoints from earlier versions are generally not supported for resume in later versions.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N1006B"></a>2.3.3. Older configurations (order.xml, etc.)</h4></div></div></div><p>Crawler configuration files from jobs in previous versions may work in 1.12.0, though missing new settings will be set to their default values, and obsolete old settings will generate log warnings. Re-creating configurations from defaults or hand-editting to match newer files is recommended. </p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="1_12_0_changes"></a>2.4. Changes</h3></div></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10074"></a>2.4.1. Duplication reduction features</h4></div></div></div><p>A collection of Processors, including the FetchHistoryProcessor, PersistProcessor, and its subclasses, may be used together with new options on the FetchHTTP and writer processors to carry information forward between crawls and collect less duplicate content on later recrawls. The project wiki features <a href="http://webteam.archive.org/confluence/display/Heritrix/Feature+Notes+-+1.12.0" target="_top">notes on using the new duplication-reduction functionality</a>. </p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="1_12_0_deciderules"></a>2.4.2. DecideRules have replaced Filters on Processors</h4></div></div></div><p>All Processors which used internal Filters for differentially acting on URIs now use DecideRules instead. In those cases where a DecideRule replacement for a Filter is not yet available, a legacy Filter can be wrapped in a FilterDecideRule to preserve prior functionality. In a future release, all Filters will be removed in favor of equivalent DecideRules. </p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10083"></a>2.4.3. WARC</h4></div></div></div><p>ExperimentalWARCWriter has been updated to match proposed WARC version <a href="http://archive-access.svn.sourceforge.net/viewvc/*checkout*/archive-access/branches/gjm_warc_0_12/warc/warc_file_format.html" target="_top">"WARC/0.12" (revision H1.12-RC1)</a>. The implementation as of Heritrix 1.10.x remains for reference as org.archive.io.warc.v10.ExperimentalV10WARCWriterProcessor. The WARC format remains under discussion. </p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N1008C"></a>2.4.4. Kw3WriterProcessor</h4></div></div></div><p>Oskar Grenholm of the Swedish National Library has contributed a module that writes the results of successful fetches to files on disk. These files are MIME-files of the type used by the Swedish National Library's <a href="http://www.kb.se/kw3/" target="_top">Kulturarw3 web harvesting</a>.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10095"></a>2.4.5. All tracked changes</h4></div></div></div><p>A dynamic list of all tracked changes marked as fixed in 1.12.0 is available at: <a href="http://webteam.archive.org/confluence/display/Heritrix/Issues+with+%27Fix+Version%27+1.12.0" target="_top">Issues with 'Fix Version' 1.12.0.</a></p></div></div></div><div class="navfooter"><hr><table summary="Navigation footer" width="100%"><tr><td align="left" width="40%"><a accesskey="p" href="1_12_1.html">Prev</a> </td><td align="center" width="20%"> </td><td align="right" width="40%"> <a accesskey="n" href="1_10_2.html">Next</a></td></tr><tr><td valign="top" align="left" width="40%">1. Release 1.12.1 - 5/6/2007 </td><td align="center" width="20%"><a accesskey="h" href="index.html">Home</a></td><td valign="top" align="right" width="40%"> 3. Release 1.10.2 - 01/15/2007</td></tr></table></div></body></html>
⌨️ 快捷键说明
复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?