1_10_0.html

来自「网络爬虫开源代码」· HTML 代码 · 共 260 行 · 第 1/3 页

HTML
260
字号
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>5.&nbsp;Release 1.10.0 - 09/11/2006</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix Release Notes"><link rel="up" href="index.html" title="Heritrix Release Notes"><link rel="prev" href="1_10_1.html" title="4.&nbsp;Release 1.10.1 - 09/27/2006"><link rel="next" href="1_8_0.html" title="6.&nbsp;Release 1.8.0 - 05/05/2006"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">5.&nbsp;Release 1.10.0 - 09/11/2006</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="1_10_1.html">Prev</a>&nbsp;</td><th align="center" width="60%">&nbsp;</th><td align="right" width="20%">&nbsp;<a accesskey="n" href="1_8_0.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="1_10_0"></a>5.&nbsp;Release 1.10.0 - 09/11/2006</h2></div></div></div><div class="abstract"><p class="title"><b>Abstract</b></p><p>Release 1.10.0 adds new configuration options, experimental new      protocol and format support, and lots of fixes. 43 tracked bugs have      been fixed and 35 feature requests added.</p><p>Release 1.10.0 requires JDK 1.5.x ("Java 5") Java      facilities.</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="1_10_0_contributors"></a>5.1.&nbsp;Contributors</h3></div></div></div><p>Aside from the <a href="http://crawler.archive.org/team-list.html" target="_top">usual suspects</a>,      the following contributed to this release: <div class="itemizedlist"><ul type="disc"><li><p>Eric C Jensen</p></li><li><p>Olaf Freyer</p></li><li><p>Karl Wright (of MetaCarta)</p></li><li><p>Frank McCown (of Old Dominion University)</p></li><li><p>Max Sch&ouml;fmann</p></li><li><p>S&oslash;ren Vejrup Carlsen (of Royal Library, Denmark)</p></li></ul></div></p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="1_10_0_limitations"></a>5.2.&nbsp;Known Limitations/Issues</h3></div></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="1_10_0_limitations_bdb_nfs"></a>5.2.1.&nbsp;java.io.IOException: No locks available</h4></div></div></div><p>See <a href="1_8_0.html#bdb_nfs" title="6.1.1.&nbsp;java.io.IOException: No locks available">Section&nbsp;6.1.1, &ldquo;java.io.IOException: No locks available&rdquo;</a> in 1.8.0 Release Notes.</p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="old_checkpoints_and_old_order_files"></a>5.3.&nbsp;Pre-1.10.0 checkpoints</h3></div></div></div><p>For sure 1.8.0 checkpoints will not be recoverable with      1.10.0.</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="1_10_0_changes"></a>5.4.&nbsp;Changes</h3></div></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="admindefaults"></a>5.4.1.&nbsp;No default login/password for web UI and JMX</h4></div></div></div><p>The old default login of 'admin' and password of 'letmein' for        access to the crawler web UI (and JMX agent control) have been        eliminated. It is now necessary to specify an access username and        password to start Heritrix. This may be done with the -a or --admin        command-line argument or via the system property        'heritrix.cmdline.admin'. (These each take a colon-separated username        and password, like 'username:password'.)</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="localhostonly"></a>5.4.2.&nbsp;Web UI binds to localhost only by default</h4></div></div></div><p>Previously, the Jetty web server that runs the Heritrix web UI        listened on all available network interfaces. In 1.10.0, Jetty will        only bind to localhost by default. The -b or --bind command-line        argument can be used to specify a different interface or list of        interfaces to bind to instead. You may specify "-b /" to get the old        behavior -- binding on all interfaces -- but only take this step after        reading section 2.3 of the User Manual, "Security        Considerations".</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="quotaretire"></a>5.4.3.&nbsp;QuotaEnforcer 'force-retire' option</h4></div></div></div><p>The optional QuotaEnforcer processor has a new setting,        'force-retire', which is by default 'true', and changes the default        behavior of QuotaEnforcer. Previously, when a URI was noted as being        over-quota, it would be marked with a special over-quota failure code        which caused it to complete processing as an error. As a result, all        over-quota URIs would quickly be finished as errors and appear in the        crawl.log, but there would be no opportunity to raise the quota and        continue crawling.</p><p>The new default behavior instead marks the URI with a directive        requesting its frontier queue be retired. If the frontier supports        this directive, the URI will be returned to its queue as if never        tried, and the whole queue retired from active crawling. This offers        the opportunity to raise the quota and continue crawling the URI and        others of its queue. (All settings changes cause all retired queues to        be reevaluated.) However, the over-quota URIs will not appear as        errors in the crawl.log.</p><p>If the old behavior is preferred, set 'force-retire' to        'false'.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="1_10_0canonicalization"></a>5.4.4.&nbsp;URL canonicalization changes</h4></div></div></div><p>In 1.10.0, URL canonicalization has changed in two ways. First,        the stripping of sessionids has improved [See <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=1550797&group_id=73833&atid=539099" target="_top">Stripping        sessionid can leave behind doubled ampersands</a>]. Previous, if        the sessionid was in the middle of a query string bookended by other        query parameters, canonicalization would leave behind the encasing        ampersands: E.g. If the URL        <code class="literal">http://a.com/?a=1&amp;sid=00000000000000000000000000000000&amp;b=1</code>        was passed through canonicalization, the result would be:        <code class="literal">http://a.com/?a=1&amp;&amp;b1</code>. This has been fixed        so that the result will now be:        <code class="literal">http://a.com/?a=1&amp;b1</code>.</p><p>The second change, <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=1550805&group_id=73833&atid=539102" target="_top">[1550805]        Add stripping of coldfusion sessionids</a>, adds the new        coldfusion sessionid stripper to the list of default canonicalization        rules.</p><p>We bring your attention to these seemingly minor changes because        for those of you running regular crawls, with both of the above        changes in place, depending on the type of crawl, there should be a        reduction in overall the number of (duplicate) pages crawled.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="warc"></a>5.4.5.&nbsp;WARC</h4></div></div></div><p>This release includes experimental WARC readers and writers. Be        warned that both code and specification are not yet final and so are        both subject to change with no guarantees of backward compatibility:        i.e. newer readers may not be able to read WARCs written with older        writers. See the <a href="/apidocs/org/archive/io/warc/package-summary.html" target="_top">org.archive.io.warc</a>        package documentation for more on the current state of code including        documentation of initial version of <code class="code">Arc2Warc</code> and        <code class="code">Warc2Arc</code> tools.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N102A0"></a>5.4.6.&nbsp;FTP</h4></div></div></div><p>This release also include experimental support for FTP. This        support is disabled by the default heritrix configuration. See the        User Guide for information on how to enable FTP.</p></div><p>        <div class="table"><a name="N102A7"></a><p class="title"><b>Table&nbsp;3.&nbsp;All Tracked Changes</b></p><table summary="All Tracked Changes" border="1"><colgroup><col><col><col><col><col><col></colgroup><thead><tr><th>ID</th><th>Type</th><th>Summary</th><th>Open Date</th><th>By</th><th>Filer</th></tr></thead><tbody><tr><td>                  <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=1545462&group_id=73833&atid=539102" target="_top">1545462</a>                </td><td>Add</td><td>Experimental WARC Readers and Writers</td><td>2006-08-23</td><td>stack-sf</td><td>stack-sf</td></tr><tr><td>                  <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=1494491&group_id=73833&atid=539102" target="_top">1494491</a>                </td><td>Add</td><td>path/role-sensitive robots (eg ignore for inline                images/css)</td><td>2006-05-24</td><td>karl-ia</td><td>gojomo</td></tr><tr><td>                  <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=1550849&group_id=73833&atid=539102" target="_top">1550849</a>                </td><td>Add</td><td>'Implied' URI extractor (eg, YouTube)</td><td>2006-09-01</td><td>karl-ia</td><td>gojomo</td></tr><tr><td>                  <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=1549665&group_id=73833&atid=539102" target="_top">1549665</a>                </td><td>Add</td><td>Add experimental Warc2Arc and Arc2Warc scripts</td><td>2006-08-30</td><td>stack-sf</td><td>stack-sf</td></tr><tr><td>                  <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=1546829&group_id=73833&atid=539102" target="_top">1546829</a>                </td><td>Add</td><td>Secure admin UI: Bind cmd-line argument</td><td>2006-08-25</td><td>karl-ia</td><td>stack-sf</td></tr><tr><td>                  <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=1545600&group_id=73833&atid=539102" target="_top">1545600</a>                </td><td>Add</td><td>remove default admin username/password</td><td>2006-08-23</td><td>karl-ia</td><td>gojomo</td></tr><tr><td>                  <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=1536441&group_id=73833&atid=539102" target="_top">1536441</a>                </td><td>Add</td><td>hash-based CrawlMapper</td><td>2006-08-08</td><td>karl-ia</td><td>gojomo</td></tr><tr><td>                  <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=1535744&group_id=73833&atid=539102" target="_top">1535744</a>                </td><td>Add</td><td>force reread of disk settings (for out-of-JVM/bulk                changes)</td><td>2006-08-06</td><td>karl-ia</td><td>gojomo</td></tr><tr><td>                  <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=1534280&group_id=73833&atid=539102" target="_top">1534280</a>                </td><td>Add</td><td>scriptable (beanshell) Processor, DecideRule                options</td><td>2006-08-03</td><td>gojomo</td><td>gojomo</td></tr><tr><td>                  <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=1522112&group_id=73833&atid=539102" target="_top">1522112</a>                </td><td>Add</td><td>CrawlMapper skip mapping 'E'mbeds (etc)</td><td>2006-07-13</td><td>karl-ia</td><td>gojomo</td></tr><tr><td>                  <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=1520269&group_id=73833&atid=539102" target="_top">1520269</a>                </td><td>Add</td><td>keep over-limit (-500X) URIs in queues (don't                'finish/log)</td><td>2006-07-10</td><td>karl-ia</td><td>gojomo</td></tr><tr><td>                  <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=1387423&group_id=73833&atid=539102" target="_top">1387423</a>

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?