⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 1_0_0.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
字号:
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>9.&nbsp;Release 1.0.0 - 2004-08-06</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix Release Notes"><link rel="up" href="index.html" title="Heritrix Release Notes"><link rel="prev" href="1_0_2.html" title="8.&nbsp;Release 1.0.2 - 2004-09-14"><link rel="next" href="0_10_0.html" title="10.&nbsp;Release 0.10.0 - 2004-06-046"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">9.&nbsp;Release 1.0.0 - 2004-08-06</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="1_0_2.html">Prev</a>&nbsp;</td><th align="center" width="60%">&nbsp;</th><td align="right" width="20%">&nbsp;<a accesskey="n" href="0_10_0.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="1_0_0"></a>9.&nbsp;Release 1.0.0 - 2004-08-06</h2></div></div></div><div class="abstract"><p class="title"><b>Abstract</b></p><p>Added new prefix ('SURT') scope and filter, compression of      recovery log, mass adding of URIs to running crawler, crawling via a      http proxy, adding of headers to request, improved out-of-the-box      defaults, hash of content to crawl log and to arcreader output, and many      bug fixes.</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="1_0_0_limitations"></a>9.1.&nbsp;Known Limitations</h3></div></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="upper_bounds"></a>9.1.1.&nbsp;Crawl Size Upper Bounds</h4></div></div></div><p>Heritrix 1.0.0 uses disk-based queues to hold any number of        pending URIs bounded only by available disk space, but still relies on        in-memory structures to efficiently track all discovered hosts and        previously-scheduled URIs. Crawls whose total scheduled URIs or        discovered hosts exhaust all available memory will trigger        out-of-memory errors, which freeze a crawl at the point of the        error.</p><p>With the default settings, and an assignment of a 256MB Java        heap to the Heritrix process, crawling which discovers up to 10 000        hosts, and schedules over 6 000 000 URIs, should be possible.        Discovery of higher numbers of URIs/hosts will likely trigger        out-of-memory problems unless a larger java heap was assigned at        startup.</p><p>Broad crawls -- those using the BroadScope or ranging over        domains with many subdomains -- can easily and quickly exceed these        parameters. Thus broad crawls in Heritrix 1.0.0 are not recommended,        except for experimental purposes.</p><p>Narrower crawls, restricted to specific hosts or domains a        limited number of subdomains, can run for a week or more, collecting        millions of resources. Larger heaps can allow crawls to run into the        tens of millions of collected URIS, and tens of thousands of        discovered hosts.</p><p>An experimental alternate Frontier, the DiskIncludedFrontier, is        also available via the 'Modules' crawl configuration tab. It uses a        capped amount of memory plus disk storage to remember any number of        scheduled URIs, but its performance is poor and it has not received        the same testing as our default Frontier. The memory cost of        additional discovered hosts continues to rise without limit when using        a DiskIncludedFrontier.</p><p>Future versions of Heritrix will include other frontier        implementations allowing larger and unbounded crawls with minimal        performance penalties.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="958055"></a>9.1.2.&nbsp;<a href="https://sourceforge.net/tracker/?func=detail&aid=958055&group_id=73833&atid=539099" target="_top">[        958055 ] Seed ConcurrentModificationException</a></h4></div></div></div><p>Its possible to get ConcurrentModificationsException editing        options on a running crawl.</p><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N11C13"></a>9.1.2.1.&nbsp;Workaround</h5></div></div></div><p>Pause the crawl when making changes to crawl options.</p></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="984390"></a>9.1.3.&nbsp;<a href="https://sourceforge.net/tracker/?func=detail&aid=984390&group_id=73833&atid=539099" target="_top">[        984390 ] Build fails: "rws" mode and Mac OS X interact        badly</a></h4></div></div></div><p>On macintoshes and linux kernel version 2.6, heritrix fails to        build (unit tests fail).</p><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N11C20"></a>9.1.3.1.&nbsp;Workaround</h5></div></div></div><p>See issue, <a href="https://sourceforge.net/tracker/?func=detail&aid=984390&group_id=73833&atid=539099" target="_top">[          984390 ] Build fails: "rws" mode and Mac OS X interact          badly</a>, for source code workaround edit.</p></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="955975"></a>9.1.4.&nbsp;<a href="https://sourceforge.net/tracker/index.php?func=detail&aid=955975&group_id=73833&atid=539099" target="_top">[        955975 ] Build fails: JVM and kernel 2.6+ (Was 2 tests        fail...)</a></h4></div></div></div><p>Heritrix fails to build on linux kernel 2.6.</p><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N11C31"></a>9.1.4.1.&nbsp;Workaround</h5></div></div></div><p>Build fails unless you use a JDK in advance of pedigree 1.5          beta 2 (It works with jdk1.5.0-rc). See <a href="https://sourceforge.net/tracker/index.php?func=detail&aid=955975&group_id=73833&atid=539099" target="_top">[          955975 ] Build fails: JVM and kernel 2.6+ (Was 2 tests          fail...)</a> and <a href="#984390" target="_top">above</a>.</p></div></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N11C3E"></a>9.2.&nbsp;Changes</h3></div></div></div><p><div class="table"><a name="N11C42"></a><p class="title"><b>Table&nbsp;9.&nbsp;Changes</b></p><table summary="Changes" border="1"><colgroup><col><col><col></colgroup><thead><tr><th>ID</th><th>Type</th><th>Summary</th></tr></thead><tbody><tr><td><a href="http://sourceforge.net/tracker/index.php?func=detail&group_id=73833&atid=539102&aid=939679" target="_top">939679</a></td><td>Add</td><td>Mass-add URIs to running crawl and force                reconsideration</td></tr><tr><td><a href="http://sourceforge.net/tracker/index.php?func=detail&group_id=73833&atid=539102&aid=986977" target="_top">986977</a></td><td>Add</td><td>SurtPrefix scope (and filter)</td></tr><tr><td><a href="http://sourceforge.net/tracker/index.php?func=detail&group_id=73833&atid=539102&aid=989816" target="_top">989816</a></td><td>Add</td><td>Specification of default CharSequence charset</td></tr><tr><td><a href="http://sourceforge.net/tracker/index.php?func=detail&group_id=73833&atid=539102&aid=983001" target="_top">983001</a></td><td>Add</td><td>crawl.log entries all on one line</td></tr><tr><td><a href="http://sourceforge.net/tracker/index.php?func=detail&group_id=73833&atid=539102&aid=869584" target="_top">869584</a></td><td>Add</td><td>Hash content-bodies, show in logs (and future                ARCs)</td></tr><tr><td><a href="http://sourceforge.net/tracker/index.php?func=detail&group_id=73833&atid=539102&aid=964581" target="_top">964581</a></td><td>Add</td><td>option to preference (quick-get) embeds</td></tr><tr><td><a href="http://sourceforge.net/tracker/index.php?func=detail&group_id=73833&atid=539102&aid=964493" target="_top">964493</a></td><td>Add</td><td>Compress recover.log</td></tr><tr><td><a href="http://sourceforge.net/tracker/index.php?func=detail&group_id=73833&atid=539102&aid=988106" target="_top">988106</a></td><td>Add</td><td>[UURI] 'http:///...' converted to 'http://...'</td></tr><tr><td><a href="http://sourceforge.net/tracker/index.php?func=detail&group_id=73833&atid=539102&aid=926143" target="_top">926143</a></td><td>Add</td><td>enable use through HTTP proxy</td></tr><tr><td><a href="http://sourceforge.net/tracker/index.php?func=detail&group_id=73833&atid=539102&aid=945922" target="_top">945922</a></td><td>Add</td><td>Allow adding (subtracting?) http headers</td></tr><tr><td><a href="http://sourceforge.net/tracker/index.php?func=detail&group_id=73833&atid=539102&aid=983109" target="_top">983109</a></td><td>Add</td><td>Improved out-of-the-box defaults</td></tr><tr><td><a href="http://sourceforge.net/tracker/index.php?func=detail&group_id=73833&atid=539102&aid=982909" target="_top">982909</a></td><td>Add</td><td>ARCWriter makes FAT gzip header</td></tr><tr><td><a href="http://sourceforge.net/tracker/index.php?func=detail&group_id=73833&atid=539102&aid=925734" target="_top">925734</a></td><td>Add</td><td>exponential backoff URI/host retries</td></tr><tr><td>-</td><td>Fix</td><td>Total data "written" isn't necessarily written                (wording)</td></tr><tr><td>-</td><td>Fix</td><td>embeds within scope problem</td></tr><tr><td>-</td><td>Fix</td><td>NPE clearing alerts</td></tr><tr><td>-</td><td>Fix</td><td>arcmetadata repeated once for every domain                config</td></tr><tr><td>-</td><td>Fix</td><td>CCE deserializing diskqueue [Was:                IllegalArgumentExcepti...]</td></tr><tr><td>-</td><td>Fix</td><td>no docs for recovery-journal feature</td></tr><tr><td>-</td><td>Fix</td><td>Pause/Terminate ignored on 2.6 kernel 1.5 JVM</td></tr><tr><td>-</td><td>Fix</td><td>Investigate "Relative URI but no base"</td></tr><tr><td>-</td><td>Fix</td><td>User-Agent should be able to mimic Mozilla (as does                Google)</td></tr><tr><td>-</td><td>Fix</td><td>referral URL should be stored in recover.log</td></tr><tr><td>-</td><td>Fix</td><td>ToeThreads hung in FetchDNS after Pause</td></tr><tr><td>-</td><td>Fix</td><td>robots.txt lookup for different ports on same                host</td></tr><tr><td>-</td><td>Fix</td><td>Empty log percentages displayed as NaN%</td></tr><tr><td>-</td><td>Fix</td><td>UURI doubly-encodes %XX sequences</td></tr><tr><td>-</td><td>Fix</td><td>Single settings change causes two versions to be                created</td></tr><tr><td>-</td><td>Fix</td><td>New IA debian image is 2.6 (Was: Build fails: JVM and                ...)</td></tr><tr><td>-</td><td>Fix</td><td>NPE in PathDepthFilter</td></tr><tr><td>-</td><td>Fix</td><td>[investigate &amp; rule out] Thread report deadlock                risks</td></tr><tr><td>-</td><td>Fix</td><td>jetty susceptible to DoS attack</td></tr><tr><td>-</td><td>Fix</td><td>'ignore' robots does not ignore meta nofollow</td></tr><tr><td>-</td><td>Fix</td><td>URI Syntax Errors stop page parsing.</td></tr><tr><td>-</td><td>Fix</td><td>NPE in ExtractorHTML/TextUtils.getMatcher()</td></tr><tr><td>-</td><td>Fix</td><td>ARCReader: Failed to find GZIP MAGIC</td></tr><tr><td>-</td><td>Fix</td><td>javascript embedded URLs</td></tr><tr><td>-</td><td>Fix</td><td>NoClassDefFoundError when starting a job</td></tr><tr><td>-</td><td>Fix</td><td>Max number of deferrals hard-coded to 10.</td></tr><tr><td>-</td><td>Fix</td><td>Frontier report thread safety problems?</td></tr><tr><td>-</td><td>Fix</td><td>ARCReader hanging</td></tr><tr><td>-</td><td>Fix</td><td>log-browsing by regexp outofmemoryerror</td></tr><tr><td>-</td><td>Fix</td><td>Deferred URLs due the DNS problem --                Heritrix(-50)-Deferred</td></tr><tr><td>-</td><td>Fix</td><td>Assertion failures shouldn't be more fatal than Runtime                Exc.</td></tr><tr><td>-</td><td>Fix</td><td>min-interval is superfluous; remove</td></tr><tr><td>-</td><td>Fix</td><td>crawl doesn't end when using valence &gt; 1</td></tr><tr><td>-</td><td>Fix</td><td>Giant (in # of files) state directory                problematic</td></tr><tr><td>-</td><td>Fix</td><td>robots-expiration units, default wrong</td></tr><tr><td>-</td><td>Fix</td><td>NoSuchElementException in URI queues halts                crawling</td></tr><tr><td>-</td><td>Fix</td><td>#anchor links not trimmed, and thus recrawled</td></tr><tr><td>-</td><td>Fix</td><td>arc's filedesc file name includes .gz</td></tr><tr><td>-</td><td>Fix</td><td>[denmark-workshop] Cookie mangling</td></tr><tr><td>-</td><td>Fix</td><td>HttpException: Unable to parse header</td></tr><tr><td>-</td><td>Fix</td><td>bogus ARC-header when no Content-type</td></tr><tr><td>-</td><td>Fix</td><td>paths when crawling without UI</td></tr><tr><td>-</td><td>Fix</td><td>domain scope leakage</td></tr></tbody></table></div></p></div></div><div class="navfooter"><hr><table summary="Navigation footer" width="100%"><tr><td align="left" width="40%"><a accesskey="p" href="1_0_2.html">Prev</a>&nbsp;</td><td align="center" width="20%">&nbsp;</td><td align="right" width="40%">&nbsp;<a accesskey="n" href="0_10_0.html">Next</a></td></tr><tr><td valign="top" align="left" width="40%">8.&nbsp;Release 1.0.2 - 2004-09-14&nbsp;</td><td align="center" width="20%"><a accesskey="h" href="index.html">Home</a></td><td valign="top" align="right" width="40%">&nbsp;10.&nbsp;Release 0.10.0 - 2004-06-046</td></tr></table></div></body></html>

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -