1_4_0.html

来自「网络爬虫开源代码」· HTML 代码 · 共 390 行 · 第 1/4 页
HTML
390 行
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>8.&nbsp;Release 1.4.0 - 04/28/2005</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix Release Notes"><link rel="up" href="index.html" title="Heritrix Release Notes"><link rel="prev" href="1_6_0.html" title="7.&nbsp;Release 1.6.0 - 12/01/2005"><link rel="next" href="1_2_0.html" title="9.&nbsp;Release 1.2.0 - 11/16/2004"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">8.&nbsp;Release 1.4.0 - 04/28/2005</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="1_6_0.html">Prev</a>&nbsp;</td><th align="center" width="60%">&nbsp;</th><td align="right" width="20%">&nbsp;<a accesskey="n" href="1_2_0.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="1_4_0"></a>8.&nbsp;Release 1.4.0 - 04/28/2005</h2></div></div></div><div class="abstract"><p class="title"><b>Abstract</b></p><p>Much improved memory usage, new scoping/filter model, and a new      revisiting frontier. Over 90 bugs fixed.</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="1_4_0_limitations"></a>8.1.&nbsp;Known Limitations/Issues</h3></div></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="glibc2_3_2"></a>8.1.1.&nbsp;Glibc 2.3.2 and NPTL</h4></div></div></div><p>NPTL is the 'new' linux threading model. It replaces        <span class="emphasis"><em>linuxthreads</em></span> the 'old' model. You can tell you're        running NPTL if your java process shows as one process only in the        process listing. Wwith linuxthreads, all java threads show as distinct        linux processes. Linux threading is integral to glibc.</p><p>On rare occasions we've seen the crawler hang without obvious        explaination when running with NPTL threading on linux. Doing a thread        dump on the hung crawler, one version of the hung crawler has threads        waiting to obtain a lock that no one apparently holds. Our reading has        these rare, crawl-killing, hangs as a problem in glbc2.3.2 when        running with NPTL (NPTL 0.60) (We used to hang frequently but        workarounds seem to have mitigated the frequency of lockup making it        extremely rare). An upgrade to glibc2.3.3+ seems to do away with these        hangs. Glibc2.3.3 has NPTL 0.61. Fedora3 has glibc2.3.4. If an upgrade        is not possible -- for example, the new glibc is not currently        available for debian -- you can disable NPTL and run with old threads        by setting the environment variable        <code class="literal">LD_ASSUME_KERNEL=2.4.1</code> (You can set this        environment variable on a per process basis).</p><p>NPTL is usually the default threading model on linux and is        usually what you want -- threads are more lightweight and java        throughput seems to be slightly higher with NPTL enabled. Various are        the ways in which you can see which threading model you are using. Do        an ldd on the java executable to see what shared libraries its using.        Note the location of the glibc shared library. Executing        <code class="literal">PATH_TO_GLIBC/lib.so.6</code>, usually        <code class="literal">/lib/lib.so.6</code>, will list details on glibc. Look in        the listing for either 'nptl' or 'linuxthreads'. On debian systems,        lib.so.6 is not executable but you can make it so. You can also do the        following to determine library versions and which threading you are        using: <code class="literal">% getconf GNU_LIBC_VERSION</code> and <code class="literal">%        getconf GNU_LIBPTHREAD_VERSION</code>.</p><p>See <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=1086554&group_id=73833&atid=539099" target="_top">[        1086554 ] glibc 2.3.2 NPTL hang (Was bdbfrontier stall in...)</a>        for more on the issue.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="1093962"></a>8.1.2.&nbsp;<a href="http://sourceforge.net/tracker/index.php?func=detail&aid=1093962&group_id=73833&atid=539099" target="_top">[1093962]        SSL handshake fails when server requests switch to SSL        V3</a></h4></div></div></div><p>When connecting to a secure server, if the server wants to        switch from SSL V2 to SSL V3 when client is using a SUN JVM, the        connection fails. See issue 1093962for more.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="old_jobs_profiles"></a>8.1.3.&nbsp;Using old jobs or profiles with 1.4</h4></div></div></div><p>You'll need to make one change to make your old order.xml files        and profiles to run with Heritrix 1.4.x. Below is a diff that shows        the change that needs to be made (The type of the        <code class="literal">path</code> changed from <code class="literal">string</code> to        <code class="literal">stringList</code>): <pre class="programlisting">+++ order.xml   2005-02-01 13:12:34.000000000 -0800@@ -162,7 +162,9 @@         &lt;string name="prefix"&gt;BT&lt;/string&gt;         &lt;string name="suffix"&gt;&lt;/string&gt;         &lt;integer name="max-size-bytes"&gt;100000000&lt;/integer&gt;-        &lt;string name="path"&gt;arcs&lt;/string&gt;+        &lt;stringList name="path"&gt;+          &lt;string&gt;arcs&lt;/string&gt;+        &lt;/stringList&gt;         &lt;integer name="pool-max-active"&gt;5&lt;/integer&gt;         &lt;integer name="pool-max-wait"&gt;300000&lt;/integer&gt;       &lt;/newObject&gt;</pre></p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="cme_frontier"></a>8.1.4.&nbsp;<a href="https://sourceforge.net/tracker/?func=detail&atid=539099&aid=1119644&group_id=73833" target="_top">[        1119644 ] frontier ConcurrentModificationException</a></h4></div></div></div><p>Sometimes you'll get a ConcurrentModificationException exception        when you go to view or refresh the Frontier's report page. Workaround        is to retry. The page should eventually come up.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="arcfile_suffix"></a>8.1.5.&nbsp;New ARC file suffix</h4></div></div></div><p>Pre-release 1.2.0, currently open ARC files that are being        written to by the crawler were differentiated by an '.open' suffix.        When the crawler finished writing, the suffix was removed. A new        suffix has been introduced -- '.invalid' -- which the crawler will use        to mark ARC files it thinks suspect -- usually because there was an        IOException thrown during the writing of an ARC Record. Such ARCs need        to be checked for validity. Run <code class="literal">% gzip -t</code> and        <code class="literal">% ARCReader --strict</code> against all files with an        '.invalid' suffix -- and any unclosed '.open' files present after a        crawl has ended -- to check for corruption.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="1149470"></a>8.1.6.&nbsp;DNS lookups fail (-6 in crawl.log)</h4></div></div></div><p><a href="https://sourceforge.net/tracker/index.php?func=detail&aid=1149470&group_id=73833&atid=539099" target="_top">[1149470]        all DNS attempts fail -6</a> discusses badly-formatted DNS records        returned on windows platform that Heritrix fails to parse and it        includes a pointer to a mailing list discussion of failed lookups on        non-english windows. The issue includes description of a        workaround.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="1178102"></a>8.1.7.&nbsp;FatalConfigurationException creating new job based on        old</h4></div></div></div><p>Older SUN JVMS -- pre-beta3 versions of the SUN JVM 1.5.0 for        instance -- had an issue using nio copying files. Try upgrading your        JVM. See <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=1178102&group_id=73833&atid=539099" target="_top">[1178102]        FCE on creation of new job based on job w/ overrides</a> for more        on this.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="oome142"></a>8.1.8.&nbsp;OutOfMemoryErrors (OOMEs)</h4></div></div></div><p>Unusual pages -- pages of unorthodox structure, pages that        contain thousands upon thousands of links -- will on occasion produce        OOMEs.</p><p>There have been improvements regards memory usage running        multiple jobs in series, <a href="1_2_0.html#oome_pending_jobs" title="9.1.3.&nbsp;Running more than one job in series throws OOME">Section&nbsp;9.1.3, &ldquo;Running more than one job in series throws OOME&rdquo;</a>, but        starting up a new job after a long-running job can prompt OOMEs.        Workaround for now is to restart Heritrix between the running of big        jobs.</p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="1_4_0_changes"></a>8.2.&nbsp;Changes</h3></div></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="bdbfrontier"></a>8.2.1.&nbsp;Berkeley DB Based Frontier</h4></div></div></div><p>The BdbFrontier -- a frontier that keeps its queues of URIs in        <a href="http://www.sleepycat.com/products/je.shtml" target="_top">Berkeley DB        Java Edition</a> databases -- has been made the default Frontier.        Other core datastructures such as the queue of 'alreadyseen' URIs have        also been moved into bdbje databases.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="dns_arc_ip"></a>8.2.2.&nbsp;The IP in dns ARC Records</h4></div></div></div><p>Dns entries in ARCs look like this: <pre class="programlisting">dns:www.archive.org 207.241.238.254 20050310233154 text/dns 58 20050310233154www.archive.org.        1600    IN      A       207.241.224.241</pre>        The above record is for the lookup of www.archive.org.</p><p>Previous to 1.4.0, the IP used on the ARC Record metaline -- the        first line of an ARC Record entry (207.241.238.254 in the above        example) -- was the IP of the host looked up. As of 1.4.0, we write        the IP of the dns server that returned us the address looked up.        Previous to this there was no recording of the dnsserver IP.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="arf"></a>8.2.3.&nbsp;AdaptiveRevisitFrontier</h4></div></div></div><p>A new, experimental Frontier with configurable revisiting policy        and tools for noticing page change, etc.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="dr"></a>8.2.4.&nbsp;DecidingScope and DecidingFilter</h4></div><div><h5 class="subtitle">A.K.A New Scoping Model</h5></div></div></div><p>A new, experimental scope and filter that allow the user to pick        and choose from an assortment of ready-made decision rules and have        each rule applied in an orderable sequence. The last non-PASS decision        stands as the aggregate decision for the decide rule sequence.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="mem_improvements"></a>8.2.5.&nbsp;Crawl Size Upper Bounds Update</h4></div></div></div><p>Memory usage has been improved in this release. Previously        RAM-based datastructures that grew without bound now are disk-backed        kept in berkeley db databases. Where previous, see <a href="1_0_0.html#upper_bounds" title="12.1.1.&nbsp;Crawl Size Upper Bounds">Section&nbsp;12.1.1, &ldquo;Crawl Size Upper Bounds&rdquo;</a>, Heritrix was unsuited for broad crawling,        while still experimental, using default memory settings -- a heap of        256m -- broad-crawls of 5 to 6 days before encountering
1_4_0.html - 源码说明

本页面展示了「网络爬虫开源代码」中的 1_4_0.html 源码文件，采用 HTML 编程语言编写，共 390 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与网络爬虫相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?