1_4_0.html
来自「网络爬虫开源代码」· HTML 代码 · 共 390 行 · 第 1/4 页
HTML
390 行
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>8. Release 1.4.0 - 04/28/2005</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix Release Notes"><link rel="up" href="index.html" title="Heritrix Release Notes"><link rel="prev" href="1_6_0.html" title="7. Release 1.6.0 - 12/01/2005"><link rel="next" href="1_2_0.html" title="9. Release 1.2.0 - 11/16/2004"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">8. Release 1.4.0 - 04/28/2005</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="1_6_0.html">Prev</a> </td><th align="center" width="60%"> </th><td align="right" width="20%"> <a accesskey="n" href="1_2_0.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="1_4_0"></a>8. Release 1.4.0 - 04/28/2005</h2></div></div></div><div class="abstract"><p class="title"><b>Abstract</b></p><p>Much improved memory usage, new scoping/filter model, and a new revisiting frontier. Over 90 bugs fixed.</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="1_4_0_limitations"></a>8.1. Known Limitations/Issues</h3></div></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="glibc2_3_2"></a>8.1.1. Glibc 2.3.2 and NPTL</h4></div></div></div><p>NPTL is the 'new' linux threading model. It replaces <span class="emphasis"><em>linuxthreads</em></span> the 'old' model. You can tell you're running NPTL if your java process shows as one process only in the process listing. Wwith linuxthreads, all java threads show as distinct linux processes. Linux threading is integral to glibc.</p><p>On rare occasions we've seen the crawler hang without obvious explaination when running with NPTL threading on linux. Doing a thread dump on the hung crawler, one version of the hung crawler has threads waiting to obtain a lock that no one apparently holds. Our reading has these rare, crawl-killing, hangs as a problem in glbc2.3.2 when running with NPTL (NPTL 0.60) (We used to hang frequently but workarounds seem to have mitigated the frequency of lockup making it extremely rare). An upgrade to glibc2.3.3+ seems to do away with these hangs. Glibc2.3.3 has NPTL 0.61. Fedora3 has glibc2.3.4. If an upgrade is not possible -- for example, the new glibc is not currently available for debian -- you can disable NPTL and run with old threads by setting the environment variable <code class="literal">LD_ASSUME_KERNEL=2.4.1</code> (You can set this environment variable on a per process basis).</p><p>NPTL is usually the default threading model on linux and is usually what you want -- threads are more lightweight and java throughput seems to be slightly higher with NPTL enabled. Various are the ways in which you can see which threading model you are using. Do an ldd on the java executable to see what shared libraries its using. Note the location of the glibc shared library. Executing <code class="literal">PATH_TO_GLIBC/lib.so.6</code>, usually <code class="literal">/lib/lib.so.6</code>, will list details on glibc. Look in the listing for either 'nptl' or 'linuxthreads'. On debian systems, lib.so.6 is not executable but you can make it so. You can also do the following to determine library versions and which threading you are using: <code class="literal">% getconf GNU_LIBC_VERSION</code> and <code class="literal">% getconf GNU_LIBPTHREAD_VERSION</code>.</p><p>See <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=1086554&group_id=73833&atid=539099" target="_top">[ 1086554 ] glibc 2.3.2 NPTL hang (Was bdbfrontier stall in...)</a> for more on the issue.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="1093962"></a>8.1.2. <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=1093962&group_id=73833&atid=539099" target="_top">[1093962] SSL handshake fails when server requests switch to SSL V3</a></h4></div></div></div><p>When connecting to a secure server, if the server wants to switch from SSL V2 to SSL V3 when client is using a SUN JVM, the connection fails. See issue 1093962for more.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="old_jobs_profiles"></a>8.1.3. Using old jobs or profiles with 1.4</h4></div></div></div><p>You'll need to make one change to make your old order.xml files and profiles to run with Heritrix 1.4.x. Below is a diff that shows the change that needs to be made (The type of the <code class="literal">path</code> changed from <code class="literal">string</code> to <code class="literal">stringList</code>): <pre class="programlisting">+++ order.xml 2005-02-01 13:12:34.000000000 -0800@@ -162,7 +162,9 @@ <string name="prefix">BT</string> <string name="suffix"></string> <integer name="max-size-bytes">100000000</integer>- <string name="path">arcs</string>+ <stringList name="path">+ <string>arcs</string>+ </stringList> <integer name="pool-max-active">5</integer> <integer name="pool-max-wait">300000</integer> </newObject></pre></p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="cme_frontier"></a>8.1.4. <a href="https://sourceforge.net/tracker/?func=detail&atid=539099&aid=1119644&group_id=73833" target="_top">[ 1119644 ] frontier ConcurrentModificationException</a></h4></div></div></div><p>Sometimes you'll get a ConcurrentModificationException exception when you go to view or refresh the Frontier's report page. Workaround is to retry. The page should eventually come up.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="arcfile_suffix"></a>8.1.5. New ARC file suffix</h4></div></div></div><p>Pre-release 1.2.0, currently open ARC files that are being written to by the crawler were differentiated by an '.open' suffix. When the crawler finished writing, the suffix was removed. A new suffix has been introduced -- '.invalid' -- which the crawler will use to mark ARC files it thinks suspect -- usually because there was an IOException thrown during the writing of an ARC Record. Such ARCs need to be checked for validity. Run <code class="literal">% gzip -t</code> and <code class="literal">% ARCReader --strict</code> against all files with an '.invalid' suffix -- and any unclosed '.open' files present after a crawl has ended -- to check for corruption.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="1149470"></a>8.1.6. DNS lookups fail (-6 in crawl.log)</h4></div></div></div><p><a href="https://sourceforge.net/tracker/index.php?func=detail&aid=1149470&group_id=73833&atid=539099" target="_top">[1149470] all DNS attempts fail -6</a> discusses badly-formatted DNS records returned on windows platform that Heritrix fails to parse and it includes a pointer to a mailing list discussion of failed lookups on non-english windows. The issue includes description of a workaround.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="1178102"></a>8.1.7. FatalConfigurationException creating new job based on old</h4></div></div></div><p>Older SUN JVMS -- pre-beta3 versions of the SUN JVM 1.5.0 for instance -- had an issue using nio copying files. Try upgrading your JVM. See <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=1178102&group_id=73833&atid=539099" target="_top">[1178102] FCE on creation of new job based on job w/ overrides</a> for more on this.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="oome142"></a>8.1.8. OutOfMemoryErrors (OOMEs)</h4></div></div></div><p>Unusual pages -- pages of unorthodox structure, pages that contain thousands upon thousands of links -- will on occasion produce OOMEs.</p><p>There have been improvements regards memory usage running multiple jobs in series, <a href="1_2_0.html#oome_pending_jobs" title="9.1.3. Running more than one job in series throws OOME">Section 9.1.3, “Running more than one job in series throws OOME”</a>, but starting up a new job after a long-running job can prompt OOMEs. Workaround for now is to restart Heritrix between the running of big jobs.</p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="1_4_0_changes"></a>8.2. Changes</h3></div></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="bdbfrontier"></a>8.2.1. Berkeley DB Based Frontier</h4></div></div></div><p>The BdbFrontier -- a frontier that keeps its queues of URIs in <a href="http://www.sleepycat.com/products/je.shtml" target="_top">Berkeley DB Java Edition</a> databases -- has been made the default Frontier. Other core datastructures such as the queue of 'alreadyseen' URIs have also been moved into bdbje databases.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="dns_arc_ip"></a>8.2.2. The IP in dns ARC Records</h4></div></div></div><p>Dns entries in ARCs look like this: <pre class="programlisting">dns:www.archive.org 207.241.238.254 20050310233154 text/dns 58 20050310233154www.archive.org. 1600 IN A 207.241.224.241</pre> The above record is for the lookup of www.archive.org.</p><p>Previous to 1.4.0, the IP used on the ARC Record metaline -- the first line of an ARC Record entry (207.241.238.254 in the above example) -- was the IP of the host looked up. As of 1.4.0, we write the IP of the dns server that returned us the address looked up. Previous to this there was no recording of the dnsserver IP.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="arf"></a>8.2.3. AdaptiveRevisitFrontier</h4></div></div></div><p>A new, experimental Frontier with configurable revisiting policy and tools for noticing page change, etc.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="dr"></a>8.2.4. DecidingScope and DecidingFilter</h4></div><div><h5 class="subtitle">A.K.A New Scoping Model</h5></div></div></div><p>A new, experimental scope and filter that allow the user to pick and choose from an assortment of ready-made decision rules and have each rule applied in an orderable sequence. The last non-PASS decision stands as the aggregate decision for the decide rule sequence.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="mem_improvements"></a>8.2.5. Crawl Size Upper Bounds Update</h4></div></div></div><p>Memory usage has been improved in this release. Previously RAM-based datastructures that grew without bound now are disk-backed kept in berkeley db databases. Where previous, see <a href="1_0_0.html#upper_bounds" title="12.1.1. Crawl Size Upper Bounds">Section 12.1.1, “Crawl Size Upper Bounds”</a>, Heritrix was unsuited for broad crawling, while still experimental, using default memory settings -- a heap of 256m -- broad-crawls of 5 to 6 days before encountering
⌨️ 快捷键说明
复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?