1_2_0.html

来自「网络爬虫开源代码」· HTML 代码 · 共 49 行 · 第 1/2 页

HTML
49
字号
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>9.&nbsp;Release 1.2.0 - 11/16/2004</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix Release Notes"><link rel="up" href="index.html" title="Heritrix Release Notes"><link rel="prev" href="1_4_0.html" title="8.&nbsp;Release 1.4.0 - 04/28/2005"><link rel="next" href="1_0_4.html" title="10.&nbsp;Release 1.0.4 - 2004-09-22"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">9.&nbsp;Release 1.2.0 - 11/16/2004</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="1_4_0.html">Prev</a>&nbsp;</td><th align="center" width="60%">&nbsp;</th><td align="right" width="20%">&nbsp;<a accesskey="n" href="1_0_4.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="1_2_0"></a>9.&nbsp;Release 1.2.0 - 11/16/2004</h2></div></div></div><div class="abstract"><p class="title"><b>Abstract</b></p><p>Added IP-based politeness, configurable URI-canonicalization, and      mid-fetch abort. Lots of Bug fixes.</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="1_2_0_limitations"></a>9.1.&nbsp;Known Limitations</h3></div></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="ibmjvm"></a>9.1.1.&nbsp;IBM JVM</h4></div></div></div><p>The IBM JVM generally is more performant than SUN JVMs. It also        emits more detailed heap dumps. That said, new Heritrix 1.2.0 features        may not work on the IBM JVM.</p><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="ibmhttps"></a>9.1.1.1.&nbsp;HTTPS</h5></div></div></div><p>Heritrix 1.2.0 uses the new HttpClient 3.0x library which          allows the setting of socket read timeouts. Connections to https          sites fail if using the IBM JVM.</p><p>The IBM JVM 141 (cxia321411-20030930) NPEs setting the          NoTcpDelay. <pre class="programlisting">java.lang.NullPointerException   at com.ibm.jsse.bf.setTcpNoDelay(Unknown Source)   at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:683)   at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1328)</pre></p><p>Using the IBM JVM 142, its saying SSL connection not open when          we go to use inputstreams: <pre class="programlisting">java.net.SocketException: Socket is not connected   at java.net.Socket.getInputStream(Socket.java:726)     at com.ibm.jsse.bs.getInputStream(Unknown Source)   at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:715)   at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1328)</pre></p><p>Newer versions of the httpclient library may address this          (Current version is alpha2).</p></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="lostjobs"></a>9.1.2.&nbsp;Jobs don't show in UI when a bunch are run back-to-back</h4></div></div></div><p>If more than one job waiting in the queue of pending jobs, then        the second job often won't show in the UI; The UI says its running but        its not possible to see a status bar on the running job. See <a href="https://sourceforge.net/tracker/?func=detail&aid=1024120&group_id=73833&atid=539099" target="_top">[        1024120 ] Lost crawl job after terminate running job with jobs        pending</a>. For now, the workaround is to study the running job        by viewing the crawl job logs on disk (Oddly, the 3rd queued up job        will start to show in the UI again).</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="oome_pending_jobs"></a>9.1.3.&nbsp;Running more than one job in series throws OOME</h4></div></div></div><p>OutOfMemoryExceptions are frequent when jobs are run in series.        <a href="https://sourceforge.net/tracker/index.php?func=detail&aid=1055592&group_id=73833&atid=539099" target="_top">[        1055592 ] terminated crawl still hogging memory, causing OOM</a>.

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?