outside.html

来自「网络爬虫开源代码」· HTML 代码 · 共 315 行 · 第 1/3 页

HTML
315
字号
        <code class="literal">.del</code> suffix. Assembling a checkpoint will often        require renaming files with the <code class="literal">.del</code> suffix so they        have the <code class="literal">.jdb</code> suffix in accordance with the        <code class="literal">bdbj-logs-manifest.txt</code> list (See below for more on        this).</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>With this expert mode enabled, the crawler          <code class="literal">crawl-order/state-path</code> "state" directory will          grow without bound; a process external to the crawler can be set to          prune the state directory of <code class="literal">.del</code> files          referenced by checkpoints since superceded).</p></div><p>To enable the no-files copy checkpoint, set the new expert mode        setting <code class="literal">checkpoint-copy-bdbje-logs</code> to        <code class="literal">false</code>.</p><p>To recover using a checkpoint that has all but the bdbje log        files present, you will need to copy all logs listed in        <code class="literal">bdbj-logs-manifest.txt</code> to the        <code class="literal">bdbje-logs</code> checkpoint subdirectory. In some cases        this will necessitate renaming logs with the <code class="literal">.del</code>        to instead have the <code class="literal">.jdb</code> ending as suggested above.        One thing to watch for is copying too many logs into the bdbje logs        subdirectory. The list of logs must match exactly whats in the        manifest file. Otherwise, the recovery will fail (For example, see        <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=1325961&group_id=73833&atid=539099" target="_top">[1325961]        resurrectOneQueueState has keys for items not in        allqueues</a>).</p><p>On checkpoint recovery, Heritrix copies bdbje log files from the        referenced checkpoint <code class="literal">bdb-logs</code> subdirectory to the        new crawl's <code class="literal">crawl-order/state-path</code> "state"        directory. As noted above, this can take some time. Of note, if a        bdbje log file already exists in the new crawls'        <code class="literal">crawl-order/state-path</code> "state" directory,        checkpoint recover will not overwrite the existing bdbje log file.        Exploit this property and save on recovery time by using native unix        <code class="literal">cp</code> to manually copy over bdbje log files from the        checkpoint directory to the new crawls'        <code class="literal">crawl-order/state-path</code> "state" directory before        launching a recovery (Or, at the extreme, though it will trash your        checkpoint, set the checkpoint's <code class="literal">bdb-logs</code>        subdirectory as the new crawls        <code class="literal">crawl-order/state-path</code> "state" directory).</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="automated_chkpt"></a>9.4.2.&nbsp;Automated Checkpointing</h4></div></div></div><p>To have Heritrix run a checkpoint on a period, uncomment (or        add) to <code class="literal">heritrix.properties</code> a line like:        <pre class="programlisting">org.archive.crawler.framework.Checkpointer.period = 2</pre>        This will install a Timer Thread that will run on an interval (Units        are in hours). See <code class="literal">heritrix_out.log</code> to see log of        installation of the timer thread that will run the checkpoint on a        period and to see log of everytime it runs (Assuming        <code class="literal">org.archive.crawler.framework.Checkpointer.level</code> is        set to INFO).</p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="mon_com"></a>9.5.&nbsp;Remote Monitoring and Control</h3></div></div></div><p>As of release 1.4.0, Heritrix will start up the JVM's JMX Agent if      deployed in a SUN 1.5.0 JVM. It password protects the JMX Agent using      whatever was specified as the Heritrix admin password so to login,      you'll use 'monitorRole' or 'controlRole' for login and the Heritrix      admin password as password. By default, the JMX Agent is started up on      port 8849 (To change any of the JMX settings, set the JMX_OPTS      environment variable).</p><p>On startup, Heritrix looks if any JMX Agent running in current      context and registers itself with the first JMX Agent found publishing      attributes and operations that can be run remotely. If running in a SUN      1.5.0 JVM where the JVM JMX Agent has been started, Heritrix will attach      to the JVM JMX Agent (If running inside JBOSS, Heritrix will register      with the JBOSS JMX Agent).</p><p>To see what Attributes and Operations are available via JMX, use      the SUN 1.5.0 JDK jconsole application -- its in $JAVA_HOME/bin -- or      use <a href="outside.html#jmxclient" title="9.2.4.&nbsp;cmdline-jmxclient">Section&nbsp;9.2.4, &ldquo;cmdline-jmxclient&rdquo;</a>.</p><p>To learn more about the SUN 1.5.0 JDK JMX managements and      jconsole, see <a href="http://java.sun.com/j2se/1.5.0/docs/guide/management/agent.html" target="_top">Monitoring      and Management Using JMX</a>. This O'Reilly article is also a good      place for getting started : <a href="http://www.onjava.com/pub/a/onjava/2004/09/29/tigerjmx.html" target="_top">Monitoring      Local and Remote Applications Using JMX 1.2 and JConsole</a>.</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="ftp_support"></a>9.6.&nbsp;Experimental FTP Support</h3></div></div></div><p>As of release 1.10.0, Heritrix has experimental support for      crawling FTP servers. To enable FTP support for your crawls, there is a      configuration file change you will have to manually make.</p><p>Specifically, you will have to edit the      <code class="filename">$HERITRIX_HOME/conf/heritrix.properties</code> file.      Remove <code class="literal">ftp</code> from the      <code class="literal">org.archive.net.UURIFactory.ignored-schemes</code> property      list. Also, you must add <code class="literal">ftp</code> to the      <code class="literal">org.archive.net.UURIFactory.schemes</code> property      list.</p><p>After that change, you should be able to add the FetchFTP      processor to your crawl using the Web UI. Just create a new job, click      "Modules", and add FetchFTP under "Fetchers."</p><p>Note that FetchFTP is a little unusual in that it works both as a      fetcher and as an extractor. If an FTP URI refers to a directory, and if      FetchFTP's <code class="literal">extract-from-dirs</code> property is set to true,      then FetchFTP will extract one link for every line of the directory      listing. Similarly, if the <code class="literal">extract-parent</code> property is      true, then FetchFTP will extract the parent directory from every FTP URI      it encounters.</p><p>Also, remember that FetchFTP is experimental. As of 1.10, FetchFTP      has the following known limitations:</p><div class="orderedlist"><ol type="1"><li>          FetchFTP can only store directories if the FTP server supports the           <code class="literal">NLIST</code>           command. Some older systems may not support           <code class="literal">NLIST</code>          .        </li><li>          Similarly, FetchFTP uses passive mode transfer, to work behind firewalls. Not all FTP servers support passive mode, however.        </li><li>          Heritrix currently has no means of determining the mime-type of a document unless an HTTP server explicitly mentions one. Since FTP has no concept of metadata, all documents retrieved using FetchFTP have a mime-type of           <code class="literal">no-type</code>          .         </li><li>          In the absence of a mime-type, many of the postprocessors will not work. For instance, HTMLExtractor will not extract links from an HTML file fetched with FetchFTP.         </li></ol></div><p>Still, FetchFTP can be used to archive an FTP directory of      tarballs, for instance. If you discover any additional problems using      FetchFTP, please inform the      <code class="email">&lt;<a href="mailto:archive-crawler@yahoogroups.com">archive-crawler@yahoogroups.com</a>&gt;</code> mailing list.</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N10F66"></a>9.7.&nbsp;Duplication Reduction Processors</h3></div></div></div><p>Starting in release 1.12.0, a number of Processors can cooperate      to carry forward URI content history information between crawls,      reducing the amount of duplicate material downloaded or stored in later      crawls. For more information, see the project wiki's <a href="http://webteam.archive.org/confluence/display/Heritrix/Feature+Notes+-+1.12.0" target="_top">notes      on using the new duplication-reduction functionality</a>. </p></div></div><div class="navfooter"><hr><table summary="Navigation footer" width="100%"><tr><td align="left" width="40%"><a accesskey="p" href="analysis.html">Prev</a>&nbsp;</td><td align="center" width="20%">&nbsp;</td><td align="right" width="40%">&nbsp;<a accesskey="n" href="usecases.html">Next</a></td></tr><tr><td valign="top" align="left" width="40%">8.&nbsp;Analysis of jobs&nbsp;</td><td align="center" width="20%"><a accesskey="h" href="index.html">Home</a></td><td valign="top" align="right" width="40%">&nbsp;A.&nbsp;Common Heritrix Use Cases</td></tr></table></div></body></html>

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?