⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 outside.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
📖 第 1 页 / 共 3 页
字号:
        more work: to recover from a checkpoint, the bdbje log files need to be        manually assembled in the checkpoint <code class="literal">bdb-logs</code>        subdirectory. You'll know which bdbje log files make up the checkpoint        because Heritrix writes the checkpoint list of bdbje logs into the        checkpoint directory to a file named        <code class="literal">bdbj-logs-manifest.txt</code>. To prevent bdbje removing log        files that might be needed assembling a checkpoint made at sometime in        the past, when running expert mode checkpointing, we configure bdbje        not to delete logs when its finished with them; instead, bdbje gives        logs its no longer using a <code class="literal">.del</code> suffix. Assembling a        checkpoint will often require renaming files with the        <code class="literal">.del</code> suffix so they have the <code class="literal">.jdb</code>        suffix in accordance with the <code class="literal">bdbj-logs-manifest.txt</code>        list (See below for more on this).        </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>With this expert mode enabled, the crawler        <code class="literal">crawl-order/state-path</code> "state" directory will grow        without bound; a process external to the crawler can be set to prune        the state directory of <code class="literal">.del</code> files referenced        by checkpoints since superceded).        </p></div><p>To enable the no-files copy checkpoint, set the new expert mode        setting <code class="literal">checkpoint-copy-bdbje-logs</code> to        <code class="literal">false</code>.        </p><p>To recover using a checkpoint that has all but the bdbje log files        present, you will need to copy all logs listed in        <code class="literal">bdbj-logs-manifest.txt</code> to the        <code class="literal">bdbje-logs</code> checkpoint subdirectory.  In some cases        this will necessitate renaming logs with the <code class="literal">.del</code> to        instead have the <code class="literal">.jdb</code> ending as suggested above. One        thing to watch for is copying too many logs into the bdbje logs        subdirectory. The list of logs must match exactly whats in the manifest        file.  Otherwise, the recovery will fail (For example, see         <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=1325961&group_id=73833&atid=539099" target="_top">[1325961] resurrectOneQueueState has keys for items not in allqueues</a>).        </p><p>On checkpoint recovery, Heritrix copies bdbje log files from the        referenced checkpoint <code class="literal">bdb-logs</code> subdirectory        to the new crawl's <code class="literal">crawl-order/state-path</code> "state"        directory. As noted above, this can take some time.       Of note, if a bdbje log file already exists in the new crawls'      <code class="literal">crawl-order/state-path</code> "state" directory, checkpoint      recover will not overwrite the existing bdbje log file.      Exploit this property and save on recovery time by using native unix      <code class="literal">cp</code> to manually copy over bdbje log files from the      checkpoint directory to the new crawls'      <code class="literal">crawl-order/state-path</code> "state"      directory before launching a recovery (Or, at the extreme, though it will      trash your checkpoint, set the checkpoint's <code class="literal">bdb-logs</code>      subdirectory as the new crawls <code class="literal">crawl-order/state-path</code>      "state" directory).</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="automated_chkpt"></a>9.4.2.&nbsp;Automated Checkpointing</h4></div></div></div><p>To have Heritrix run a checkpoint on a period, uncomment (or        add) to <code class="literal">heritrix.properties</code> a line like:        <pre class="programlisting">org.archive.crawler.framework.Checkpointer.period = 2</pre>        This will install a Timer Thread that will run on an interval (Units        are in hours). See <code class="literal">heritrix_out.log</code> to see log        of installation of the timer thread that will run the checkpoint        on a period and to see log of everytime it runs (Assuming         <code class="literal">org.archive.crawler.framework.Checkpointer.level</code> is        set to INFO).</p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="mon_com"></a>9.5.&nbsp;Remote Monitoring and Control</h3></div></div></div><p>As of release 1.4.0, Heritrix will start up the JVM's JMX Agent if      deployed in a SUN 1.5.0 JVM. It password protects the JMX Agent using      whatever was specified as the Heritrix admin password so to login,      you'll use 'monitorRole' or 'controlRole' for login and the Heritrix      admin password as password. By default, the JMX Agent is started up on      port 8849 (To change any of the JMX settings, set the JMX_OPTS      environment variable).</p><p>On startup, Heritrix looks if any JMX Agent running in current       context and registers itself with the first JMX Agent found publishing      attributes and operations that can be run remotely.  If running in      a SUN 1.5.0 JVM where the JVM JMX Agent has been started, Heritrix      will attach to the JVM JMX Agent (If running inside JBOSS, Heritrix      will register with the JBOSS JMX Agent).      </p><p>To see what Attributes and Operations are  available via JMX, use      the SUN 1.5.0 JDK jconsole application -- its in $JAVA_HOME/bin -- or      use <a href="outside.html#jmxclient" title="9.2.4.&nbsp;cmdline-jmxclient">Section&nbsp;9.2.4, &ldquo;cmdline-jmxclient&rdquo;</a>.</p><p>To learn more about the SUN 1.5.0 JDK      JMX managements and jconsole, see <a href="http://java.sun.com/j2se/1.5.0/docs/guide/management/agent.html" target="_top">Monitoring      and Management Using JMX</a>. This O'Reilly article is also a good      place for getting started : <a href="http://www.onjava.com/pub/a/onjava/2004/09/29/tigerjmx.html" target="_top">Monitoring      Local and Remote Applications Using JMX 1.2 and JConsole</a>.</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="ftp_support"></a>9.6.&nbsp;Experimental FTP Support</h3></div></div></div><p>As of release 1.10.0, Heritrix has experimental support for       crawling FTP servers.  To enable FTP support for your crawls, there is      a configuration file change you will have to manually make.</p><p>Specifically, you will have to edit the       <code class="filename">$HERITRIX_HOME/conf/heritrix.properties</code> file.      Remove <code class="literal">ftp</code> from the       <code class="literal">org.archive.net.UURIFactory.ignored-schemes</code> property      list. Also, you must add <code class="literal">ftp</code> to the       <code class="literal">org.archive.net.UURIFactory.schemes</code> property list.      </p><p>After that change, you should be able to add the FetchFTP       processor to your crawl using the Web UI.  Just create a new job,      click "Modules", and add FetchFTP under "Fetchers."</p><p>Note that FetchFTP is a little unusual in that it works both      as a fetcher and as an extractor.  If an FTP URI refers to a       directory, and if FetchFTP's <code class="literal">extract-from-dirs</code>       property is set to true, then FetchFTP will extract one link for      every line of the directory listing.  Similarly, if the       <code class="literal">extract-parent</code> property is true, then FetchFTP will       extract the parent directory from every FTP URI it encounters.</p><p>Also, remember that FetchFTP is experimental.  As of 1.10,      FetchFTP has the following known limitations:</p><div class="orderedlist"><ol type="1"><li>FetchFTP can only store directories if the FTP server       supports the <code class="literal">NLIST</code> command.  Some older systems      may not support <code class="literal">NLIST</code>.</li><li>Similarly, FetchFTP uses passive mode transfer,      to work behind firewalls.  Not all FTP servers support passive      mode, however.</li><li>Heritrix currently has no means of determining the       mime-type of a document unless an HTTP server explicitly mentions      one.  Since FTP has no concept of metadata, all documents retrieved      using FetchFTP have a mime-type of <code class="literal">no-type</code>.      </li><li>In the absence of a mime-type, many of the postprocessors      will not work.  For instance, HTMLExtractor will      not extract links from an HTML file fetched with FetchFTP.      </li></ol></div><p>Still, FetchFTP can be used to archive an FTP directory of       tarballs, for instance.  If you discover any additional problems using      FetchFTP, please inform the <code class="email">&lt;<a href="mailto:archive-crawler@yahoogroups.com">archive-crawler@yahoogroups.com</a>&gt;</code>      mailing list.</p></div></div><div class="navfooter"><hr><table summary="Navigation footer" width="100%"><tr><td align="left" width="40%"><a accesskey="p" href="analysis.html">Prev</a>&nbsp;</td><td align="center" width="20%">&nbsp;</td><td align="right" width="40%">&nbsp;<a accesskey="n" href="usecases.html">Next</a></td></tr><tr><td valign="top" align="left" width="40%">8.&nbsp;Analysis of jobs&nbsp;</td><td align="center" width="20%"><a accesskey="h" href="index.html">Home</a></td><td valign="top" align="right" width="40%">&nbsp;A.&nbsp;Common Heritrix Use Cases</td></tr></table></div></body></html>

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -