📄 outside.html
字号:
more work: to recover from a checkpoint, the bdbje log files need to be manually assembled in the checkpoint <code class="literal">bdb-logs</code> subdirectory. You'll know which bdbje log files make up the checkpoint because Heritrix writes the checkpoint list of bdbje logs into the checkpoint directory to a file named <code class="literal">bdbj-logs-manifest.txt</code>. To prevent bdbje removing log files that might be needed assembling a checkpoint made at sometime in the past, when running expert mode checkpointing, we configure bdbje not to delete logs when its finished with them; instead, bdbje gives logs its no longer using a <code class="literal">.del</code> suffix. Assembling a checkpoint will often require renaming files with the <code class="literal">.del</code> suffix so they have the <code class="literal">.jdb</code> suffix in accordance with the <code class="literal">bdbj-logs-manifest.txt</code> list (See below for more on this). </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>With this expert mode enabled, the crawler <code class="literal">crawl-order/state-path</code> "state" directory will grow without bound; a process external to the crawler can be set to prune the state directory of <code class="literal">.del</code> files referenced by checkpoints since superceded). </p></div><p>To enable the no-files copy checkpoint, set the new expert mode setting <code class="literal">checkpoint-copy-bdbje-logs</code> to <code class="literal">false</code>. </p><p>To recover using a checkpoint that has all but the bdbje log files present, you will need to copy all logs listed in <code class="literal">bdbj-logs-manifest.txt</code> to the <code class="literal">bdbje-logs</code> checkpoint subdirectory. In some cases this will necessitate renaming logs with the <code class="literal">.del</code> to instead have the <code class="literal">.jdb</code> ending as suggested above. One thing to watch for is copying too many logs into the bdbje logs subdirectory. The list of logs must match exactly whats in the manifest file. Otherwise, the recovery will fail (For example, see <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=1325961&group_id=73833&atid=539099" target="_top">[1325961] resurrectOneQueueState has keys for items not in allqueues</a>). </p><p>On checkpoint recovery, Heritrix copies bdbje log files from the referenced checkpoint <code class="literal">bdb-logs</code> subdirectory to the new crawl's <code class="literal">crawl-order/state-path</code> "state" directory. As noted above, this can take some time. Of note, if a bdbje log file already exists in the new crawls' <code class="literal">crawl-order/state-path</code> "state" directory, checkpoint recover will not overwrite the existing bdbje log file. Exploit this property and save on recovery time by using native unix <code class="literal">cp</code> to manually copy over bdbje log files from the checkpoint directory to the new crawls' <code class="literal">crawl-order/state-path</code> "state" directory before launching a recovery (Or, at the extreme, though it will trash your checkpoint, set the checkpoint's <code class="literal">bdb-logs</code> subdirectory as the new crawls <code class="literal">crawl-order/state-path</code> "state" directory).</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="automated_chkpt"></a>9.4.2. Automated Checkpointing</h4></div></div></div><p>To have Heritrix run a checkpoint on a period, uncomment (or add) to <code class="literal">heritrix.properties</code> a line like: <pre class="programlisting">org.archive.crawler.framework.Checkpointer.period = 2</pre> This will install a Timer Thread that will run on an interval (Units are in hours). See <code class="literal">heritrix_out.log</code> to see log of installation of the timer thread that will run the checkpoint on a period and to see log of everytime it runs (Assuming <code class="literal">org.archive.crawler.framework.Checkpointer.level</code> is set to INFO).</p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="mon_com"></a>9.5. Remote Monitoring and Control</h3></div></div></div><p>As of release 1.4.0, Heritrix will start up the JVM's JMX Agent if deployed in a SUN 1.5.0 JVM. It password protects the JMX Agent using whatever was specified as the Heritrix admin password so to login, you'll use 'monitorRole' or 'controlRole' for login and the Heritrix admin password as password. By default, the JMX Agent is started up on port 8849 (To change any of the JMX settings, set the JMX_OPTS environment variable).</p><p>On startup, Heritrix looks if any JMX Agent running in current context and registers itself with the first JMX Agent found publishing attributes and operations that can be run remotely. If running in a SUN 1.5.0 JVM where the JVM JMX Agent has been started, Heritrix will attach to the JVM JMX Agent (If running inside JBOSS, Heritrix will register with the JBOSS JMX Agent). </p><p>To see what Attributes and Operations are available via JMX, use the SUN 1.5.0 JDK jconsole application -- its in $JAVA_HOME/bin -- or use <a href="outside.html#jmxclient" title="9.2.4. cmdline-jmxclient">Section 9.2.4, “cmdline-jmxclient”</a>.</p><p>To learn more about the SUN 1.5.0 JDK JMX managements and jconsole, see <a href="http://java.sun.com/j2se/1.5.0/docs/guide/management/agent.html" target="_top">Monitoring and Management Using JMX</a>. This O'Reilly article is also a good place for getting started : <a href="http://www.onjava.com/pub/a/onjava/2004/09/29/tigerjmx.html" target="_top">Monitoring Local and Remote Applications Using JMX 1.2 and JConsole</a>.</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="ftp_support"></a>9.6. Experimental FTP Support</h3></div></div></div><p>As of release 1.10.0, Heritrix has experimental support for crawling FTP servers. To enable FTP support for your crawls, there is a configuration file change you will have to manually make.</p><p>Specifically, you will have to edit the <code class="filename">$HERITRIX_HOME/conf/heritrix.properties</code> file. Remove <code class="literal">ftp</code> from the <code class="literal">org.archive.net.UURIFactory.ignored-schemes</code> property list. Also, you must add <code class="literal">ftp</code> to the <code class="literal">org.archive.net.UURIFactory.schemes</code> property list. </p><p>After that change, you should be able to add the FetchFTP processor to your crawl using the Web UI. Just create a new job, click "Modules", and add FetchFTP under "Fetchers."</p><p>Note that FetchFTP is a little unusual in that it works both as a fetcher and as an extractor. If an FTP URI refers to a directory, and if FetchFTP's <code class="literal">extract-from-dirs</code> property is set to true, then FetchFTP will extract one link for every line of the directory listing. Similarly, if the <code class="literal">extract-parent</code> property is true, then FetchFTP will extract the parent directory from every FTP URI it encounters.</p><p>Also, remember that FetchFTP is experimental. As of 1.10, FetchFTP has the following known limitations:</p><div class="orderedlist"><ol type="1"><li>FetchFTP can only store directories if the FTP server supports the <code class="literal">NLIST</code> command. Some older systems may not support <code class="literal">NLIST</code>.</li><li>Similarly, FetchFTP uses passive mode transfer, to work behind firewalls. Not all FTP servers support passive mode, however.</li><li>Heritrix currently has no means of determining the mime-type of a document unless an HTTP server explicitly mentions one. Since FTP has no concept of metadata, all documents retrieved using FetchFTP have a mime-type of <code class="literal">no-type</code>. </li><li>In the absence of a mime-type, many of the postprocessors will not work. For instance, HTMLExtractor will not extract links from an HTML file fetched with FetchFTP. </li></ol></div><p>Still, FetchFTP can be used to archive an FTP directory of tarballs, for instance. If you discover any additional problems using FetchFTP, please inform the <code class="email"><<a href="mailto:archive-crawler@yahoogroups.com">archive-crawler@yahoogroups.com</a>></code> mailing list.</p></div></div><div class="navfooter"><hr><table summary="Navigation footer" width="100%"><tr><td align="left" width="40%"><a accesskey="p" href="analysis.html">Prev</a> </td><td align="center" width="20%"> </td><td align="right" width="40%"> <a accesskey="n" href="usecases.html">Next</a></td></tr><tr><td valign="top" align="left" width="40%">8. Analysis of jobs </td><td align="center" width="20%"><a accesskey="h" href="index.html">Home</a></td><td valign="top" align="right" width="40%"> A. Common Heritrix Use Cases</td></tr></table></div></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -