📄 install.html
字号:
will be found in their development locations and startup messages will show on the text console (standard out).</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10107"></a>2.2.2.4. heritrix.home</h5></div></div></div><p>Where heritrix is homed usually passed by the heritrix launch script. </p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N1010C"></a>2.2.2.5. heritrix.out</h5></div></div></div><p>Where stdout/stderr are sent, usually heritrix_out.log and passed by the heritrix launch script. </p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10111"></a>2.2.2.6. heritrix.version</h5></div></div></div><p>Version of heritrix set by the heritrix build into heritrix.properties. </p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10116"></a>2.2.2.7. heritrix.jobsdir</h5></div></div></div><p>Where to drop heritrix jobs. Usually empty. Default location is <code class="literal">${HERITRIX_HOME}/jobs</code>. </p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="hconf"></a>2.2.2.8. heritrix.conf</h5></div></div></div><p>Specify an alternate configuration directory other than the default <code class="literal">$HERITRIX_HOME/conf</code>.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10129"></a>2.2.2.9. heritrix.cmdline</h5></div></div></div><p>This set of system properties are rarely used. They are for use when Heritrix has NOT been started from the command-line -- e.g. its been embedded in another application -- and the startup configuration that is set usually by command-line options, instead needs to be done via system properties alone. </p><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N1012E"></a>2.2.2.9.1. heritrix.cmdline.admin</h6></div></div></div><p>Value is a colon-delimited String user name and password for admin GUI</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N10133"></a>2.2.2.9.2. heritrix.cmdline.nowui</h6></div></div></div><p>If set to true, will prevent embedded web server crawler control interface from starting up.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N10138"></a>2.2.2.9.3. heritrix.cmdline.order</h6></div></div></div><p>If set to to a string file path, will use the specified crawl order XML file.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N1013D"></a>2.2.2.9.4. heritrix.cmdline.port</h6></div></div></div><p>Value is port GUI is to run on.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N10142"></a>2.2.2.9.5. heritrix.cmdline.run</h6></div></div></div><p>If true, crawler is set into run mode on startup.</p></div></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10147"></a>2.2.2.10. javax.net.ssl.trustStore</h5></div></div></div><p>Heritrix has its own trust store at <code class="literal">conf/heritrix.cacerts</code> that it uses if the <code class="literal">FetcherHTTP</code> is configured to use a trust level of other than <span class="emphasis"><em>open</em></span> (open is the default setting). In the unusual case where you'd like to have Heritrix use an alternate truststore, point at the alternate by supplying the JSSE <code class="literal">javax.net.ssl.trustStore</code> property on the command line: e.g.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N1015B"></a>2.2.2.11. java.util.logging.config.file</h5></div></div></div><p>The Heritrix <code class="filename"><code class="literal">conf</code></code> directory includes a file named <code class="filename">heritrix.properties</code>. A section of this file specifies the default Heritrix logging configuration. To override these settings, point <code class="literal">java.util.logging.config.file</code> at a properties file with an alternate logging configuration. Below we reproduce the default <code class="filename">heritrix.properties</code> for reference:<pre class="programlisting"> # Basic logging setup; to console, all levels handlers= java.util.logging.ConsoleHandler java.util.logging.ConsoleHandler.level= ALL # Default global logging level: only warnings or higher .level= WARNING # currently necessary (?) for standard logs to work crawl.level= INFO runtime-errors.level= INFO uri-errors.level= INFO progress-statistics.level= INFO recover.level= INFO # HttpClient is too chatty... only want to hear about severe problems org.apache.commons.httpclient.level= SEVERE</pre>Here's an example of how you might specify an override:<pre class="programlisting"> % JAVA_OPTS="-Djava.util.logging.config.file=heritrix.properties" \ ./bin/heritrix --no-wui order.xml</pre></p><p>Alternatively you could edit the default file.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N1017B"></a>2.2.2.12. java.io.tmpdir</h5></div></div></div><p>Specify an alternate tmp directory. Default is /tmp. </p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10180"></a>2.2.2.13. com.sun.management.jmxremote.port</h5></div></div></div><p>What port to start up JMX Agent on. Default is 8849. See also the environment variable JMX_PORT. </p></div></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="security"></a>2.3. Security Considerations</h3></div></div></div><p>The crawler is a large and active network application which presents security implications, both local to the machine where it operates, and remotely for machines it contacts.</p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N1018B"></a>2.3.1. Local to the Crawling Machine</h4></div></div></div><p>It is important to recognize that the web UI (discussed in <a href="wui.html" title="3. Web based user interface">Section 3, “Web based user interface”</a>) and JMX agent (discussed in <a href="outside.html#mon_com" title="9.5. Remote Monitoring and Control">Section 9.5, “Remote Monitoring and Control”</a>) allow remote control of the crawler process in ways that might potentially disrupt a crawl, change the crawler's behavior, read or write locally-accessible files, and perform or trigger other actions in the Java VM or local machine. </p><p>The administrative login and password are currently only a very mild protection against unauthorized access, unless you take additional steps to prevent access to the crawler machine. We strongly recommend some combination of the following practices:</p><p><span class="bold"><strong>First,</strong></span> use network configuration tools, like a firewall, to only allow trusted remote hosts to contact the web UI and, if applicable, JMX agent ports. (The default web UI port is 8080; JMX is 8849.)</p><p><span class="bold"><strong>Second,</strong></span> use a strong and unique username/password combination to secure the web UI and JMX agent. However, keep in mind that the default administrative web server uses plain HTTP for access, so these values are susceptible to eavesdropping in transit if network links between your browser and the crawler are compromised. (An upcoming update will change the default to HTTPS.) Also, setting the username/password on the command-line may result in their values being visible to other users of the crawling machine, and they are additionally printed to the console and heritrix_out.log for operator reference.</p><p><span class="bold"><strong>Third,</strong></span> run the crawler as a user with the minimum privileges necessary for its operation, so that in the event of unauthorized access to the web UI or JMX agent, the potential damage is limited.</p><p>Successful unauthorized access to the web UI or JMX agent could trivially end or corrupt a crawl, or change the crawler's behavior to be a nuisance to other network hosts. By adjusting configuration paths, unauthorized access could potentially delete, corrupt, or replace files accessible to the crawler process, and thus cause more extensive problems on the crawler machine.</p><p>Another potential risk is that some worst-case or maliciously-crafted crawled content might, in combination with crawler bugs, disrupt the crawl or other files or operations of the local system. For example, in the past, even without malicious intent, some rich-media content has caused runaway memory use in 3rd-party libraries used by the crawler, resulting in a memory-exhaustion condition that can stop or corrupt a crawl in progress. Similarly, atypical input patterns have at times caused runaway CPU use by crawler link-extraction regular expressions, severely slowing crawls. Crawl operators should monitor their crawls closely and stay informed via the project discussion list and bug database for any newly discovered similar bugs. </p></div></div></div><div class="navfooter"><hr><table summary="Navigation footer" width="100%"><tr><td align="left" width="40%"><a accesskey="p" href="intro.html">Prev</a> </td><td align="center" width="20%"> </td><td align="right" width="40%"> <a accesskey="n" href="wui.html">Next</a></td></tr><tr><td valign="top" align="left" width="40%">1. Introduction </td><td align="center" width="20%"><a accesskey="h" href="index.html">Home</a></td><td valign="top" align="right" width="40%"> 3. Web based user interface</td></tr></table></div></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -