install.html

来自「网络爬虫开源代码」· HTML 代码 · 共 183 行 · 第 1/2 页

HTML
183
字号
          you want heritrix to use a properties file other than that found at          <code class="literal">conf/heritrix.properties</code>.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N100FD"></a>2.2.2.2.&nbsp;heritrix.context</h5></div></div></div><p>Provide an alternate context for the Heritrix admin UI.          Usually the admin webapp is mounted on root: i.e. '/'.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10102"></a>2.2.2.3.&nbsp;heritrix.development</h5></div></div></div><p>Set this property when you want to run the crawler from          eclipse. This property takes no arguments. When this property is          set, the <code class="literal">conf</code> and <code class="literal">webapps</code>          directories will be found in their development locations and startup          messages will show on the text console (standard out).</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N1010F"></a>2.2.2.4.&nbsp;heritrix.home</h5></div></div></div><p>Where heritrix is homed usually passed by the heritrix launch          script.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10114"></a>2.2.2.5.&nbsp;heritrix.out</h5></div></div></div><p>Where stdout/stderr are sent, usually heritrix_out.log and          passed by the heritrix launch script.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10119"></a>2.2.2.6.&nbsp;heritrix.version</h5></div></div></div><p>Version of heritrix set by the heritrix build into          heritrix.properties.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N1011E"></a>2.2.2.7.&nbsp;heritrix.jobsdir</h5></div></div></div><p>Where to drop heritrix jobs. Usually empty. Default location          is <code class="literal">${HERITRIX_HOME}/jobs</code>.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="hconf"></a>2.2.2.8.&nbsp;heritrix.conf</h5></div></div></div><p>Specify an alternate configuration directory other than the          default <code class="literal">$HERITRIX_HOME/conf</code>.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10131"></a>2.2.2.9.&nbsp;heritrix.cmdline</h5></div></div></div><p>This set of system properties are rarely used. They are for          use when Heritrix has NOT been started from the command-line -- e.g.          its been embedded in another application -- and the startup          configuration that is set usually by command-line options, instead          needs to be done via system properties alone.</p><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N10136"></a>2.2.2.9.1.&nbsp;heritrix.cmdline.admin</h6></div></div></div><p>Value is a colon-delimited String user name and password for            admin GUI</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N1013B"></a>2.2.2.9.2.&nbsp;heritrix.cmdline.nowui</h6></div></div></div><p>If set to true, will prevent embedded web server crawler            control interface from starting up.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N10140"></a>2.2.2.9.3.&nbsp;heritrix.cmdline.order</h6></div></div></div><p>If set to to a string file path, will use the specified            crawl order XML file.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N10145"></a>2.2.2.9.4.&nbsp;heritrix.cmdline.port</h6></div></div></div><p>Value is the port to run the GUI on.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N1014A"></a>2.2.2.9.5.&nbsp;heritrix.cmdline.run</h6></div></div></div><p>If true, crawler is set into run mode on startup.</p></div></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N1014F"></a>2.2.2.10.&nbsp;javax.net.ssl.trustStore</h5></div></div></div><p>Heritrix has its own trust store at          <code class="literal">conf/heritrix.cacerts</code> that it uses if the          <code class="literal">FetcherHTTP</code> is configured to use a trust level of          other than <span class="emphasis"><em>open</em></span> (open is the default setting).          In the unusual case where you'd like to have Heritrix use an          alternate truststore, point at the alternate by supplying the JSSE          <code class="literal">javax.net.ssl.trustStore</code> property on the command          line: e.g.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10163"></a>2.2.2.11.&nbsp;java.util.logging.config.file</h5></div></div></div><p>The Heritrix <code class="filename"><code class="literal">conf</code></code>          directory includes a file named          <code class="filename">heritrix.properties</code>. A section of this file          specifies the default Heritrix logging configuration. To override          these settings, point          <code class="literal">java.util.logging.config.file</code> at a properties          file with an alternate logging configuration. Below we reproduce the          default <code class="filename">heritrix.properties</code> for          reference:<pre class="programlisting">  # Basic logging setup; to console, all levelshandlers= java.util.logging.ConsoleHandlerjava.util.logging.ConsoleHandler.level= ALL# Default global logging level: only warnings or higher.level= WARNING# currently necessary (?) for standard logs to workcrawl.level= INFOruntime-errors.level= INFOuri-errors.level= INFOprogress-statistics.level= INFOrecover.level= INFO# HttpClient is too chatty... only want to hear about severe problemsorg.apache.commons.httpclient.level= SEVERE</pre>Here's an example          of how you might specify an override:<pre class="programlisting">  % JAVA_OPTS="-Djava.util.logging.config.file=heritrix.properties" \      ./bin/heritrix --no-wui order.xml</pre></p><p>Alternatively you could edit the default file.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10183"></a>2.2.2.12.&nbsp;java.io.tmpdir</h5></div></div></div><p>Specify an alternate tmp directory. Default is /tmp.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10188"></a>2.2.2.13.&nbsp;com.sun.management.jmxremote.port</h5></div></div></div><p>What port to start up JMX Agent on. Default is 8849. See also          the environment variable JMX_PORT.</p></div></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="security"></a>2.3.&nbsp;Security Considerations</h3></div></div></div><p>The crawler is a large and active network application which      presents security implications, both local to the machine where it      operates, and remotely for machines it contacts.</p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10193"></a>2.3.1.&nbsp;Local to the Crawling Machine</h4></div></div></div><p>It is important to recognize that the web UI (discussed in <a href="wui.html" title="3.&nbsp;Web based user interface">Section&nbsp;3, &ldquo;Web based user interface&rdquo;</a>) and JMX agent (discussed in <a href="outside.html#mon_com" title="9.5.&nbsp;Remote Monitoring and Control">Section&nbsp;9.5, &ldquo;Remote Monitoring and Control&rdquo;</a>) allow remote control of the crawler process in        ways that might potentially disrupt a crawl, change the crawler's        behavior, read or write locally-accessible files, and perform or        trigger other actions in the Java VM or local machine.</p><p>The administrative login and password are currently only a very        mild protection against unauthorized access, unless you take        additional steps to prevent access to the crawler machine. We strongly        recommend some combination of the following practices:</p><p><span class="bold"><strong>First,</strong></span> use network        configuration tools, like a firewall, to only allow trusted remote        hosts to contact the web UI and, if applicable, JMX agent ports. (The        default web UI port is 8080; JMX is 8849.)</p><p><span class="bold"><strong>Second,</strong></span> use a strong and unique        username/password combination to secure the web UI and JMX agent.        However, keep in mind that the default administrative web server uses        plain HTTP for access, so these values are susceptible to        eavesdropping in transit if network links between your browser and the        crawler are compromised. (An upcoming update will change the default        to HTTPS.) Also, setting the username/password on the command-line may        result in their values being visible to other users of the crawling        machine, and they are additionally printed to the console and        heritrix_out.log for operator reference.</p><p><span class="bold"><strong>Third,</strong></span> run the crawler as a        user with the minimum privileges necessary for its operation, so that        in the event of unauthorized access to the web UI or JMX agent, the        potential damage is limited.</p><p>Successful unauthorized access to the web UI or JMX agent could        trivially end or corrupt a crawl, or change the crawler's behavior to        be a nuisance to other network hosts. By adjusting configuration        paths, unauthorized access could potentially delete, corrupt, or        replace files accessible to the crawler process, and thus cause more        extensive problems on the crawler machine.</p><p>Another potential risk is that some worst-case or        maliciously-crafted crawled content might, in combination with crawler        bugs, disrupt the crawl or other files or operations of the local        system. For example, in the past, even without malicious intent, some        rich-media content has caused runaway memory use in 3rd-party        libraries used by the crawler, resulting in a memory-exhaustion        condition that can stop or corrupt a crawl in progress. Similarly,        atypical input patterns have at times caused runaway CPU use by        crawler link-extraction regular expressions, severely slowing crawls.        Crawl operators should monitor their crawls closely and stay informed        via the project discussion list and bug database for any newly        discovered similar bugs.</p></div></div></div><div class="navfooter"><hr><table summary="Navigation footer" width="100%"><tr><td align="left" width="40%"><a accesskey="p" href="intro.html">Prev</a>&nbsp;</td><td align="center" width="20%">&nbsp;</td><td align="right" width="40%">&nbsp;<a accesskey="n" href="wui.html">Next</a></td></tr><tr><td valign="top" align="left" width="40%">1.&nbsp;Introduction&nbsp;</td><td align="center" width="20%"><a accesskey="h" href="index.html">Home</a></td><td valign="top" align="right" width="40%">&nbsp;3.&nbsp;Web based user interface</td></tr></table></div></body></html>

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?