⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 install.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
📖 第 1 页 / 共 2 页
字号:
          will be found in their development locations and startup messages          will show on the text console (standard out).</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10107"></a>2.2.2.4.&nbsp;heritrix.home</h5></div></div></div><p>Where heritrix is homed usually passed by the heritrix	launch script.          </p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N1010C"></a>2.2.2.5.&nbsp;heritrix.out</h5></div></div></div><p>Where stdout/stderr are sent, usually heritrix_out.log and	passed by the heritrix launch script.          </p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10111"></a>2.2.2.6.&nbsp;heritrix.version</h5></div></div></div><p>Version of heritrix set by the heritrix build into		heritrix.properties.          </p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10116"></a>2.2.2.7.&nbsp;heritrix.jobsdir</h5></div></div></div><p>Where to drop heritrix jobs.  Usually empty.  Default          location is <code class="literal">${HERITRIX_HOME}/jobs</code>.          </p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="hconf"></a>2.2.2.8.&nbsp;heritrix.conf</h5></div></div></div><p>Specify an alternate configuration directory other than        the default <code class="literal">$HERITRIX_HOME/conf</code>.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10129"></a>2.2.2.9.&nbsp;heritrix.cmdline</h5></div></div></div><p>This set of system properties are rarely used.        They are for use when Heritrix has NOT been started from the        command-line -- e.g. its been embedded in another application -- and	the startup configuration that is set usually by command-line options,        instead needs to be done via system properties alone.        </p><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N1012E"></a>2.2.2.9.1.&nbsp;heritrix.cmdline.admin</h6></div></div></div><p>Value is a colon-delimited String  user name and		password for admin GUI</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N10133"></a>2.2.2.9.2.&nbsp;heritrix.cmdline.nowui</h6></div></div></div><p>If set to true, will prevent embedded web server crawler 		control interface from starting up.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N10138"></a>2.2.2.9.3.&nbsp;heritrix.cmdline.order</h6></div></div></div><p>If set to to a string file path, will use the specified 		crawl order XML file.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N1013D"></a>2.2.2.9.4.&nbsp;heritrix.cmdline.port</h6></div></div></div><p>Value is port GUI is to run on.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N10142"></a>2.2.2.9.5.&nbsp;heritrix.cmdline.run</h6></div></div></div><p>If true, crawler is set into run mode on startup.</p></div></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10147"></a>2.2.2.10.&nbsp;javax.net.ssl.trustStore</h5></div></div></div><p>Heritrix has its own trust store at          <code class="literal">conf/heritrix.cacerts</code> that it uses if the          <code class="literal">FetcherHTTP</code> is configured to use a trust level of          other than <span class="emphasis"><em>open</em></span> (open is the default setting).          In the unusual case where you'd like to have Heritrix use an          alternate truststore, point at the alternate by supplying the JSSE          <code class="literal">javax.net.ssl.trustStore</code> property on the command          line: e.g.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N1015B"></a>2.2.2.11.&nbsp;java.util.logging.config.file</h5></div></div></div><p>The Heritrix <code class="filename"><code class="literal">conf</code></code>          directory includes a file named          <code class="filename">heritrix.properties</code>. A section of this file          specifies the default Heritrix logging configuration. To override          these settings, point          <code class="literal">java.util.logging.config.file</code> at a properties          file with an alternate logging configuration. Below we reproduce the          default <code class="filename">heritrix.properties</code> for          reference:<pre class="programlisting">  # Basic logging setup; to console, all levels  handlers= java.util.logging.ConsoleHandler  java.util.logging.ConsoleHandler.level= ALL  # Default global logging level: only warnings or higher  .level= WARNING  # currently necessary (?) for standard logs to work  crawl.level= INFO  runtime-errors.level= INFO  uri-errors.level= INFO  progress-statistics.level= INFO  recover.level= INFO  # HttpClient is too chatty... only want to hear about severe problems  org.apache.commons.httpclient.level= SEVERE</pre>Here's an          example of how you might specify an override:<pre class="programlisting">  % JAVA_OPTS="-Djava.util.logging.config.file=heritrix.properties" \          ./bin/heritrix --no-wui order.xml</pre></p><p>Alternatively you could edit the default file.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N1017B"></a>2.2.2.12.&nbsp;java.io.tmpdir</h5></div></div></div><p>Specify an alternate tmp directory.  Default is /tmp.                </p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10180"></a>2.2.2.13.&nbsp;com.sun.management.jmxremote.port</h5></div></div></div><p>What port to start up JMX Agent on.  Default is 8849.                See also the environment variable JMX_PORT.                </p></div></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="security"></a>2.3.&nbsp;Security Considerations</h3></div></div></div><p>The crawler is a large and active network application which      presents security implications, both local to the machine where     it operates, and remotely for machines it contacts.</p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N1018B"></a>2.3.1.&nbsp;Local to the Crawling Machine</h4></div></div></div><p>It is important to recognize that the web UI (discussed in      <a href="wui.html" title="3.&nbsp;Web based user interface">Section&nbsp;3, &ldquo;Web based user interface&rdquo;</a>) and JMX agent (discussed in      <a href="outside.html#mon_com" title="9.5.&nbsp;Remote Monitoring and Control">Section&nbsp;9.5, &ldquo;Remote Monitoring and Control&rdquo;</a>) allow remote control of the crawler process      in ways that might potentially disrupt a crawl, change the crawler's      behavior, read or write locally-accessible files, and perform or      trigger other actions in the Java VM or local machine. </p><p>The administrative login and password are currently only a      very mild protection against unauthorized access, unless you take     additional steps to prevent access to the crawler machine. We      strongly recommend some combination of the following practices:</p><p><span class="bold"><strong>First,</strong></span> use network      configuration tools, like a firewall, to only allow trusted remote      hosts to contact the web UI and, if applicable, JMX agent ports.      (The default web UI port is 8080; JMX is 8849.)</p><p><span class="bold"><strong>Second,</strong></span> use a strong and      unique username/password combination to secure the web UI and JMX     agent. However, keep in mind that the default administrative web      server uses plain HTTP for access, so these values are susceptible     to eavesdropping in transit if network links between your browser and     the crawler are compromised. (An upcoming update will change the      default to HTTPS.) Also, setting the username/password on     the command-line may result in their values being visible to other     users of the crawling machine, and they are additionally printed      to the console and heritrix_out.log for operator reference.</p><p><span class="bold"><strong>Third,</strong></span> run the crawler as a      user with the minimum privileges necessary for its operation, so that     in the event of unauthorized access to the web UI or JMX agent,     the potential damage is limited.</p><p>Successful unauthorized access to the web UI or JMX agent could      trivially end or corrupt a crawl, or change the crawler's behavior to      be a nuisance to other network hosts. By adjusting configuration      paths, unauthorized access could potentially delete, corrupt, or      replace files accessible to the crawler process, and thus cause more      extensive problems on the crawler machine.</p><p>Another potential risk is that some      worst-case or maliciously-crafted crawled content might, in      combination with crawler bugs, disrupt the crawl or other files or     operations of the local system. For example, in the past, even     without malicious intent, some rich-media content has caused      runaway memory use in 3rd-party libraries used by the crawler,      resulting in a memory-exhaustion condition that can stop or      corrupt a crawl in progress. Similarly, atypical input patterns have     at times caused runaway CPU use by crawler link-extraction regular     expressions, severely slowing crawls. Crawl operators should      monitor their crawls closely and stay informed via the project      discussion list and bug database for any newly discovered similar     bugs. </p></div></div></div><div class="navfooter"><hr><table summary="Navigation footer" width="100%"><tr><td align="left" width="40%"><a accesskey="p" href="intro.html">Prev</a>&nbsp;</td><td align="center" width="20%">&nbsp;</td><td align="right" width="40%">&nbsp;<a accesskey="n" href="wui.html">Next</a></td></tr><tr><td valign="top" align="left" width="40%">1.&nbsp;Introduction&nbsp;</td><td align="center" width="20%"><a accesskey="h" href="index.html">Home</a></td><td valign="top" align="right" width="40%">&nbsp;3.&nbsp;Web based user interface</td></tr></table></div></body></html>

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -