📄 install.html
字号:
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>2. Installing and running Heritrix</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix User Manual"><link rel="up" href="index.html" title="Heritrix User Manual"><link rel="prev" href="intro.html" title="1. Introduction"><link rel="next" href="wui.html" title="3. Web based user interface"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">2. Installing and running Heritrix</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="intro.html">Prev</a> </td><th align="center" width="60%"> </th><td align="right" width="20%"> <a accesskey="n" href="wui.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="install"></a>2. Installing and running Heritrix</h2></div></div></div><p>This chapter will explain how to set up Heritrix. </p><p>Because Heritrix is a pure Java program it can (in theory anyway) be run on any platform that has a Java 5.0 VM. However we are only committed to supporting its operation on Linux and so this chapter only covers setup on that platform. Because of this, what follows assumes basic Linux administration skills. Other chapters in the user manual are platform agnostic.</p><p>This chapter also only covers installing and running the prepackaged binary distributions of Heritrix. For information about downloading and compiling the source see the <a href="http://crawler.archive.org/articles/developer_manual/index.html" target="_top">Developer's Manual</a>.</p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N1003F"></a>2.1. Obtaining and installing Heritrix</h3></div></div></div><p>The packaged binary can be downloaded from the project's <a href="http://sourceforge.net/projects/archive-crawler" target="_top">sourceforge home page</a>. Each release comes in four flavors, packaged as .tar.gz or .zip and including source or not.</p><p>For installation on Linux get the file <code class="filename">heritrix-?.?.?.tar.gz</code> (where ?.?.? is the most recent version number).</p><p>The packaged binary comes largely ready to run. Once downloaded it can be untarred into the desired directory.</p><p><pre class="programlisting"> % tar xfz heritrix-?.?.?.tar.gz</pre></p><p>Once you have downloaded and untarred the correct file you can move on to the next step.</p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10056"></a>2.1.1. System requirements</h4></div></div></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10059"></a>2.1.1.1. Java Runtime Environment</h5></div></div></div><p>The Heritrix crawler is implemented purely in Java. This means that the only true requirement for running it is that you have a JRE installed (Building will require a JDK).</p><p>The Heritrix crawler, since release 1.10.0, makes use of Java 5.0 features so your JRE must be at least of a 5.0 (1.5.0+) pedigree.</p><p>We currently include all of the free/open source third-party libraries necessary to run Heritrix in the distribution package. See <a href="http://crawler.archive.org/dependencies.html" target="_top">dependencies</a> for the complete list (Licenses for all of the listed libraries are listed in the dependencies section of the raw project.xml at the root of the <code class="literal">src</code> download or on Sourceforge).</p><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N1006A"></a>2.1.1.1.1. Installing Java</h6></div></div></div><p>If you do not have Java installed you can download Java from:</p><div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>Sun</strong></span> -- <a href="http://java.sun.com/" target="_top">java.sun.com</a></p></li><li><p><span class="bold"><strong>IBM</strong></span> -- <a href="http://www.ibm.com/java" target="_top">www.ibm.com/java</a></p></li></ul></div></div></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10082"></a>2.1.1.2. Hardware</h5></div></div></div><p>Default a java heap of 256MB RAM, which is usually suitable for crawls that range over hundreds of hosts. Assign more -- see <a href="install.html#java_opts" title="2.2.1.3. JAVA_OPTS">Section 2.2.1.3, “JAVA_OPTS”</a> for how -- of your available ram to the heap if you are crawling thousands of hosts or experience Java out-of-memory problems.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N1008A"></a>2.1.1.3. Linux</h5></div></div></div><p>The Heritrix crawler has been built and tested primarily on Linux. It has seen some informal use on Macintosh, Windows 2000 and Windows XP, but is not tested, packaged, nor supported on platforms other than Linux at this time.</p></div></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N1008F"></a>2.2. Running Heritrix</h3></div></div></div><p>To run Heritrix, first do the following: <pre class="programlisting"> % export HERITRIX_HOME=/PATH/TO/BUILT/HERITRIX</pre>...where <code class="literal">$HERITRIX_HOME</code> is the location of your untarred <code class="filename">heritrix.?.?.?.tar.gz</code>.</p><p>Next run:<pre class="programlisting"> % cd $HERITRIX_HOME % chmod u+x $HERITRIX_HOME/bin/heritrix % $HERITRIX_HOME/bin/heritrix --help</pre>This should give you usage output like the following:</p><p><pre class="programlisting"><code class="computeroutput"> Usage: heritrix --help Usage: heritrix --nowui ORDER.XML Usage: heritrix [--port=#] [--run] [--bind=IP,IP...] --admin=LOGIN:PASSWORD \ [ORDER.XML] Usage: heritrix [--port=#] --selftest[=TESTNAME] Version: @VERSION@ Options: -b,--bind Comma-separated list of IP addresses or hostnames for web server to listen on. Set to / to listen on all available network interfaces. Default is 127.0.0.1. -a,--admin Login and password for web user interface administration. Required (unless passed via the 'heritrix.cmdline.admin' system property). Pass value of the form 'LOGIN:PASSWORD'. -h,--help Prints this message and exits. -n,--nowui Put heritrix into run mode and begin crawl using ORDER.XML. Do not put up web user interface. -p,--port Port to run web user interface on. Default: 8080. -r,--run Put heritrix into run mode. If ORDER.XML begin crawl. -s,--selftest Run the integrated selftests. Pass test name to test it only (Case sensitive: E.g. pass 'Charset' to run charset selftest). Arguments: ORDER.XML Crawl order to run.</code></pre>Launch the crawler with the UI enabled by doing the following:</p><p><pre class="programlisting"> % $HERITRIX_HOME/bin/heritrix</pre>This will start up Heritrix printing out a startup message that looks like the following:</p><p><pre class="programlisting"> [b116-dyn-60 619] heritrix-0.4.0 > ./bin/heritrix Tue Feb 10 17:03:01 PST 2004 Starting heritrix... Tue Feb 10 17:03:05 PST 2004 Heritrix 0.4.0 is running. Web UI is at: http://b116-dyn-60.archive.org:8080/admin Login and password: admin/letmein</pre></p><p>See <a href="wui.html" title="3. Web based user interface">Section 3, “Web based user interface”</a> and <a href="tutorial.html" title="4. A quick guide to running your first crawl job">Section 4, “A quick guide to running your first crawl job”</a> to get your first crawl up and running.</p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N100BE"></a>2.2.1. Environment variables</h4></div></div></div><p>Below are environment variables that effect Heritrix operation. </p><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N100C3"></a>2.2.1.1. HERITRIX_HOME</h5></div></div></div><p>Set this environment variable to point at the Heritrix home directory. For example, if you've unpacked Heritrix in your home directory and Heritrix is sitting in the heritrix-1.0.0 directory, you'd set HERITRIX_HOME as follows. Assuming your shell is bash:<pre class="programlisting"> % export HERITRIX_HOME=~/heritrix-1.0.0</pre>If you don't set this environment variable, the Heritrix start script makes a guess at the home for Heritrix. It doesn't always guess correctly.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N100CC"></a>2.2.1.2. JAVA_HOME</h5></div></div></div><p>This environment variable may already exist. It should point to the Java installation on the machine. An example of how this might be set (assuming your shell is bash):</p><p><pre class="programlisting"> % export JAVA_HOME=/usr/local/java/jre/</pre></p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="java_opts"></a>2.2.1.3. JAVA_OPTS</h5></div></div></div><p>Pass options to the Heritrix JVM by populating the JAVA_OPTS environment variable with values. For example, if you want to have Heritrix run with a larger heap, say 512 megs, you could do either of the following (assuming your shell is bash):<pre class="programlisting"> % export JAVA_OPTS="-Xmx512M" % $HERITRIX_HOME/bin/heritrix</pre>Or, you could do it all on the one line as follows:<pre class="programlisting"> % JAVA_OPTS="-Xmx512m" $HERITRIX_HOME/bin/heritrix</pre></p></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N100E2"></a>2.2.2. System properties</h4></div></div></div><p>Below we document the system properties passed on the command-line that can influence Heritrix's behavior. If you are using the /bin/heritrix script to launch Heritrix you may have to edit it to change/set these properties or else pass them as part of JAVA_OPTS. </p><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="heritrix.properties"></a>2.2.2.1. heritrix.properties</h5></div></div></div><p>Set this property to point at an alternate heritrix.properties file -- <code class="literal">e.g.: -Dheritrix.properties=/tmp/alternate.properties</code> -- when you want heritrix to use a properties file other than that found at <code class="literal">conf/heritrix.properties</code>.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N100F5"></a>2.2.2.2. heritrix.context</h5></div></div></div><p>Provide an alternate context for the Heritrix admin UI. Usually the admin webapp is mounted on root: i.e. '/'. </p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N100FA"></a>2.2.2.3. heritrix.development</h5></div></div></div><p>Set this property when you want to run the crawler from eclipse. This property takes no arguments. When this property is set, the <code class="literal">conf</code> and <code class="literal">webapps</code> directories
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -