⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 arale.html

📁 用java写的网络爬虫
💻 HTML
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<html>
<head>
<title>Arale User Manual</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
<style>
body {
background-color : #FFFFFF;
font-family: Verdana, Geneva, Arial, Helvetica, sans-serif;
font-size: x-small;
color: #000000;
}
td, p, li, a {
font-family: Verdana, Geneva, Arial, Helvetica, sans-serif;
font-size: x-small;
}
code, pre {
font-family: monospaced;
font-size: x-small;
}
</style>
</head>

<body>

<p style="font-size:small"><b>Arale User Manual</b>

<p>
author: Flavio Tordini<br>
email: <a href="mailto:flaviotordini@tiscali.it">flaviotordini@tiscali.it</a><br>
web: <a href="http://web.tiscali.it/_flat">http://web.tiscali.it/_flat</a>

<p>
<a href="#intro">Introduction</a><br>
<a href="#get">Getting Arale</a><br>
<a href="#sys">System Requirements</a><br>
<a href="#install">Installing Arale</a><br>
<a href="#run">Running Arale</a><br>
<a href="#settings">Arale settings</a><br>
<a href="#build">Building Arale</a><br>

<p><a name="intro"></a><b>Introduction</b>
<p>
Arale is a java multithreaded web spider. While many bots around are focused on page indexing, Arale is primarly designed for personal use. It fits the needs of advanced web surfers and web developers.
<p>
With Arale you can download entire web sites or specific resources from the web. Some real life cases are:<br>
<li>want to download only images, videos, mp3 or zip files from a site.</li>
<li>manuals, articles, ebooks fragmented in many files to discourage download.</li>
<li>user-unfriendly sites. Popups, banners and tricky javascripts annoying you before you can download a resource.</li>
<p>
<i>Multithreaded</i> means that Arale can download more than one file simultaneously. Arale can easily saturate your bandwith, thus providing the fastest possible download speed for your internet connection.

<p>
If you're developing dynamic sites using technologies such as JSP, PHP, ASP or whatever, you may be interested in rendering dynamic pages to static files.
Arale supports URL renaming: query string is encoded in the static filename and .html extension is appended.
let's make an example:
<p>
original URL: <code>mypage.jsp?myparam=myvalue</code><br>
static filename: <code>mypage.jsp!myparam=myvalue.html</code><br>
<p>
Existing links to renamed URLs are substituted with modified links. This preserves navigation among static files.
Once a dynamic site is trasformed into a set of static files it can be deployed on a server that does not support dynamic pages. For example you may deploy a JSP site in a free web space.

<p>
Currently Arale is a command-line tool. It would be nice to develop a GUI for it. I'd like to have some feedback from users, so if you think it's worth send me an email and tell me what you think. ;)


<p><a name="get"></a><b>Getting Arale</b>
<p>
The latest version of Arale can be downloaded from <a href="http://web.tiscali.it/_flat">http://web.tiscali.it/_flat</a>.
The distribution includes Arale sources along with building scripts (see <a href="#build">Building Arale</a>).

<p><a name="sys"></a><b>System Requirements</b>
<p>
In order to run Arale, you need the Java Development Kit (JDK) or the Java Runtime Environment (JRE) installed on your system.
Arale requires Java 2. The recommended Java version for running Arale is Java 2 version 1.3 or later.
<li><a href="http://java.sun.com/j2se/">Java Development Kit</a></li>
<li><a href="http://java.sun.com/j2se/">Java Runtime Environment</a></li>

<p><a name="install"></a><b>Installing Arale</b>
<p>
Simply extract the Arale distribution archive to a directory. Make sure you have the JAVA_HOME environment variable pointing to Java Development Kit installation directory.
As an option you may set an ARALE_OPTS environment variable. The value of ARALE_OPTS contains command line arguments that should be passed to the Java Virtual Machine when starting Arale. For example, you can define properties or set the maximum Java heap size.

The following sets up the environment on Windows:
<pre>
set JAVA_HOME=c:\jdk1.3.1
set ARALE_HOME=c:\arale
set ARALE_OPTS=-mx32m
</pre>
To complete Arale installation run <code>windows/setup.bat</code> in Arale installation directory. this will create shortcuts to Arale and will integrate Arale with Internet Explorer. Cool!
<p>
on Unix (bash):
<pre>
export JAVA_HOME=/usr/local/jdk-1.3.1
export ARALE_HOME=/usr/local/arale
export ARALE_OPTS=-mx32m
</pre>

<p><a name="run"></a><b>Running Arale</b>
<p>
Running Arale is simple, when you installed it as described in the previous section. Just type <code>arale</code> followed by an URL.
<pre>arale http://web.tiscali.it/_flat</pre>
By default Arale reads its settings from the <code>arale.properties</code> file. You can override this behaviour by typing:
<pre>arale http://web.tiscali.it/_flat -settings mysettings.properties</pre>

Command-line option summary:
<pre>
Usage: arale [&lt;URL&gt;] [&lt;options&gt;]
        -settings &lt;file&gt;: Use specified property file
        -output &lt;dir&gt;: Use specified output directory
        -version: Print Arale version and exit
        -help: Print this message and exit
</pre>


<p><a name="settings"></a><b>Arale settings</b>
<ul>
<li><b>URL</b>: start URL</li>

<li><b>output.directory</b>: this is Arale output directory. It may be a relative or an absolute path. Arale will put all downloaded files in subdirectories by recreating the directory structure found on the remote server.</li>

<li><b>download.tokens</b>: Arale will download URLs that contain these tokens. Tokens are separated by spaces. Just like this: <code>.html .gif .jpg .css</code>. </li> What <i>token</i> means? A token is a series of characters Arale will search for when scanning files. When Arale finds a token specified by this parameter, it then searches for right limit and a left limit of the ipothetic link. Then Arale tries to connect to that URL. If the resource is found then it is immediatly downloaded to disk, otherwise Arale just keeps going.</li>

<li><b>scan.tokens</b>: Arale will scan URLs that contain these tokens. Tokens are separated by spaces. URLs containing these tokens should all have a text/html content type. Resources found with these tokens will be scanned for new links. They will not be downloaded if they are not in the download.tokens list.</li>

<li><b>force.html.scanning</b>: Force scanning of resources having a text/html content type. Even if they're not listed in scan.tokens.</li>

<li><b>ensure.html.scanning</b>: Ensure that only resources having a text/html content type will be scanned. For example a dynamic resource (.jsp, .asp ...) may return any content type, not only text/html.</li>

<li><b>domain.depth</b>: This parameter represents how many domain levels deep should arale follow links. 1 means no domain change. Increasing this value will dramatically increase the number of followed links. For example 2 means Arale will crawl the starting domain plus all domains linked in the starting domain pages.</li>

<li><b>file.minsize</b>: Minimum downloaded file size. All files smaller than this value will be discarded. -1 means Arale will ignore this setting.</li>

<li><b>file.minsize</b>: Maximum downloaded file size. All files bigger than this value will be discarded. -1 means Arale will ignore this setting.</li>

<li><b>file.download.unknown.size</b>: Tells arale wheter to download files whose size cannot be predetermined. Sometimes the web server will not tell the file size, in that Arale will use this setting to decide what to do. this value may be true or false.</li>

<li><b>thread.count</b>: This is the number of threads Arale will allocate. In practice this is the number of simultaneous HTTP connections. Choosing a higher value may increase may increase processing speed, but may also stress your machine and the remote server(s). 1 is the minimum value.</li>

<li><b>pause.milliseconds</b>: The number of milliseconds to pause before starting processing the next URL. Use this setting to make Arale take a breath between URL processing. This may be useful if you're on a LAN and want to avoid creating noticeable bandwidth usage bursts. If you're rendering a dynamic site into static pages, setting this value may increase the process reliability.</li>

<li><b>rename.dynamic.files</b>: Arale will rename dynamic files such as .jsp, .asp, .php to .html. Also link to renamed resources will be substituted. Enable this setting if you're rendering a dynamic site into static pages. This is also great when downloading from dynamic sites.</li>

<li><b>url.leftdelimiters</b>: Characters that delimit a URL on the left side. This is used by Arale parsing methods. You probably will not need to change this setting.</li>

<li><b>url.rightdelimiters</b>: Characters that delimit a URL on the right side. This is used by Arale parsing methods. You probably will not need to change this setting.</li>

</ul>


<p><a name="build"></a><b>Building Arale</b>
<p>
Arale sources are located in the archive named <code>arale-sources.zip</code>.

Arale uses the Jakarta Ant build tool.

Ant can be dowloaded from the Jakarta Project site:
<a href="http://jakarta.apache.org/ant">http://jakarta.apache.org/ant</a>.
Extract Ant archive and set the ANT_HOME environment
variable to the directory you installed Ant.
Refer to Ant documentation for further details.

Once you're done with Ant, simply run the build batch file.
The Ant build script for Arale (<code>build.xml</code>) takes a number a parameters, called tasks. They are:<br>
<code>clean</code> - deletes compiled classes, javadocs, and the arale jar<br>
<code>prepare</code> - creates required directories<br>
<code>compile</code> - compiles source files<br>
<code>dist</code> - creates a jar<br>
<code>javadoc</code> - generates javadoc documentation<br>

<p><b>[end of file]</b>

</body>
</html>

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -