📄 tutorial.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
字号:
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>4.&nbsp;A quick guide to running your first crawl job</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix User Manual"><link rel="up" href="index.html" title="Heritrix User Manual"><link rel="prev" href="wui.html" title="3.&nbsp;Web based user interface"><link rel="next" href="creating.html" title="5.&nbsp;Creating jobs and profiles"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">4.&nbsp;A quick guide to running your first crawl job</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="wui.html">Prev</a>&nbsp;</td><th align="center" width="60%">&nbsp;</th><td align="right" width="20%">&nbsp;<a accesskey="n" href="creating.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tutorial"></a>4.&nbsp;A quick guide to running your first crawl job</h2></div></div></div><p>Once you've installed Heritrix and logged into the WUI (see above)    you are presented with the web Console page. Near the top there is a row of    tabs.</p><p><span class="bold"><strong>Step 1.</strong></span> <span class="emphasis"><em>Create a    job</em></span></p><p>To create a new job choose the Jobs tab, this will take you to the    Jobs page. Once there you are presented with three options for creating a    new job. Select 'With defaults'. This will create a new    job based on the default profile (see <a href="creating.html#profile">Section&nbsp;5.2, &ldquo;Profile&rdquo;</a>).</p><p>On the screen that comes next you will be asked to supply a name,    description and a seed list for the new job.</p><p>For a name supply a short text with no special characters or spaces    (except dash and underscore). You can skip the description if you like. In    the seeds list type in the URL of the sites you are interested in    harvesting. One URL to a line.</p><p>Creating a job is covered in greater detail in <a href="creating.html" title="5.&nbsp;Creating jobs and profiles">Section&nbsp;5, &ldquo;Creating jobs and profiles&rdquo;</a>.</p><p><span class="bold"><strong>Step 2.</strong></span> <span class="emphasis"><em>Configure the    job</em></span></p><p>Once you've entered this information in you are ready to go to the    configuration pages. Click the <span class="emphasis"><em>Modules</em></span> button in the    row of buttons at the bottom of the page.</p><p>This will take you to the modules configuration page (more details    in <a href="config.html#modules" title="6.1.&nbsp;Modules (Scope, Frontier, and Processors)">Section&nbsp;6.1, &ldquo;Modules (Scope, Frontier, and Processors)&rdquo;</a>). For now we are only interested in the    option second from the top named <span class="bold"><strong>Select crawl    scope</strong></span>. It allows you to specify the limits of the crawl. By    default it is limited to the domains that your seeds span. This may be    suitable for your purposes. If not you can choose a broad scope (not    limited to the domains of its seeds) or the more restrictive host scope    that limits the crawl to the hosts that its seeds span. For more on    scopes refer to <a href="config.html#scopes" title="6.1.1.&nbsp;Crawl Scope">Section&nbsp;6.1.1, &ldquo;Crawl Scope&rdquo;</a>.</p><p>To change scopes, select the new one from the combobox and click the    <span class="emphasis"><em>Change </em></span>button.</p><p>Next turn your attention to the second row of tabs at the top of the    page, below the usual tabs. You are currently on the far left tab. Now    select the tab called <span class="emphasis"><em>Settings</em></span> near the middle of the    row.</p><p>This takes you to the Settings page. It allows you to configure    various details of the crawl. Exhaustive coverage of this page can be    found in <a href="config.html#settings" title="6.3.&nbsp;Settings">Section&nbsp;6.3, &ldquo;Settings&rdquo;</a>. For now we are only interested in    the two settings under <span class="bold"><strong>http-headers</strong></span>.    These are the <code class="literal">user-agent</code> and <code class="literal">from</code>    field of the HTTP headers in the crawlers requests. You must set them to    valid values before a crawl can be run. The current values upper-case what    needs replacing. If you have trouble with that please refer to <a href="config.html#httpheaders" title="6.3.1.3.&nbsp;HTTP headers">Section&nbsp;6.3.1.3, &ldquo;HTTP headers&rdquo;</a> for what's regarded as valid values.</p><p>Once you've set the <span class="bold"><strong>http-headers</strong></span>    settings to proper values (and made any other desired changes), you can    click the <span class="emphasis"><em>Submit job</em></span> tab at the far right of the    second row of tabs. The crawl job is now configured and ready to    run.</p><p>Configuring a job is covered in greater detail in <a href="config.html" title="6.&nbsp;Configuring jobs and profiles">Section&nbsp;6, &ldquo;Configuring jobs and profiles&rdquo;</a>.</p><p><span class="bold"><strong>Step 3.</strong></span> <span class="emphasis"><em>Running the    job</em></span></p><p>Submitted new jobs are placed in a queue of pending    jobs. The crawler does not start processing jobs from this queue until the    crawler is started. While the crawler is stopped, jobs are simply    held.</p><p>To start the crawler, click on the Console tab. Once on the Console    page, you will find the option <span class="emphasis"><em>Start</em></span> at the top of    the <span class="bold"><strong>Crawler Status</strong></span> box, just to the     right of the indicator of current status. Clicking this option will put     the crawling into <span class="emphasis"><em>Crawling Jobs</em></span> mode, where it will     begin crawling any next pending job, such as the job you just created and    configured.</p><p>The Console will update to display progress information about the    on-going crawl. Click the <span class="emphasis"><em>Refresh</em></span> option (or the    top-left Heritrix logo) to update this information.</p><p>For more information about running a job see <a href="running.html" title="7.&nbsp;Running a job">Section&nbsp;7, &ldquo;Running a job&rdquo;</a>.</p><p>Detailed information about evaluating the progress of a job can be    found in <a href="analysis.html" title="8.&nbsp;Analysis of jobs">Section&nbsp;8, &ldquo;Analysis of jobs&rdquo;</a>.</p></div><div class="navfooter"><hr><table summary="Navigation footer" width="100%"><tr><td align="left" width="40%"><a accesskey="p" href="wui.html">Prev</a>&nbsp;</td><td align="center" width="20%">&nbsp;</td><td align="right" width="40%">&nbsp;<a accesskey="n" href="creating.html">Next</a></td></tr><tr><td valign="top" align="left" width="40%">3.&nbsp;Web based user interface&nbsp;</td><td align="center" width="20%"><a accesskey="h" href="index.html">Home</a></td><td valign="top" align="right" width="40%">&nbsp;5.&nbsp;Creating jobs and profiles</td></tr></table></div></body></html>
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -