📄 creating.html
字号:
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>5. Creating jobs and profiles</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix User Manual"><link rel="up" href="index.html" title="Heritrix User Manual"><link rel="prev" href="tutorial.html" title="4. A quick guide to running your first crawl job"><link rel="next" href="config.html" title="6. Configuring jobs and profiles"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">5. Creating jobs and profiles</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="tutorial.html">Prev</a> </td><th align="center" width="60%"> </th><td align="right" width="20%"> <a accesskey="n" href="config.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="creating"></a>5. Creating jobs and profiles</h2></div></div></div><p>In order to run a crawl a configuration must be created that defines it. In Heritrix such a configuration is called a <span class="bold"><strong>crawl job</strong></span>.</p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N10250"></a>5.1. Crawl job</h3></div></div></div><p>A crawl job encompasses the configurations needed to run a single crawl. It also contains some additional elements such as file locations, status etc.</p><p>Once logged onto the WUI new jobs can be created by going to the <span class="emphasis"><em>Jobs</em></span> tab. Once the Jobs page loads users can create jobs by choosing of the following three options:</p><div class="orderedlist"><ol type="1"><li><p><span class="bold"><strong>Based on existing job</strong></span></p><p>This option allows the user to create a job by basing it on any existing job, regardless of whether it has been crawled or not. Can be useful for repeating crawls or recovering a crawl that had problems. (See <a href="outside.html#recover" title="9.3. Recovery of Frontier State and recover.gz">Section 9.3, “Recovery of Frontier State and recover.gz”</a></p></li><li><p><span class="bold"><strong>Based on a profile</strong></span></p><p>This option allows the user to create a job by basing it on any existing profiles.</p></li><li><p><span class="bold"><strong>With defaults</strong></span></p><p>This option creates a new crawl job based on the default profile.</p></li></ol></div><p>Options 1 and 2 will display a list of available options. Initially there are two profiles and no existing jobs.</p><p>All crawl jobs are created by basing them on profiles (see <a href="creating.html#profile">Section 5.2, “Profile”</a>) or existing jobs.</p><p>Once the proper profile/job has been chosen to base the new job on, a simple page will appear asking for the new job's:</p><div class="orderedlist"><ol type="1"><li><p><span class="bold"><strong>Name</strong></span></p><p>The name must only contain letters, numbers, dash (-) and underscore (_). No other characters are allowed. This name will be used to identify the crawl in the WUI but it need not be unique. The name can not be changed later</p></li><li><p><span class="bold"><strong>Description</strong></span></p><p>A short description of the job. This is a freetext input box and can be edited later.</p></li><li><p><span class="bold"><strong>Seeds</strong></span></p><p>The seed URIs to use for the job. This list can be edited later along with the general configurations.</p></li></ol></div><p>Below these input fields there are several buttons. The last one <span class="emphasis"><em>Submit job</em></span> will immediately submit the job and (assuming it is properly configured) it will be ready to run (see <a href="running.html" title="7. Running a job">Section 7, “Running a job”</a>). The other buttons will take the user to the relevant configuration pages (those are covered in detail in <a href="config.html" title="6. Configuring jobs and profiles">Section 6, “Configuring jobs and profiles”</a>). Once all desired changes have been made to the configuration, click the '<span class="emphasis"><em>Submit job</em></span>' tab (usually displayed top and bottom right) to submit it to the list of waiting jobs.<div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>Changes made afterwards to the original jobs or profiles that a new job is based on will <span class="bold"><strong>not</strong></span> in any way affect the newly created job.</p></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>Jobs based on the default profile provided with Heritrix are not ready to run <span class="emphasis"><em>as is</em></span>. Their HTTP header information must be set to valid values. See <a href="config.html#httpheaders" title="6.3.1.3. HTTP headers">Section 6.3.1.3, “HTTP headers”</a> for details.</p></div></p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N102B4"></a>5.2. Profile</h3></div></div></div><p>A profile is a template for a crawl job. It contains all the configurations that a crawl job would, but is not considered to be 'crawlable'. That is Heritrix will not allow you to directly crawl a profile, only jobs based on profiles. The reason for this is that while profiles may in fact be complete, they may also not be.</p><p>A common example is leaving the HTTP headers (<code class="literal">user-agent</code>, <code class="literal">from</code>) in an illegal state in a profile to force the user to input valid data. This applies to the default (<span class="emphasis"><em>default</em></span>) profile that comes with Heritrix. Other examples would be leaving the seeds list empty, not specifying some processors (such as the writer/indexer) etc.</p><p>In general there is less error checking of profiles.</p><p>To manage profiles, go to the <span class="emphasis"><em>Profiles</em></span> tab in the WUI. That page will display a list of existing profiles. To create a new profile select the option of creating a "New profile based on it" from the existing profile to use as a template. Much like jobs, profiles can only be created based on other profiles. It is not possible to create profiles based on existing jobs.</p><p>The process from there on mirrors the creation of jobs. A page will ask for the new profiles name, description and seeds list. Unlike job names, profile names <span class="emphasis"><em>must be unique</em></span> from other profile names - jobs and a profile can share the same name - otherwise the same rules apply.</p><p>The user then proceeds to the configuration pages (see <a href="config.html" title="6. Configuring jobs and profiles">Section 6, “Configuring jobs and profiles”</a>) to modify the behavior of the new profile from that of the parent profile.<div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>Even though profiles are based on other profiles, changes made to the original profiles afterwards will <span class="bold"><strong>not</strong></span> affect the new ones.</p></div></p></div></div><div class="navfooter"><hr><table summary="Navigation footer" width="100%"><tr><td align="left" width="40%"><a accesskey="p" href="tutorial.html">Prev</a> </td><td align="center" width="20%"> </td><td align="right" width="40%"> <a accesskey="n" href="config.html">Next</a></td></tr><tr><td valign="top" align="left" width="40%">4. A quick guide to running your first crawl job </td><td align="center" width="20%"><a accesskey="h" href="index.html">Home</a></td><td valign="top" align="right" width="40%"> 6. Configuring jobs and profiles</td></tr></table></div></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -