⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 creating.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
字号:
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>5.&nbsp;Creating jobs and profiles</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix User Manual"><link rel="up" href="index.html" title="Heritrix User Manual"><link rel="prev" href="tutorial.html" title="4.&nbsp;A quick guide to running your first crawl job"><link rel="next" href="config.html" title="6.&nbsp;Configuring jobs and profiles"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">5.&nbsp;Creating jobs and profiles</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="tutorial.html">Prev</a>&nbsp;</td><th align="center" width="60%">&nbsp;</th><td align="right" width="20%">&nbsp;<a accesskey="n" href="config.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="creating"></a>5.&nbsp;Creating jobs and profiles</h2></div></div></div><p>In order to run a crawl a configuration must be created that defines    it. In Heritrix such a configuration is called a <span class="bold"><strong>crawl job</strong></span>.</p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N10250"></a>5.1.&nbsp;Crawl job</h3></div></div></div><p>A crawl job encompasses the configurations needed to run a single      crawl. It also contains some additional elements such as file locations,      status etc.</p><p>Once logged onto the WUI new jobs can be created by going to the      <span class="emphasis"><em>Jobs</em></span> tab. Once the Jobs page loads users can create      jobs by choosing of the following three options:</p><div class="orderedlist"><ol type="1"><li><p><span class="bold"><strong>Based on existing job</strong></span></p><p>This option allows the user to create a job by basing it on          any existing job, regardless of whether it has been crawled or not.          Can be useful for repeating crawls or recovering a crawl that had          problems. (See <a href="outside.html#recover" title="9.3.&nbsp;Recovery of Frontier State and recover.gz">Section&nbsp;9.3, &ldquo;Recovery of Frontier State and recover.gz&rdquo;</a></p></li><li><p><span class="bold"><strong>Based on a profile</strong></span></p><p>This option allows the user to create a job by basing it on          any existing profiles.</p></li><li><p><span class="bold"><strong>With defaults</strong></span></p><p>This option creates a new crawl job based on the default          profile.</p></li></ol></div><p>Options 1 and 2 will display a list of available options.      Initially there are two profiles and no existing jobs.</p><p>All crawl jobs are created by basing them on profiles (see <a href="creating.html#profile">Section&nbsp;5.2, &ldquo;Profile&rdquo;</a>) or existing jobs.</p><p>Once the proper profile/job has been chosen to base the new job      on, a simple page will appear asking for the new job's:</p><div class="orderedlist"><ol type="1"><li><p><span class="bold"><strong>Name</strong></span></p><p>The name must only contain letters, numbers, dash (-) and          underscore (_). No other characters are allowed. This name will be          used to identify the crawl in the WUI but it need not be unique. The          name can not be changed later</p></li><li><p><span class="bold"><strong>Description</strong></span></p><p>A short description of the job. This is a freetext input box          and can be edited later.</p></li><li><p><span class="bold"><strong>Seeds</strong></span></p><p>The seed URIs to use for the job. This list can be edited          later along with the general configurations.</p></li></ol></div><p>Below these input fields there are several buttons. The last one      <span class="emphasis"><em>Submit job</em></span> will immediately submit the job and      (assuming it is properly configured) it will be ready to run (see      <a href="running.html" title="7.&nbsp;Running a job">Section&nbsp;7, &ldquo;Running a job&rdquo;</a>). The other buttons will take the user to the      relevant configuration pages (those are covered in detail in <a href="config.html" title="6.&nbsp;Configuring jobs and profiles">Section&nbsp;6, &ldquo;Configuring jobs and profiles&rdquo;</a>). Once all desired changes have been made to the      configuration, click the '<span class="emphasis"><em>Submit job</em></span>' tab (usually      displayed top and bottom right) to submit it to the list of waiting      jobs.<div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>Changes made afterwards to the original jobs or profiles that          a new job is based on will <span class="bold"><strong>not</strong></span> in          any way affect the newly created job.</p></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>Jobs based on the default profile provided with Heritrix are          not ready to run <span class="emphasis"><em>as is</em></span>. Their HTTP header          information must be set to valid values. See <a href="config.html#httpheaders" title="6.3.1.3.&nbsp;HTTP headers">Section&nbsp;6.3.1.3, &ldquo;HTTP headers&rdquo;</a> for details.</p></div></p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N102B4"></a>5.2.&nbsp;Profile</h3></div></div></div><p>A profile is a template for a crawl job. It contains all the      configurations that a crawl job would, but is not considered to be      'crawlable'. That is Heritrix will not allow you to directly crawl a      profile, only jobs based on profiles. The reason for this is that while      profiles may in fact be complete, they may also not be.</p><p>A common example is leaving the HTTP headers      (<code class="literal">user-agent</code>, <code class="literal">from</code>) in an illegal      state in a profile to force the user to input valid data. This applies      to the default (<span class="emphasis"><em>default</em></span>) profile that comes with      Heritrix. Other examples would be leaving the seeds list empty, not      specifying some processors (such as the writer/indexer) etc.</p><p>In general there is less error checking of profiles.</p><p>To manage profiles, go to the <span class="emphasis"><em>Profiles</em></span> tab in      the WUI. That page will display a list of existing profiles. To create a      new profile select the option of creating a "New profile based on it"      from the existing profile to use as a template. Much like jobs, profiles      can only be created based on other profiles. It is not possible to      create profiles based on existing jobs.</p><p>The process from there on mirrors the creation of jobs. A page      will ask for the new profiles name, description and seeds list. Unlike      job names, profile names <span class="emphasis"><em>must be unique</em></span> from other      profile names - jobs and a profile can share the same name - otherwise      the same rules apply.</p><p>The user then proceeds to the configuration pages (see <a href="config.html" title="6.&nbsp;Configuring jobs and profiles">Section&nbsp;6, &ldquo;Configuring jobs and profiles&rdquo;</a>) to modify the behavior of the new profile from that      of the parent profile.<div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>Even though profiles are based on other profiles, changes made          to the original profiles afterwards will <span class="bold"><strong>not</strong></span> affect the new ones.</p></div></p></div></div><div class="navfooter"><hr><table summary="Navigation footer" width="100%"><tr><td align="left" width="40%"><a accesskey="p" href="tutorial.html">Prev</a>&nbsp;</td><td align="center" width="20%">&nbsp;</td><td align="right" width="40%">&nbsp;<a accesskey="n" href="config.html">Next</a></td></tr><tr><td valign="top" align="left" width="40%">4.&nbsp;A quick guide to running your first crawl job&nbsp;</td><td align="center" width="20%"><a accesskey="h" href="index.html">Home</a></td><td valign="top" align="right" width="40%">&nbsp;6.&nbsp;Configuring jobs and profiles</td></tr></table></div></body></html>

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -