📄 config.html
字号:
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>6. Configuring jobs and profiles</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix User Manual"><link rel="up" href="index.html" title="Heritrix User Manual"><link rel="prev" href="creating.html" title="5. Creating jobs and profiles"><link rel="next" href="running.html" title="7. Running a job"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">6. Configuring jobs and profiles</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="creating.html">Prev</a> </td><th align="center" width="60%"> </th><td align="right" width="20%"> <a accesskey="n" href="running.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="config"></a>6. Configuring jobs and profiles</h2></div></div></div><p>Creating crawl jobs (<a href="creating.html#crawljob">Section 5.1, “Crawl job”</a>) and profiles (<a href="creating.html#profile">Section 5.2, “Profile”</a>) is just the first step. Configuring them is a more complicated process.</p><p>The following section applies equally to configuring crawl jobs and profiles. It does not matter if new ones are being created or existing ones are being edited. The interface is almost entirely the same, only the <span class="emphasis"><em>Submit job</em></span> / <span class="emphasis"><em>Finished</em></span> button will vary.<div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>Editing options for jobs being crawled are somewhat limited. See <a href="running.html#editrun" title="7.4. Editing a running job">Section 7.4, “Editing a running job”</a> for more.</p></div></p><p>Each page in the configuration section of the WUI will have a secondary row of tabs below the general ones. This secondary row is often replicated at the bottom of longer pages.</p><p>This row offers access to different parts of the configuration. While configuring the global level (more on global vs. overrides and refinements in <a href="config.html#overrides" title="6.4. Overrides">Section 6.4, “Overrides”</a> and <a href="config.html#refinements" title="6.5. Refinements">Section 6.5, “Refinements”</a>) the following options are available (left to right):</p><div class="itemizedlist"><ul type="disc"><li><p>Modules (<a href="config.html#modules" title="6.1. Modules (Scope, Frontier, and Processors)">Section 6.1, “Modules (Scope, Frontier, and Processors)”</a>)</p><p>Add/remove/set configurable modules, such as the crawl Scope (<a href="config.html#scopes" title="6.1.1. Crawl Scope">Section 6.1.1, “Crawl Scope”</a>), Frontier (<a href="config.html#frontier" title="6.1.2. Frontier">Section 6.1.2, “Frontier”</a>), or Processors (<a href="config.html#processors" title="6.1.3. Processing Chains">Section 6.1.3, “Processing Chains”</a>).</p></li><li><p>Submodules (<a href="config.html#submodules" title="6.2. Submodules">Section 6.2, “Submodules”</a>)</p><p>Here you can:</p><div class="itemizedlist"><ul type="circle"><li><p>Add/remove/reorder URL canonicalization rules (<a href="config.html#urlcanon" title="6.2.1. URL Canonicalization Rules">Section 6.2.1, “URL Canonicalization Rules”</a>)</p></li><li><p>Add/remove/reorder filters (<a href="config.html#filters" title="6.2.2. Filters">Section 6.2.2, “Filters”</a>)</p></li><li><p>Add/remove login credentials (<a href="config.html#credentials" title="6.2.3. Credentials">Section 6.2.3, “Credentials”</a>)</p></li></ul></div></li><li><p>Settings (<a href="config.html#settings" title="6.3. Settings">Section 6.3, “Settings”</a>)</p><p>Configure settings on Heritrix modules</p></li><li><p>Overrides (<a href="config.html#overrides" title="6.4. Overrides">Section 6.4, “Overrides”</a>)</p><p>Override settings on Heritrix modules based on domain</p></li><li><p>Refinements (<a href="config.html#refinements" title="6.5. Refinements">Section 6.5, “Refinements”</a>)</p><p>Refine settings on Heritrix modules based on arbitrary criteria</p></li><li><p>Submit job / Finished</p><p>Clicking this tab will take the user back to the Jobs or Profiles page, saving any changes.</p></li></ul></div><p>The <span class="emphasis"><em>Settings</em></span> tab is probably the most frequently used page as it allows the user to fine tune the settings of any Heritrix module used in a job or profile.</p><p>It is safe to navigate between these, it will not cause new jobs to be submitted to the queue of pending jobs. That only happens once the <span class="emphasis"><em>Submit job</em></span> tab is clicked. Navigating out of the configuration pages using the top level tabs will cause new jobs to be lost. Any changes made are saved when navigating within the configuration pages. There is no undo function, once made changes can not be undone.</p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="modules"></a>6.1. Modules (Scope, Frontier, and Processors)</h3></div></div></div><p>Heritrix has several types of pluggable modules. These modules, while having a fixed interface usually have a number of provided implementations. They can also be third party plugins. The "Modules" tab allows the user to set several types of these pluggable modules.</p><p>Once modules have been added to the configuration they can be configured in greater detail on the Settings tab (<a href="config.html#settings" title="6.3. Settings">Section 6.3, “Settings”</a>). If a module can contain within it multiple other modules, these can be configured on the Submodules tab.</p><p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>Modules are referred to by their Java class names (org.archive.crawler.frontier.BdbFrontier). This is done because these are the only names we can be assured of being unique.</p></div>See <a href="http://crawler.archive.org/articles/developer_manual/index.html" target="_top">Developer's Manual</a> for information about creating and adding custom modules to Heritrix.</p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="scopes"></a>6.1.1. Crawl Scope</h4></div></div></div><p>A crawl scope is an object that decides for each discovered URI if it is within the scope of the current crawl.</p><p>Several scopes are provided with Heritrix:</p><div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>BroadScope</strong></span></p><p>This scope allows for limiting the depth of a crawl (how many links away Heritrix should crawl) but does not impose any limits on the hosts, domains, or URI paths crawled.</p></li><li><p><a name="surtprefixscope"></a><span class="bold"><strong>SurtPrefixScope</strong></span></p><p>A highly flexible and fairly efficient scope which can crawl within defined domains, individual hosts, or path-defined areas of hosts, or any mixture of those, depending on the configuration.</p><p>It considers whether any URI is inside the primary focus of the scope by converting the URI to its <a href="glossary.html#surt">SURT</a> form, and then seeing if that SURT form begins with any of a number of <a href="glossary.html#surtprefix">SURT prefix</a>es. (See the glossary definitions for detailed information about the SURT form of a URI and SURT prefix comparisons.)</p><p>The operator may establish the set of SURT prefixes used either by letting the SURT prefixes be implied from the supplied seed URIs, specifying an external file with a listing of SURT prefixes, or both.</p><p>This scope also enables a special syntax within the seeds list for adding SURT prefixes separate from seeds. Any line in the seeds list beginning with a '+' will be considered a SURT prefix specification, rather than a seed. Any URL you put after the '+' will only be used to deduce a SURT prefix -- it will not be independently scheduled. You can also put your own literal SURT prefix after the '+'.</p><p>For example, each of the following SURT prefix directives in the seeds box are equivalent:</p><p><pre class="programlisting">+http://(org,example, # literal SURT prefix+http://example.org # regular URL implying same SURT prefix+example.org # URL fragment with implied 'http' scheme </pre></p><p>When you use this scope, it adds 3 hard-to-find-in-the-UI attributes -- <code class="literal">surts-source-file</code>, <code class="literal">seeds-as-surt-prefixes</code>, and <code class="literal">surts-dump-file</code> -- to the end of the scope section, just after <code class="literal">transitiveFilter</code> but before <code class="literal">http-headers</code>.</p><p>Use the <code class="literal">surts-source-file</code> setting to supply an external file from which to infer SURT prefixes, if desired. Any URLs in this file will be converted to the implied SURT prefix, and any line beginning with a '+' will be interpreted as a literal, precise SURT prefix. Use the <code class="literal">seeds-as-surt-prefixes</code> setting to establish whether SURT prefixes should be deduced from the seeds, in accordance with the rules given at the <a href="glossary.html#surtprefix">SURT prefix</a> glossary entry. (The default is 'true', to deduce SURT prefixes from seeds.)</p><p>To see what SURT prefixes were actually used -- perhaps merged from seed-deduced and externally-supplied -- you can specify a file path in the <code class="literal">surts-dump-file</code> setting. The sorted list of actual SURT prefixes used will be written to that file for reference. (Note that redundant entries will be removed from this dump. If you have SURT prefixes <http://(org,> and <http://(org,archive,>, only the former will actually be used, because all SURT form URIs prefixed by the latter are also prefixed by the former.)</p><p>See also the crawler wiki on <a href="http://crawler.archive.org/cgi-bin/wiki.pl?SurtScope" target="_top">SurtScope</a>.</p></li><li><p><span class="bold"><strong>FilterScope</strong></span></p><p>A highly configurable scope. By adding different filters in different combinations this scope can be configured to provide a wide variety of behaviour.</p><p>After selecting this filter, you must then go to the <span class="emphasis"><em>Filters</em></span> tab and add the filters you want to run as part of your scope. Add the filters at the <span class="emphasis"><em>focusFilter</em></span> label and give them a meaningful name. The URIRegexFilter probably makes most sense in this context (The ContentTypeRegexFilter won't work at scope time because we don't know the content-type till after we've fetched the document).</p><p>After adding the filter(s), return to the <span class="emphasis"><em>Settings</em></span> tab and fill in any configuration required of the filters. For example, say you added the URIRegexFilter, and you wanted only 'www.archive.org' hosts to be in focus, fill in a regex like the following: <code class="literal">^(?:http|dns)www.archve.org/\.*</code> (Be careful you don't rule out prerequisites such as dns or robots.txt when specifying your scope filter).</p></li></ul></div><p>The following scopes are available, but the same effects can be achieved more efficently, and in combination, with SurtPrefixScope. When SurtPrefixScope can be more easily understood and configured, these scopes may be removed entirely.</p><div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>DomainScope</strong></span></p><p>This scope limits discovered URIs to the set of domains defined by the provided seeds. That is any URI discovered belonging to a domain from which one of the seed came is within scope. Like always it is possible to apply depth restrictions.</p><p>Using the seed 'archive.org', a domain scope will fetch 'audio.archive.org', 'movies.archive.org', etc. It will fetch all discovered URIs from 'archive.org' and from any subdomain of 'archive.org'.</p></li><li><p><span class="bold"><strong>HostScope</strong></span></p><p>This scope limits discovered URIs to the set of hosts defined by the provided seeds.</p><p>If the seed is 'www.archive.org', then we'll only fetch items discovered on this host. The crawler will not go to 'audio.archive.org' or 'movies.archive.org'.</p></li><li><p><span class="bold"><strong>PathScope</strong></span></p><p>This scope goes yet further and limits the discovered URIs to a section of paths on hosts defined by the seeds. Of course any host that has a seed pointing at its root (i.e. <code class="literal">www.sample.com/index.html</code>) will be included in full where as a host whose only seed is
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -