📄 config.html
字号:
TooManyHopsDecideRule // but reject if too many hops from seeds TransclusionDecideRule // notwithstanding above, accept if within a few transcluding hops (frames/imgs/redirects) PathologicalPathDecideRule // but reject if pathological repetitions TooManyPathSegmentsDecideRule // ...or if too many path-segments PrerequisiteAcceptDecideRule // but always accept a prerequisite of other URI </pre></p><p>In Heritirx 1.10.0, the default profile was changed to use the above set of DecideRules (Previous to this, the operator had to choose the 'deciding-default' profile, since removed).</p><p>The naming, behavior, and user-interface for DecideRule-based scoping is subject to significant change based on feedback and experience in future releases.</p><p>Enable FINE logging on the class <code class="literal">org.archive.crawler.deciderules.DecideRuleSequence</code> to watch each deciderules finding on each processed URI. </p></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="frontier"></a>6.1.2. Frontier</h4></div></div></div><p>The Frontier is a pluggable module that maintains the internal state of the crawl. What URIs have been discovered, crawled etc. As such its selection greatly effects, for instance, the order in which discovered URIs are crawled.</p><p>There is only one Frontier per crawl job.</p><p>Multiple Frontiers are provided with Heritrix, each of a particular character.</p><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="bdbfrontier"></a>6.1.2.1. BdbFrontier</h5></div></div></div><p>The default Frontier in Heritrix as of 1.4.0 and later is the BdbFrontier(Previously, the default was the <a href="config.html#hqf" title="6.1.2.2. HostQueuesFrontier">Section 6.1.2.2, “HostQueuesFrontier”</a>). The BdbFrontier visits URIs and sites discovered in a generally breadth-first manner, it offers configuration options controlling how it throttles its activity against particular hosts, and whether it has a bias towards finishing hosts in progress ('site-first' crawling) or cycling among all hosts with pending URIs.</p><p>Discovered URIs are only crawled once, except that robots.txt and DNS information can be configured so that it is refreshed at specified intervals for each host.</p><p>The main difference between the BdbFrontier and its precursor, <a href="config.html#hqf" title="6.1.2.2. HostQueuesFrontier">Section 6.1.2.2, “HostQueuesFrontier”</a>, is that BdbFrontier uses BerkeleyDB Java Edition to shift more running Frontier state to disk.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="hqf"></a>6.1.2.2. HostQueuesFrontier</h5></div></div></div><p>The forerunner of the <a href="config.html#bdbfrontier" title="6.1.2.1. BdbFrontier">Section 6.1.2.1, “BdbFrontier”</a>. Now deprecated mostly because its custom disk-based data structures could not move as much Frontier state out of main memory as the BerkeleyDB Java Edition approach. Has same general characteristics as the <a href="config.html#bdbfrontier" title="6.1.2.1. BdbFrontier">Section 6.1.2.1, “BdbFrontier”</a>.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="dsf"></a>6.1.2.3. DomainSensitveFrontier</h5></div></div></div><p>A subclass of the <a href="config.html#hqf" title="6.1.2.2. HostQueuesFrontier">Section 6.1.2.2, “HostQueuesFrontier”</a> written by Oskar Grenholm. The DSF allows specifying an upper-bound on the number of documents downloaded per-site. It does this by exploiting <a href="config.html#overrides" title="6.4. Overrides">Section 6.4, “Overrides”</a> adding a filter to block further fetching once the crawler has attained per-site limits.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="arf"></a>6.1.2.4. AdaptiveRevisitingFrontier</h5></div></div></div><p>The AdaptiveRevisitingFrontier -- a.k.a <span class="emphasis"><em>AR</em></span> Frontier -- will repeatedly visit all encountered URIs. Wait time between visits is configurable and varies based on wait intervals specified by a WaitEvaluator processor. It was written by Kristinn Sigurdsson. <div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>This Frontier is still experimental, in active development and has not been tested extensively.</p></div></p><p>In addition to the WaitEvaluator (or similar processor) a crawl using this Frontier will also need to use the ChangeEvaluator processor: i.e. this Frontier requires that ChangeEvaluator and WaitEvaluator or equivalents are present in the processing chain.</p><p>ChangeEvaluator should be at the very top of the extractor chain.</p><p>WaitEvaluator -- or an equivalent -- needs to be in the post processing chain.</p><p>The ChangeEvaluator has no configurable settings. The WaitEvaluator however has numerous settings to adjust the revisit policy. <div class="itemizedlist"><ul type="disc"><li><p>Initial wait. A waiting period before revisiting the first time.</p></li><li><p>Increase and decrease factors on unchanged and changed documents respectively. Basically if a document has not changed between visits, its wait time will be multiplied by the "unchanged-factor" and if it has changed, the wait time will be divided by the "changed-factor". Both values accept real numbers, not just integers.</p></li><li><p>Finally, there is a 'default-wait-interval' for URIs where it is not possible to judge changes in content. Currently this applies only to DNS lookups.</p></li></ul></div> </p><p>If you want to specify different wait times and factors for URIs based on their mime types, this is possible. You have to create a Refinement (<a href="config.html#refinements" title="6.5. Refinements">Section 6.5, “Refinements”</a>) and use the ContentType criteria. Simply use a regular expression that matches the desired mime type as its parameter and then override the applicable parameters in the refinement.</p><p>By setting the 'state' directory to the same location that another AR crawl used, it should resume that crawl (minus some stats).</p></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="processors"></a>6.1.3. Processing Chains</h4></div></div></div><p>When a URI is crawled it is in fact passed through a series of processors. This series is split for convenience between five chains and the user can add, remove and reorder the processors on each of these chains.</p><p>Each URI taken off the Frontier queue runs through the <code class="literal">Processing Chains</code> listed in the diagram shown below. URIs are always processed in the order shown in the diagram unless a particular processor throws a fatal error or decides to stop the processing of the current URI for some reason. In this circumstance, processing skips to the end, to the Post-processing chain, for cleanup.</p><p>Each processing chain is made up of zero or more individual processors. For example, the extractor processing chain might comprise the <code class="literal">ExtractorHTML</code> , an <code class="literal">ExtractorJS</code> , and the <code class="literal">ExtractorUniversal</code> processors. Within a processing step, the order in which processors are run is the order in which processors are listed on the modules page.</p><p>Generally, particular processors only make sense within the context of one particular processing chain. For example, it wouldn't make sense to run the <code class="literal">FetchHTTP</code> processor in the Post-processing chain. This is however not enforced, so users must take care to construct logical processor chains.</p><div><img src="processing_steps.png"></div><p>Most of the processors are fairly self explanatory, however the first and last two merit a little bit more attention.</p><p>In the <code class="literal">Pre-fetch processing</code> chain the following two processors should be included (or replacement modules that perform similar operations):</p><div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>Preselector</strong></span></p><p>Last check if the URI should indeed be crawled. Can for example recheck scope. Useful if scope has been changed after the crawl starts (This processor is not strictly necessary).</p></li><li><p><span class="bold"><strong>PreconditionEnforcer</strong></span></p><p>Ensures that all preconditions for crawling a URI have been met. These currently include verifying that DNS and robots.txt information has been fetched for the URI. Should always be included.</p></li></ul></div><p>Similarly the <code class="literal">Post Processing</code> chain has the following special purpose processors:</p><div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>CrawlStateUpdater</strong></span></p><p>Updates the per-host information that may have been affected by the fetch. This is currently robots and IP address info. Should always be included.</p></li><li><p><span class="bold"><strong>LinksScoper</strong></span></p><p>Checks all links extracted from the current download against the crawl scope. Those that are out of scope are discarded. Logging of discarded URLs can be enabled. </p></li><li><p><span class="bold"><strong>FrontierScheduler</strong></span></p><p>'Schedules' any URLs stored as CandidateURIs found in the current CrawlURI with the frontier for crawling. Also schedules prerequisites if any.</p></li></ul></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="stattrack"></a>6.1.4. Statistics Tracking</h4></div></div></div><p>Any number of statistics tracking modules can be attached to a crawl. Currently only one is provided with Heritrix. The <code class="literal">StatisticsTracker</code> module that comes with Heritrix writes the <code class="literal">progress-statistics.log</code> file and provides the WUI with the data it needs to display progress information about a crawl. It is strongly recommended that any crawl running with the WUI use this module.</p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="submodules"></a>6.2. Submodules</h3></div></div></div><p>On the Submodules tab, configuration points that take variable-sized listings of components can be configured. Components can be added, ordered, and removed. Examples of such components are listings of canonicalization rules to run against each URL discovered, <a href="config.html#filters" title="6.2.2. Filters">Section 6.2.2, “Filters”</a> on processors, and credentials. Once submodules are added under the Submodules tab, they will show in subsequent redrawings of the Settings tab. Values which control their operation
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -