📄 config.html
字号:
<code class="literal">www.sample2.com/path/index.html</code> will be limited to URIs under <code class="literal">/path/</code>.</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>Internally Heritrix defines everything up to the right most slash as the <code class="literal">path</code> when doing path scope so for example, the URLs <code class="literal">http://members.aol.com/bigbird</code> and <code class="literal">http://members.aol.com/~bigbird</code> will treat as in scope any URL that begins <code class="literal">members.aol.com</code>. If your intent is to only include all below the path <code class="literal">bigbird</code>, add a slash on the end, using a form such as <code class="literal">http://members.aol.com/bigbird/</code> or <code class="literal">http://members.aol.com/bigbird/index.html</code> instead. </p></div></li></ul></div><p>Scopes usually allow for some flexibility in defining depth and possible transitive includes (that is getting items that would usually be out of scope because of special circumstance such as their being embedded in the display of an included resource). Most notably, every scope can have additional filters applied in two different contexts (some scopes may only have one these contexts).</p><div class="orderedlist"><ol type="1"><li><p><span class="bold"><strong>Focus</strong></span></p><p>URIs matching these filters will be considered to be within scope</p></li><li><p><span class="bold"><strong>Exclude</strong></span></p><p>URIs matching these filters will be considered to be out of scope.</p></li></ol></div><p>Custom made Scopes may have different sets of filters. Also some scopes have filters hardcoded into them. This allows you to edit their settings but not remove or replace them. For example most of the provided scopes have a <code class="literal">Transclusion</code> filter hardcoded into them that handles transitive items (URIs that normally shouldn't be included but because of special circumstance they will be included).</p><p>For more about Filters see <a href="config.html#filters" title="6.2.2. Filters">Section 6.2.2, “Filters”</a>.</p><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="scopeproblems"></a>6.1.1.1. Problems with the current Scopes</h5></div></div></div><p>Our original Scope classes -- PathScope, HostScope, DomainScope, BroadScope -- all could be thought of as fitting a specific pattern: A URI is included if and only if:</p><p><pre class="programlisting">protected final boolean innerAccepts(Object o) { return ((isSeed(o) || focusAccepts(o)) || additionalFocusAccepts(o) || transitiveAccepts(o)) && !excludeAccepts(o);}</pre></p><p>More generally, the <span class="emphasis"><em>focus</em></span> filter was meant to rule things in by prima facia/regexp-pattern analysis; the <span class="emphasis"><em>transitive</em></span> filter rule extra items in by dynamic path analysis (for example, off site embedded images); and the <span class="emphasis"><em>exclusion</em></span> filter rule things out by any number of chained exclusion rules. So in a typical crawl, the <span class="emphasis"><em>focus</em></span> filter drew from one of these categories:<div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>broad</strong></span> : accept all</p></li><li><p><span class="bold"><strong>domain</strong></span>: accept if on same 'domain' (for some definition) as seeds</p></li><li><p><span class="bold"><strong>host</strong></span>: accept if on exact host as seeds</p></li><li><p><span class="bold"><strong>path</strong></span>: accept if on same host and a shared path-prefix as seeds</p></li></ul></div>The <span class="emphasis"><em>transitive</em></span> filter was configured based on the various link-hops and embed-hops thresholds set by the operator.</p><p>The <span class="emphasis"><em>exclusion</em></span> filter was in fact a compound chain of filters, OR'ed together, such that any one of them could knock a URI out of consideration. However, a number of aspects of this arrangement have caused problems: <div class="orderedlist"><ol type="1"><li><p>To truly understand what happens to an URI, you must understand the above nested boolean-construct.</p></li><li><p>Adding mixed focuses -- such as all of this one host, all of this other domain, and then just these paths on this other host -- is not supported by these classes, nor easy to mix-in to the <span class="emphasis"><em>focus</em></span> filter.</p></li><li><p>Constructing and configuring the multiple filters required many setup steps across several WUI pages.</p></li><li><p>The reverse sense of the <span class="emphasis"><em>exclusion</em></span> filters -- if URIs are accepted by the filter, they are excluded from the crawl -- proved confusing, exacerbated by the fact that 'filter' itself can commonly mean either 'filter in' or 'filter out'.</p></li></ol></div></p><p>As a result of these problems, the SurtPrefixScope was added, and further major changes are planned. The first steps are described in the next section, <a href="config.html#decidingscope" title="6.1.1.2. DecidingScope">Section 6.1.1.2, “DecidingScope”</a>. These changes will also affect whether and how filters (see <a href="config.html#filters" title="6.2.2. Filters">Section 6.2.2, “Filters”</a>) are used.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="decidingscope"></a>6.1.1.2. DecidingScope</h5></div></div></div><p>To address the shortcomings above, and generally make alternate scope choices more understandable and flexible, a new mechanism for scoping and filtering has been introduced in Heritrix 1.4. This new approach is somewhat like (and inspired by) HTTrack's 'scan rules'/filters, Alexa's mask/ignore/void syntax for adjusting recurring crawls, or the Nutch 'regex-urlfilter' facility, but may be a bit more general than any of those.</p><p>This new approach is available as a DecidingScope, which is modelled as a series of DecideRules. Each DecideRule, when presented with an Object (most often a URI of some form), may respond with one of three decisions:</p><div class="itemizedlist"><ul type="disc"><li><p>ACCEPT: the object is ruled in</p></li><li><p>REJECT: the object is ruled out</p></li><li><p>PASS: the rule has no opinion; retain whatever previous decision was made</p></li></ul></div><p>To define a Scope, the operator configures an ordered series of DecideRules. A URI under consideration begins with no assumed status. Each rule is applied in turn to the candidate URI. If the rule decides ACCEPT or REJECT, the URI's status is set accordingly. After all rules have been applied, if the URI's status is ACCEPT it is "in scope" and scheduled for crawling; if its status is REJECT it is discarded.</p><p>There are no branches, but much of what nested conditionals can achieve is possible, in a form that should be be easier to follow than arbitrary expressions.</p><p>The current list of available DecideRules includes:</p><p><pre class="programlisting"> AcceptDecideRule -- ACCEPTs all (establishing an early default) RejectDecideRule -- REJECTs all (establishing an early default) TooManyHopsDecideRule(max-hops=N) -- REJECTS all with hopsPath.length()>N, PASSes otherwise PrerequisiteAcceptDecideRule -- ACCEPTs any with 'P' as last hop, PASSes otherwise (allowing prerequisites of items within other limits to also be included MatchesRegExpDecideRule(regexp=pattern) -- ACCEPTs (or REJECTs) all matching a regexp, PASSing otherwise NotMatchesRegExpDecideRule(regexp=pattern) -- ACCEPTs (or REJECTs) all *not* matching a regexp, PASSing otherwise. PathologicalPathDecideRule(max-reps=N) -- REJECTs all mathing problem patterns TooManyPathSegmentsDecideRule(max-segs=N) -- REJECTs all with too many path-segments ('/'s) TransclusionDecideRule(extra-hops=N) -- ACCEPTs anything with up to N non-navlink (non-'L')hops at end SurtPrefixedDecideRule(use-seeds=bool;use-file=path) -- ACCEPTs (or REJECTs) anything matched by SURT prefix set generated from supplied seeds/files/etc. NotSurtPrefixedDecideRule(use-seeds=bool;use-file=path) -- ACCEPTs (or REJECTs) anything *not* matched by SURT prefix set generated from supplied seeds/files/etc. OnHostsDecideRule(use-seeds=bool;use-file=path) -- ACCEPTs (or REJECTs) anything on same hosts as deduced from supplied seeds/files/etc. NotOnHostsDecideRule(use-seeds=bool;use-file=path) -- ACCEPTs (or REJECTs) anything on *not* same hosts as deduced from supplied seeds/files/etc. OnDomainsDecideRule(use-seeds=bool;use-file=path) -- ACCEPTs (or REJECTs) anything on same domains as deduced from supplied seeds/files/etc. NotOnDomainsSetDecideRule(use-seeds=bool;use-file=path) -- ACCEPTs (or REJECTs) anything *not* on same domains as deduced from supplied seeds/files/etc. MatchesFilePatternDecideRule -- ACCEPTs (or REJECTs) URIs matching a chosen predefined convenience regexp pattern (such as common file-extensions) NotMatchesFilePatternDecideRule -- ACCEPTs (or REJECTs) URIs *not* matching a chosen predefined convenience regexp pattern </pre></p><p>...covering just about everything our previous focus- and filter- based classes did. By ordering exclude and include actions, combinations that were awkward before -- or even impossible without writing custom code -- becomes straightforward.</p><p>For example, a previous request that was hard for us to accomodate was the idea: "crawl exactly these X hosts, and get offsite images if only on the same domains." That is, don't wander off the exact hosts to follow navigational links -- only to get offsite resources that share the same domain.</p><p>Our relevant function-of-seeds tests -- host-based and domain-based -- were exclusive of each other (at the 'focus' level) and difficult to mix-in with path-based criteria (at the 'transitive' level).</p><p>As a series of DecideRules, the above request can be easily achieved as:</p><p><pre class="programlisting"> RejectDecideRule OnHostsDecideRule(use-seeds=true) TranscludedDecideRule(extra-hops=2) NotOnDomainsDecideRule(REJECT,use-seeds=true); </pre></p><p>A good default set of DecideRules for many purposes would be...</p><p><pre class="programlisting"> RejectDecideRule // reject by default SurtPrefixedDecideRule // accept within SURT prefixes established by seeds
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -