📄 config.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
📖 第 1 页 / 共 5 页
字号:
            <code class="literal">www.sample2.com/path/index.html</code> will be limited            to URIs under <code class="literal">/path/</code>.</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>Internally Heritrix defines everything up to the right              most slash as the <code class="literal">path</code> when doing path scope              so for example, the URLs              <code class="literal">http://members.aol.com/bigbird</code> and              <code class="literal">http://members.aol.com/~bigbird</code> will treat as              in scope any URL that begins <code class="literal">members.aol.com</code>.              If your intent is to only include all below the path              <code class="literal">bigbird</code>, add a slash on the end, using a form              such as <code class="literal">http://members.aol.com/bigbird/</code> or              <code class="literal">http://members.aol.com/bigbird/index.html</code>              instead. </p></div></li></ul></div><p>Scopes usually allow for some flexibility in defining depth and        possible transitive includes (that is getting items that would usually        be out of scope because of special circumstance such as their being        embedded in the display of an included resource). Most notably, every        scope can have additional filters applied in two different contexts        (some scopes may only have one these contexts).</p><div class="orderedlist"><ol type="1"><li><p><span class="bold"><strong>Focus</strong></span></p><p>URIs matching these filters will be considered to be within            scope</p></li><li><p><span class="bold"><strong>Exclude</strong></span></p><p>URIs matching these filters will be considered to be out of            scope.</p></li></ol></div><p>Custom made Scopes may have different sets of filters. Also some        scopes have filters hardcoded into them. This allows you to edit their        settings but not remove or replace them. For example most of the        provided scopes have a <code class="literal">Transclusion</code> filter        hardcoded into them that handles transitive items (URIs that normally        shouldn't be included but because of special circumstance they will be        included).</p><p>For more about Filters see <a href="config.html#filters" title="6.2.2.&nbsp;Filters">Section&nbsp;6.2.2, &ldquo;Filters&rdquo;</a>.</p><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="scopeproblems"></a>6.1.1.1.&nbsp;Problems with the current Scopes</h5></div></div></div><p>Our original Scope classes -- PathScope, HostScope,          DomainScope, BroadScope -- all could be thought of as fitting a          specific pattern: A URI is included if and only if:</p><p><pre class="programlisting">protected final boolean innerAccepts(Object o) {    return ((isSeed(o) || focusAccepts(o)) || additionalFocusAccepts(o) ||            transitiveAccepts(o)) &amp;&amp; !excludeAccepts(o);}</pre></p><p>More generally, the <span class="emphasis"><em>focus</em></span> filter was          meant to rule things in by prima facia/regexp-pattern analysis; the          <span class="emphasis"><em>transitive</em></span> filter rule extra items in by          dynamic path analysis (for example, off site embedded images); and          the <span class="emphasis"><em>exclusion</em></span> filter rule things out by any          number of chained exclusion rules. So in a typical crawl, the          <span class="emphasis"><em>focus</em></span> filter drew from one of these          categories:<div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>broad</strong></span> : accept                all</p></li><li><p><span class="bold"><strong>domain</strong></span>: accept if on                same 'domain' (for some definition) as seeds</p></li><li><p><span class="bold"><strong>host</strong></span>: accept if on                exact host as seeds</p></li><li><p><span class="bold"><strong>path</strong></span>: accept if on same                host and a shared path-prefix as seeds</p></li></ul></div>The <span class="emphasis"><em>transitive</em></span> filter was          configured based on the various link-hops and embed-hops thresholds          set by the operator.</p><p>The <span class="emphasis"><em>exclusion</em></span> filter was in fact a          compound chain of filters, OR'ed together, such that any one of them          could knock a URI out of consideration. However, a number of aspects          of this arrangement have caused problems: <div class="orderedlist"><ol type="1"><li><p>To truly understand what happens to an URI, you must                understand the above nested boolean-construct.</p></li><li><p>Adding mixed focuses -- such as all of this one host,                all of this other domain, and then just these paths on this                other host -- is not supported by these classes, nor easy to                mix-in to the <span class="emphasis"><em>focus</em></span> filter.</p></li><li><p>Constructing and configuring the multiple filters                required many setup steps across several WUI pages.</p></li><li><p>The reverse sense of the <span class="emphasis"><em>exclusion</em></span>                filters -- if URIs are accepted by the filter, they are                excluded from the crawl -- proved confusing, exacerbated by                the fact that 'filter' itself can commonly mean either 'filter                in' or 'filter out'.</p></li></ol></div></p><p>As a result of these problems, the SurtPrefixScope was added,          and further major changes are planned. The first steps are described          in the next section, <a href="config.html#decidingscope" title="6.1.1.2.&nbsp;DecidingScope">Section&nbsp;6.1.1.2, &ldquo;DecidingScope&rdquo;</a>. These changes          will also affect whether and how filters (see <a href="config.html#filters" title="6.2.2.&nbsp;Filters">Section&nbsp;6.2.2, &ldquo;Filters&rdquo;</a>) are used.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="decidingscope"></a>6.1.1.2.&nbsp;DecidingScope</h5></div></div></div><p>To address the shortcomings above, and generally make          alternate scope choices more understandable and flexible, a new          mechanism for scoping and filtering has been introduced in Heritrix          1.4. This new approach is somewhat like (and inspired by) HTTrack's          'scan rules'/filters, Alexa's mask/ignore/void syntax for adjusting          recurring crawls, or the Nutch 'regex-urlfilter' facility, but may          be a bit more general than any of those.</p><p>This new approach is available as a DecidingScope, which is          modelled as a series of DecideRules. Each DecideRule, when presented          with an Object (most often a URI of some form), may respond with one          of three decisions:</p><div class="itemizedlist"><ul type="disc"><li><p>ACCEPT: the object is ruled in</p></li><li><p>REJECT: the object is ruled out</p></li><li><p>PASS: the rule has no opinion; retain whatever previous              decision was made</p></li></ul></div><p>To define a Scope, the operator configures an ordered series          of DecideRules. A URI under consideration begins with no assumed          status. Each rule is applied in turn to the candidate URI. If the          rule decides ACCEPT or REJECT, the URI's status is set accordingly.          After all rules have been applied, if the URI's status is ACCEPT it          is "in scope" and scheduled for crawling; if its status is REJECT it          is discarded.</p><p>There are no branches, but much of what nested conditionals          can achieve is possible, in a form that should be be easier to          follow than arbitrary expressions.</p><p>The current list of available DecideRules includes:</p><p><pre class="programlisting">       AcceptDecideRule -- ACCEPTs all (establishing an early default)    RejectDecideRule -- REJECTs all (establishing an early default)    TooManyHopsDecideRule(max-hops=N) -- REJECTS all with hopsPath.length()&gt;N, PASSes otherwise    PrerequisiteAcceptDecideRule -- ACCEPTs any with 'P' as last hop, PASSes otherwise (allowing prerequisites of items within other limits to also be included    MatchesRegExpDecideRule(regexp=pattern) -- ACCEPTs (or REJECTs) all matching a regexp, PASSing otherwise    NotMatchesRegExpDecideRule(regexp=pattern) -- ACCEPTs (or REJECTs) all *not* matching a regexp, PASSing otherwise.     PathologicalPathDecideRule(max-reps=N) -- REJECTs all mathing problem patterns    TooManyPathSegmentsDecideRule(max-segs=N) -- REJECTs all with too many path-segments ('/'s)    TransclusionDecideRule(extra-hops=N) -- ACCEPTs anything with up to N non-navlink (non-'L')hops at end    SurtPrefixedDecideRule(use-seeds=bool;use-file=path) -- ACCEPTs (or REJECTs) anything matched by SURT prefix set generated from supplied seeds/files/etc.    NotSurtPrefixedDecideRule(use-seeds=bool;use-file=path) -- ACCEPTs (or REJECTs) anything *not* matched by SURT prefix set generated from supplied seeds/files/etc.    OnHostsDecideRule(use-seeds=bool;use-file=path) -- ACCEPTs (or REJECTs) anything on same hosts as deduced from supplied seeds/files/etc.    NotOnHostsDecideRule(use-seeds=bool;use-file=path) -- ACCEPTs (or REJECTs) anything on *not* same hosts as deduced from supplied seeds/files/etc.    OnDomainsDecideRule(use-seeds=bool;use-file=path) -- ACCEPTs (or REJECTs) anything on same domains as deduced from supplied seeds/files/etc.    NotOnDomainsSetDecideRule(use-seeds=bool;use-file=path) -- ACCEPTs (or REJECTs) anything *not* on same domains as deduced from supplied seeds/files/etc.    MatchesFilePatternDecideRule -- ACCEPTs (or REJECTs) URIs matching a chosen predefined convenience regexp pattern (such as common file-extensions)    NotMatchesFilePatternDecideRule -- ACCEPTs (or REJECTs) URIs *not* matching a chosen predefined convenience regexp pattern        </pre></p><p>...covering just about everything our previous focus- and          filter- based classes did. By ordering exclude and include actions,          combinations that were awkward before -- or even impossible without          writing custom code -- becomes straightforward.</p><p>For example, a previous request that was hard for us to          accomodate was the idea: "crawl exactly these X hosts, and get          offsite images if only on the same domains." That is, don't wander          off the exact hosts to follow navigational links -- only to get          offsite resources that share the same domain.</p><p>Our relevant function-of-seeds tests -- host-based and          domain-based -- were exclusive of each other (at the 'focus' level)          and difficult to mix-in with path-based criteria (at the          'transitive' level).</p><p>As a series of DecideRules, the above request can be easily          achieved as:</p><p><pre class="programlisting">        RejectDecideRule        OnHostsDecideRule(use-seeds=true)        TranscludedDecideRule(extra-hops=2)        NotOnDomainsDecideRule(REJECT,use-seeds=true);        </pre></p><p>A good default set of DecideRules for many purposes would          be...</p><p><pre class="programlisting">        RejectDecideRule               // reject by default        SurtPrefixedDecideRule         // accept within SURT prefixes established by seeds
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -