📄 config.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
📖 第 1 页 / 共 5 页
字号:
        TooManyHopsDecideRule          // but reject if too many hops from seeds        TransclusionDecideRule         // notwithstanding above, accept if within a few transcluding hops (frames/imgs/redirects)        PathologicalPathDecideRule     // but reject if pathological repetitions        TooManyPathSegmentsDecideRule  // ...or if too many path-segments        PrerequisiteAcceptDecideRule   // but always accept a prerequisite of other URI        </pre></p><p>In Heritirx 1.10.0, the default profile was changed to use          the above set of DecideRules (Previous to this, the operator had to          choose the 'deciding-default' profile, since removed).</p><p>The naming, behavior, and user-interface for DecideRule-based          scoping is subject to significant change based on feedback and          experience in future releases.</p><p>Enable FINE logging on the class          <code class="literal">org.archive.crawler.deciderules.DecideRuleSequence</code>          to watch each deciderules finding on each processed URI.          </p></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="frontier"></a>6.1.2.&nbsp;Frontier</h4></div></div></div><p>The Frontier is a pluggable module that maintains the internal        state of the crawl. What URIs have been discovered, crawled etc. As        such its selection greatly effects, for instance, the order in which        discovered URIs are crawled.</p><p>There is only one Frontier per crawl job.</p><p>Multiple Frontiers are provided with Heritrix, each of a        particular character.</p><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="bdbfrontier"></a>6.1.2.1.&nbsp;BdbFrontier</h5></div></div></div><p>The default Frontier in Heritrix as of 1.4.0 and later is the          BdbFrontier(Previously, the default was the <a href="config.html#hqf" title="6.1.2.2.&nbsp;HostQueuesFrontier">Section&nbsp;6.1.2.2, &ldquo;HostQueuesFrontier&rdquo;</a>).          The BdbFrontier visits URIs and sites discovered in a generally          breadth-first manner, it offers configuration options controlling          how it throttles its activity against particular hosts, and whether          it has a bias towards finishing hosts in progress ('site-first'          crawling) or cycling among all hosts with pending URIs.</p><p>Discovered URIs are only crawled once, except that robots.txt          and DNS information can be configured so that it is refreshed at          specified intervals for each host.</p><p>The main difference between the BdbFrontier and its precursor,          <a href="config.html#hqf" title="6.1.2.2.&nbsp;HostQueuesFrontier">Section&nbsp;6.1.2.2, &ldquo;HostQueuesFrontier&rdquo;</a>, is that BdbFrontier uses BerkeleyDB Java          Edition to shift more running Frontier state to disk.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="hqf"></a>6.1.2.2.&nbsp;HostQueuesFrontier</h5></div></div></div><p>The forerunner of the <a href="config.html#bdbfrontier" title="6.1.2.1.&nbsp;BdbFrontier">Section&nbsp;6.1.2.1, &ldquo;BdbFrontier&rdquo;</a>. Now          deprecated mostly because its custom disk-based data structures          could not move as much Frontier state out of main memory as the          BerkeleyDB Java Edition approach. Has same general          characteristics as the <a href="config.html#bdbfrontier" title="6.1.2.1.&nbsp;BdbFrontier">Section&nbsp;6.1.2.1, &ldquo;BdbFrontier&rdquo;</a>.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="dsf"></a>6.1.2.3.&nbsp;DomainSensitveFrontier</h5></div></div></div><p>A subclass of the <a href="config.html#hqf" title="6.1.2.2.&nbsp;HostQueuesFrontier">Section&nbsp;6.1.2.2, &ldquo;HostQueuesFrontier&rdquo;</a> written by Oskar          Grenholm. The DSF allows specifying an upper-bound on the number of          documents downloaded per-site. It does this by exploiting <a href="config.html#overrides" title="6.4.&nbsp;Overrides">Section&nbsp;6.4, &ldquo;Overrides&rdquo;</a> adding a filter to block further fetching          once the crawler has attained per-site limits.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="arf"></a>6.1.2.4.&nbsp;AdaptiveRevisitingFrontier</h5></div></div></div><p>The AdaptiveRevisitingFrontier -- a.k.a          <span class="emphasis"><em>AR</em></span> Frontier -- will repeatedly visit all          encountered URIs. Wait time between visits is configurable and          varies based on wait intervals specified by a WaitEvaluator          processor. It was written by Kristinn Sigurdsson. <div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>This Frontier is still experimental, in active development              and has not been tested extensively.</p></div></p><p>In addition to the WaitEvaluator (or similar processor) a          crawl using this Frontier will also need to use the ChangeEvaluator          processor: i.e. this Frontier requires that ChangeEvaluator and          WaitEvaluator or equivalents are present in the processing          chain.</p><p>ChangeEvaluator should be at the very top of the extractor          chain.</p><p>WaitEvaluator -- or an equivalent -- needs to be in the post          processing chain.</p><p>The ChangeEvaluator has no configurable settings. The          WaitEvaluator however has numerous settings to adjust the revisit          policy. <div class="itemizedlist"><ul type="disc"><li><p>Initial wait. A waiting period before revisiting the                first time.</p></li><li><p>Increase and decrease factors on unchanged and changed                documents respectively. Basically if a document has not                changed between visits, its wait time will be multiplied by                the "unchanged-factor" and if it has changed, the wait time                will be divided by the "changed-factor". Both values accept                real numbers, not just integers.</p></li><li><p>Finally, there is a 'default-wait-interval' for URIs                where it is not possible to judge changes in content.                Currently this applies only to DNS lookups.</p></li></ul></div> </p><p>If you want to specify different wait times and factors for          URIs based on their mime types, this is possible. You have to create          a Refinement (<a href="config.html#refinements" title="6.5.&nbsp;Refinements">Section&nbsp;6.5, &ldquo;Refinements&rdquo;</a>) and use the          ContentType criteria. Simply use a regular expression that matches          the desired mime type as its parameter and then override the          applicable parameters in the refinement.</p><p>By setting the 'state' directory to the same location that          another AR crawl used, it should resume that crawl (minus some          stats).</p></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="processors"></a>6.1.3.&nbsp;Processing Chains</h4></div></div></div><p>When a URI is crawled it is in fact passed through a series of        processors. This series is split for convenience between five chains        and the user can add, remove and reorder the processors on each of        these chains.</p><p>Each URI taken off the Frontier queue runs through the        <code class="literal">Processing Chains</code> listed in the diagram shown        below. URIs are always processed in the order shown in the diagram        unless a particular processor throws a fatal error or decides to stop        the processing of the current URI for some reason. In this        circumstance, processing skips to the end, to the Post-processing        chain, for cleanup.</p><p>Each processing chain is made up of zero or more individual        processors. For example, the extractor processing chain might comprise        the <code class="literal">ExtractorHTML</code> , an        <code class="literal">ExtractorJS</code> , and the        <code class="literal">ExtractorUniversal</code> processors. Within a processing        step, the order in which processors are run is the order in which        processors are listed on the modules page.</p><p>Generally, particular processors only make sense within the        context of one particular processing chain. For example, it wouldn't        make sense to run the <code class="literal">FetchHTTP</code> processor in the        Post-processing chain. This is however not enforced, so users must        take care to construct logical processor chains.</p><div><img src="processing_steps.png"></div><p>Most of the processors are fairly self explanatory, however the        first and last two merit a little bit more attention.</p><p>In the <code class="literal">Pre-fetch processing</code> chain the        following two        processors should be included (or replacement modules that perform        similar operations):</p><div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>Preselector</strong></span></p><p>Last check if the URI should indeed be crawled. Can for            example recheck scope. Useful if scope has been changed after the            crawl starts (This processor is not strictly necessary).</p></li><li><p><span class="bold"><strong>PreconditionEnforcer</strong></span></p><p>Ensures that all preconditions for crawling a URI have been            met. These currently include verifying that DNS and robots.txt            information has been fetched for the URI. Should always be            included.</p></li></ul></div><p>Similarly the <code class="literal">Post Processing</code> chain has the        following        special purpose processors:</p><div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>CrawlStateUpdater</strong></span></p><p>Updates the per-host information that may have been affected            by the fetch. This is currently robots and IP address info. Should            always be included.</p></li><li><p><span class="bold"><strong>LinksScoper</strong></span></p><p>Checks all links extracted from the current download              against the crawl scope.  Those that are out of scope are              discarded.  Logging of discarded URLs can be enabled.              </p></li><li><p><span class="bold"><strong>FrontierScheduler</strong></span></p><p>'Schedules' any URLs stored as CandidateURIs found                in the current CrawlURI with the frontier for crawling.                Also schedules prerequisites if any.</p></li></ul></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="stattrack"></a>6.1.4.&nbsp;Statistics Tracking</h4></div></div></div><p>Any number of statistics tracking modules can be attached to a        crawl. Currently only one is provided with Heritrix. The        <code class="literal">StatisticsTracker</code> module that comes with Heritrix        writes the <code class="literal">progress-statistics.log</code> file and        provides the WUI with the data it needs to display progress        information about a crawl. It is strongly recommended that any crawl        running with the WUI use this module.</p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="submodules"></a>6.2.&nbsp;Submodules</h3></div></div></div><p>On the Submodules tab,       configuration points that take variable-sized listings of components      can be configured. Components can be added, ordered, and removed.      Examples of such components are listings of canonicalization rules to      run against each URL discovered, <a href="config.html#filters" title="6.2.2.&nbsp;Filters">Section&nbsp;6.2.2, &ldquo;Filters&rdquo;</a>      on processors, and      credentials.  Once submodules      are added under the Submodules tab, they will show in subsequent       redrawings of the Settings tab.  Values which control their operation
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -