⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 config.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
📖 第 1 页 / 共 5 页
字号:
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>6.&nbsp;Configuring jobs and profiles</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix User Manual"><link rel="up" href="index.html" title="Heritrix User Manual"><link rel="prev" href="creating.html" title="5.&nbsp;Creating jobs and profiles"><link rel="next" href="running.html" title="7.&nbsp;Running a job"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">6.&nbsp;Configuring jobs and profiles</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="creating.html">Prev</a>&nbsp;</td><th align="center" width="60%">&nbsp;</th><td align="right" width="20%">&nbsp;<a accesskey="n" href="running.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="config"></a>6.&nbsp;Configuring jobs and profiles</h2></div></div></div><p>Creating crawl jobs (<a href="creating.html#crawljob">Section&nbsp;5.1, &ldquo;Crawl job&rdquo;</a>) and profiles    (<a href="creating.html#profile">Section&nbsp;5.2, &ldquo;Profile&rdquo;</a>) is just the first step. Configuring them is a    more complicated process.</p><p>The following section applies equally to configuring crawl jobs and    profiles. It does not matter if new ones are being created or existing    ones are being edited. The interface is almost entirely the same, only the    <span class="emphasis"><em>Submit job</em></span> / <span class="emphasis"><em>Finished</em></span> button    will vary.<div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>Editing options for jobs being crawled are somewhat limited. See        <a href="running.html#editrun" title="7.4.&nbsp;Editing a running job">Section&nbsp;7.4, &ldquo;Editing a running job&rdquo;</a> for more.</p></div></p><p>Each page in the configuration section of the WUI will have a    secondary row of tabs below the general ones. This secondary row is often    replicated at the bottom of longer pages.</p><p>This row offers access to different parts of the configuration.    While configuring the global level (more on global vs. overrides and    refinements in <a href="config.html#overrides" title="6.4.&nbsp;Overrides">Section&nbsp;6.4, &ldquo;Overrides&rdquo;</a> and <a href="config.html#refinements" title="6.5.&nbsp;Refinements">Section&nbsp;6.5, &ldquo;Refinements&rdquo;</a>) the following options are available (left to    right):</p><div class="itemizedlist"><ul type="disc"><li><p>Modules (<a href="config.html#modules" title="6.1.&nbsp;Modules (Scope, Frontier, and Processors)">Section&nbsp;6.1, &ldquo;Modules (Scope, Frontier, and Processors)&rdquo;</a>)</p><p>Add/remove/set configurable modules, such as the crawl Scope        (<a href="config.html#scopes" title="6.1.1.&nbsp;Crawl Scope">Section&nbsp;6.1.1, &ldquo;Crawl Scope&rdquo;</a>), Frontier (<a href="config.html#frontier" title="6.1.2.&nbsp;Frontier">Section&nbsp;6.1.2, &ldquo;Frontier&rdquo;</a>),        or Processors (<a href="config.html#processors" title="6.1.3.&nbsp;Processing Chains">Section&nbsp;6.1.3, &ldquo;Processing Chains&rdquo;</a>).</p></li><li><p>Submodules (<a href="config.html#submodules" title="6.2.&nbsp;Submodules">Section&nbsp;6.2, &ldquo;Submodules&rdquo;</a>)</p><p>Here you can:</p><div class="itemizedlist"><ul type="circle"><li><p>Add/remove/reorder URL canonicalization rules (<a href="config.html#urlcanon" title="6.2.1.&nbsp;URL Canonicalization Rules">Section&nbsp;6.2.1, &ldquo;URL Canonicalization Rules&rdquo;</a>)</p></li><li><p>Add/remove/reorder filters (<a href="config.html#filters" title="6.2.2.&nbsp;Filters">Section&nbsp;6.2.2, &ldquo;Filters&rdquo;</a>)</p></li><li><p>Add/remove login credentials (<a href="config.html#credentials" title="6.2.3.&nbsp;Credentials">Section&nbsp;6.2.3, &ldquo;Credentials&rdquo;</a>)</p></li></ul></div></li><li><p>Settings (<a href="config.html#settings" title="6.3.&nbsp;Settings">Section&nbsp;6.3, &ldquo;Settings&rdquo;</a>)</p><p>Configure settings on Heritrix modules</p></li><li><p>Overrides (<a href="config.html#overrides" title="6.4.&nbsp;Overrides">Section&nbsp;6.4, &ldquo;Overrides&rdquo;</a>)</p><p>Override settings on Heritrix modules based on domain</p></li><li><p>Refinements (<a href="config.html#refinements" title="6.5.&nbsp;Refinements">Section&nbsp;6.5, &ldquo;Refinements&rdquo;</a>)</p><p>Refine settings on Heritrix modules based on arbitrary        criteria</p></li><li><p>Submit job / Finished</p><p>Clicking this tab will take the user back to the Jobs or        Profiles page, saving any changes.</p></li></ul></div><p>The <span class="emphasis"><em>Settings</em></span> tab is probably the most    frequently used page as it allows the user to fine tune the settings of    any Heritrix module used in a job or profile.</p><p>It is safe to navigate between these, it will not cause new jobs to    be submitted to the queue of pending jobs. That only happens once the    <span class="emphasis"><em>Submit job</em></span> tab is clicked. Navigating out of the    configuration pages using the top level tabs will cause new jobs to be    lost. Any changes made are saved when navigating within the configuration    pages. There is no undo function, once made changes can not be    undone.</p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="modules"></a>6.1.&nbsp;Modules (Scope, Frontier, and Processors)</h3></div></div></div><p>Heritrix has several types of pluggable modules. These modules,      while having a fixed interface usually have a number of provided      implementations. They can also be third party plugins. The "Modules" tab      allows the user to set several types of these pluggable modules.</p><p>Once modules have been added to the configuration they can be      configured in greater detail on the Settings tab (<a href="config.html#settings" title="6.3.&nbsp;Settings">Section&nbsp;6.3, &ldquo;Settings&rdquo;</a>). If a module can contain within it multiple other      modules, these can be configured on the Submodules tab.</p><p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>Modules are referred to by their Java class names          (org.archive.crawler.frontier.BdbFrontier). This is done because          these are the only names we can be assured of being unique.</p></div>See <a href="http://crawler.archive.org/articles/developer_manual/index.html" target="_top">Developer's      Manual</a> for information about creating and adding custom modules      to Heritrix.</p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="scopes"></a>6.1.1.&nbsp;Crawl Scope</h4></div></div></div><p>A crawl scope is an object that decides for each discovered URI        if it is within the scope of the current crawl.</p><p>Several scopes are provided with Heritrix:</p><div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>BroadScope</strong></span></p><p>This scope allows for limiting the depth of a crawl (how            many links away Heritrix should crawl) but does not impose any            limits on the hosts, domains, or URI paths crawled.</p></li><li><p><a name="surtprefixscope"></a><span class="bold"><strong>SurtPrefixScope</strong></span></p><p>A highly flexible and fairly efficient scope which can crawl            within defined domains, individual hosts, or path-defined areas of            hosts, or any mixture of those, depending on the            configuration.</p><p>It considers whether any URI is inside the primary focus of            the scope by converting the URI to its <a href="glossary.html#surt">SURT</a>            form, and then seeing if that SURT form begins with any of a            number of <a href="glossary.html#surtprefix">SURT prefix</a>es. (See the glossary            definitions for detailed information about the SURT form of a URI            and SURT prefix comparisons.)</p><p>The operator may establish the set of SURT prefixes used            either by letting the SURT prefixes be implied from the supplied            seed URIs, specifying an external file with a listing of SURT            prefixes, or both.</p><p>This scope also enables a special syntax within the seeds            list for adding SURT prefixes separate from seeds. Any line in             the seeds list beginning with a '+' will be considered a SURT            prefix specification, rather than a seed. Any URL you put after            the '+' will only be used to deduce a SURT prefix -- it will not            be independently scheduled. You can also put your own literal            SURT prefix after the '+'.</p><p>For example, each of the following SURT prefix directives in             the seeds box are equivalent:</p><p><pre class="programlisting">+http://(org,example,      # literal SURT prefix+http://example.org        # regular URL implying same SURT prefix+example.org               # URL fragment with implied 'http' scheme            </pre></p><p>When you use this scope, it adds 3 hard-to-find-in-the-UI            attributes -- <code class="literal">surts-source-file</code>,            <code class="literal">seeds-as-surt-prefixes</code>, and            <code class="literal">surts-dump-file</code> -- to the end of the scope            section, just after <code class="literal">transitiveFilter</code> but before            <code class="literal">http-headers</code>.</p><p>Use the <code class="literal">surts-source-file</code> setting to            supply an external file from which to infer SURT prefixes, if             desired. Any URLs in this file will be converted to the implied            SURT prefix, and any line beginning with a '+' will be             interpreted as a literal, precise SURT prefix.             Use the <code class="literal">seeds-as-surt-prefixes</code> setting to            establish whether SURT prefixes should be deduced from the seeds,            in accordance with the rules given at the <a href="glossary.html#surtprefix">SURT prefix</a> glossary entry. (The default is 'true', to            deduce SURT prefixes from seeds.)</p><p>To see what SURT prefixes were actually used -- perhaps            merged from seed-deduced and externally-supplied -- you can            specify a file path in the <code class="literal">surts-dump-file</code>            setting. The sorted list of actual SURT prefixes used will be            written to that file for reference. (Note that redundant entries            will be removed from this dump. If you have SURT prefixes            &lt;http://(org,&gt; and &lt;http://(org,archive,&gt;, only the            former will actually be used, because all SURT form URIs prefixed            by the latter are also prefixed by the former.)</p><p>See also the crawler wiki on <a href="http://crawler.archive.org/cgi-bin/wiki.pl?SurtScope" target="_top">SurtScope</a>.</p></li><li><p><span class="bold"><strong>FilterScope</strong></span></p><p>A highly configurable scope. By adding different filters in            different combinations this scope can be configured to provide a            wide variety of behaviour.</p><p>After selecting this filter, you must then go to the            <span class="emphasis"><em>Filters</em></span> tab and add the filters you want to            run as part of your scope. Add the filters at the            <span class="emphasis"><em>focusFilter</em></span> label and give them a meaningful            name. The URIRegexFilter probably makes most sense in this context            (The ContentTypeRegexFilter won't work at scope time because we            don't know the content-type till after we've fetched the            document).</p><p>After adding the filter(s), return to the            <span class="emphasis"><em>Settings</em></span> tab and fill in any configuration            required of the filters. For example, say you added the            URIRegexFilter, and you wanted only 'www.archive.org' hosts to be            in focus, fill in a regex like the following:            <code class="literal">^(?:http|dns)www.archve.org/\.*</code> (Be careful you            don't rule out prerequisites such as dns or robots.txt when            specifying your scope filter).</p></li></ul></div><p>The following scopes are available, but the same effects can be        achieved more efficently, and in combination, with SurtPrefixScope.        When SurtPrefixScope can be more easily understood and configured,        these scopes may be removed entirely.</p><div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>DomainScope</strong></span></p><p>This scope limits discovered URIs to the set of domains            defined by the provided seeds. That is any URI discovered            belonging to a domain from which one of the seed came is within            scope. Like always it is possible to apply depth            restrictions.</p><p>Using the seed 'archive.org', a domain scope will fetch            'audio.archive.org', 'movies.archive.org', etc. It will fetch all            discovered URIs from 'archive.org' and from any subdomain of            'archive.org'.</p></li><li><p><span class="bold"><strong>HostScope</strong></span></p><p>This scope limits discovered URIs to the set of hosts            defined by the provided seeds.</p><p>If the seed is 'www.archive.org', then we'll only fetch            items discovered on this host. The crawler will not go to            'audio.archive.org' or 'movies.archive.org'.</p></li><li><p><span class="bold"><strong>PathScope</strong></span></p><p>This scope goes yet further and limits the discovered URIs            to a section of paths on hosts defined by the seeds. Of course any            host that has a seed pointing at its root (i.e.            <code class="literal">www.sample.com/index.html</code>) will be included in            full where as a host whose only seed is

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -