📄 config.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
📖 第 1 页 / 共 5 页
字号:
      are configured over under the Settings tab.</p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="urlcanon"></a>6.2.1.&nbsp;URL Canonicalization Rules</h4></div></div></div><p>Heritrix keeps a list of already seen URLs and before fetching,        does a look up into this 'already seen' or 'already included' list to        see if the URL has already been crawled. Often an URL can be written        in multiple ways but the page fetched is the same in each case. For        example, the page that is at        <code class="literal">http://www.archive.org/index.html</code> is the same page        as is at <code class="literal">http//WWW.ARCHIVE.ORG/</code> though the URLs        differ (In this case by case only). Before going to the 'already        included' list, Heritrix makes an effort at equating the likes of        <code class="literal">http://www.archive.org/index.html</code> and        <code class="literal">http://ARCHIVE.ORG/</code> by running each URL through a        set of canonicalization rules. Heritrix uses the result of this        canonicalization process when it goes to test if an URL has already        been seen.</p><p>An example of a canonicalization rule would lowercase all URLs.        Another might strip the 'www' prefix from domains.</p><p>The <code class="literal">URL Canonicalization Rules</code> screen allows        you to specify canonicalization rules and the order in which they are        run. A default set lowercases, strips wwws, removes sessionids and        does other types of fixup such as removal of any userinfo. The URL        page works in the same manner as the <a href="config.html#filters" title="6.2.2.&nbsp;Filters">Section&nbsp;6.2.2, &ldquo;Filters&rdquo;</a>        page.</p><p>To watch the canonicalization process, enable        <code class="literal">org.archive.crawler.url.Canonicalizer</code> logging in        <code class="literal">heritrix.properties</code> (There should already be a        commented out directive in the properties file. Search for it). Output        will show in <code class="literal">heritrix_out.log</code>. Set the logging        level to INFO to see just before and after the transform. Set level to        FINE to see the result of each rule's transform.</p><p>Canonicalization rules can be added as an override so an added        rule only works in the overridden domain.</p><p>Canonicalization rules are NOT run if the URI-to-check is the        fruit of a redirect. We do this for the following reason. Lets say the        www canonicalization rule is in place (the rule that equates        'archive.org' and 'www.archive.org'). If the crawler first encounters        'archive.org' but the server at archive.org wants us to come in via        'www.archive.org', it will redirect us to 'www.archive.org'. The        alreadyseen database will have been marked with 'archive.org' on the        original access of 'archive.org'. The www canonicalization rule runs        and makes 'archive.org' of 'www.archive.org' which has already been        seen. If we always ran canonicalization rules regardless, we wouldn't        ever crawl 'www.archive.org'.</p><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="urlcanonexample"></a>6.2.1.1.&nbsp;URL Canonicalization Use Case: Stripping Site-Particular          Session IDs</h5></div></div></div><p>Say site x.y.z is returning URLs with a session ID key of          <code class="literal">cid</code> as in          <code class="literal">http://x.y.z/index.html?cid=XYZ123112232112229BCDEFFA0000111</code>.          Say the session ID value is always 32 characters. Say also, for          simplicity's sake, that it always appears on the end of the          URL.</p><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="urlcanonexamplesoln"></a>6.2.1.1.1.&nbsp;Solution</h6></div></div></div><p>Add a RegexRule override for the domain x.y.z. To do this,            pause the crawl, add an override for x.y.z by clicking on the            <code class="literal">overrides</code> tab in the main menu bar and filling            in the domain x.y.z. Once in the override screen, click on the            <code class="literal">URL</code> tab in the override menu bar -- the new bar            that appears below the main bar when in override mode -- and add a            <code class="literal">RegexRule</code> canonicalization rule. Name it            <code class="literal">cidStripper</code>. Adjust where you'd like it to            appear in the running of canonicalization rules (Towards the end            should be fine). Now browse back to the override settings. The new            canonicalization rule <code class="literal">cidStripper</code> should appear            in the settings page list of canonicalization rules. Fill in the            RegexRule <code class="literal">matching-regex</code> with something like            the following: <code class="literal">^(.+)(?:cid=[0-9a-zA-Z]{32})?$</code>            (Match a tail of 'cid=SOME_32_CHAR_STR' grouping all that comes            before this tail). Fill into the <code class="literal">format</code> field            <code class="literal">${1}</code> (This will copy the first group from the            regex if the regex matched). To see the rule in operation, set the            logging level for            <code class="literal">org.archive.crawler.url.Canonicalizer</code> in            <code class="literal">heritrix.properties</code> (Try uncommenting the line            <code class="literal">org.archive.crawler.url.Canonicalizer.level =            INFO</code>). Study the output and adjust your regex            accordingly.</p><p>See also <a href="http://groups.yahoo.com/group/archive-crawler/message/1611" target="_top">msg1611</a>            for another's experience getting regex to work.</p></div></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="filters"></a>6.2.2.&nbsp;Filters</h4></div></div></div><p>Filters are modules that take a <a href="glossary.html#crawluri">CrawlURI</a> and        determine if it matches the criteria of the filter. If so it returns        true, otherwise it returns false.</p><p>Filters are used in a couple of different contexts in        Heritrix.</p><p>Their use in scopes has already been discussed in <a href="config.html#scopes" title="6.1.1.&nbsp;Crawl Scope">Section&nbsp;6.1.1, &ldquo;Crawl Scope&rdquo;</a> and the problems with using them that in <a href="config.html#scopeproblems" title="6.1.1.1.&nbsp;Problems with the current Scopes">Section&nbsp;6.1.1.1, &ldquo;Problems with the current Scopes&rdquo;</a>.          </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>A DecidingFilter was added in 1.4.0 to address problems        with current filter model.  DecideRules can be added into        a DecidingFilter with the filter decision the result of all        included DecideRule set processing. There are DecideRule equivalents        for all Filter-types mentioned below.        See <a href="config.html#decidingscope" title="6.1.1.2.&nbsp;DecidingScope">Section&nbsp;6.1.1.2, &ldquo;DecidingScope&rdquo;</a> for more on        the particulars of DecideRules and on the new Deciding model in        general.        </p></div><p>Aside from scopes, filters are also used in processors. Filters        applied to processors always filter URIs <span class="emphasis"><em>out</em></span>.        That is to say that any URI matching a filter on a processor will        effectively skip over that processor.</p><p>This can be useful to disable (for instance) link extraction on        documents coming from a specific section of a given website.</p><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10633"></a>6.2.2.1.&nbsp;Adding, removing and reordering filters</h5></div></div></div><p>The Submodules page of the configuration section of the WUI          lists existing filters          along with the option to remove, add, or move Filters up or          down in the listing.</p><p>Adding a new filters requires giving it a unique name (for          that list), selecting the class type of the filter from a combobox          and clicking the associated add button. After the filter is added,          its custom settings, if any, will appear in the Settings page of the          configuration UI.</p><p>Since filters can in turn contain other filters (the OrFilter          being the best example of this) these lists can become quite complex          and at times confusing.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N1063C"></a>6.2.2.2.&nbsp;Provided filters</h5></div></div></div><p>The following is an overview of the most useful of the filters          provided with Heritrix.</p><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N10641"></a>6.2.2.2.1.&nbsp;org.archive.crawler.filter.OrFilter</h6></div></div></div><p>Contains any number of filters and returns true if any of            them returns true. A logical OR on its filters basically.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N10646"></a>6.2.2.2.2.&nbsp;org.archive.crawler.filter.URIRegExpFilter</h6></div></div></div><p>Returns true if a URI matches the regular expression set for            it. See <a href="glossary.html#regexpr">Regular expressions</a> for more about regular            expressions in Heritrix.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N1064E"></a>6.2.2.2.3.&nbsp;org.archive.crawler.filter.ContentTypeRegExpFilter</h6></div></div></div><p>This filter runs a regular expression against the response            <code class="literal">Content-Type</code> header. Returns true if content            type matches the regular expression. ContentType regexp filter            cannot be used until after fetcher processors have run. Only then            is the Content-Type of the response known. A good place for this            filter is the writer step in processing. See <a href="glossary.html#regexpr">Regular expressions</a> for more about regular expressions in            Heritrix.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N1065A"></a>6.2.2.2.4.&nbsp;org.archive.crawler.filter.SurtPrefixFilter</h6></div></div></div><p>Returns true if a URI is prefixed by one of the <a href="glossary.html#surtprefix">SURT prefix</a>es supplied by an external file.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N10662"></a>6.2.2.2.5.&nbsp;org.archive.crawler.filter.FilePatternFilter</h6></div></div></div><p>Compares suffix of a passed URI against a regular expression            pattern, returns true for matches.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N10667"></a>6.2.2.2.6.&nbsp;org.archive.crawler.filter.PathDepthFilter</h6></div></div></div><p>Returns true for all <a href="glossary.html#crawluri">CrawlURI</a> passed in            with a path depth less or equal to its            <code class="literal">max-path-depth</code> value.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N10673"></a>6.2.2.2.7.&nbsp;org.archive.crawler.filter.PathologicalPathFilter</h6></div></div></div><p>Checks if a URI contains a repeated pattern.</p><p>This filter checks if a if a pattern is repeated a specific            number of times. The use is to avoid crawler traps where the            server adds the same pattern to the requested URI like:</p><p><pre class="programlisting">  http://host/img/img/img/img....</pre></p><p>Returns true if such a pattern is found. Sometimes used on a            processor but is primarily of use in the exclude section of            scopes.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N10680"></a>6.2.2.2.8.&nbsp;org.archive.crawler.filter.HopsFilter</h6></div></div></div><p>Returns true for all URIs passed in with a <a href="glossary.html#link-hop-count">Link hop count</a> greater than the            <code class="literal">max-link-hops</code> value.</p><p>Generally only used in scopes.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N1068E"></a>6.2.2.2.9.&nbsp;org.archive.crawler.filter.TransclusionFilter</h6></div></div></div><p>Filter which returns true for <a href="glossary.html#crawluri">CrawlURI</a>            instances which contain more than zero but fewer than            <code class="literal">max-trans-hops</code> embed entries at the end of            their <a href="glossary.html#discoverypath">Discovery path</a>.</p><p>Generally only used in scopes.</p></div></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="credentials"></a>6.2.3.&nbsp;Credentials</h4></div></div></div><p>In this section you can add login credentials that will allow
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -