📄 config.html
字号:
are configured over under the Settings tab.</p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="urlcanon"></a>6.2.1. URL Canonicalization Rules</h4></div></div></div><p>Heritrix keeps a list of already seen URLs and before fetching, does a look up into this 'already seen' or 'already included' list to see if the URL has already been crawled. Often an URL can be written in multiple ways but the page fetched is the same in each case. For example, the page that is at <code class="literal">http://www.archive.org/index.html</code> is the same page as is at <code class="literal">http//WWW.ARCHIVE.ORG/</code> though the URLs differ (In this case by case only). Before going to the 'already included' list, Heritrix makes an effort at equating the likes of <code class="literal">http://www.archive.org/index.html</code> and <code class="literal">http://ARCHIVE.ORG/</code> by running each URL through a set of canonicalization rules. Heritrix uses the result of this canonicalization process when it goes to test if an URL has already been seen.</p><p>An example of a canonicalization rule would lowercase all URLs. Another might strip the 'www' prefix from domains.</p><p>The <code class="literal">URL Canonicalization Rules</code> screen allows you to specify canonicalization rules and the order in which they are run. A default set lowercases, strips wwws, removes sessionids and does other types of fixup such as removal of any userinfo. The URL page works in the same manner as the <a href="config.html#filters" title="6.2.2. Filters">Section 6.2.2, “Filters”</a> page.</p><p>To watch the canonicalization process, enable <code class="literal">org.archive.crawler.url.Canonicalizer</code> logging in <code class="literal">heritrix.properties</code> (There should already be a commented out directive in the properties file. Search for it). Output will show in <code class="literal">heritrix_out.log</code>. Set the logging level to INFO to see just before and after the transform. Set level to FINE to see the result of each rule's transform.</p><p>Canonicalization rules can be added as an override so an added rule only works in the overridden domain.</p><p>Canonicalization rules are NOT run if the URI-to-check is the fruit of a redirect. We do this for the following reason. Lets say the www canonicalization rule is in place (the rule that equates 'archive.org' and 'www.archive.org'). If the crawler first encounters 'archive.org' but the server at archive.org wants us to come in via 'www.archive.org', it will redirect us to 'www.archive.org'. The alreadyseen database will have been marked with 'archive.org' on the original access of 'archive.org'. The www canonicalization rule runs and makes 'archive.org' of 'www.archive.org' which has already been seen. If we always ran canonicalization rules regardless, we wouldn't ever crawl 'www.archive.org'.</p><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="urlcanonexample"></a>6.2.1.1. URL Canonicalization Use Case: Stripping Site-Particular Session IDs</h5></div></div></div><p>Say site x.y.z is returning URLs with a session ID key of <code class="literal">cid</code> as in <code class="literal">http://x.y.z/index.html?cid=XYZ123112232112229BCDEFFA0000111</code>. Say the session ID value is always 32 characters. Say also, for simplicity's sake, that it always appears on the end of the URL.</p><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="urlcanonexamplesoln"></a>6.2.1.1.1. Solution</h6></div></div></div><p>Add a RegexRule override for the domain x.y.z. To do this, pause the crawl, add an override for x.y.z by clicking on the <code class="literal">overrides</code> tab in the main menu bar and filling in the domain x.y.z. Once in the override screen, click on the <code class="literal">URL</code> tab in the override menu bar -- the new bar that appears below the main bar when in override mode -- and add a <code class="literal">RegexRule</code> canonicalization rule. Name it <code class="literal">cidStripper</code>. Adjust where you'd like it to appear in the running of canonicalization rules (Towards the end should be fine). Now browse back to the override settings. The new canonicalization rule <code class="literal">cidStripper</code> should appear in the settings page list of canonicalization rules. Fill in the RegexRule <code class="literal">matching-regex</code> with something like the following: <code class="literal">^(.+)(?:cid=[0-9a-zA-Z]{32})?$</code> (Match a tail of 'cid=SOME_32_CHAR_STR' grouping all that comes before this tail). Fill into the <code class="literal">format</code> field <code class="literal">${1}</code> (This will copy the first group from the regex if the regex matched). To see the rule in operation, set the logging level for <code class="literal">org.archive.crawler.url.Canonicalizer</code> in <code class="literal">heritrix.properties</code> (Try uncommenting the line <code class="literal">org.archive.crawler.url.Canonicalizer.level = INFO</code>). Study the output and adjust your regex accordingly.</p><p>See also <a href="http://groups.yahoo.com/group/archive-crawler/message/1611" target="_top">msg1611</a> for another's experience getting regex to work.</p></div></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="filters"></a>6.2.2. Filters</h4></div></div></div><p>Filters are modules that take a <a href="glossary.html#crawluri">CrawlURI</a> and determine if it matches the criteria of the filter. If so it returns true, otherwise it returns false.</p><p>Filters are used in a couple of different contexts in Heritrix.</p><p>Their use in scopes has already been discussed in <a href="config.html#scopes" title="6.1.1. Crawl Scope">Section 6.1.1, “Crawl Scope”</a> and the problems with using them that in <a href="config.html#scopeproblems" title="6.1.1.1. Problems with the current Scopes">Section 6.1.1.1, “Problems with the current Scopes”</a>. </p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>A DecidingFilter was added in 1.4.0 to address problems with current filter model. DecideRules can be added into a DecidingFilter with the filter decision the result of all included DecideRule set processing. There are DecideRule equivalents for all Filter-types mentioned below. See <a href="config.html#decidingscope" title="6.1.1.2. DecidingScope">Section 6.1.1.2, “DecidingScope”</a> for more on the particulars of DecideRules and on the new Deciding model in general. </p></div><p>Aside from scopes, filters are also used in processors. Filters applied to processors always filter URIs <span class="emphasis"><em>out</em></span>. That is to say that any URI matching a filter on a processor will effectively skip over that processor.</p><p>This can be useful to disable (for instance) link extraction on documents coming from a specific section of a given website.</p><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10633"></a>6.2.2.1. Adding, removing and reordering filters</h5></div></div></div><p>The Submodules page of the configuration section of the WUI lists existing filters along with the option to remove, add, or move Filters up or down in the listing.</p><p>Adding a new filters requires giving it a unique name (for that list), selecting the class type of the filter from a combobox and clicking the associated add button. After the filter is added, its custom settings, if any, will appear in the Settings page of the configuration UI.</p><p>Since filters can in turn contain other filters (the OrFilter being the best example of this) these lists can become quite complex and at times confusing.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N1063C"></a>6.2.2.2. Provided filters</h5></div></div></div><p>The following is an overview of the most useful of the filters provided with Heritrix.</p><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N10641"></a>6.2.2.2.1. org.archive.crawler.filter.OrFilter</h6></div></div></div><p>Contains any number of filters and returns true if any of them returns true. A logical OR on its filters basically.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N10646"></a>6.2.2.2.2. org.archive.crawler.filter.URIRegExpFilter</h6></div></div></div><p>Returns true if a URI matches the regular expression set for it. See <a href="glossary.html#regexpr">Regular expressions</a> for more about regular expressions in Heritrix.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N1064E"></a>6.2.2.2.3. org.archive.crawler.filter.ContentTypeRegExpFilter</h6></div></div></div><p>This filter runs a regular expression against the response <code class="literal">Content-Type</code> header. Returns true if content type matches the regular expression. ContentType regexp filter cannot be used until after fetcher processors have run. Only then is the Content-Type of the response known. A good place for this filter is the writer step in processing. See <a href="glossary.html#regexpr">Regular expressions</a> for more about regular expressions in Heritrix.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N1065A"></a>6.2.2.2.4. org.archive.crawler.filter.SurtPrefixFilter</h6></div></div></div><p>Returns true if a URI is prefixed by one of the <a href="glossary.html#surtprefix">SURT prefix</a>es supplied by an external file.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N10662"></a>6.2.2.2.5. org.archive.crawler.filter.FilePatternFilter</h6></div></div></div><p>Compares suffix of a passed URI against a regular expression pattern, returns true for matches.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N10667"></a>6.2.2.2.6. org.archive.crawler.filter.PathDepthFilter</h6></div></div></div><p>Returns true for all <a href="glossary.html#crawluri">CrawlURI</a> passed in with a path depth less or equal to its <code class="literal">max-path-depth</code> value.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N10673"></a>6.2.2.2.7. org.archive.crawler.filter.PathologicalPathFilter</h6></div></div></div><p>Checks if a URI contains a repeated pattern.</p><p>This filter checks if a if a pattern is repeated a specific number of times. The use is to avoid crawler traps where the server adds the same pattern to the requested URI like:</p><p><pre class="programlisting"> http://host/img/img/img/img....</pre></p><p>Returns true if such a pattern is found. Sometimes used on a processor but is primarily of use in the exclude section of scopes.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N10680"></a>6.2.2.2.8. org.archive.crawler.filter.HopsFilter</h6></div></div></div><p>Returns true for all URIs passed in with a <a href="glossary.html#link-hop-count">Link hop count</a> greater than the <code class="literal">max-link-hops</code> value.</p><p>Generally only used in scopes.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N1068E"></a>6.2.2.2.9. org.archive.crawler.filter.TransclusionFilter</h6></div></div></div><p>Filter which returns true for <a href="glossary.html#crawluri">CrawlURI</a> instances which contain more than zero but fewer than <code class="literal">max-trans-hops</code> embed entries at the end of their <a href="glossary.html#discoverypath">Discovery path</a>.</p><p>Generally only used in scopes.</p></div></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="credentials"></a>6.2.3. Credentials</h4></div></div></div><p>In this section you can add login credentials that will allow
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -