📄 config.html
字号:
Heritrix to gain access to areas of websites requiring authentication. As with all modules they are only added here (supplying a unique name for each credential) and then configured on the settings page (<a href="config.html#settings" title="6.3. Settings">Section 6.3, “Settings”</a>).</p><p>One of the settings for each credential is its <code class="literal">credential-domain</code> and thus it is possible to create all credentials on the global level. However since this can cause excessive unneeded checking of credentials it is recommended that credentials be added to the appropriate domain override (see <a href="config.html#overrides" title="6.4. Overrides">Section 6.4, “Overrides”</a> for details). That way the credential is only checked when the relevant domain is being crawled.</p><p>Heritrix can do two types of authentication: <a href="http://www.faqs.org/rfcs/rfc2617.html" target="_top">RFC2617</a> (BASIC and DIGEST Auth) and POST and GET of an HTML Form.</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Logging</h3><p>To enable text console logging of authentication interactions (for example for debugging), set the FetchHTTP and PrconditionEnforcer log levels to fine</p><p><pre class="programlisting">org.archive.crawler.fetcher.FetchHTTP.level = FINEorg.archive.crawler.prefetch.PreconditionEnforcer.level = FINE</pre></p><p>This is done by editing the <code class="filename">heritrix.properties</code> file under the <code class="filename">conf</code> directory as described in <a href="install.html#heritrix.properties" title="2.2.2.1. heritrix.properties">Section 2.2.2.1, “heritrix.properties”</a>.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N106CD"></a>6.2.3.1. <a href="http://www.faqs.org/rfcs/rfc2617.html" target="_top">RFC2617</a> (BASIC and DIGEST Auth)</h5></div></div></div><p>Supply <a href="#cd" target="_top">credential-domain</a>, <a href="realm" target="_top">realm</a>, login, and password.</p><p>The way that the RFC2617 authentication works in Heritrix is that in response to a 401 response code (Unauthorized), Heritrix will use a key made up of the Credential Domain plus Realm to do a lookup into its Credential Store. If a match is found, then the credential is loaded into the CrawlURI and the CrawlURI is marked for immediate retry.</p><p>When the requeued CrawlURI comes around again, this time through, the found credentials are added to the request. If the request succeeds -- result code of 200 -- the credentials are promoted to the CrawlServer and all subsequent requests made against this CrawlServer will preemptively volunteer the credential. If the credential fails -- we get another 401 -- then the URI is let die a natural 401 death.</p><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="cd"></a>6.2.3.1.1. credential-domain</h6></div></div></div><p>This equates to the canonical root URI of RFC2617; effectively, in our case, its the CrawlServer name or <a href="http://java.sun.com/j2se/1.4.2/docs/api/java/net/URI.html" target="_top">URI authority</a> (domain plus port if other than port 80). Examples of credential-domain would be: 'www.archive.org' or 'www.archive.org:8080', etc.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="realm"></a>6.2.3.1.2. realm</h6></div></div></div><p>Realm as per <a href="http://www.faqs.org/rfcs/rfc2617.html" target="_top">RFC2617</a>. The realm string must match exactly the realm name presented in the authentication challenge served up by the web server</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N106F5"></a>6.2.3.1.3. Known Limitations</h6></div></div></div><div class="simplesect" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N106F8"></a>One Realm per Credential Domain Only</h6></div></div></div><p>Currently, you can only have one realm per credential domain.</p></div><div class="simplesect" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N106FD"></a>Digest Auth works for Apache</h6></div></div></div><p>... but your mileage may vary going up against other servers (See <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=914301&group_id=73833&atid=539102" target="_top">[ 914301 ] Logging in (HTTP POST, Basic Auth, etc.)</a> to learn more).</p></div></div></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10706"></a>6.2.3.2. HTML Form POST or GET</h5></div></div></div><p>Supply <a href="#cdh" target="_top">credential-domain</a>, <a href="httpmethod" target="_top">http-method</a>, <a href="loginuri" target="_top">login-uri</a>, and <a href="formitems" target="_top">form-items</a>, .</p><p>Before a <code class="literal">uri</code> is scheduled, we look for preconditions. Examples of preconditions are the getting of the the dns record for the server that hosts the <code class="literal">uri</code> and the fetching of the <code class="literal">robots.txt</code>: i.e. we don't fetch any <code class="literal">uri</code> unless we first have gotten the <code class="literal">robots.txt</code> file. The HTML Form Credentials are done as a precondition. If there are HTML Form Credentials for a particular crawlserver in the credential store, the uri specified in the HTML Form Credential login-uri field is scheduled as a precondition for the site, after the fetching of the dns and robots preconditions.</p><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="cdh"></a>6.2.3.2.1. credential-domain</h6></div></div></div><p>Same as the Rfc22617 Credential <a href="#cd" target="_top">credential-domain</a>.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="loginuri"></a>6.2.3.2.2. login-url</h6></div></div></div><p>Relative or absolute URI to the page that the HTML Form submits to (Not the page that contains the HTML Form).</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="formitems"></a>6.2.3.2.3. form-items</h6></div></div></div><p>Listing of HTML Form key/value pairs. Don't forget to include the form submit button.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N10747"></a>6.2.3.2.4. Known Limitations</h6></div></div></div><div class="simplesect" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N1074A"></a>Site is crawled logged in or not; cannot do both</h6></div></div></div><p>If a site has an HTML Form Credential associated, the next thing done after the getting of the dns record and the robots.txt is that a login is performed against all listed HTML Form Credential login-uris. This means that the crawler will only ever view sites that have HTML Form Credentials from the 'logged-in' perspective. There is no way currently of telling the crawler to crawl the site 'non-logged-in' and then, when done, log in and crawl the site anew only this time from the 'logged-in' perspective (At least, not as part of the one crawl job).</p></div><div class="simplesect" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N1074F"></a>No means of verifying or rerunning login</h6></div></div></div><p>The login is run once only and the crawler continues whether the login succeeded or not. There is no means of telling the crawler retry upon unsuccessful authentication. Neither is there a means for the crawler to report success or otherwise (The operator is expected to study logs to see whether authentication ran successfully).</p></div></div></div></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="settings"></a>6.3. Settings</h3></div></div></div><p>This page presents a semi-treelike representation of all the modules (fixed and pluggable alike) that make up the current configuration and allows the user to edit any of their settings. Go to the Modules and SubModules tabs to add, remove, replace modules mentioned here in the Settings page. </p><p>The first option presented directly under the top tabs is whether to hide or display 'expert settings'. Expert settings are those settings that are rarely changed and should only be changed by someone with a clear understanding of their implication. This document will not discuss any of the expert settings.</p><p>The first setting is the description of the job previously discussed. The seed list is at the bottom of the page. Between the two are all the other possible settings.</p><p>Module names are presented in bold and a short explanation of them is provided. As discussed in the previous three chapters some of them can be replaced, removed or augmented.</p><p>Behind each module and settings name a small question mark is present. By clicking on it a more detailed explanation of the relevant item pops up. For most settings users should refer to that as their primary source of information.</p><p>Some settings provide a fixed number of possible 'legal' values in combo boxes. Most are however typical text input fields. Two types of settings require a bit of additional attention.</p><div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>Lists</strong></span></p><p>Some settings are a list of values. In those cases a list is printed with an associated <span class="emphasis"><em>Remove</em></span> button and an input box is printed below it with an <span class="emphasis"><em>Add</em></span> button. Only those items in the list box are considered in the list itself. A value in the input box does not become a part of the list until the user clicks <span class="emphasis"><em>Add</em></span>. There is no way to edit existing values beyond removing them and replacing them with correct values. It is also not possible to reorder the list.</p></li><li><p><span class="bold"><strong>Simple typed maps</strong></span></p><p>Generally Maps in the Heritrix settings framework contain program modules (such as the processors for example) and are therefore edited elsewhere. However maps that only accept simple data types (Java primitives) can be edited here.</p><p>They are treated as a key, value pair. Two input boxes are provided for new entries with the first one representing the key and the second the value. Clicking the associated <span class="emphasis"><em>Add</em></span> button adds the entry to the map. Above the input boxes a list of existing entries is displayed along with a <span class="emphasis"><em>Remove</em></span> option. Simple maps can not be reordered.</p></li></ul></div><p>Changes on this page are not saved until you navigate to another part of the settings framework or you click the submit job/finished tab.</p><p>If there is a problem with one of the settings a red star will appear next to it. Clicking the star will display the relevant error message.</p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10788"></a>6.3.1. Basic settings</h4></div></div></div><p>Some settings are always present. They form the so called crawl order. The root of the settings hierarchy that other modules plug into.</p><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N1078D"></a>6.3.1.1. Crawl limits</h5></div></div></div><p>In addition to limits imposed on the scope of the crawl it is possible to enforce arbitrary limits on the duration and extent of the crawl with the following settings:</p><div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>max-bytes-download</strong></span></p><p>Stop after a fixed number of bytes have been downloaded. 0 means unlimited.</p></li><li><p><span class="bold"><strong>max-document-download</strong></span></p><p>Stop after downloading a fixed number of documents. 0 means unlimited.</p></li><li><p><span class="bold"><strong>max-time-sec</strong></span></p><p>Stop after a certain number of seconds have elapsed. 0 means unlimited.</p><p>For handy reference there are 3600 seconds in an hour and 86400 seconds in a day.</p></li></ul></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>These are not hard limits. Once one of these limits is hit it will trigger a graceful termination of the crawl job, that means that URIs already being crawled will be completed. As a result the set limit will be exceeded by some amount.</p></div></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N107AD"></a>6.3.1.2. max-toe-threads</h5></div></div></div><p>Set the numbe
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -