⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 config.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
📖 第 1 页 / 共 5 页
字号:
        Heritrix to gain access to areas of websites requiring authentication.        As with all modules they are only added here (supplying a unique name        for each credential) and then configured on the settings page (<a href="config.html#settings" title="6.3.&nbsp;Settings">Section&nbsp;6.3, &ldquo;Settings&rdquo;</a>).</p><p>One of the settings for each credential is its        <code class="literal">credential-domain</code> and thus it is possible to create        all credentials on the global level. However since this can cause        excessive unneeded checking of credentials it is recommended that        credentials be added to the appropriate domain override (see <a href="config.html#overrides" title="6.4.&nbsp;Overrides">Section&nbsp;6.4, &ldquo;Overrides&rdquo;</a> for details). That way the credential is only        checked when the relevant domain is being crawled.</p><p>Heritrix can do two types of authentication: <a href="http://www.faqs.org/rfcs/rfc2617.html" target="_top">RFC2617</a> (BASIC and        DIGEST Auth) and POST and GET of an HTML Form.</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Logging</h3><p>To enable text console logging of authentication interactions           (for example for debugging), set the FetchHTTP and PrconditionEnforcer          log levels to fine</p><p><pre class="programlisting">org.archive.crawler.fetcher.FetchHTTP.level = FINEorg.archive.crawler.prefetch.PreconditionEnforcer.level = FINE</pre></p><p>This is done by editing the          <code class="filename">heritrix.properties</code> file under the          <code class="filename">conf</code> directory as described in <a href="install.html#heritrix.properties" title="2.2.2.1.&nbsp;heritrix.properties">Section&nbsp;2.2.2.1, &ldquo;heritrix.properties&rdquo;</a>.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N106CD"></a>6.2.3.1.&nbsp;<a href="http://www.faqs.org/rfcs/rfc2617.html" target="_top">RFC2617</a> (BASIC          and DIGEST Auth)</h5></div></div></div><p>Supply <a href="#cd" target="_top">credential-domain</a>, <a href="realm" target="_top">realm</a>, login, and password.</p><p>The way that the RFC2617 authentication works in Heritrix is          that in response to a 401 response code (Unauthorized), Heritrix          will use a key made up of the Credential Domain plus Realm to do a          lookup into its Credential Store. If a match is found, then the          credential is loaded into the CrawlURI and the CrawlURI is marked          for immediate retry.</p><p>When the requeued CrawlURI comes around again, this time          through, the found credentials are added to the request. If the          request succeeds -- result code of 200 -- the credentials are          promoted to the CrawlServer and all subsequent requests made against          this CrawlServer will preemptively volunteer the credential. If the          credential fails -- we get another 401 -- then the URI is let die a          natural 401 death.</p><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="cd"></a>6.2.3.1.1.&nbsp;credential-domain</h6></div></div></div><p>This equates to the canonical root URI of RFC2617;            effectively, in our case, its the CrawlServer name or <a href="http://java.sun.com/j2se/1.4.2/docs/api/java/net/URI.html" target="_top">URI            authority</a> (domain plus port if other than port 80).            Examples of credential-domain would be: 'www.archive.org' or            'www.archive.org:8080', etc.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="realm"></a>6.2.3.1.2.&nbsp;realm</h6></div></div></div><p>Realm as per <a href="http://www.faqs.org/rfcs/rfc2617.html" target="_top">RFC2617</a>. The            realm string must match exactly the realm name presented in the            authentication challenge served up by the web server</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N106F5"></a>6.2.3.1.3.&nbsp;Known Limitations</h6></div></div></div><div class="simplesect" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N106F8"></a>One Realm per Credential Domain Only</h6></div></div></div><p>Currently, you can only have one realm per credential              domain.</p></div><div class="simplesect" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N106FD"></a>Digest Auth works for Apache</h6></div></div></div><p>... but your mileage may vary going up against other              servers (See <a href="http://sourceforge.net/tracker/index.php?func=detail&aid=914301&group_id=73833&atid=539102" target="_top">[              914301 ] Logging in (HTTP POST, Basic Auth, etc.)</a> to              learn more).</p></div></div></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10706"></a>6.2.3.2.&nbsp;HTML Form POST or GET</h5></div></div></div><p>Supply <a href="#cdh" target="_top">credential-domain</a>, <a href="httpmethod" target="_top">http-method</a>, <a href="loginuri" target="_top">login-uri</a>, and <a href="formitems" target="_top">form-items</a>, .</p><p>Before a <code class="literal">uri</code> is scheduled, we look for          preconditions. Examples of preconditions are the getting of the the          dns record for the server that hosts the <code class="literal">uri</code> and          the fetching of the <code class="literal">robots.txt</code>: i.e. we don't          fetch any <code class="literal">uri</code> unless we first have gotten the          <code class="literal">robots.txt</code> file. The HTML Form Credentials are          done as a precondition. If there are HTML Form Credentials for a          particular crawlserver in the credential store, the uri specified in          the HTML Form Credential login-uri field is scheduled as a          precondition for the site, after the fetching of the dns and robots          preconditions.</p><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="cdh"></a>6.2.3.2.1.&nbsp;credential-domain</h6></div></div></div><p>Same as the Rfc22617 Credential <a href="#cd" target="_top">credential-domain</a>.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="loginuri"></a>6.2.3.2.2.&nbsp;login-url</h6></div></div></div><p>Relative or absolute URI to the page that the HTML Form            submits to (Not the page that contains the HTML Form).</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="formitems"></a>6.2.3.2.3.&nbsp;form-items</h6></div></div></div><p>Listing of HTML Form key/value pairs. Don't forget to            include the form submit button.</p></div><div class="sect5" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N10747"></a>6.2.3.2.4.&nbsp;Known Limitations</h6></div></div></div><div class="simplesect" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N1074A"></a>Site is crawled logged in or not; cannot do both</h6></div></div></div><p>If a site has an HTML Form Credential associated, the next              thing done after the getting of the dns record and the              robots.txt is that a login is performed against all listed HTML              Form Credential login-uris. This means that the crawler will              only ever view sites that have HTML Form Credentials from the              'logged-in' perspective. There is no way currently of telling              the crawler to crawl the site 'non-logged-in' and then, when              done, log in and crawl the site anew only this time from the              'logged-in' perspective (At least, not as part of the one crawl              job).</p></div><div class="simplesect" lang="en"><div class="titlepage"><div><div><h6 class="title"><a name="N1074F"></a>No means of verifying or rerunning login</h6></div></div></div><p>The login is run once only and the crawler continues              whether the login succeeded or not. There is no means of telling              the crawler retry upon unsuccessful authentication. Neither is              there a means for the crawler to report success or otherwise              (The operator is expected to study logs to see whether              authentication ran successfully).</p></div></div></div></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="settings"></a>6.3.&nbsp;Settings</h3></div></div></div><p>This page presents a semi-treelike representation of all the      modules (fixed and pluggable alike) that make up the current      configuration and allows the user to edit any of their settings.      Go to the Modules and SubModules tabs to add, remove, replace modules      mentioned here in the Settings page.      </p><p>The first option presented directly under the top tabs is whether      to hide or display 'expert settings'. Expert settings are those settings      that are rarely changed and should only be changed by someone with a      clear understanding of their implication. This document will not discuss      any of the expert settings.</p><p>The first setting is the description of the job previously      discussed. The seed list is at the bottom of the page. Between the two      are all the other possible settings.</p><p>Module names are presented in bold and a short explanation of them      is provided. As discussed in the previous three chapters some of them      can be replaced, removed or augmented.</p><p>Behind each module and settings name a small question mark is      present. By clicking on it a more detailed explanation of the relevant      item pops up. For most settings users should refer to that as their      primary source of information.</p><p>Some settings provide a fixed number of possible 'legal' values in      combo boxes. Most are however typical text input fields. Two types of      settings require a bit of additional attention.</p><div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>Lists</strong></span></p><p>Some settings are a list of values. In those cases a list is          printed with an associated <span class="emphasis"><em>Remove</em></span> button and an          input box is printed below it with an <span class="emphasis"><em>Add</em></span>          button. Only those items in the list box are considered in the list          itself. A value in the input box does not become a part of the list          until the user clicks <span class="emphasis"><em>Add</em></span>. There is no way to          edit existing values beyond removing them and replacing them with          correct values. It is also not possible to reorder the list.</p></li><li><p><span class="bold"><strong>Simple typed maps</strong></span></p><p>Generally Maps in the Heritrix settings framework contain          program modules (such as the processors for example) and are          therefore edited elsewhere. However maps that only accept simple          data types (Java primitives) can be edited here.</p><p>They are treated as a key, value pair. Two input boxes are          provided for new entries with the first one representing the key and          the second the value. Clicking the associated          <span class="emphasis"><em>Add</em></span> button adds the entry to the map. Above the          input boxes a list of existing entries is displayed along with a          <span class="emphasis"><em>Remove</em></span> option. Simple maps can not be          reordered.</p></li></ul></div><p>Changes on this page are not saved until you navigate to another      part of the settings framework or you click the submit job/finished      tab.</p><p>If there is a problem with one of the settings a red star will      appear next to it. Clicking the star will display the relevant error      message.</p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10788"></a>6.3.1.&nbsp;Basic settings</h4></div></div></div><p>Some settings are always present. They form the so called crawl        order. The root of the settings hierarchy that other modules plug        into.</p><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N1078D"></a>6.3.1.1.&nbsp;Crawl limits</h5></div></div></div><p>In addition to limits imposed on the scope of the crawl it is          possible to enforce arbitrary limits on the duration and extent of          the crawl with the following settings:</p><div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>max-bytes-download</strong></span></p><p>Stop after a fixed number of bytes have been downloaded. 0              means unlimited.</p></li><li><p><span class="bold"><strong>max-document-download</strong></span></p><p>Stop after downloading a fixed number of documents. 0              means unlimited.</p></li><li><p><span class="bold"><strong>max-time-sec</strong></span></p><p>Stop after a certain number of seconds have elapsed. 0              means unlimited.</p><p>For handy reference there are 3600 seconds in an hour and              86400 seconds in a day.</p></li></ul></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>These are not hard limits. Once one of these limits is hit            it will trigger a graceful termination of the crawl job, that            means that URIs already being crawled will be completed. As a            result the set limit will be exceeded by some amount.</p></div></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N107AD"></a>6.3.1.2.&nbsp;max-toe-threads</h5></div></div></div><p>Set the numbe

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -