📄 auth_proposal.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
📖 第 1 页 / 共 3 页
字号:
      to use verifying server proffered certs).</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Testing</h3><p>Test to see if certificates are volunteered even in case where        we're running in open trust mode. Test to see how hard to append a        host-particular keystore to the general Heritrix keystore at        runtime.</p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="ntlmdesc"></a>2.4.&nbsp;NTLM [<a href="#ntlm" title="[ntlm]"><span class="abbrev">ntlm</span></a>]</h3></div></div><div></div></div><div class="blockquote"><blockquote class="blockquote"><p>NTLM is...a proprietary protocol designed by Microsoft with no        publicly available specification. Early version of NTLM were less        secure than Digest authentication due to faults in the design, however        these were fixed in a service pack for Windows NT 4 and the protocol        is now considered more secure than Digest authentication... There are        some significant differences in the way that NTLM works compared with        basic and digest authentication...NTLM authenticates a connection and        not a request, so you need to authenticate every time a new connection        is made and keeping the connection open during authentication is        vital. Due to this, NTLM cannot be used to authenticate with both a        proxy and the server, nor can NTLM be used with HTTP 1.0 connections        or servers that do not support HTTP keep-alives. [<a href="#httpclient" title="[httpclient]"><span class="abbrev">httpclient</span></a>]</p></blockquote></div><p>The NTLM is put outside the scope of this proposal because its      nature is antithetical to how Heritrix works: i.e. It authenticates the      connection, not a session [<span class="citation">Also see <a href="#connbased" title="1.1.3.&nbsp;Connection-based authentication schemes">Section&nbsp;1.1.3, &ldquo;Connection-based authentication schemes&rdquo;</a> </span>]. Related, the implementation is      incomplete in httpclient. NTLM will not be discussed further.</p></div></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="N100CD"></a>3.&nbsp;Proposal</h2></div></div><div></div></div><p>Proposal is to put off implementation of client-side certificates in    Heritrix. Rare is the case where its needed.</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Workaround?</h3><p>It should be possible to just add the client certificate to the      local truststore and all would just work. Test.</p></div><p>Having cut <a href="#ntlmdesc" title="2.4.&nbsp;NTLM ">Section&nbsp;2.4, &ldquo;NTLM &rdquo;</a> and <a href="#clientcertdesc" title="2.3.&nbsp;X509 Client Certificates">Section&nbsp;2.3, &ldquo;X509 Client Certificates&rdquo;</a>, we're left with <a href="#basicdesc" title="2.1.&nbsp;Basic and Digest Access Authentication ">Section&nbsp;2.1, &ldquo;Basic and Digest Access Authentication &rdquo;</a>    and <a href="#postdesc" title="2.2.&nbsp;HTTP POST and GET of Authentication Credentials">Section&nbsp;2.2, &ldquo;HTTP POST and GET of Authentication Credentials&rdquo;</a>, the assumed most commonly used web    authentication schemes.</p><p>Reading in the above, <a href="#schemes" title="2.&nbsp;Authentication Schemes">Section&nbsp;2, &ldquo;Authentication Schemes&rdquo;</a>, it may be apparent    that there can not be one solution that will work for both schemes. The    discussion in the following two sections -- a section per scheme under    consideration -- should bring this fact out and help identify facility    common to the two schemes detailed later in <a href="#commonage" title="3.3.&nbsp;Commonage">Section&nbsp;3.3, &ldquo;Commonage&rdquo;</a>.</p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N100ED"></a>3.1.&nbsp;Basic and Digest Access Authentication [<a href="#rfc2617" title="[rfc2617]">rfc2617</a>]</h3></div></div><div></div></div><p>A basic implementation would, upon receipt of a 401 response      status code, extract a realm from the 401 response and use this      <i class="parameter"><tt>realm + URI canonical root URL</tt></i> as a compound key      to do a look up into a store of Basic/Digest Auth credentials. If a      match is found, the <span class="emphasis"><em>persistent domain/virtualdomain      object</em></span> made for the current domain is loaded with the      discovered credentials and the 401'ing current URI is marked for retry      (If no matching credentials found, the current URI is marked failed with      a 401 response code).</p><p>Let it be a given that any rfc2617 credentials found in a      <span class="emphasis"><em>persistent domain/virtualdomain object</em></span> always get      always loaded into the HTTP GET request.</p><p>When our 401'ing URI comes around again for retry, since      credentials were loaded the last time this URI was seen, credentials      will be found in the <span class="emphasis"><em>persistent domain/virtualdomain      object</em></span> and will be added to the request headers. This time      around the authentication should succeed.</p><p>Any other URI that is a member of this realm will also      subsequently successfully authenticate given the above rule whereby we      always load any found credentials into the current request.</p><p>Let the above be the default behavior. Configurations would      enable/disable:</p><div class="itemizedlist"><ul type="disc"><li><p>Enable/Disable this feature.</p></li><li><p><a name="preemptiveauth"></a>Pre-population of the <span class="emphasis"><em>persistent          domain/virtualdomain object</em></span> with all rfc2617 credentials          upon construction thereby avoiding 401s altogether since we'd be          sending all credentials in advance of any challenge (preemptive          authentication). A domain might have many rfc2617 realms. Preemptive          authentication would have us volunteering all of a domains realms'          credentials in each request.</p><p>The query of the store pre-populating the <span class="emphasis"><em>persistent          domain/virtualdomain object</em></span> would use the <i class="parameter"><tt>URI          canonical root URL</tt></i> for a key.</p><p>This configuration could be set globally for all Heritrix          requests or per <i class="parameter"><tt>URI canonical root URL</tt></i> by          setting a property on the corresponding record in the store.</p></li><li><p>Upon receipt of a 401 and on successfully locating appropriate          credentials in the store (or already loaded in the          <span class="emphasis"><em>persistent domain/virtualdomain object</em></span>),          configuration could enable immediately retrying the request rather          than letting the 401 percolate down through the Heritrix processing          chain and back up out of the Frontier (Enabling this configuration          would leave no trace of the 401 in the ARC).</p></li></ul></div><p>The simplest implementation would have us always do <a href="preemptiveauth" target="_top">preemptive authentication</a>. Configuration      would turn this feature on or off, and that'd be all.</p><p>Below we look with more detail at aspects of the above proposed      implementation.</p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10131"></a>3.1.1.&nbsp;CrawlServer</h4></div></div><div></div></div><p>In Heritrix, the <span class="emphasis"><em>persistent domain/virtualdomain        object</em></span> is <tt class="classname"><a href="http://crawler.archive.org/xref/org/archive/crawler/datamodel/CrawlServer.html" target="_top">org.archive.crawler.datamodel.CrawlServer</a></tt>.        Its created inside in <a href="http://crawler.archive.org/xref/org/archive/crawler/basic/Frontier.html" target="_top">org.archive.crawler.basic.Frontier#next()</a>        if no extant CrawlServer is found in the <a href="org.archive.crawler.datamodel.ServerCache" target="_top">org.archive.crawler.datamodel.ServerCache</a>.        The lookup is done using a (decoded) <a href="http://java.sun.com/j2se/1.4.2/docs/api/java/net/URI.html" target="_top">URI        authority</a>. The currently processed URI has easy access to its        corresponding CrawlServer. See <a href="http://crawler.archive.org/xref/org/archive/crawler/datamodel/CrawlURI.html" target="_top">CrawlURI#getServer()</a>.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N1014E"></a>3.1.2.&nbsp;HTTPClient</h4></div></div><div></div></div><p>HTTPClient has builtin support for Basic, Digest and NTLM. It        takes care of sending appropriate Authentication headers.</p><p>Digest Authentication generally works but has a ways to go        according to the comment made on 2004-03-11 16:21 in <a href="http://issues.apache.org/bugzilla/show_bug.cgi?id=27594" target="_top">Wrong        reauthentication when using DigestAuthentication</a></p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Multiple Realms</h3><p>What to do if host has multiple realms? Will HTTPClient [<a href="#httpclient" title="[httpclient]"><span class="abbrev">httpclient</span></a>] do right thing and offer all credentials          available appropriately? Need to test.</p></div><p>The HTTPClient authentication code was just refactored        extensively in HEAD -- post 2.0 release. Reported problems        authenticating via a proxy going over SSL.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10162"></a>3.1.3.&nbsp;RFC2617 Record</h4></div></div><div></div></div><p>A RFC2617 record would be keyed by <i class="parameter"><tt>URI canonical root        URL</tt></i>. It would contain a realm, login and password. We'd        not distingush proxy (407) records.</p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N1016B"></a>3.2.&nbsp;HTTP POST and GET of Authentication Credentials</h3></div></div><div></div></div><p>Every URI processed by Heritrix first has preconditions checked.      Example preconditions are the fetching of a domain's DNS record and its      <tt class="filename">robots.txt</tt> file before proceeding to make requests      against the domain. This proposal is to add a new <span class="emphasis"><em>login      precondition</em></span> after the fashion of the robots and DNS      preconditions -- See <a href="org.archive.crawler.prefetch.PreconditionEnforcer" target="_top">org.archive.crawler.prefetch.PreconditionEnforcer</a>      -- and a facility for having our HTTP fetcher run a configurable one      time login.</p><p>The new <i class="parameter"><tt>login precondition</tt></i> will test the      current URI against a preloaded list of <span class="emphasis"><em>login URI      patterns</em></span>. Each <i class="parameter"><tt>login URI pattern      </tt></i>describes a protected area of a domain (or virtualdomain):      e.g. "http://www.archive.org/private/*". Each <i class="parameter"><tt>login URI      pattern</tt></i> serves as a key to an associated <span class="emphasis"><em>login      record</em></span>. A <i class="parameter"><tt>login record</tt></i> has all      information necessary for negotiation of a successful login such as the      HTML form content to submit -- username, password, submit button name,      etc. -- and whether login requires POSTing or GETting the login form.      The login record also has a <span class="emphasis"><em>ran login</em></span> flag that      says whether or not the login has been run previously against this      protected area.</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Ran Login flag</h3><p>The <i class="parameter"><tt>ran login</tt></i> flag says whether the login        has been <span class="emphasis"><em>run</em></span>, not whether or not login        <span class="emphasis"><em>succeeded</em></span>. Guaging whether the login was        successful or not is difficult. It varies with the login        implementation as already noted.</p></div><p>Also part of the login record is a <span class="emphasis"><em>login URI</em></span>.
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -