📄 auth_proposal.html
字号:
to use verifying server proffered certs).</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Testing</h3><p>Test to see if certificates are volunteered even in case where we're running in open trust mode. Test to see how hard to append a host-particular keystore to the general Heritrix keystore at runtime.</p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="ntlmdesc"></a>2.4. NTLM [<a href="#ntlm" title="[ntlm]"><span class="abbrev">ntlm</span></a>]</h3></div></div><div></div></div><div class="blockquote"><blockquote class="blockquote"><p>NTLM is...a proprietary protocol designed by Microsoft with no publicly available specification. Early version of NTLM were less secure than Digest authentication due to faults in the design, however these were fixed in a service pack for Windows NT 4 and the protocol is now considered more secure than Digest authentication... There are some significant differences in the way that NTLM works compared with basic and digest authentication...NTLM authenticates a connection and not a request, so you need to authenticate every time a new connection is made and keeping the connection open during authentication is vital. Due to this, NTLM cannot be used to authenticate with both a proxy and the server, nor can NTLM be used with HTTP 1.0 connections or servers that do not support HTTP keep-alives. [<a href="#httpclient" title="[httpclient]"><span class="abbrev">httpclient</span></a>]</p></blockquote></div><p>The NTLM is put outside the scope of this proposal because its nature is antithetical to how Heritrix works: i.e. It authenticates the connection, not a session [<span class="citation">Also see <a href="#connbased" title="1.1.3. Connection-based authentication schemes">Section 1.1.3, “Connection-based authentication schemes”</a> </span>]. Related, the implementation is incomplete in httpclient. NTLM will not be discussed further.</p></div></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="N100CD"></a>3. Proposal</h2></div></div><div></div></div><p>Proposal is to put off implementation of client-side certificates in Heritrix. Rare is the case where its needed.</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Workaround?</h3><p>It should be possible to just add the client certificate to the local truststore and all would just work. Test.</p></div><p>Having cut <a href="#ntlmdesc" title="2.4. NTLM ">Section 2.4, “NTLM ”</a> and <a href="#clientcertdesc" title="2.3. X509 Client Certificates">Section 2.3, “X509 Client Certificates”</a>, we're left with <a href="#basicdesc" title="2.1. Basic and Digest Access Authentication ">Section 2.1, “Basic and Digest Access Authentication ”</a> and <a href="#postdesc" title="2.2. HTTP POST and GET of Authentication Credentials">Section 2.2, “HTTP POST and GET of Authentication Credentials”</a>, the assumed most commonly used web authentication schemes.</p><p>Reading in the above, <a href="#schemes" title="2. Authentication Schemes">Section 2, “Authentication Schemes”</a>, it may be apparent that there can not be one solution that will work for both schemes. The discussion in the following two sections -- a section per scheme under consideration -- should bring this fact out and help identify facility common to the two schemes detailed later in <a href="#commonage" title="3.3. Commonage">Section 3.3, “Commonage”</a>.</p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N100ED"></a>3.1. Basic and Digest Access Authentication [<a href="#rfc2617" title="[rfc2617]">rfc2617</a>]</h3></div></div><div></div></div><p>A basic implementation would, upon receipt of a 401 response status code, extract a realm from the 401 response and use this <i class="parameter"><tt>realm + URI canonical root URL</tt></i> as a compound key to do a look up into a store of Basic/Digest Auth credentials. If a match is found, the <span class="emphasis"><em>persistent domain/virtualdomain object</em></span> made for the current domain is loaded with the discovered credentials and the 401'ing current URI is marked for retry (If no matching credentials found, the current URI is marked failed with a 401 response code).</p><p>Let it be a given that any rfc2617 credentials found in a <span class="emphasis"><em>persistent domain/virtualdomain object</em></span> always get always loaded into the HTTP GET request.</p><p>When our 401'ing URI comes around again for retry, since credentials were loaded the last time this URI was seen, credentials will be found in the <span class="emphasis"><em>persistent domain/virtualdomain object</em></span> and will be added to the request headers. This time around the authentication should succeed.</p><p>Any other URI that is a member of this realm will also subsequently successfully authenticate given the above rule whereby we always load any found credentials into the current request.</p><p>Let the above be the default behavior. Configurations would enable/disable:</p><div class="itemizedlist"><ul type="disc"><li><p>Enable/Disable this feature.</p></li><li><p><a name="preemptiveauth"></a>Pre-population of the <span class="emphasis"><em>persistent domain/virtualdomain object</em></span> with all rfc2617 credentials upon construction thereby avoiding 401s altogether since we'd be sending all credentials in advance of any challenge (preemptive authentication). A domain might have many rfc2617 realms. Preemptive authentication would have us volunteering all of a domains realms' credentials in each request.</p><p>The query of the store pre-populating the <span class="emphasis"><em>persistent domain/virtualdomain object</em></span> would use the <i class="parameter"><tt>URI canonical root URL</tt></i> for a key.</p><p>This configuration could be set globally for all Heritrix requests or per <i class="parameter"><tt>URI canonical root URL</tt></i> by setting a property on the corresponding record in the store.</p></li><li><p>Upon receipt of a 401 and on successfully locating appropriate credentials in the store (or already loaded in the <span class="emphasis"><em>persistent domain/virtualdomain object</em></span>), configuration could enable immediately retrying the request rather than letting the 401 percolate down through the Heritrix processing chain and back up out of the Frontier (Enabling this configuration would leave no trace of the 401 in the ARC).</p></li></ul></div><p>The simplest implementation would have us always do <a href="preemptiveauth" target="_top">preemptive authentication</a>. Configuration would turn this feature on or off, and that'd be all.</p><p>Below we look with more detail at aspects of the above proposed implementation.</p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10131"></a>3.1.1. CrawlServer</h4></div></div><div></div></div><p>In Heritrix, the <span class="emphasis"><em>persistent domain/virtualdomain object</em></span> is <tt class="classname"><a href="http://crawler.archive.org/xref/org/archive/crawler/datamodel/CrawlServer.html" target="_top">org.archive.crawler.datamodel.CrawlServer</a></tt>. Its created inside in <a href="http://crawler.archive.org/xref/org/archive/crawler/basic/Frontier.html" target="_top">org.archive.crawler.basic.Frontier#next()</a> if no extant CrawlServer is found in the <a href="org.archive.crawler.datamodel.ServerCache" target="_top">org.archive.crawler.datamodel.ServerCache</a>. The lookup is done using a (decoded) <a href="http://java.sun.com/j2se/1.4.2/docs/api/java/net/URI.html" target="_top">URI authority</a>. The currently processed URI has easy access to its corresponding CrawlServer. See <a href="http://crawler.archive.org/xref/org/archive/crawler/datamodel/CrawlURI.html" target="_top">CrawlURI#getServer()</a>.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N1014E"></a>3.1.2. HTTPClient</h4></div></div><div></div></div><p>HTTPClient has builtin support for Basic, Digest and NTLM. It takes care of sending appropriate Authentication headers.</p><p>Digest Authentication generally works but has a ways to go according to the comment made on 2004-03-11 16:21 in <a href="http://issues.apache.org/bugzilla/show_bug.cgi?id=27594" target="_top">Wrong reauthentication when using DigestAuthentication</a></p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Multiple Realms</h3><p>What to do if host has multiple realms? Will HTTPClient [<a href="#httpclient" title="[httpclient]"><span class="abbrev">httpclient</span></a>] do right thing and offer all credentials available appropriately? Need to test.</p></div><p>The HTTPClient authentication code was just refactored extensively in HEAD -- post 2.0 release. Reported problems authenticating via a proxy going over SSL.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10162"></a>3.1.3. RFC2617 Record</h4></div></div><div></div></div><p>A RFC2617 record would be keyed by <i class="parameter"><tt>URI canonical root URL</tt></i>. It would contain a realm, login and password. We'd not distingush proxy (407) records.</p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N1016B"></a>3.2. HTTP POST and GET of Authentication Credentials</h3></div></div><div></div></div><p>Every URI processed by Heritrix first has preconditions checked. Example preconditions are the fetching of a domain's DNS record and its <tt class="filename">robots.txt</tt> file before proceeding to make requests against the domain. This proposal is to add a new <span class="emphasis"><em>login precondition</em></span> after the fashion of the robots and DNS preconditions -- See <a href="org.archive.crawler.prefetch.PreconditionEnforcer" target="_top">org.archive.crawler.prefetch.PreconditionEnforcer</a> -- and a facility for having our HTTP fetcher run a configurable one time login.</p><p>The new <i class="parameter"><tt>login precondition</tt></i> will test the current URI against a preloaded list of <span class="emphasis"><em>login URI patterns</em></span>. Each <i class="parameter"><tt>login URI pattern </tt></i>describes a protected area of a domain (or virtualdomain): e.g. "http://www.archive.org/private/*". Each <i class="parameter"><tt>login URI pattern</tt></i> serves as a key to an associated <span class="emphasis"><em>login record</em></span>. A <i class="parameter"><tt>login record</tt></i> has all information necessary for negotiation of a successful login such as the HTML form content to submit -- username, password, submit button name, etc. -- and whether login requires POSTing or GETting the login form. The login record also has a <span class="emphasis"><em>ran login</em></span> flag that says whether or not the login has been run previously against this protected area.</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Ran Login flag</h3><p>The <i class="parameter"><tt>ran login</tt></i> flag says whether the login has been <span class="emphasis"><em>run</em></span>, not whether or not login <span class="emphasis"><em>succeeded</em></span>. Guaging whether the login was successful or not is difficult. It varies with the login implementation as already noted.</p></div><p>Also part of the login record is a <span class="emphasis"><em>login URI</em></span>.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -