📄 auth_proposal.html
字号:
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>Heritrix Negotiation of Authentication Schemes</title><meta content="DocBook XSL Stylesheets V1.61.3" name="generator"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="article" lang="en" id="N10001"><div class="titlepage"><div><div><h2 class="title"><a name="N10001"></a>Heritrix Negotiation of Authentication Schemes</h2></div><div><h3 class="subtitle"><i>A Proposal to address RFE <a href="https://sourceforge.net/tracker/index.php?func=detail&aid=914301&group_id=73833&atid=539102" target="_top">[ 914301 ] Logging in (HTTP POST, Basic Auth, etc.)</a></i></h3></div><div><div class="author"><h3 class="author"><span class="firstname">Michael</span> <span class="surname">Stack</span></h3><div class="affiliation"><span class="orgname">Internet Archive<br></span></div></div></div></div><div></div><hr></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt>1. <a href="#N1001B">Introduction</a></dt><dd><dl><dt>1.1. <a href="#N10024">Scope</a></dt><dt>1.2. <a href="#N10037">Assumptions</a></dt></dl></dd><dt>2. <a href="#schemes">Authentication Schemes</a></dt><dd><dl><dt>2.1. <a href="#basicdesc">Basic and Digest Access Authentication </a></dt><dt>2.2. <a href="#postdesc">HTTP POST and GET of Authentication Credentials</a></dt><dt>2.3. <a href="#clientcertdesc">X509 Client Certificates</a></dt><dt>2.4. <a href="#ntlmdesc">NTLM </a></dt></dl></dd><dt>3. <a href="#N100CD">Proposal</a></dt><dd><dl><dt>3.1. <a href="#N100ED">Basic and Digest Access Authentication </a></dt><dt>3.2. <a href="#N1016B">HTTP POST and GET of Authentication Credentials</a></dt><dt>3.3. <a href="#commonage">Commonage</a></dt></dl></dd><dt>4. <a href="#N10271">Design</a></dt><dd><dl><dt>4.1. <a href="#N10274">Configuration</a></dt><dt>4.2. <a href="#N10279">Credential store</a></dt></dl></dd><dt>5. <a href="#N10287">Future</a></dt><dd><dl><dt>5.1. <a href="#N1028C">Same URL different Page Content</a></dt><dt>5.2. <a href="#N10291">Integration with the UI</a></dt></dl></dd><dt><a href="#N10296">Bibliography</a></dt></dl></div><div class="abstract"><p class="title"><b>Abstract</b></p><p>Description of common web authentication schemes. Description of the problem volunteering credentials at the appropriate juncture. Proposal for navigating HTTP POST login and Basic Auth for when Heritrix has been supplied credentials ahead of the authorization challenge.</p></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="N1001B"></a>1. Introduction</h2></div></div><div></div></div><p>This document is divided into two parts. The first part disccuses common web authentication schemes eliminating the less common. The second part outlines Heritrix negotiation of HTML login forms and Basic/Digest Auth authentications schemes. On the end are a list of items to consider for future versions of the authentication system.</p><p>This intent of this document is to solicit feedback in advance of implementation.</p><p>The rest of this introduction is given over to scope and assumptions made in this document.</p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N10024"></a>1.1. Scope</h3></div></div><div></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10027"></a>1.1.1. Delivery timeline</h4></div></div><div></div></div><p>Delivery on the proposal is to be parcelled out over Heritrix versions. A first cut at Heritrix form-based POST/GET authentication is to be included in version 1.0 (End of April, 2004).</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N1002C"></a>1.1.2. Common web authentication schemes only</h4></div></div><div></div></div><p>This proposal is for the common web authentication schemes only: E.g. HTTP POST to a HTML form, and Basic and Digest Auth. This proposal does not cover the Heritrix crawler authenticating against a LDAP server, PAM, getting tickets from a Kerberos server, negotiating single sign-ons, etc.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="connbased"></a>1.1.3. Connection-based authentication schemes</h4></div></div><div></div></div><p>Connection-based authentication schemes are outside the scope of this proposal. They are antithetical to the current Heritrix mode of operation. Consideration of connection-based authentication schemes is postponed until Heritrix does other than HTTP/1.0 behavior of getting a new connection per request.</p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N10037"></a>1.2. Assumptions</h3></div></div><div></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N1003A"></a>1.2.1. Heritrix has been granted necessary authentication credentials</h4></div></div><div></div></div><p>Assumption is that Heritrix has been granted legitimate access to the site we're trying to log into ahead of the login attempt; that the site owners have given permission and the necessary login/password combination and/or certificates necessary to gain access.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="procchainassumption"></a>1.2.2. Heritrix URI processing chain</h4></div></div><div></div></div><p>Assumption is that this proposal integrate with the Heritrix URI processing chains model [<span class="citation">See <a href="http://crawler.archive.org/user.html" target="_top">URI Processing Chains</a> </span>] rather than go to an authentication framework such as <a href="#jaas" target="_top">JAAS</a> and encapsulate the complete authentication dialog within a JAAS LoginModule plugin, with a plugin per authentication scheme supported. On the one hand, the Heritrix URI processing chain lends itself naturally to the processing of the common web authentication mechanisms with its core notions of HTML fetching and extracting, and besides, the authentication dialog will likely have links to harvest. On the other hand, authentication will be spread about the application.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10050"></a>1.2.3. No means of recording credentials used authenticating in an ARC</h4></div></div><div></div></div><p>There is no means currently for recording in an arc file the credentials used getting to pages (If we recorded the request, we'd have some hope of archiving them).</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10055"></a>1.2.4. Credentials store does not need to be secured</h4></div></div><div></div></div><p>Assumption is that Heritrix does not need to secure the store in which we keep credentials to offer up during authentications; the credentials store does not need to be saved on disk encrypted and password protected.</p></div></div></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="schemes"></a>2. Authentication Schemes</h2></div></div><div></div></div><p>This section discusses common web authentication schemes and where applicable, practical issues navigating the schemes' requirements. The first two described, <a href="#basicdesc" title="2.1. Basic and Digest Access Authentication ">Section 2.1, “Basic and Digest Access Authentication ”</a> and <a href="#postdesc" title="2.2. HTTP POST and GET of Authentication Credentials">Section 2.2, “HTTP POST and GET of Authentication Credentials”</a>, are assumed most commonly used.</p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="basicdesc"></a>2.1. Basic and Digest Access Authentication [<a href="#rfc2617" title="[rfc2617]">rfc2617</a>]</h3></div></div><div></div></div><p>The server returns a HTTP response code of <tt class="constant">401 Unauthorized</tt> or <tt class="constant">407 Proxy Authentication Required</tt> when it requires authentiation of the client.</p><div class="blockquote"><blockquote class="blockquote"><p>The realm directive (case-insensitive) is required for all authentication schemes that issue a challenge. The realm value (case-sensitive), in combination with the canonical root URL...of the server being accessed, defines the protection space. [<a href="#rfc2617" title="[rfc2617]">rfc2617</a>]</p></blockquote></div><p>The canonical root URL is discussed in this message, <a href="http://cert.uni-stuttgart.de/archive/bugtraq/1999/08/msg00380.html" target="_top">Re: IE and cached passwords</a>. Its scheme + hostname + port only. Path and query string have been stripped. Effectively, it equates to scheme + <a href="http://java.sun.com/j2se/1.4.2/docs/api/java/net/URI.html" target="_top">URI authority.</a></p><div class="blockquote"><blockquote class="blockquote"><p>A client SHOULD assume that all paths at or deeper than the depth of the last symbolic element in the path field of the Request-URI also are within the protection space specified by the Basic realm value of the current challenge. A client MAY preemptively send the corresponding Authorization header with requests for resources in that space without receipt of another challenge from the server. [<a href="#rfc2617" title="[rfc2617]">rfc2617</a>]</p></blockquote></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="postdesc"></a>2.2. HTTP POST and GET of Authentication Credentials</h3></div></div><div></div></div><p>Generally, this scheme works as follows. When an unauthenticated client attempts to access a protected area, they are redirected by the server to a page with an HTML login form. The client must then HTTP POST or a HTTP GET the HTML form with the client access credentials filled in. Upon verification of the credentials by the server, the client is given access. So the client does not need to pass credentials on all subsequent accesses to the protected areas of the site, the server will mark the client usually in one of two ways: It will write a special, usually time- and scope-limited, token, or "cookie", back to the client which the client volunteers on all subsequent accesses, or the server will serve pages that have embedded URLs rewritten to include a special token. The tokens are examined by the server on each subsequent access for validity and access continues while the token remains valid.</p><p>There is no standard for how this dialogue is supposed to proceed. Myriad are the implementations of this basic scheme. Below is a listing of common difficulties:</p><div class="itemizedlist"><ul type="disc"><li><p>Form field item names are varient.</p></li><li><p>Means by which unsuccessful login is reported to the client varies. A client can be redirected to new failed login page or the original login page is redrawn with the inclusion of banner message reporting on the failed login.</p></li><li><p>Following on from the previous point, should a solution POST authentication and then do all necessary to ensure a successful login -- i.e. follow redirects, regex over the result page to ensure it says "successful login", etc. -- or should a solution do nought but POST and then give whatever the resultant page to the Heritrix URI processing chain whether successful or not?</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Processing of form success page?</h3><p>The result page should probably be let through. It may have valuable links on board. The alternative would necessitate our running an out-of-band subset of the Heritrix URI processing chain POSTing/GETting authentication running extractors to verify result of login attempt. This mini authentication chain could be kept tidy encapsulated within a login module -- see <a href="#procchainassumption" title="1.2.2. Heritrix URI processing chain">Section 1.2.2, “Heritrix URI processing chain”</a>-- but ugly would be how to transfer such as the cookies from the mini chain over to the main URI processing chain.</p></div></li><li><p>The aforementioned differing ways in which the server parks in the client a validated token.</p></li><li><p>What if login attempt fails? Should we retry? For how long? Means maintaining a state across URI processing?</p></li><li><p>Should there be tools to help an operator develop Heritrix authentication configuration? Should a tool be developed that runs the login outside of the Heritrix context to make it easier on operator developing the authentication configuration?</p></li></ul></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="clientcertdesc"></a>2.3. X509 Client Certificates</h3></div></div><div></div></div><p>To gain access, the client must volunteer a trusted certificate setting up an SSL connection to the server. Upon receipt, the server tests the client is entitled to access.</p><p>Its probably rare that client certificates alone will be used as access protection. More likely, certificates will be used in combination with one of the above listed schemes.</p><p>The certificate the client is to volunteer needs to be in a local TrustStore available to the Heritrix TrustManager making the SSL connection (Heritrix already maintains its own keystore of certificates
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -