⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 auth_proposal.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
📖 第 1 页 / 共 3 页
字号:
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>Heritrix Negotiation of Authentication Schemes</title><meta content="DocBook XSL Stylesheets V1.61.3" name="generator"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="article" lang="en" id="N10001"><div class="titlepage"><div><div><h2 class="title"><a name="N10001"></a>Heritrix Negotiation of Authentication Schemes</h2></div><div><h3 class="subtitle"><i>A Proposal to address RFE <a href="https://sourceforge.net/tracker/index.php?func=detail&aid=914301&group_id=73833&atid=539102" target="_top">[  914301 ] Logging in (HTTP POST, Basic Auth, etc.)</a></i></h3></div><div><div class="author"><h3 class="author"><span class="firstname">Michael</span> <span class="surname">Stack</span></h3><div class="affiliation"><span class="orgname">Internet Archive<br></span></div></div></div></div><div></div><hr></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt>1. <a href="#N1001B">Introduction</a></dt><dd><dl><dt>1.1. <a href="#N10024">Scope</a></dt><dt>1.2. <a href="#N10037">Assumptions</a></dt></dl></dd><dt>2. <a href="#schemes">Authentication Schemes</a></dt><dd><dl><dt>2.1. <a href="#basicdesc">Basic and Digest Access Authentication </a></dt><dt>2.2. <a href="#postdesc">HTTP POST and GET of Authentication Credentials</a></dt><dt>2.3. <a href="#clientcertdesc">X509 Client Certificates</a></dt><dt>2.4. <a href="#ntlmdesc">NTLM </a></dt></dl></dd><dt>3. <a href="#N100CD">Proposal</a></dt><dd><dl><dt>3.1. <a href="#N100ED">Basic and Digest Access Authentication </a></dt><dt>3.2. <a href="#N1016B">HTTP POST and GET of Authentication Credentials</a></dt><dt>3.3. <a href="#commonage">Commonage</a></dt></dl></dd><dt>4. <a href="#N10271">Design</a></dt><dd><dl><dt>4.1. <a href="#N10274">Configuration</a></dt><dt>4.2. <a href="#N10279">Credential store</a></dt></dl></dd><dt>5. <a href="#N10287">Future</a></dt><dd><dl><dt>5.1. <a href="#N1028C">Same URL different Page Content</a></dt><dt>5.2. <a href="#N10291">Integration with the UI</a></dt></dl></dd><dt><a href="#N10296">Bibliography</a></dt></dl></div><div class="abstract"><p class="title"><b>Abstract</b></p><p>Description of common web authentication schemes. Description of the    problem volunteering credentials at the appropriate juncture. Proposal for    navigating HTTP POST login and Basic Auth for when Heritrix has been    supplied credentials ahead of the authorization challenge.</p></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="N1001B"></a>1.&nbsp;Introduction</h2></div></div><div></div></div><p>This document is divided into two parts. The first part disccuses    common web authentication schemes eliminating the less common. The second    part outlines Heritrix negotiation of HTML login forms and Basic/Digest    Auth authentications schemes. On the end are a list of items to consider    for future versions of the authentication system.</p><p>This intent of this document is to solicit feedback in advance of    implementation.</p><p>The rest of this introduction is given over to scope and assumptions    made in this document.</p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N10024"></a>1.1.&nbsp;Scope</h3></div></div><div></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10027"></a>1.1.1.&nbsp;Delivery timeline</h4></div></div><div></div></div><p>Delivery on the proposal is to be parcelled out over Heritrix        versions. A first cut at Heritrix form-based POST/GET authentication        is to be included in version 1.0 (End of April, 2004).</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N1002C"></a>1.1.2.&nbsp;Common web authentication schemes only</h4></div></div><div></div></div><p>This proposal is for the common web authentication schemes only:        E.g. HTTP POST to a HTML form, and Basic and Digest Auth. This        proposal does not cover the Heritrix crawler authenticating against a        LDAP server, PAM, getting tickets from a Kerberos server, negotiating        single sign-ons, etc.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="connbased"></a>1.1.3.&nbsp;Connection-based authentication schemes</h4></div></div><div></div></div><p>Connection-based authentication schemes are outside the scope of        this proposal. They are antithetical to the current Heritrix mode of        operation. Consideration of connection-based authentication schemes is        postponed until Heritrix does other than HTTP/1.0 behavior of getting        a new connection per request.</p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N10037"></a>1.2.&nbsp;Assumptions</h3></div></div><div></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N1003A"></a>1.2.1.&nbsp;Heritrix has been granted necessary authentication        credentials</h4></div></div><div></div></div><p>Assumption is that Heritrix has been granted legitimate access        to the site we're trying to log into ahead of the login attempt; that        the site owners have given permission and the necessary login/password        combination and/or certificates necessary to gain access.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="procchainassumption"></a>1.2.2.&nbsp;Heritrix URI processing chain</h4></div></div><div></div></div><p>Assumption is that this proposal integrate with the Heritrix URI        processing chains model [<span class="citation">See <a href="http://crawler.archive.org/user.html" target="_top">URI Processing        Chains</a> </span>] rather than go to an authentication        framework such as <a href="#jaas" target="_top">JAAS</a> and encapsulate the        complete authentication dialog within a JAAS LoginModule plugin, with        a plugin per authentication scheme supported. On the one hand, the        Heritrix URI processing chain lends itself naturally to the processing        of the common web authentication mechanisms with its core notions of        HTML fetching and extracting, and besides, the authentication dialog        will likely have links to harvest. On the other hand, authentication        will be spread about the application.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10050"></a>1.2.3.&nbsp;No means of recording credentials used authenticating in an        ARC</h4></div></div><div></div></div><p>There is no means currently for recording in an arc file the        credentials used getting to pages (If we recorded the request, we'd        have some hope of archiving them).</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10055"></a>1.2.4.&nbsp;Credentials store does not need to be secured</h4></div></div><div></div></div><p>Assumption is that Heritrix does not need to secure the store in        which we keep credentials to offer up during authentications; the        credentials store does not need to be saved on disk encrypted and        password protected.</p></div></div></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="schemes"></a>2.&nbsp;Authentication Schemes</h2></div></div><div></div></div><p>This section discusses common web authentication schemes and where    applicable, practical issues navigating the schemes' requirements. The    first two described, <a href="#basicdesc" title="2.1.&nbsp;Basic and Digest Access Authentication ">Section&nbsp;2.1, &ldquo;Basic and Digest Access Authentication &rdquo;</a> and <a href="#postdesc" title="2.2.&nbsp;HTTP POST and GET of Authentication Credentials">Section&nbsp;2.2, &ldquo;HTTP POST and GET of Authentication Credentials&rdquo;</a>, are assumed most commonly used.</p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="basicdesc"></a>2.1.&nbsp;Basic and Digest Access Authentication [<a href="#rfc2617" title="[rfc2617]">rfc2617</a>]</h3></div></div><div></div></div><p>The server returns a HTTP response code of <tt class="constant">401      Unauthorized</tt> or <tt class="constant">407 Proxy Authentication      Required</tt> when it requires authentiation of the client.</p><div class="blockquote"><blockquote class="blockquote"><p>The realm directive (case-insensitive) is required for all        authentication schemes that issue a challenge. The realm value        (case-sensitive), in combination with the canonical root URL...of the        server being accessed, defines the protection space. [<a href="#rfc2617" title="[rfc2617]">rfc2617</a>]</p></blockquote></div><p>The canonical root URL is discussed in this message, <a href="http://cert.uni-stuttgart.de/archive/bugtraq/1999/08/msg00380.html" target="_top">Re:      IE and cached passwords</a>. Its scheme + hostname + port only. Path      and query string have been stripped. Effectively, it equates to scheme +      <a href="http://java.sun.com/j2se/1.4.2/docs/api/java/net/URI.html" target="_top">URI      authority.</a></p><div class="blockquote"><blockquote class="blockquote"><p>A client SHOULD assume that all paths at or deeper than the        depth of the last symbolic element in the path field of the        Request-URI also are within the protection space specified by the        Basic realm value of the current challenge. A client MAY preemptively        send the corresponding Authorization header with requests for        resources in that space without receipt of another challenge from the        server. [<a href="#rfc2617" title="[rfc2617]">rfc2617</a>]</p></blockquote></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="postdesc"></a>2.2.&nbsp;HTTP POST and GET of Authentication Credentials</h3></div></div><div></div></div><p>Generally, this scheme works as follows. When an unauthenticated      client attempts to access a protected area, they are redirected by the      server to a page with an HTML login form. The client must then HTTP POST      or a HTTP GET the HTML form with the client access credentials filled      in. Upon verification of the credentials by the server, the client is      given access. So the client does not need to pass credentials on all      subsequent accesses to the protected areas of the site, the server will      mark the client usually in one of two ways: It will write a special,      usually time- and scope-limited, token, or "cookie", back to the client      which the client volunteers on all subsequent accesses, or the server      will serve pages that have embedded URLs rewritten to include a special      token. The tokens are examined by the server on each subsequent access      for validity and access continues while the token remains valid.</p><p>There is no standard for how this dialogue is supposed to proceed.      Myriad are the implementations of this basic scheme. Below is a listing      of common difficulties:</p><div class="itemizedlist"><ul type="disc"><li><p>Form field item names are varient.</p></li><li><p>Means by which unsuccessful login is reported to the client          varies. A client can be redirected to new failed login page or the          original login page is redrawn with the inclusion of banner message          reporting on the failed login.</p></li><li><p>Following on from the previous point, should a solution POST          authentication and then do all necessary to ensure a successful          login -- i.e. follow redirects, regex over the result page to ensure          it says "successful login", etc. -- or should a solution do nought          but POST and then give whatever the resultant page to the Heritrix          URI processing chain whether successful or not?</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Processing of form success page?</h3><p>The result page should probably be let through. It may have            valuable links on board. The alternative would necessitate our            running an out-of-band subset of the Heritrix URI processing chain            POSTing/GETting authentication running extractors to verify result            of login attempt. This mini authentication chain could be kept            tidy encapsulated within a login module -- see <a href="#procchainassumption" title="1.2.2.&nbsp;Heritrix URI processing chain">Section&nbsp;1.2.2, &ldquo;Heritrix URI processing chain&rdquo;</a>-- but ugly would be how to            transfer such as the cookies from the mini chain over to the main            URI processing chain.</p></div></li><li><p>The aforementioned differing ways in which the server parks in          the client a validated token.</p></li><li><p>What if login attempt fails? Should we retry? For how long?          Means maintaining a state across URI processing?</p></li><li><p>Should there be tools to help an operator develop Heritrix          authentication configuration? Should a tool be developed that runs          the login outside of the Heritrix context to make it easier on          operator developing the authentication configuration?</p></li></ul></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="clientcertdesc"></a>2.3.&nbsp;X509 Client Certificates</h3></div></div><div></div></div><p>To gain access, the client must volunteer a trusted certificate      setting up an SSL connection to the server. Upon receipt, the server      tests the client is entitled to access.</p><p>Its probably rare that client certificates alone will be used as      access protection. More likely, certificates will be used in combination      with one of the above listed schemes.</p><p>The certificate the client is to volunteer needs to be in a local      TrustStore available to the Heritrix TrustManager making the SSL      connection (Heritrix already maintains its own keystore of certificates

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -