📄 auth_proposal.html
字号:
The <i class="parameter"><tt>login URI</tt></i> is the login page whose successful navigation gives access to the protected space: e.g. If the pattern we used testing was, "http://www.archive.org/private/*", the <i class="parameter"><tt>login URI</tt></i> might be "http://www.archive.org/private/login.html".</p><p>If the current URI matches one of the <i class="parameter"><tt>login URI pattern</tt></i> list, we pull the matched patterns associated <i class="parameter"><tt>login record</tt></i>. If the <i class="parameter"><tt>ran login</tt></i> flag has not been set, the <i class="parameter"><tt>login URI</tt></i> is <span class="emphasis"><em>force</em></span> queued. Its force queued in case the URI has been seen (GET'd) already. The <i class="parameter"><tt>login URI</tt></i> (somehow) has the <i class="parameter"><tt>login record</tt></i> associated. The presence of the <i class="parameter"><tt>login record</tt></i> distingushes the <i class="parameter"><tt>login URI</tt></i>. The current URI is requeued (Precondition not met). Otherwise the current URI is let run through as per normal.</p><p>When the <i class="parameter"><tt>login URI</tt></i> becomes the current URI and is being processed by the HTTP fetcher, the presence of the <i class="parameter"><tt>login record</tt></i> with a <i class="parameter"><tt>ran login</tt></i> set to false signals the HTTP fetcher to run the abnormal login sequence rather than do its usual GET. The <i class="parameter"><tt>login record</tt></i> has all the HTTP fetcher needs to execute the login. Upon completion, the <i class="parameter"><tt>login ran</tt></i> flag is set in the <i class="parameter"><tt>login record</tt></i> and the <i class="parameter"><tt>login record</tt></i> is removed from the <i class="parameter"><tt>login URI</tt></i>.</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">GET of the login URI</h3><p>What if we haven't already seen the login page? Should the login precondition first force fetch the login URI without the login record loaded so its first GET'd before the we run a login?</p></div><p>This implementation cannot guarantee successful login nor is there provision for retries. The general notion is that the single running of the login succeeds and that the produced success cookie or rewritten URI makes it back to the Heritrix client gaining us access to the protected area.</p><p>Configuration would enable or disable this feature.</p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10202"></a>3.2.1. Login Record</h4></div></div><div></div></div><p>A login record would be keyed by the pattern it applies to and would contain aforementioned <i class="parameter"><tt>ran login</tt></i> flag and <i class="parameter"><tt>login URI</tt></i>. Tied to the login URI would be a list of key-value pairs to hold the login form content as well as specification of whether the form is to be POSTed or GETed.</p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="commonage"></a>3.3. Commonage</h3></div></div><div></div></div><p>Here we discuss features common to the two above authentication scheme implementations.</p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10215"></a>3.3.1. URI#authority as URI canonical root URL</h4></div></div><div></div></div><p>Proposal is to equate the two. Doing so means no need to change CrawlServer. Currently the CawlServer is constructed wrapping the URI#authority portion of an URI. URI#authority is <i class="parameter"><tt>URI canonical root URL</tt></i> absent the scheme. Assuming CrawlServer is for http only, then it should be safe making this equation.</p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">DNS</h3><p>Are there CrawlServer instances made for anything but http schemes?</p></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">HTTPS</h3><p>Check that <i class="parameter"><tt>URI canonical root URL</tt></i>s of <tt class="filename">http://www.example.com</tt> and <tt class="filename">https://www.example.com</tt> result in different <tt class="classname">CrawlServer</tt> instances.</p></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10237"></a>3.3.2. Population of Domain/VirtualDomain object with Credentials</h4></div></div><div></div></div><p>Proposal is that CrawlServer encapsulate credentials store accessing, that it read the store upon construction.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N1023C"></a>3.3.3. Caching of Credentials</h4></div></div><div></div></div><p>Once read from the store, we need to cache the credentials in CrawlServer.</p><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10241"></a>3.3.3.1. JAAS Subject, Principal and Credentials [<a href="#jaas" title="[jaas]"><span class="abbrev">jaas</span></a>]</h5></div></div><div></div></div><p>Proposal is that we at least look at selectively exploiting this library caching credentials. For example, a CrawlServer might implement the java.security.auth.Subject interface. To this Subject, we'd add implementations of the Principals and Credentials interfaces (Makes sense for the carrying of RFC2617 credentials. Less so for login credentials. TBD).</p></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="store"></a>3.3.4. Credential Stores</h4></div></div><div></div></div><p>The credential store would be on disk.</p><p>For convenience, particularly listing credentials in a global file store, credentials can be grouped first by host (the base domain -- domain minus port #) and then by URI#authority (domain plus any port #).</p><p>Configuration would allow us to point at a global store of credentials.</p><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10252"></a>3.3.4.1. Layering of Credential Stores</h5></div></div><div></div></div><p>Subsequently, we'd add support for <span class="emphasis"><em>layering</em></span> stores. Modeled after apache's <tt class="filename">.htaccess</tt> mechanism for selectively overriding the main server configuration on a directory scope, or, closer to home, on how Heritrix settings can be overridden on a per-host basis, it'd be possible to point the store querying code at a directory whose subdirectories are named for domains progressing from a root down through the macro level org, com, gov, etc., subdomains getting progressively more precise: e.g travel.yahoo.com would be found under the yahoo.com directory which would be under the com directory. Searching for credentials, we'd search up through the directory structure going from the current domain on up to the root. <i class="parameter"><tt>realm + canonical root URL</tt></i> key. If not found in the domain store, of if a domain store did not exist, we'd back up the settings hierarchy until we hit the global store.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10262"></a>3.3.4.2. Exploit the settings framework implementing credentials store</h5></div></div><div></div></div><p>Propose extending or adapting the Heritrix settings framework to have it manage our credentials store so we can exploit code already written.</p></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10267"></a>3.3.5. Logging</h4></div></div><div></div></div><p>A new log will trace authentication transactions. Log will include listing of credentials offered, new cookies, query parameters, and pertinent HTTP headers returned by the submitted authentication, and where possible, report on whether authentication succeeded or not</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N1026C"></a>3.3.6. Debugging tool</h4></div></div><div></div></div><p>A command-line tool to run single logins to aid debugging logins will aid development and be of use to operators.</p></div></div></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="N10271"></a>4. Design</h2></div></div><div></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N10274"></a>4.1. Configuration</h3></div></div><div></div></div><p>Will add to the HTTP Fetcher options that enable, disable and configuration of the two authentication types supported.</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N10279"></a>4.2. Credential store</h3></div></div><div></div></div><p>Below is a static class model diagram for accessing the credential store.</p><div class="mediaobject"><img src="credentials.gif"></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Implementation looks nothing like the above</h3><p>Ignore the above design. The implementation turned out to be something else altogether. The model was effectively inverted (credentials hold domains) and notions of going via a CredentialManager/CredentialStore to do all operations on the store were removed. While the resultant implementation is not a good OOM, its amenable to UI manipulation (and sits easily atop the heritrix settings system).</p></div></div></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="N10287"></a>5. Future</h2></div></div><div></div></div><p>This section has issues to be addressed later, probably in a version 2.0 of the authentication system.</p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N1028C"></a>5.1. Same URL different Page Content</h3></div></div><div></div></div><p>Heritrix distingushes pages by URIs. Pages seen can be different whether logged in or not. We'll need some way to force/suggest sets of URIs are revisitable after a login token is received. This might mean the 'fingerprint' of a URI includes any authentication information to be used.</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N10291"></a>5.2. Integration with the UI</h3></div></div><div></div></div><p>Add/Edit/Delete of Credentials via the UI. Flagging the operator about 401s and likely html login forms.</p></div></div><div class="bibliography" id="N10296"><div class="titlepage"><div><div><h2 class="title"><a name="N10296"></a>Bibliography</h2></div></div><div></div></div><div class="biblioentry"><a name="heritrix"></a><p>[<span class="abbrev">heritrix</span>] <span class="title"><i><a href="http://crawler.archive.org" target="_top">Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.</a></i>. </span></p></div><div class="biblioentry"><a name="httpclient"></a><p>[<span class="abbrev">httpclient</span>] <span class="title"><i>Apache Jakarta Commons HTTPClient <a href="http://jakarta.apache.org/commons/httpclient/authentication.html" target="_top">Authentication Guide</a></i>. </span><span class="edition">Commons HTTPClient version 2.0.. </span></p></div><div class="biblioentry"><a name="jaas"></a><p>[<span class="abbrev">jaas</span>] <span class="title"><i><a href="http://java.sun.com/products/jaas/index.jsp" target="_top">Java Authentication and Authorization Service (JAAS)</a></i>. </span></p></div><div class="biblioentry"><a name="ntlm"></a><p>[<span class="abbrev">ntlm</span>] <span class="title"><i>The <a href="http://davenport.sourceforge.net/ntlm.html" target="_top">NTLM Authentication Protocol</a></i>. </span></p></div><div class="biblioentry"><a name="rfc2617"></a><p>[rfc2617] <span class="title"><i>RFC2617 <a href="http://ftp.ics.uci.edu/pub/ietf/http/rfc2617.txt" target="_top">HTTP Authentication: Basic and Digest Access Authentication</a></i>. </span></p></div></div></div></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -