📄 ar01s07.html
字号:
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>7. Some notes on the URI classes</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix developer documentation"><link rel="up" href="index.html" title="Heritrix developer documentation"><link rel="prev" href="chap_modules_common.html" title="6. Common needs for all configurable modules"><link rel="next" href="ar01s08.html" title="8. Writing a Frontier"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">7. Some notes on the URI classes</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="chap_modules_common.html">Prev</a> </td><th align="center" width="60%"> </th><td align="right" width="20%"> <a accesskey="n" href="ar01s08.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="N10333"></a>7. Some notes on the URI classes</h2></div></div></div><p>URIs<sup>[<a href="#ftn.N10338" name="N10338">1</a>]</sup> in Heritrix are represented by several classes. The basic building block is <code class="literal">org.archive.datamodel.UURI</code> which subclasses <code class="literal">org.apache.commons.httpclient.URI</code>. "UURI" is an abbrieviation for "Usable URI." This class always normalizes and derelativizes URIs -- implying that if a UURI instance is successfully created, the represented URI will be, at least on its face, "usable" -- neither illegal nor including superficial variances which complicate later processing. It also provides methods for accessing the different parts of the URI.</p><p>We used to base all on <code class="literal">java.net.URI</code> but because of bugs and its strict RFC2396 conformance in the face of a real world that acts otherwise, its facility was was subsumed by UURI.</p><p>Two classes wrap the <code class="literal">UURI</code> in Heritrix:</p><div class="variablelist"><dl><dt><span class="term"><a href="http://crawler.archive.org/apidocs/org/archive/crawler/datamodel/CandidateURI.html" target="_top">CandidateURI</a></span></dt><dd><p>A URI, discovered or passed-in, that may be scheduled (and thus become a CrawlURI). Contains just the fields necessary to perform quick in-scope analysis. This class wraps an UURI instance.</p></dd><dt><span class="term"><a href="http://crawler.archive.org/apidocs/org/archive/crawler/datamodel/CrawlURI.html" target="_top">CrawlURI</a></span></dt><dd><p>Represents a candidate URI and the associated state it collects as it is crawled. The CrawlURI is an subclass of CandidateURI. It is instances of this class that is fed to the processors.</p></dd></dl></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="urischemes"></a>7.1. Supported Schemes (UnsupportedUriSchemeException)</h3></div></div></div><p>A property in <code class="literal">heritrix.properties</code> named <code class="literal">org.archive.crawler.datamodel.UURIFactory.schemes</code> lists supported schemes. Any scheme not listed here will cause an UnsupportedUriSchemeException which will be reported in uri-errors.log with a 'Unsupported scheme' prefix. If you add a fetcher for a scheme, you'll need to add to this list of supported schemes (Later, Heritrix can ask its fetchers what schemes are supported).</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N1037A"></a>7.2. The CrawlURI's Attribute list</h3></div></div></div><p>The CrawlURI offers a flexible attribute list which is used to keep arbitrary information about the URI while crawling. If you are going to write a processor you almost certainly will use the attribute list. The attribute list is a key/value-pair list accessed by typed accessors/setters. By convention the key values are picked from the <a href="http://crawler.archive.org/apidocs/org/archive/crawler/datamodel/CoreAttributeConstants.html" target="_top">CoreAttributeConstants</a> interface which all processors implement. If you use other keys than those listed in this interface, then you must add the handling of that attribute to some other module as well.</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N10383"></a>7.3. The recorder streams</h3></div></div></div><p>The crawler wouldn't be of much use if it did not give access to the HTTP request and response. For this purpose the CrawlURI has the <a href="http://crawler.archive.org/apidocs/org/archive/crawler/datamodel/CrawlURI.html#getHttpRecorder()" target="_top">getHttpRecorder()</a> <sup>[<a href="#ftn.N1038C" name="N1038C">2</a>]</sup> method. The recorder is referenced by the CrawlURI and available to all the processors. Se <a href="ar01s11.html#writing_a_processor">Section 11, “Writing a Processor”</a> for an explanation on how to use the recorder.</p></div><div class="footnotes"><br><hr align="left" width="100"><div class="footnote"><p><sup>[<a href="#N10338" name="ftn.N10338">1</a>] </sup>URI (Uniform Resource Identifiers) defined by <a href="http://www.ietf.org/rfc/rfc2396.txt" target="_top">RFC 2396</a> is a way of identifying resources. The relationship between URI, URL and URN is described in the RFC: [<span class="citation">"A URI can be further classified as a locator, a name, or both. The term "Uniform Resource Locator" (URL) refers to the subset of URI that identify resources via a representation of their primary access mechanism (e.g., their network "location"), rather than identifying the resource by name or by some other attribute(s) of that resource. The term "Uniform Resource Name" (URN) refers to the subset of URI that are required to remain globally unique and persistent even when the resource ceases to exist or becomes unavailable."</span>] Although Heritrix uses URIs, only URLs are supported at the moment. For more information on naming and addressing of resources see: <a href="http://www.w3.org/Addressing/" target="_top">Naming and Addressing: URIs, URLs, ...</a> on w3.org's website.</p></div><div class="footnote"><p><sup>[<a href="#N1038C" name="ftn.N1038C">2</a>] </sup>This method will most likely change name see <a href="refactor_HTTPRecorder.html" title="1. The org.archive.util.HTTPRecorder class">Section 1, “The org.archive.util.HTTPRecorder class”</a> for details.</p></div></div></div><div class="navfooter"><hr><table summary="Navigation footer" width="100%"><tr><td align="left" width="40%"><a accesskey="p" href="chap_modules_common.html">Prev</a> </td><td align="center" width="20%"> </td><td align="right" width="40%"> <a accesskey="n" href="ar01s08.html">Next</a></td></tr><tr><td valign="top" align="left" width="40%">6. Common needs for all configurable modules </td><td align="center" width="20%"><a accesskey="h" href="index.html">Home</a></td><td valign="top" align="right" width="40%"> 8. Writing a Frontier</td></tr></table></div></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -