📄 ar01s07.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
字号:
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>7.&nbsp;Some notes on the URI classes</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix developer documentation"><link rel="up" href="index.html" title="Heritrix developer documentation"><link rel="prev" href="chap_modules_common.html" title="6.&nbsp;Common needs for all configurable modules"><link rel="next" href="ar01s08.html" title="8.&nbsp;Writing a Frontier"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">7.&nbsp;Some notes on the URI classes</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="chap_modules_common.html">Prev</a>&nbsp;</td><th align="center" width="60%">&nbsp;</th><td align="right" width="20%">&nbsp;<a accesskey="n" href="ar01s08.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="N10333"></a>7.&nbsp;Some notes on the URI classes</h2></div></div></div><p>URIs<sup>[<a href="#ftn.N10338" name="N10338">1</a>]</sup> in Heritrix are represented by several classes. The basic    building block is <code class="literal">org.archive.datamodel.UURI</code> which    subclasses <code class="literal">org.apache.commons.httpclient.URI</code>. "UURI" is    an abbrieviation for "Usable URI." This class always normalizes and    derelativizes URIs -- implying that if a UURI instance is successfully    created, the represented URI will be, at least on its face, "usable" --    neither illegal nor including superficial variances which complicate later    processing. It also provides methods for accessing the different parts of    the URI.</p><p>We used to base all on <code class="literal">java.net.URI</code> but because    of bugs and its strict RFC2396 conformance in the face of a real world    that acts otherwise, its facility was was subsumed by UURI.</p><p>Two classes wrap the <code class="literal">UURI</code> in Heritrix:</p><div class="variablelist"><dl><dt><span class="term"><a href="http://crawler.archive.org/apidocs/org/archive/crawler/datamodel/CandidateURI.html" target="_top">CandidateURI</a></span></dt><dd><p>A URI, discovered or passed-in, that may be scheduled (and          thus become a CrawlURI). Contains just the fields necessary to          perform quick in-scope analysis. This class wraps an UURI          instance.</p></dd><dt><span class="term"><a href="http://crawler.archive.org/apidocs/org/archive/crawler/datamodel/CrawlURI.html" target="_top">CrawlURI</a></span></dt><dd><p>Represents a candidate URI and the associated state it          collects as it is crawled. The CrawlURI is an subclass of          CandidateURI. It is instances of this class that is fed to the          processors.</p></dd></dl></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="urischemes"></a>7.1.&nbsp;Supported Schemes (UnsupportedUriSchemeException)</h3></div></div></div><p>A property in <code class="literal">heritrix.properties</code>       named <code class="literal">org.archive.crawler.datamodel.UURIFactory.schemes</code>      lists supported schemes.  Any scheme not listed here will cause an      UnsupportedUriSchemeException which will be reported in      uri-errors.log with a 'Unsupported scheme' prefix.  If you add a      fetcher for a scheme, you'll need to add to this list of supported      schemes (Later, Heritrix can ask its fetchers what schemes are      supported).</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N1037A"></a>7.2.&nbsp;The CrawlURI's Attribute list</h3></div></div></div><p>The CrawlURI offers a flexible attribute list which is used to      keep arbitrary information about the URI while crawling. If you are      going to write a processor you almost certainly will use the attribute      list. The attribute list is a key/value-pair list accessed by      typed accessors/setters.      By convention the key values are picked from the <a href="http://crawler.archive.org/apidocs/org/archive/crawler/datamodel/CoreAttributeConstants.html" target="_top">CoreAttributeConstants</a>      interface which all processors implement. If you use other keys than      those listed in this interface, then you must add the handling of that      attribute to some other module as well.</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N10383"></a>7.3.&nbsp;The recorder streams</h3></div></div></div><p>The crawler wouldn't be of much use if it did not give access to      the HTTP request and response. For this purpose the CrawlURI has the      <a href="http://crawler.archive.org/apidocs/org/archive/crawler/datamodel/CrawlURI.html#getHttpRecorder()" target="_top">getHttpRecorder()</a>      <sup>[<a href="#ftn.N1038C" name="N1038C">2</a>]</sup> method. The recorder is referenced by the CrawlURI and      available to all the processors. Se <a href="ar01s11.html#writing_a_processor">Section&nbsp;11, &ldquo;Writing a Processor&rdquo;</a> for an explanation on how to use the      recorder.</p></div><div class="footnotes"><br><hr align="left" width="100"><div class="footnote"><p><sup>[<a href="#N10338" name="ftn.N10338">1</a>] </sup>URI (Uniform Resource Identifiers) defined by <a href="http://www.ietf.org/rfc/rfc2396.txt" target="_top">RFC 2396</a> is a way of        identifying resources. The relationship between URI, URL and URN is        described in the RFC: [<span class="citation">"A URI can be further classified as a        locator, a name, or both. The term "Uniform Resource Locator" (URL)        refers to the subset of URI that identify resources via a        representation of their primary access mechanism (e.g., their network        "location"), rather than identifying the resource by name or by some        other attribute(s) of that resource. The term "Uniform Resource Name"        (URN) refers to the subset of URI that are required to remain globally        unique and persistent even when the resource ceases to exist or        becomes unavailable."</span>] Although Heritrix uses URIs, only        URLs are supported at the moment. For more information on naming and        addressing of resources see: <a href="http://www.w3.org/Addressing/" target="_top">Naming and Addressing: URIs, URLs,        ...</a> on w3.org's website.</p></div><div class="footnote"><p><sup>[<a href="#N1038C" name="ftn.N1038C">2</a>] </sup>This method will most likely change name see <a href="refactor_HTTPRecorder.html" title="1.&nbsp;The org.archive.util.HTTPRecorder class">Section&nbsp;1, &ldquo;The org.archive.util.HTTPRecorder class&rdquo;</a> for details.</p></div></div></div><div class="navfooter"><hr><table summary="Navigation footer" width="100%"><tr><td align="left" width="40%"><a accesskey="p" href="chap_modules_common.html">Prev</a>&nbsp;</td><td align="center" width="20%">&nbsp;</td><td align="right" width="40%">&nbsp;<a accesskey="n" href="ar01s08.html">Next</a></td></tr><tr><td valign="top" align="left" width="40%">6.&nbsp;Common needs for all configurable modules&nbsp;</td><td align="center" width="20%"><a accesskey="h" href="index.html">Home</a></td><td valign="top" align="right" width="40%">&nbsp;8.&nbsp;Writing a Frontier</td></tr></table></div></body></html>
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -