📄 glossary.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
📖 第 1 页 / 共 2 页
字号:
上一页 12
     -7 URI recognized as unsupported or illegal     -8 Multiple retries all failed, retry limit reached    -50 Temporary status assigned URIs awaiting preconditions; appearance in        logs may be a bug    -60 Failure status assigned URIs which could not be queued by the         Frontier (and may in fact be unfetchable)    -61 Prerequisite robots.txt-fetch failed, precluding a fetch attempt    -62 Some other prerequisite failed, precluding a fetch attempt    -63 A prerequisite (of any type) could not be scheduled, precluding a         fetch attempt  -3000 Severe Java 'Error' conditions (OutOfMemoryError, StackOverflowError,        etc.) during URI processing.  -4000 'chaff' detection of traps/content of negligible value applied  -4001 Too many link hops away from seed  -4002 Too many embed/transitive hops away from last URI in scope  -5000 Out of scope upon reexamination (only happens if scope changes during         crawl)  -5001 Blocked from fetch by user setting  -5002 Blocked by a custom processor  -5003 Blocked due to exceeding an established quota  -5004 Blocked due to exceeding an established runtime  -6000 Deleted from Frontier by user  -7000 Processing thread was killed by the operator (perhaps because of a        hung condition)  -9998 Robots.txt rules precluded fetch</pre></p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>Codes and explainations are also available under the Help link in the    web UI.    </p></div><p>Please note that status codes defined by Heritrix may be          subject to change between versions, especially new codes may be          added to tackle a wider array of situations.</p></dd><dt><a name="surt"></a>SURT</dt><dd><p>SURT stands for Sort-friendly URI Reordering Transform, and is          a transformation applied to URIs which makes their left-to-right          representation better match the natural hierarchy of domain          names.</p><p>A URI &lt;scheme://domain.tld/path?query&gt; has SURT form          &lt;scheme://(tld,domain,)/path?query&gt;.</p><p>Conversion to SURT form also involves making all characters          lowercase, and changing the 'https' scheme to 'http'. Further, the          '/' after a URI authority component -- for example, the third slash          in a regular HTTP URI -- will only appear in the SURT form if it          appeared in the plain URI form. (This convention proves important          when using real URIs as a shorthand for SURT prefixes, as described          below.)</p><p>SURT form URIs are typically not used to specify exact URIs          for fetching. Rather, SURT form is useful when comparing or sorting          URIs. For example, URIs in SURT format sort into natural groups --          all 'archive.org' URIs will be adjacent, regardless of what          subdomains like 'books.archive.org' or 'movies.archive.org' are          used.</p><p>Most importantly, a SURT form URI, or a truncated version of a          SURT form URI, can be used as a <a href="glossary.html#surtprefix">SURT prefix</a>. A          SURT prefix will often correspond to all URIs within a common 'area'          of interest for crawling. For example, the prefix          &lt;http://(is,&gt; will be shared by all URIs in the '.is'          top-level domain.</p></dd><dt><a name="surtprefix"></a>SURT prefix</dt><dd><p>A URI in <a href="glossary.html#surt">SURT</a> form, especially if          truncated, may be of use as a "SURT prefix", a shared prefix string          of all SURT form URIs in the same 'area' of interest for web          crawling.</p><p>For example, the prefix &lt;http://(is,&gt; will be shared by          all SURT form URIs in the '.is' top-level domain. The prefix          &lt;http://(org,archive,www,)/movies&gt; (which is also a valid full          SURT form URI) will be shared by all URIs at www.archive.org with a          path beginning '/movies'.</p><p>A collection of sorted SURT prefixes is an efficient way to          specify a desired crawl scope: any URI whose SURT form starts with          any of the prefixes should be included.</p><p>A small set of conventions can be also be used to calculate an          "implied SURT prefix" from a regular URI, such as a URI supplied as          a crawl seed. These conventions are:</p><div class="orderedlist"><ol type="1"><li><p>Convert the URI to its SURT form. </p></li><li><p>If there are at least 3 slashes ('/') in the SURT form,              remove everything after the last slash. As examples,              &lt;http://(org,example,www,)/main/subsection/&gt; is unchanged;              &lt;http://(org,example,www,)/main/subsection&gt; is truncated              to &lt;http://(org,example,www,)/main/&gt;;              &lt;http://(org,example,www,)/&gt; is unchanged; and              &lt;http://(org,example,www,)&gt; is unchanged. </p></li><li><p>If the resulting form ends in an off-parenthesis (')'),              remove the off-parenthesis. So each of the above examples except              for the last is unchanged, while the last              &lt;http://(org,example,www,)&gt; becomes              &lt;http://(org,example,www,&gt;. </p></li></ol></div><p>This allows many seed URIs, in their usual form, to imply the          most useful SURT prefixes for crawling related URIs -- with the          presence or absence of a trailing '/' on URIs without further          path-info being a subtle indicator as to whether subdomains of the          supplied domain should be included.</p><p>For example, seed &lt;http://www.archive.org/&gt; will become          SURT form and implied SURT prefix          &lt;http://(org,archive,www,)/&gt;, and is the prefix of all SURT          form URIs on www.archive.org. However, any subdomain URI like          &lt;http://homepages.www.archive.org/directory&gt; would be ruled          out, because its SURT form          &lt;http://(org,archive,www,homepages,)/directory&gt; does not begin          with the full SURT prefix, including the ')/', deduced from the          seed.</p><p>In contrast, seed &lt;http://www.archive.org&gt; (note the          lack of trailing slash) will become SURT form          &lt;http://(org,archive,www,)&gt;, and implied SURT prefix          &lt;http://(org,archive,www,&gt; (note the lack of trailing ')').          This will be the prefix of all URIs on www.archive.org, as well as          any subdomain URIs like          &lt;http://homepages.www.archive.org/directory&gt;, because the full          SURT prefix appears in subdomain URI SURT forms.</p></dd><dt><a name="toethreads"></a>Toe Threads</dt><dd><p>When crawling Heritrix employs a configurable number of Toe          Threads to process each URI.</p><p>Each of these threads will request a URI from the Frontier          (<a href="config.html#frontier" title="6.1.2.&nbsp;Frontier">Section&nbsp;6.1.2, &ldquo;Frontier&rdquo;</a>), apply each of the set Processors          (<a href="config.html#processors" title="6.1.3.&nbsp;Processing Chains">Section&nbsp;6.1.3, &ldquo;Processing Chains&rdquo;</a>) to it and finally report it as          completed back to the Frontier.</p></dd></dl></div></div><div class="navfooter"><hr><table summary="Navigation footer" width="100%"><tr><td align="left" width="40%"><a accesskey="p" href="usecases.html">Prev</a>&nbsp;</td><td align="center" width="20%">&nbsp;</td><td align="right" width="40%">&nbsp;</td></tr><tr><td valign="top" align="left" width="40%">A.&nbsp;Common Heritrix Use Cases&nbsp;</td><td align="center" width="20%"><a accesskey="h" href="index.html">Home</a></td><td valign="top" align="right" width="40%">&nbsp;</td></tr></table></div></body></html>
上一页 12
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -