📄 glossary.html
字号:
-7 URI recognized as unsupported or illegal -8 Multiple retries all failed, retry limit reached -50 Temporary status assigned URIs awaiting preconditions; appearance in logs may be a bug -60 Failure status assigned URIs which could not be queued by the Frontier (and may in fact be unfetchable) -61 Prerequisite robots.txt-fetch failed, precluding a fetch attempt -62 Some other prerequisite failed, precluding a fetch attempt -63 A prerequisite (of any type) could not be scheduled, precluding a fetch attempt -3000 Severe Java 'Error' conditions (OutOfMemoryError, StackOverflowError, etc.) during URI processing. -4000 'chaff' detection of traps/content of negligible value applied -4001 Too many link hops away from seed -4002 Too many embed/transitive hops away from last URI in scope -5000 Out of scope upon reexamination (only happens if scope changes during crawl) -5001 Blocked from fetch by user setting -5002 Blocked by a custom processor -5003 Blocked due to exceeding an established quota -5004 Blocked due to exceeding an established runtime -6000 Deleted from Frontier by user -7000 Processing thread was killed by the operator (perhaps because of a hung condition) -9998 Robots.txt rules precluded fetch</pre></p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>Codes and explainations are also available under the Help link in the web UI. </p></div><p>Please note that status codes defined by Heritrix may be subject to change between versions, especially new codes may be added to tackle a wider array of situations.</p></dd><dt><a name="surt"></a>SURT</dt><dd><p>SURT stands for Sort-friendly URI Reordering Transform, and is a transformation applied to URIs which makes their left-to-right representation better match the natural hierarchy of domain names.</p><p>A URI <scheme://domain.tld/path?query> has SURT form <scheme://(tld,domain,)/path?query>.</p><p>Conversion to SURT form also involves making all characters lowercase, and changing the 'https' scheme to 'http'. Further, the '/' after a URI authority component -- for example, the third slash in a regular HTTP URI -- will only appear in the SURT form if it appeared in the plain URI form. (This convention proves important when using real URIs as a shorthand for SURT prefixes, as described below.)</p><p>SURT form URIs are typically not used to specify exact URIs for fetching. Rather, SURT form is useful when comparing or sorting URIs. For example, URIs in SURT format sort into natural groups -- all 'archive.org' URIs will be adjacent, regardless of what subdomains like 'books.archive.org' or 'movies.archive.org' are used.</p><p>Most importantly, a SURT form URI, or a truncated version of a SURT form URI, can be used as a <a href="glossary.html#surtprefix">SURT prefix</a>. A SURT prefix will often correspond to all URIs within a common 'area' of interest for crawling. For example, the prefix <http://(is,> will be shared by all URIs in the '.is' top-level domain.</p></dd><dt><a name="surtprefix"></a>SURT prefix</dt><dd><p>A URI in <a href="glossary.html#surt">SURT</a> form, especially if truncated, may be of use as a "SURT prefix", a shared prefix string of all SURT form URIs in the same 'area' of interest for web crawling.</p><p>For example, the prefix <http://(is,> will be shared by all SURT form URIs in the '.is' top-level domain. The prefix <http://(org,archive,www,)/movies> (which is also a valid full SURT form URI) will be shared by all URIs at www.archive.org with a path beginning '/movies'.</p><p>A collection of sorted SURT prefixes is an efficient way to specify a desired crawl scope: any URI whose SURT form starts with any of the prefixes should be included.</p><p>A small set of conventions can be also be used to calculate an "implied SURT prefix" from a regular URI, such as a URI supplied as a crawl seed. These conventions are:</p><div class="orderedlist"><ol type="1"><li><p>Convert the URI to its SURT form. </p></li><li><p>If there are at least 3 slashes ('/') in the SURT form, remove everything after the last slash. As examples, <http://(org,example,www,)/main/subsection/> is unchanged; <http://(org,example,www,)/main/subsection> is truncated to <http://(org,example,www,)/main/>; <http://(org,example,www,)/> is unchanged; and <http://(org,example,www,)> is unchanged. </p></li><li><p>If the resulting form ends in an off-parenthesis (')'), remove the off-parenthesis. So each of the above examples except for the last is unchanged, while the last <http://(org,example,www,)> becomes <http://(org,example,www,>. </p></li></ol></div><p>This allows many seed URIs, in their usual form, to imply the most useful SURT prefixes for crawling related URIs -- with the presence or absence of a trailing '/' on URIs without further path-info being a subtle indicator as to whether subdomains of the supplied domain should be included.</p><p>For example, seed <http://www.archive.org/> will become SURT form and implied SURT prefix <http://(org,archive,www,)/>, and is the prefix of all SURT form URIs on www.archive.org. However, any subdomain URI like <http://homepages.www.archive.org/directory> would be ruled out, because its SURT form <http://(org,archive,www,homepages,)/directory> does not begin with the full SURT prefix, including the ')/', deduced from the seed.</p><p>In contrast, seed <http://www.archive.org> (note the lack of trailing slash) will become SURT form <http://(org,archive,www,)>, and implied SURT prefix <http://(org,archive,www,> (note the lack of trailing ')'). This will be the prefix of all URIs on www.archive.org, as well as any subdomain URIs like <http://homepages.www.archive.org/directory>, because the full SURT prefix appears in subdomain URI SURT forms.</p></dd><dt><a name="toethreads"></a>Toe Threads</dt><dd><p>When crawling Heritrix employs a configurable number of Toe Threads to process each URI.</p><p>Each of these threads will request a URI from the Frontier (<a href="config.html#frontier" title="6.1.2. Frontier">Section 6.1.2, “Frontier”</a>), apply each of the set Processors (<a href="config.html#processors" title="6.1.3. Processing Chains">Section 6.1.3, “Processing Chains”</a>) to it and finally report it as completed back to the Frontier.</p></dd></dl></div></div><div class="navfooter"><hr><table summary="Navigation footer" width="100%"><tr><td align="left" width="40%"><a accesskey="p" href="usecases.html">Prev</a> </td><td align="center" width="20%"> </td><td align="right" width="40%"> </td></tr><tr><td valign="top" align="left" width="40%">A. Common Heritrix Use Cases </td><td align="center" width="20%"><a accesskey="h" href="index.html">Home</a></td><td valign="top" align="right" width="40%"> </td></tr></table></div></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -