📄 glossary.html
字号:
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>Glossary</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix User Manual"><link rel="up" href="index.html" title="Heritrix User Manual"><link rel="prev" href="usecases.html" title="A. Common Heritrix Use Cases"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">Glossary</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="usecases.html">Prev</a> </td><th align="center" width="60%"> </th><td align="right" width="20%"> </td></tr></table><hr></div><div class="glossary" id="glossary"><div class="titlepage"><div><div><h2 class="title"><a name="glossary"></a>Glossary</h2></div></div></div><div class="glossdiv"><h3 class="title">Some definitions</h3><dl><dt><a name="bytes"></a>Bytes, KB and statistics</dt><dd><p>Heritrix adheres to the following conventions for displaying byte and bit amounts:</p><p><pre class="programlisting"> Legend Type B Bytes KB Kilobytes - 1 KB = 1024 B MB Megabytes - 1 MB = 1024 KB GB Gigabytes - 1 GB = 1024 MB b bits Kb Kilobits - 1 Kb = 1000 b Mb Megabits - 1 Mb = 1000 Kb Gb Gigabits - 1 Gb = 1000 Mb</pre></p><p>This also applies to all logs.</p></dd><dt><a name="checkpointing"></a>Checkpointing</dt><dd><p>Heritrix checkpointing has been heavily influenced by what Mercator provided. In <a href="http://citeseer.nj.nec.com/najork01highperformance.html" target="_top">one of the papers on Mercator</a> it is described this way: “<span class="quote">Checkpointing is an important part of any long-running process such as a web crawl. By checkpointing we mean writing a representation of the crawler's state to stable storage that, in the event of a failure, is sufficient to allow the crawler to recover its state by reading the checkpoint and to resume crawling from the exact state it was in at the time of the checkpoint. By this definition, in the event of a failure, any work performed after the most recent checkpoint is lost, but none of the work up to the most recent checkpoint. In Mercator, the frequency with which the background thread performs a checkpoint is user-configurable; we typically checkpoint anywhere from 1 to 4 times per day.</span>”</p><p>See <a href="outside.html#checkpoint" title="9.4. Checkpointing">Section 9.4, “Checkpointing”</a> for discussion of the Heritrix implementation.</p></dd><dt><a name="crawluri"></a>CrawlURI</dt><dd><p>A URI and its associated data such as parent URI, number of links from seed etc.</p></dd><dt><a name="dates"></a>Dates and times</dt><dd><p>All times in Heritrix are GMT assuming the clock and timezone on the local system are correct.</p><p>This means that all dates/times in logs are GMT, all dates and times shown in the WUI are GMT and any times or dates entered by the user need to be in GMT.</p></dd><dt><a name="discovereduris"></a>Discovered URIs</dt><dd><p>That is any URI that has been confirmed be within 'scope'. This includes those that have been processed, are being processed and have finished processing. Does not include URIs that have been 'forgotten' (deemed out of scope when trying to fetch, most likely due to operator changing scope definition).</p><p><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>This only counts discovered URIs. Since the same URI can (at least in most frontiers) be fetched multiple times, this number may be somewhat lower then the combined queued, in process and finished items combined due to duplicate URIs being queued and processed. This variance is likely to be especially high in Frontiers implementing 'revisit' strategies.</p></div></p></dd><dt><a name="discoverypath"></a>Discovery path</dt><dd><p>Each URI has a discovery path. The path contains one character for each link or embed followed from the seed.</p><p>The character legend is as follows.</p><pre class="programlisting"> R - Redirect E - Embed X - Speculative embed (aggressive/Javascript link extraction) L - Link</pre><p>The discovery path of seeds is an empty string.</p></dd><dt><a name="glossary_frontier"></a>Frontier</dt><dd><p>A Frontier is a pluggable module in Heritrix that maintains the internal state of the crawl. See <a href="config.html#frontier" title="6.1.2. Frontier">Section 6.1.2, “Frontier”</a>.</p></dd><dt><a name="holdingvcrawling"></a>"Holding Jobs" vs. "Crawling Jobs"</dt><dd><p>The mode <span class="emphasis"><em>Crawling Jobs</em></span> generally means that the crawler will start executing a job as soon as one is made available in the pending jobs queue (as long as there is not a job already running).</p><p>If the crawler is in the <span class="emphasis"><em>Holding Jobs</em></span> mode, jobs added to the pending jobs queue will be held; they will not be started, even if there are no jobs currently being run.</p></dd><dt>Host</dt><dd><p>A host can serve multiple domains or a single domain can be served by multiple hosts. For our purposes so far, host == hostname in URI. DNS is not considered; it is volatile and may be unavailable. So when Heritrix gets the URIs...<pre class="programlisting"> http://www.example.com http://search.example.com http://201.199.7.15</pre>...even if they all point to the 201.199.7.15 IP, they are 3 different logical hosts (at the level of the URI/HTTP protocol).</p><p>Conformant HTTP proxies behave similarly, we think, even if they know www.example.com == 201.199.7.15, they will not consider them interchangeable.</p><p>This is not ideal for politeness where we'd want politeness rules to apply to the physical host rather than the logical.</p></dd><dt><a name="link-hop-count"></a>Link hop count</dt><dd><p>Number of link follow from the seed to the current URI. Seeds have a link hop count of 0.</p><p>This number is equal to counting the 'L's in a URIs discovery path.</p></dd><dt>Pending URIs</dt><dd><p>Number of URIs that are awaiting detailed processing.</p><p>Number of discovered URIs that have not been inspected for scope or duplicates. Depending on the implementation of the Frontier this might always be zero. It may also be an adjusted number that tries to account for duplicates by estimation.</p></dd><dt><a name="politeness"></a>Politeness</dt><dd><p>Politeness refers to attempts by the crawler software to limit load on a site. Without politeness restrictions the crawler might otherwise overwhelm smaller sites and even cause moderately sized sites to slow down significantly.</p><p>Unless you have express permission to crawl a site aggressively you should apply strict politeness rules to any crawl.</p></dd><dt><a name="queueduris"></a>Queued URIs</dt><dd><p>Number of URIs queued up and waiting for processing.</p><p>This includes any URIs that failed but will be retried. Basically this is any discovered URI that has not either been processed or is being processed.</p></dd><dt><a name="regexpr"></a>Regular expressions</dt><dd><p>All regular expressions used by Heritrix are Java regular expressions.</p><p>Java regular expressions differ from those used in Perl, for example, in several ways. For detailed info on Java regular expressions see the Java API for <code class="literal">java.util.regex.Pattern</code> on Sun's home page (<a href="http://java.sun.com" target="_top">java.sun.com</a>).</p><p>For API of Java SE v1.4.2 see <a href="http://java.sun.com/j2se/1.4.2/docs/api/index.html" target="_top">http://java.sun.com/j2se/1.4.2/docs/api/index.html</a>. It is recommended you lookup the API for the version of Java that is being used to run Heritrix.</p></dd><dt><a name="server"></a>Server</dt><dd><p>A server is a service on a <a href="glossary.html#host">Host</a>. There might be more than one service on a host differentiated by port number.</p></dd><dt><a name="statuscodes"></a>Status codes</dt><dd><p>Each crawled URI gets a status code. This code (or number) is an indication of what happened when Heritrix tried to fetch the URI.</p><p>Codes ranging from 200 to 599 are standard HTTP response codes and information about their meanings is available at the <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html" target="_top">World Wide Web consortium's web page</a>.</p><p>Other status codes used by Heritrix (From <a href="http://crawler.archive.org/xref/org/archive/crawler/datamodel/FetchStatusCodes.html#38" target="_top">org.archive.crawler.datamodel.FetchStatusCodes</a>):<pre class="programlisting"> Code Meaning 1 Successful DNS lookup 0 Fetch never tried (perhaps protocol unsupported or illegal URI) -1 DNS lookup failed -2 HTTP connect failed -3 HTTP connect broken -4 HTTP timeout (before any meaningful response received) -5 Unexpected runtime exception; see runtime-errors.log -6 Prerequisite domain-lookup failed, precluding fetch attempt
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -