📄 analysis.html
字号:
of times the URI was tried (This field is '-' if the download was never retried); the literal <code class="literal">lenTrunc</code> if the download was truncated because it exceeded configured limits; <code class="literal">timeTrunc</code> if the download was truncated because the download time exceeded configured limits; or <code class="literal">midFetchTrunc</code> if a midfetch filter determined the download should be truncated.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10C56"></a>8.2.2. local-errors.log</h4></div></div></div><p>Errors that occur when processing a URI that can be handled by the processors (usually these are network related problems trying to fetch the document) are logged here.</p><p>Generally these can be safely ignored, but can provide insight to advanced users when other logs and/or reports have unusual data.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10C5D"></a>8.2.3. progress-statistics.log</h4></div></div></div><p>This log is written by the <code class="literal">StatisticsTracker</code> (<a href="config.html#stattrack" title="6.1.4. Statistics Tracking">Section 6.1.4, “Statistics Tracking”</a>).</p><p>At configurable intervals a line about the progress of the crawl is written to this file.</p><p>The legends are as follows:</p><div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>timestamp</strong></span></p><p>Timestamp indicating when the line was written, in ISO8601 format.</p></li><li><p><span class="bold"><strong>discovered</strong></span></p><p>Number of URIs discovered to date.</p></li><li><p><span class="bold"><strong>queued</strong></span></p><p>Number of URIs queued at the moment.</p></li><li><p><span class="bold"><strong>downloaded</strong></span></p><p>Number of URIs downloaded to date</p></li><li><p><span class="bold"><strong>doc/s(avg)</strong></span></p><p>Number of documents downloaded per second since the last snapshot. In parenthesis since the crawl began.</p></li><li><p><span class="bold"><strong>KB/s(avg)</strong></span></p><p>Amount in Kilobytes downloaded per second since the last snapshot. In parenthesis since the crawl began.</p></li><li><p><span class="bold"><strong>dl-failures</strong></span></p><p>Number of URIs that Heritrix has failed to download to date.</p></li><li><p><span class="bold"><strong>busy-thread</strong></span></p><p>Number of toe threads currently busy processing a URI.</p></li><li><p><span class="bold"><strong>mem-use-KB</strong></span></p><p>Amount of memory currently assigned to the Java Virtual Machine.</p></li></ul></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10CAD"></a>8.2.4. runtime-errors.log</h4></div></div></div><p>This log captures unexpected exceptions and errors that occur during the crawl. Some may be due to hardware limitation (out of memory, although that error may occur without being written to this log), but most are probably because of software bugs, either in Heritrix's core but more likely in one of the pluggable classes.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10CB2"></a>8.2.5. uri-errors.log</h4></div></div></div><p>Contains errors in dealing with encountered URIs. Usually its caused by erroneous URIs. Generally only of interest to advanced users trying to explain unexpected crawl behavior.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="recover_gz"></a>8.2.6. recover.gz</h4></div></div></div><p>The recover.gz file is a gzipped journal of Frontier events. It can be used to restore the Frontier after a crash to roughly the state it had before the crash. See <a href="outside.html#recover" title="9.3. Recovery of Frontier State and recover.gz">Section 9.3, “Recovery of Frontier State and recover.gz”</a> to learn more.</p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="reports"></a>8.3. Reports</h3></div></div></div><p>Heritrix's WUI offers a couple of reports on ongoing and completed crawl jobs.</p><p>Both are accessible via the Reports tab.<div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>Although jobs are loaded after restarts of the software, their statistics are not reloaded with them. That means that these reports are only available as long as Heritrix is not shut down. All of the information is however replicated in report files at the end of each crawl for permanent storage.</p></div></p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="crawlreport"></a>8.3.1. Crawl report</h4></div></div></div><p>At the top of the crawl report some general statistics about the crawl are printed out. All of these replicate data from the Console so you should refer to <a href="running.html#console" title="7.1. Web Console">Section 7.1, “Web Console”</a> for more information on them.</p><p>Next in line are statistics about the number of URIs pending, discovered, currently queued, downloaded etc. Question marks after most of the values provides pop up descriptions of those metrics.</p><p>Following that is a breakdown of the distribution of status codes among URIs. It is sorted from most frequent to least. The number of URIs found for each status code is displayed. Only successful fetches are counted here.</p><p>A similar breakdown for file types (mime types) follows. In addition to the number of URIs per file type, the amount of data for that file type is also displayed.</p><p>Last a breakdown per host is provided. Number of URIs and amount of data for each is presented. The time that has elapsed since the last URI was finished for each host is also displayed for ongoing crawls. This value can provide valuable data on what hosts are still being actively crawled. Note that this value is only available while the crawl is in progress since it has no meaning afterwards. Also any pauses made to the crawl may distort these values, at least in the short term following resumption of crawling. Most noticably while paused all of these values will continue to grow.</p><p>Especially in broad crawls, this list can grow very large.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="seedsreport"></a>8.3.2. Seeds report</h4></div></div></div><p>This report lists all the seeds in the seeds file and also any <span class="emphasis"><em>discovered</em></span> seeds if that option is enabled (that is treat redirects from seeds as new seeds). For each seed the status code for the fetch attempt is presented in verbose form (that is with minimum textual description of its meaning). Following that is the seeds disposition, a quick look at if the seed was successfully crawled, not attempted, or failed to crawl.</p><p>Successfully crawled seeds are any that Heritrix had no internal errors crawling, the seed may never the less have generated a 404 (file not found) error.</p><p>Failure to crawl might be because of a bug in Heritrix or an invalid seed (commonly DNS lookup will have failed).</p><p>If the report is examined before the crawl is finished there might still be seeds not yet attempted. Especially if there is trouble getting their prerequisites or if the seed list is exceptionally large.</p></div></div></div><div class="navfooter"><hr><table summary="Navigation footer" width="100%"><tr><td align="left" width="40%"><a accesskey="p" href="running.html">Prev</a> </td><td align="center" width="20%"> </td><td align="right" width="40%"> <a accesskey="n" href="outside.html">Next</a></td></tr><tr><td valign="top" align="left" width="40%">7. Running a job </td><td align="center" width="20%"><a accesskey="h" href="index.html">Home</a></td><td valign="top" align="right" width="40%"> 9. Outside the user interface</td></tr></table></div></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -