📄 analysis.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
📖 第 1 页 / 共 2 页
字号:
上一页 12
        of times the URI was tried (This field is '-' if the download        was never retried); the literal <code class="literal">lenTrunc</code> if the        download was truncated because it exceeded configured limits;        <code class="literal">timeTrunc</code> if the download was truncated because the        download time exceeded configured limits; or        <code class="literal">midFetchTrunc</code> if a midfetch filter determined the        download should be truncated.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10C56"></a>8.2.2.&nbsp;local-errors.log</h4></div></div></div><p>Errors that occur when processing a URI that can be handled by        the processors (usually these are network related problems trying to        fetch the document) are logged here.</p><p>Generally these can be safely ignored, but can provide insight        to advanced users when other logs and/or reports have unusual        data.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10C5D"></a>8.2.3.&nbsp;progress-statistics.log</h4></div></div></div><p>This log is written by the <code class="literal">StatisticsTracker</code>        (<a href="config.html#stattrack" title="6.1.4.&nbsp;Statistics Tracking">Section&nbsp;6.1.4, &ldquo;Statistics Tracking&rdquo;</a>).</p><p>At configurable intervals a line about the progress of the crawl        is written to this file.</p><p>The legends are as follows:</p><div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>timestamp</strong></span></p><p>Timestamp indicating when the line was written, in ISO8601            format.</p></li><li><p><span class="bold"><strong>discovered</strong></span></p><p>Number of URIs discovered to date.</p></li><li><p><span class="bold"><strong>queued</strong></span></p><p>Number of URIs queued at the moment.</p></li><li><p><span class="bold"><strong>downloaded</strong></span></p><p>Number of URIs downloaded to date</p></li><li><p><span class="bold"><strong>doc/s(avg)</strong></span></p><p>Number of documents downloaded per second since the last            snapshot. In parenthesis since the crawl began.</p></li><li><p><span class="bold"><strong>KB/s(avg)</strong></span></p><p>Amount in Kilobytes downloaded per second since the last            snapshot. In parenthesis since the crawl began.</p></li><li><p><span class="bold"><strong>dl-failures</strong></span></p><p>Number of URIs that Heritrix has failed to download to            date.</p></li><li><p><span class="bold"><strong>busy-thread</strong></span></p><p>Number of toe threads currently busy processing a            URI.</p></li><li><p><span class="bold"><strong>mem-use-KB</strong></span></p><p>Amount of memory currently assigned to the Java Virtual            Machine.</p></li></ul></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10CAD"></a>8.2.4.&nbsp;runtime-errors.log</h4></div></div></div><p>This log captures unexpected exceptions and errors that occur        during the crawl. Some may be due to hardware limitation (out of        memory, although that error may occur without being written to this        log), but most are probably because of software bugs, either in        Heritrix's core but more likely in one of the pluggable        classes.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10CB2"></a>8.2.5.&nbsp;uri-errors.log</h4></div></div></div><p>Contains errors in dealing with encountered URIs. Usually its        caused by erroneous URIs. Generally only of interest to advanced users        trying to explain unexpected crawl behavior.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="recover_gz"></a>8.2.6.&nbsp;recover.gz</h4></div></div></div><p>The recover.gz file is a gzipped journal of Frontier events. It        can be used to restore the Frontier after a crash to roughly the state        it had before the crash. See <a href="outside.html#recover" title="9.3.&nbsp;Recovery of Frontier State and recover.gz">Section&nbsp;9.3, &ldquo;Recovery of Frontier State and recover.gz&rdquo;</a> to learn        more.</p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="reports"></a>8.3.&nbsp;Reports</h3></div></div></div><p>Heritrix's WUI offers a couple of reports on ongoing and completed      crawl jobs.</p><p>Both are accessible via the Reports tab.<div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>Although jobs are loaded after restarts of the software, their          statistics are not reloaded with them. That means that these reports          are only available as long as Heritrix is not shut down. All of the          information is however replicated in report files at the end of each          crawl for permanent storage.</p></div></p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="crawlreport"></a>8.3.1.&nbsp;Crawl report</h4></div></div></div><p>At the top of the crawl report some general statistics about the        crawl are printed out. All of these replicate data from the Console so        you should refer to <a href="running.html#console" title="7.1.&nbsp;Web Console">Section&nbsp;7.1, &ldquo;Web Console&rdquo;</a> for more information on        them.</p><p>Next in line are statistics about the number of URIs pending,        discovered, currently queued, downloaded etc. Question marks after        most of the values provides pop up descriptions of those        metrics.</p><p>Following that is a breakdown of the distribution of status        codes among URIs. It is sorted from most frequent to least. The number        of URIs found for each status code is displayed. Only successful        fetches are counted here.</p><p>A similar breakdown for file types (mime types) follows. In        addition to the number of URIs per file type, the amount of data for        that file type is also displayed.</p><p>Last a breakdown per host is provided. Number of URIs and amount        of data for each is presented. The time that has elapsed since the        last URI was finished for each host is also displayed for ongoing        crawls. This value can provide valuable data on what hosts are still        being actively crawled. Note that this value is only available while        the crawl is in progress since it has no meaning afterwards. Also any        pauses made to the crawl may distort these values, at least in the        short term following resumption of crawling. Most noticably while        paused all of these values will continue to grow.</p><p>Especially in broad crawls, this list can grow very        large.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="seedsreport"></a>8.3.2.&nbsp;Seeds report</h4></div></div></div><p>This report lists all the seeds in the seeds file and also any        <span class="emphasis"><em>discovered</em></span> seeds if that option is enabled (that        is treat redirects from seeds as new seeds). For each seed the status        code for the fetch attempt is presented in verbose form (that is with        minimum textual description of its meaning). Following that is the        seeds disposition, a quick look at if the seed was successfully        crawled, not attempted, or failed to crawl.</p><p>Successfully crawled seeds are any that Heritrix had no internal        errors crawling, the seed may never the less have generated a 404        (file not found) error.</p><p>Failure to crawl might be because of a bug in Heritrix or an        invalid seed (commonly DNS lookup will have failed).</p><p>If the report is examined before the crawl is finished there        might still be seeds not yet attempted. Especially if there is trouble        getting their prerequisites or if the seed list is exceptionally        large.</p></div></div></div><div class="navfooter"><hr><table summary="Navigation footer" width="100%"><tr><td align="left" width="40%"><a accesskey="p" href="running.html">Prev</a>&nbsp;</td><td align="center" width="20%">&nbsp;</td><td align="right" width="40%">&nbsp;<a accesskey="n" href="outside.html">Next</a></td></tr><tr><td valign="top" align="left" width="40%">7.&nbsp;Running a job&nbsp;</td><td align="center" width="20%"><a accesskey="h" href="index.html">Home</a></td><td valign="top" align="right" width="40%">&nbsp;9.&nbsp;Outside the user interface</td></tr></table></div></body></html>
上一页 12
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -