📄 analysis.html
字号:
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>8. Analysis of jobs</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix User Manual"><link rel="up" href="index.html" title="Heritrix User Manual"><link rel="prev" href="running.html" title="7. Running a job"><link rel="next" href="outside.html" title="9. Outside the user interface"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">8. Analysis of jobs</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="running.html">Prev</a> </td><th align="center" width="60%"> </th><td align="right" width="20%"> <a accesskey="n" href="outside.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="analysis"></a>8. Analysis of jobs</h2></div></div></div><p>Heritrix offers several facilities for examining the details of a crawl. The reports and logs are also availible at run time.</p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="completedjobs"></a>8.1. Completed jobs</h3></div></div></div><p>In the <span class="emphasis"><em>Jobs</em></span> tab (and page headers) is a listing of how many completed jobs there are along with a link to a page that lists them.</p><p>The following information / options are provided for each completed job:</p><div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>UID</strong></span></p><p>Each job has a unique (generated) ID. This is actually a time stamp. It differentiates jobs with the same name from one another.</p><p>This ID is used (among other things) for creating the job's directory on disk.</p></li><li><p><span class="bold"><strong>Job name</strong></span></p><p>The name that the user gave the job.</p></li><li><p><span class="bold"><strong>Status</strong></span></p><p>Status of the job. Indicates how it ended.</p></li><li><p><span class="bold"><strong>Options</strong></span></p><p>In addtion the following options are available for each job.</p><div class="itemizedlist"><ul type="circle"><li><p><span class="emphasis"><em>Crawl order</em></span></p><p>Opens up the actual XML file of the jobs configuration in a seperate window. Generally only of interest to advanced users.</p></li><li><p><span class="emphasis"><em>Crawl report</em></span></p><p>Takes the user to the job's Crawl report (<a href="analysis.html#crawlreport" title="8.3.1. Crawl report">Section 8.3.1, “Crawl report”</a>).</p></li><li><p><span class="emphasis"><em>Seeds report</em></span></p><p>Takes the user to the job's Seeds report (<a href="analysis.html#seedsreport" title="8.3.2. Seeds report">Section 8.3.2, “Seeds report”</a>).</p></li><li><p><span class="emphasis"><em>Seed file</em></span></p><p>Displays the seed</p></li><li><p><span class="emphasis"><em>Logs</em></span></p><p>Takes the user to the job's logs (<a href="analysis.html#logs" title="8.2. Logs">Section 8.2, “Logs”</a>).</p></li><li><p><span class="emphasis"><em>Journal</em></span></p><p>Takes the user to the Journal page for the job (<a href="running.html#journal">Section 7.4.1, “Journal”</a>). Users can still add entries to it.</p></li><li><p><span class="emphasis"><em>Delete</em></span></p><p>Marks the job as deleted. This will remove it from the WUI but not from disk.</p></li></ul></div></li></ul></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>It is not possible to directly access the configuration for completed jobs in the same way as you can for new, pending and running jobs. Instead users can look at the actual XML configuration file <span class="emphasis"><em>or</em></span> create a new job based on the old one. The new job (and it need never be run) will perfectly mirror the settings of the old one.</p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="logs"></a>8.2. Logs</h3></div></div></div><p>Heritrix writes several logs as it crawls a job. Each crawl job has its own set of these logs.</p><p>The location where logs are written can be configured (expert setting). Otherwise refer to the <code class="literal">crawl-manifest.txt</code> for on disk location of logs (<a href="outside.html#crawl-manifest.txt" title="9.1.2. crawl-manifest.txt">Section 9.1.2, “crawl-manifest.txt”</a>).</p><p>Logs can be manually rotated. Pause the crawl and at the base of the screen a <span class="bold"><strong>Rotate Logs</strong></span> link will appear. Clicking on <span class="bold"><strong>Rotate Logs</strong></span> will move aside all current crawl logs appending a 14-digit GMT timestamp to the moved-aside logs. New log files will be opened for the crawler to use in subsequent crawling. </p><p>The WUI offers users four ways of viewing these logs by:</p><div class="orderedlist"><ol type="1"><li><p><span class="bold"><strong>Line number</strong></span></p><p>View a section of a log that starts at a given line number and the next X lines following it. X is configurable, is 50 by default.</p></li><li><p><span class="bold"><strong>Time stamp</strong></span></p><p>View a section of a log that starts at a given time stamp and the next X lines following it. X is configurable, is 50 by default. The format of the time stamp is the same as in the logs (YYYY-MM-DDTHH:MM:SS.SSS). It is not necessary to add more detail to this then is desired. For instance the entry 2004-04-25T08 will match the first entry made after 8 am on the 25 of April, 2004.</p></li><li><p><span class="bold"><strong>Regular expression</strong></span></p><p>Filter the log based on a regular expression. Only lines matching it (and optionally lines following it that are indented - usually meaning that they are related to the previous ones) are displayed.</p><p>This can be an expensive operation on really big logs, requiring a lot of time for the page to load.</p></li><li><p><span class="bold"><strong>Tail</strong></span></p><p>Allows users to just look at the last X lines of the given log. X can be configured, is 50 by default.</p></li></ol></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="crawllog"></a>8.2.1. crawl.log</h4></div></div></div><p>For each URI tried will get an entry in the <code class="literal">crawl.log</code> regardless of success or failure.</p><p>Below is a two line extract from a crawl.log:</p><p><pre class="programlisting">2004-07-21T23:29:40.438Z 200 310 http://127.0.0.1:9999/selftest/Charset/charsetselftest_end.html LLLL http://127.0.0.1:9999/selftest/Charset/shiftjis.jsp text/html #000 20040721232940401+10 M77KNTBZH2IU6V2SIG5EEG45EJICNQNM -2004-07-21T23:29:40.502Z 200 225 http://127.0.0.1:9999/selftest/MaxLinkHops/5.html LLLLL http://127.0.0.1:9999/selftest/MaxLinkHops/4.html text/html #000 20040721232940481+12 M77KNTBZH2IU6V2SIG5EEG45EJICNQNM -</pre></p><p>The <span class="emphasis"><em>1st</em></span> column is a timestamp in ISO8601 format, to millisecond resolution. The time is the instant of logging. The <span class="emphasis"><em>2nd</em></span> column is the fetch status code. Usually this is the HTTP status code but it can also be a negative number if URL processing was unexpectedly terminated. See <a href="glossary.html#statuscodes">Status codes</a> for a listing of possible values.</p><p>The <span class="emphasis"><em>3rd</em></span> column is the size of the downloaded document in bytes. For HTTP, Size is the size of the content-only. It excludes the size of the HTTP response headers. For DNS, its the total size of the DNS response. The <span class="emphasis"><em>4th</em></span> column is the URI of the document downloaded. The <span class="emphasis"><em>5th</em></span> column holds breadcrumb codes showing the trail of downloads that got us to the current URI. See <a href="glossary.html#discoverypath">Discovery path</a> for description of possible code values. The <span class="emphasis"><em>6th</em></span> column holds the URI that immediately referenced this URI ('referrer'). Both of the latter two fields -- the discovery path and the referrer URL -- will be empty for such as the seed URIs.</p><p>The <span class="emphasis"><em>7th</em></span> holds the document mime type, the <span class="emphasis"><em>8th</em></span> column has the id of the worker thread that downloaded this document, the <span class="emphasis"><em>9th</em></span> column holds a timestamp (in RFC2550/ARC condensed digits-only format) indicating when a network fetch was begun, and if appropriate, the millisecond duration of the fetch, separated from the begin-time by a '+' character.</p><p>The <span class="emphasis"><em>10th</em></span> field is a SHA1 digest of the content only (headers are not digested). The <span class="emphasis"><em>11th</em></span> column is the 'source tag' inherited by this URI, if that feature is enabled. Finally, the <span class="emphasis"><em>12th</em></span> column holds “<span class="quote">annotations</span>”, if any have been set. Possible annontations include: the number
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -