⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 analysis.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
📖 第 1 页 / 共 2 页
字号:
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>8.&nbsp;Analysis of jobs</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix User Manual"><link rel="up" href="index.html" title="Heritrix User Manual"><link rel="prev" href="running.html" title="7.&nbsp;Running a job"><link rel="next" href="outside.html" title="9.&nbsp;Outside the user interface"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">8.&nbsp;Analysis of jobs</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="running.html">Prev</a>&nbsp;</td><th align="center" width="60%">&nbsp;</th><td align="right" width="20%">&nbsp;<a accesskey="n" href="outside.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="analysis"></a>8.&nbsp;Analysis of jobs</h2></div></div></div><p>Heritrix offers several facilities for examining the details of a    crawl. The reports and logs are also availible at run time.</p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="completedjobs"></a>8.1.&nbsp;Completed jobs</h3></div></div></div><p>In the <span class="emphasis"><em>Jobs</em></span> tab (and page headers) is a      listing of how many completed jobs there are along with a link to a page      that lists them.</p><p>The following information / options are provided for each      completed job:</p><div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>UID</strong></span></p><p>Each job has a unique (generated) ID. This is actually a time          stamp. It differentiates jobs with the same name from one          another.</p><p>This ID is used (among other things) for creating the job's          directory on disk.</p></li><li><p><span class="bold"><strong>Job name</strong></span></p><p>The name that the user gave the job.</p></li><li><p><span class="bold"><strong>Status</strong></span></p><p>Status of the job. Indicates how it ended.</p></li><li><p><span class="bold"><strong>Options</strong></span></p><p>In addtion the following options are available for each          job.</p><div class="itemizedlist"><ul type="circle"><li><p><span class="emphasis"><em>Crawl order</em></span></p><p>Opens up the actual XML file of the jobs configuration in              a seperate window. Generally only of interest to advanced              users.</p></li><li><p><span class="emphasis"><em>Crawl report</em></span></p><p>Takes the user to the job's Crawl report (<a href="analysis.html#crawlreport" title="8.3.1.&nbsp;Crawl report">Section&nbsp;8.3.1, &ldquo;Crawl report&rdquo;</a>).</p></li><li><p><span class="emphasis"><em>Seeds report</em></span></p><p>Takes the user to the job's Seeds report (<a href="analysis.html#seedsreport" title="8.3.2.&nbsp;Seeds report">Section&nbsp;8.3.2, &ldquo;Seeds report&rdquo;</a>).</p></li><li><p><span class="emphasis"><em>Seed file</em></span></p><p>Displays the seed</p></li><li><p><span class="emphasis"><em>Logs</em></span></p><p>Takes the user to the job's logs (<a href="analysis.html#logs" title="8.2.&nbsp;Logs">Section&nbsp;8.2, &ldquo;Logs&rdquo;</a>).</p></li><li><p><span class="emphasis"><em>Journal</em></span></p><p>Takes the user to the Journal page for the job (<a href="running.html#journal">Section&nbsp;7.4.1, &ldquo;Journal&rdquo;</a>). Users can still add entries to it.</p></li><li><p><span class="emphasis"><em>Delete</em></span></p><p>Marks the job as deleted. This will remove it from the WUI              but not from disk.</p></li></ul></div></li></ul></div><div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>It is not possible to directly access the configuration for        completed jobs in the same way as you can for new, pending and running        jobs. Instead users can look at the actual XML configuration file        <span class="emphasis"><em>or</em></span> create a new job based on the old one. The new        job (and it need never be run) will perfectly mirror the settings of        the old one.</p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="logs"></a>8.2.&nbsp;Logs</h3></div></div></div><p>Heritrix writes several logs as it crawls a job. Each crawl job      has its own set of these logs.</p><p>The location where logs are written can be configured (expert      setting). Otherwise refer to the <code class="literal">crawl-manifest.txt</code>      for on disk location of logs (<a href="outside.html#crawl-manifest.txt" title="9.1.2.&nbsp;crawl-manifest.txt">Section&nbsp;9.1.2, &ldquo;crawl-manifest.txt&rdquo;</a>).</p><p>Logs can be manually rotated.  Pause the crawl and at the base      of the screen a <span class="bold"><strong>Rotate Logs</strong></span>      link will appear.  Clicking on      <span class="bold"><strong>Rotate Logs</strong></span>      will move aside all current crawl logs appending a 14-digit      GMT timestamp to the moved-aside logs. New log files will be opened      for the crawler to use in subsequent crawling.      </p><p>The WUI offers users four ways of viewing these logs by:</p><div class="orderedlist"><ol type="1"><li><p><span class="bold"><strong>Line number</strong></span></p><p>View a section of a log that starts at a given line number and          the next X lines following it. X is configurable, is 50 by          default.</p></li><li><p><span class="bold"><strong>Time stamp</strong></span></p><p>View a section of a log that starts at a given time stamp and          the next X lines following it. X is configurable, is 50 by default.          The format of the time stamp is the same as in the logs          (YYYY-MM-DDTHH:MM:SS.SSS). It is not necessary to add more detail to          this then is desired. For instance the entry 2004-04-25T08 will          match the first entry made after 8 am on the 25 of April,          2004.</p></li><li><p><span class="bold"><strong>Regular expression</strong></span></p><p>Filter the log based on a regular expression. Only lines          matching it (and optionally lines following it that are indented -          usually meaning that they are related to the previous ones) are          displayed.</p><p>This can be an expensive operation on really big logs,          requiring a lot of time for the page to load.</p></li><li><p><span class="bold"><strong>Tail</strong></span></p><p>Allows users to just look at the last X lines of the given          log. X can be configured, is 50 by default.</p></li></ol></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="crawllog"></a>8.2.1.&nbsp;crawl.log</h4></div></div></div><p>For each URI tried will get an entry in the        <code class="literal">crawl.log</code> regardless of success or failure.</p><p>Below is a two line extract from a crawl.log:</p><p><pre class="programlisting">2004-07-21T23:29:40.438Z   200        310 http://127.0.0.1:9999/selftest/Charset/charsetselftest_end.html LLLL http://127.0.0.1:9999/selftest/Charset/shiftjis.jsp text/html #000 20040721232940401+10 M77KNTBZH2IU6V2SIG5EEG45EJICNQNM -2004-07-21T23:29:40.502Z   200        225 http://127.0.0.1:9999/selftest/MaxLinkHops/5.html LLLLL http://127.0.0.1:9999/selftest/MaxLinkHops/4.html text/html #000 20040721232940481+12 M77KNTBZH2IU6V2SIG5EEG45EJICNQNM -</pre></p><p>The <span class="emphasis"><em>1st</em></span> column is a timestamp        in ISO8601 format, to millisecond resolution. The time is the instant        of logging. The <span class="emphasis"><em>2nd</em></span> column is the        fetch status code. Usually this is the HTTP status code but it can        also be a negative number if URL processing was unexpectedly        terminated. See <a href="glossary.html#statuscodes">Status codes</a> for a listing of        possible values.</p><p>The <span class="emphasis"><em>3rd</em></span> column is the size of        the downloaded document in bytes. For HTTP, Size is the size of the        content-only. It excludes the size of the HTTP response headers. For        DNS, its the total size of the DNS response. The <span class="emphasis"><em>4th</em></span> column is the URI of the document        downloaded. The <span class="emphasis"><em>5th</em></span> column holds        breadcrumb codes showing the trail of downloads that got us to the        current URI. See <a href="glossary.html#discoverypath">Discovery path</a> for description of        possible code values. The <span class="emphasis"><em>6th</em></span>        column holds the URI that immediately referenced this URI        ('referrer'). Both of the latter two fields -- the discovery path and        the referrer URL -- will be empty for such as the seed URIs.</p><p>The <span class="emphasis"><em>7th</em></span> holds the document        mime type, the <span class="emphasis"><em>8th</em></span> column has the        id of the worker thread that downloaded this document, the <span class="emphasis"><em>9th</em></span> column holds a timestamp (in RFC2550/ARC        condensed digits-only format) indicating when a network fetch was        begun, and if appropriate, the millisecond duration of the fetch,        separated from the begin-time by a '+' character.</p><p>The <span class="emphasis"><em>10th</em></span> field is a SHA1        digest of the content only (headers are not digested). The <span class="emphasis"><em>11th</em></span> column is the 'source tag' inherited         by this URI, if that feature is enabled. Finally, the <span class="emphasis"><em>12th</em></span> column holds &ldquo;<span class="quote">annotations</span>&rdquo;,        if any have been set. Possible annontations include: the number

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -