📄 running.html
字号:
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>7. Running a job</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix User Manual"><link rel="up" href="index.html" title="Heritrix User Manual"><link rel="prev" href="config.html" title="6. Configuring jobs and profiles"><link rel="next" href="analysis.html" title="8. Analysis of jobs"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">7. Running a job</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="config.html">Prev</a> </td><th align="center" width="60%"> </th><td align="right" width="20%"> <a accesskey="n" href="analysis.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="running"></a>7. Running a job</h2></div></div></div><p>Once a crawl job has been created and properly configured it can be run. To start a crawl the user must go to the web Console page (via the Console tab).</p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="console"></a>7.1. Web Console</h3></div></div></div><p>The web Console presents on overview of the current status of the crawler.</p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N109CD"></a>7.1.1. Crawler Status Box</h4></div></div></div><p>The following information is always provided:</p><div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>Crawler Status</strong></span></p><p>Is the crawler in <span class="emphasis"><em>Holding Jobs</em></span> or <span class="emphasis"><em>Crawling Jobs</em></span> mode? If holding, no new jobs pending or created will be started (but a job already begun will continue). If crawling, the next pending or created job will be started as soon as possible, for example when a previous job finishes. For more detail see <a href="glossary.html#holdingvcrawling">"Holding Jobs" vs. "Crawling Jobs"</a>.</p><p>To the right of the current crawler status, a control link reading either "Start" or "Hold" will toggle the crawler between the two modes.</p></li><li><p><span class="bold"><strong>Jobs</strong></span></p><p>If a current job is in progress, its status and name will appear. Alternatively, "None running" will appear to indicate no job is in progress because the crawler is holding, or "None available" if no job is in progress because no jobs have been queued.</p><p>Below the current job info, the number of jobs pending and completed is shown. The completed count includes those that failed to start for some reason (see <a href="running.html#failedtostart" title="7.3.2. Job failed to start">Section 7.3.2, “Job failed to start”</a> for more on misconfigured jobs).</p></li><li><p><span class="bold"><strong>Alerts</strong></span></p><p>Total number of alerts, and within brackets new alerts, if any.</p><p>See <a href="running.html#alerts">Section 7.3.4, “Alerts”</a> for more on alerts.</p></li><li><p><span class="bold"><strong>Memory</strong></span></p><p>The amount of memory currently used, the size of the Java heap, and the maximum size to which the heap can possibly grow are all displayed, in kilobytes (KB).</p></li></ul></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10A04"></a>7.1.2. Job Status Box</h4></div></div></div><p>If a job is in-progress -- running, paused, or between job states -- the following information is also provided in a second area underneath the <span class="emphasis"><em>Crawler Status Box</em></span>.</p><div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>Job Status</strong></span></p><p>The current status of the job in progress. Jobs being crawled are usually running or paused.</p><p>To the right of the current status, controls for pausing/resuming or terminating the current job will appear as appropriate.</p><p>When a job is terminated, its status will be marked as 'Ended by operator'. All currently active threads will be allowed to finish behind the scenes even though the WUI will report the job being terminated at once. If the crawler is in "Crawling Jobs" mode, a next pending job, if any, will start immediately.</p><p>When a running job is paused, it may take some time for all the active threads to enter a paused state. Until then the job is considered to be still running and 'pausing'. It is possible to resume from this interim state.</p><p>Once paused a job is considered to be suspended and time spent in that state does not count towards elapsed job time or rates.</p></li><li><p><span class="bold"><strong>Rates</strong></span></p><p>The number of URIs successfully processed per second is shown, both the rate in the latest sampling interval and (in parentheses) the average rate since the crawl began. The sampling interval is typically about 20 seconds, and is adjustable via the "interval-seconds" setting. The latest rate of progress can fluctuate considerably, as the crawler workload varies and housekeeping memory and file operations occur -- especially if the sampling interval has been set to a low value.</p><p>Also show is the rate of successful content collection, in KB/sec, for the latest sampling interval and (in parentheses) the average since the crawl began. (See <a href="glossary.html#bytes">Bytes, KB and statistics</a>.) </p></li><li><p><span class="bold"><strong>Time</strong></span></p><p>The amount of time that has elapsed since the crawl began (excluding any time spent paused) is displayed, as well as a very crude estimate of the require time remaining. (This estimate does not yet consider the typical certainty of discovering more URIs to crawl, and ignored other factors, so should not be relied upon until it can be improved in future releases.)</p></li><li><p><span class="bold"><strong>Load</strong></span></p><p>A number of measures are shown of how busy or loaded the job has made the crawler. The number of active threads, compared to the total available, is shown. Typically, if only a small number of threads are active, it is because activating more threads would exceed the configured politeness settings, given the remaining URI workload. (For example, if all remaining URIs are on a single host, no more than one thread will be active -- and often none will be, as polite delays are observed between requests.)</p><p>The <span class="emphasis"><em>congestion ratio</em></span> is a rough estimate of how much additional capacity, as a multiple of current capacity, would be necessary to crawl the current workload at the maximum rate allowable by politeness settings. (It is calculated by comparing the number of internal queues that are progressing with those that are waiting for a thread to become available.)</p><p>The <span class="emphasis"><em>deepest queue</em></span> number indicates the longest chain of pending URIs that must be processed sequentially, which is a better indicator of the work remaining than the total number of URIs pending. (A thousand URIs in a thousand independent queues can complete in parallel very quickly; a thousand in one queue will take longer.)</p><p>The <span class="emphasis"><em>average depth</em></span> number indicates the average depth of the last URI in every active sequential queue.</p></li><li><p><span class="bold"><strong>Totals</strong></span></p><p>A progress bar indicates the relative percentage of completed URIs to those known and pending. As with the remaining time estimate, no consideration is given to the liklihood of discovering additional URIs to crawl. So, the percentage completed can shrink as well as grow, especially in broader crawls.</p><p>To the left of the progress bar, the total number of URIs successfully downloaded is shown; to the right, the total number of URIs queued for future processing. Beneath the bar, the total of downloaded plus queued is shown, as well as the uncompressed total size of successfully downloaded data in kilobytes. See <a href="glossary.html#bytes">Bytes, KB and statistics</a>. (Compressed ARCs on disk will be somewhat smaller than this figure.)</p></li><li><p><span class="bold"><strong>Paused Operations</strong></span></p><p>When the job is paused, additional options will appear such as <span class="emphasis"><em>View or Edit Frontier URIs</em></span>.</p><p>The <span class="emphasis"><em>View or Edit Frontier URIs</em></span> option takes the operator to a page allowing the lookup and deletion of URIs in the frontier by using a regular expression, or addition of URIs from an external file (even URIs that have already been processed).</p></li></ul></div><p>Some of this information is replicated in the head of each page (see <a href="running.html#header" title="7.3.3. All page status header">Section 7.3.3, “All page status header”</a>).</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10A65"></a>7.1.3. Console Bottom Operations</h4></div></div></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10A68"></a>7.1.3.1. Refresh</h5></div></div></div><p>Update the status display. The status display does not update itself and quickly becomes out of date as crawling proceeds. This also refreshes the options available if they've changed as a result of a change in the state of the job being crawled.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10A6D"></a>7.1.3.2. Shut down Heritrix</h5></div></div></div><p>It is possible to shut down Heritrix through this option. Doing so will terminate the Java process running Heritrix and the only way to start it up again will be via the command line as this also disables the WUI.</p><p>The user is asked to confirm this action twice to prevent accidental shut downs.</p><p>This option will try to terminate any current job gracefully
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -