running.html

来自「网络爬虫开源代码」· HTML 代码 · 共 183 行 · 第 1/2 页

HTML
183
字号
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>7.&nbsp;Running a job</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix User Manual"><link rel="up" href="index.html" title="Heritrix User Manual"><link rel="prev" href="config.html" title="6.&nbsp;Configuring jobs and profiles"><link rel="next" href="analysis.html" title="8.&nbsp;Analysis of jobs"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">7.&nbsp;Running a job</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="config.html">Prev</a>&nbsp;</td><th align="center" width="60%">&nbsp;</th><td align="right" width="20%">&nbsp;<a accesskey="n" href="analysis.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="running"></a>7.&nbsp;Running a job</h2></div></div></div><p>Once a crawl job has been created and properly configured it can be    run. To start a crawl the user must go to the web Console page (via the    Console tab).</p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="console"></a>7.1.&nbsp;Web Console</h3></div></div></div><p>The web Console presents on overview of the current status of the      crawler.</p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N109D4"></a>7.1.1.&nbsp;Crawler Status Box</h4></div></div></div><p>The following information is always provided:</p><div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>Crawler Status</strong></span></p><p>Is the crawler in <span class="emphasis"><em>Holding Jobs</em></span> or            <span class="emphasis"><em>Crawling Jobs</em></span> mode? If holding, no new jobs            pending or created will be started (but a job already begun will            continue). If crawling, the next pending or created job will be            started as soon as possible, for example when a previous job            finishes. For more detail see <a href="glossary.html#holdingvcrawling">"Holding Jobs" vs. "Crawling Jobs"</a>.</p><p>To the right of the current crawler status, a control link            reading either "Start" or "Hold" will toggle the crawler between            the two modes.</p></li><li><p><span class="bold"><strong>Jobs</strong></span></p><p>If a current job is in progress, its status and name will            appear. Alternatively, "None running" will appear to indicate no            job is in progress because the crawler is holding, or "None            available" if no job is in progress because no jobs have been            queued.</p><p>Below the current job info, the number of jobs pending and            completed is shown. The completed count includes those that failed            to start for some reason (see <a href="running.html#failedtostart" title="7.3.2.&nbsp;Job failed to start">Section&nbsp;7.3.2, &ldquo;Job failed to start&rdquo;</a> for            more on misconfigured jobs).</p></li><li><p><span class="bold"><strong>Alerts</strong></span></p><p>Total number of alerts, and within brackets new alerts, if            any.</p><p>See <a href="running.html#alerts">Section&nbsp;7.3.4, &ldquo;Alerts&rdquo;</a> for more on alerts.</p></li><li><p><span class="bold"><strong>Memory</strong></span></p><p>The amount of memory currently used, the size of the Java            heap, and the maximum size to which the heap can possibly grow are            all displayed, in kilobytes (KB).</p></li></ul></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10A0B"></a>7.1.2.&nbsp;Job Status Box</h4></div></div></div><p>If a job is in-progress -- running, paused, or between job        states -- the following information is also provided in a second area        underneath the <span class="emphasis"><em>Crawler Status Box</em></span>.</p><div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>Job Status</strong></span></p><p>The current status of the job in progress. Jobs being            crawled are usually running or paused.</p><p>To the right of the current status, controls for            pausing/resuming or terminating the current job will appear as            appropriate.</p><p>When a job is terminated, its status will be marked as            'Ended by operator'. All currently active threads will be allowed            to finish behind the scenes even though the WUI will report the            job being terminated at once. If the crawler is in "Crawling Jobs"            mode, a next pending job, if any, will start immediately.</p><p>When a running job is paused, it may take some time for all            the active threads to enter a paused state. Until then the job is            considered to be still running and 'pausing'. It is possible to            resume from this interim state.</p><p>Once paused a job is considered to be suspended and time            spent in that state does not count towards elapsed job time or            rates.</p></li><li><p><span class="bold"><strong>Rates</strong></span></p><p>The number of URIs successfully processed per second is            shown, both the rate in the latest sampling interval and (in            parentheses) the average rate since the crawl began. The sampling            interval is typically about 20 seconds, and is adjustable via the            "interval-seconds" setting. The latest rate of progress can            fluctuate considerably, as the crawler workload varies and            housekeeping memory and file operations occur -- especially if the            sampling interval has been set to a low value.</p><p>Also show is the rate of successful content collection, in            KB/sec, for the latest sampling interval and (in parentheses) the            average since the crawl began. (See <a href="glossary.html#bytes">Bytes, KB and statistics</a>.)</p></li><li><p><span class="bold"><strong>Time</strong></span></p><p>The amount of time that has elapsed since the crawl began            (excluding any time spent paused) is displayed, as well as a very            crude estimate of the require time remaining. (This estimate does            not yet consider the typical certainty of discovering more URIs to            crawl, and ignored other factors, so should not be relied upon            until it can be improved in future releases.)</p></li><li><p><span class="bold"><strong>Load</strong></span></p><p>A number of measures are shown of how busy or loaded the job            has made the crawler. The number of active threads, compared to            the total available, is shown. Typically, if only a small number            of threads are active, it is because activating more threads would            exceed the configured politeness settings, given the remaining URI            workload. (For example, if all remaining URIs are on a single            host, no more than one thread will be active -- and often none            will be, as polite delays are observed between requests.)</p><p>The <span class="emphasis"><em>congestion ratio</em></span> is a rough            estimate of how much additional capacity, as a multiple of current            capacity, would be necessary to crawl the current workload at the            maximum rate allowable by politeness settings. (It is calculated            by comparing the number of internal queues that are progressing            with those that are waiting for a thread to become            available.)</p><p>The <span class="emphasis"><em>deepest queue</em></span> number indicates the            longest chain of pending URIs that must be processed sequentially,            which is a better indicator of the work remaining than the total            number of URIs pending. (A thousand URIs in a thousand independent            queues can complete in parallel very quickly; a thousand in one            queue will take longer.)</p><p>The <span class="emphasis"><em>average depth</em></span> number indicates the            average depth of the last URI in every active sequential            queue.</p></li><li><p><span class="bold"><strong>Totals</strong></span></p><p>A progress bar indicates the relative percentage of            completed URIs to those known and pending. As with the remaining            time estimate, no consideration is given to the liklihood of            discovering additional URIs to crawl. So, the percentage completed            can shrink as well as grow, especially in broader crawls.</p><p>To the left of the progress bar, the total number of URIs            successfully downloaded is shown; to the right, the total number            of URIs queued for future processing. Beneath the bar, the total            of downloaded plus queued is shown, as well as the uncompressed            total size of successfully downloaded data in kilobytes. See <a href="glossary.html#bytes">Bytes, KB and statistics</a>. (Compressed ARCs on disk will be somewhat            smaller than this figure.)</p></li><li><p><span class="bold"><strong>Paused Operations</strong></span></p><p>When the job is paused, additional options will appear such            as <span class="emphasis"><em>View or Edit Frontier URIs</em></span>.</p><p>The <span class="emphasis"><em>View or Edit Frontier URIs</em></span> option            takes the operator to a page allowing the lookup and deletion of            URIs in the frontier by using a regular expression, or addition of            URIs from an external file (even URIs that have already been            processed).</p></li></ul></div><p>Some of this information is replicated in the head of each page        (see <a href="running.html#header" title="7.3.3.&nbsp;All page status header">Section&nbsp;7.3.3, &ldquo;All page status header&rdquo;</a>).</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10A6C"></a>7.1.3.&nbsp;Console Bottom Operations</h4></div></div></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10A6F"></a>7.1.3.1.&nbsp;Refresh</h5></div></div></div><p>Update the status display. The status display does not update          itself and quickly becomes out of date as crawling proceeds. This          also refreshes the options available if they've changed as a result          of a change in the state of the job being crawled.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10A74"></a>7.1.3.2.&nbsp;Shut down Heritrix</h5></div></div></div><p>It is possible to shut down Heritrix through this option.          Doing so will terminate the Java process running Heritrix and the          only way to start it up again will be via the command line as this          also disables the WUI.</p><p>The user is asked to confirm this action twice to prevent          accidental shut downs.</p><p>This option will try to terminate any current job gracefully          but will only wait a very short time for active threads to

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?