⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 running.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
📖 第 1 页 / 共 2 页
字号:
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>7.&nbsp;Running a job</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix User Manual"><link rel="up" href="index.html" title="Heritrix User Manual"><link rel="prev" href="config.html" title="6.&nbsp;Configuring jobs and profiles"><link rel="next" href="analysis.html" title="8.&nbsp;Analysis of jobs"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">7.&nbsp;Running a job</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="config.html">Prev</a>&nbsp;</td><th align="center" width="60%">&nbsp;</th><td align="right" width="20%">&nbsp;<a accesskey="n" href="analysis.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="running"></a>7.&nbsp;Running a job</h2></div></div></div><p>Once a crawl job has been created and properly configured it can be    run. To start a crawl the user must go to the web Console page (via the    Console tab).</p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="console"></a>7.1.&nbsp;Web Console</h3></div></div></div><p>The web Console presents on overview of the current status of the      crawler.</p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N109CD"></a>7.1.1.&nbsp;Crawler Status Box</h4></div></div></div><p>The following information is always provided:</p><div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>Crawler Status</strong></span></p><p>Is the crawler in <span class="emphasis"><em>Holding Jobs</em></span> or             <span class="emphasis"><em>Crawling Jobs</em></span> mode? If holding, no new jobs            pending or created will be started (but a job already begun will            continue). If crawling, the next pending or created job will be             started as soon as possible, for example when a previous job             finishes. For more detail see             <a href="glossary.html#holdingvcrawling">"Holding Jobs" vs. "Crawling Jobs"</a>.</p><p>To the right of the current crawler status, a control            link reading either "Start" or "Hold" will toggle the crawler            between the two modes.</p></li><li><p><span class="bold"><strong>Jobs</strong></span></p><p>If a current job is in progress, its status and name             will appear. Alternatively, "None running" will appear to             indicate no job is in progress because the crawler is holding,            or "None available" if no job is in progress because no jobs            have been queued.</p><p>Below the current job info, the number of jobs pending and            completed is shown. The completed count includes those            that failed to start for some reason (see <a href="running.html#failedtostart" title="7.3.2.&nbsp;Job failed to start">Section&nbsp;7.3.2, &ldquo;Job failed to start&rdquo;</a> for more on misconfigured jobs).</p></li><li><p><span class="bold"><strong>Alerts</strong></span></p><p>Total number of alerts, and within brackets new alerts, if            any.</p><p>See <a href="running.html#alerts">Section&nbsp;7.3.4, &ldquo;Alerts&rdquo;</a> for more on alerts.</p></li><li><p><span class="bold"><strong>Memory</strong></span></p><p>The amount of memory currently used, the size of the             Java heap, and the maximum size to which the heap can possibly            grow are all displayed, in kilobytes (KB).</p></li></ul></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10A04"></a>7.1.2.&nbsp;Job Status Box</h4></div></div></div><p>If a job is in-progress -- running, paused, or between job        states -- the following information is also provided in a second        area underneath the <span class="emphasis"><em>Crawler Status Box</em></span>.</p><div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>Job Status</strong></span></p><p>The current status of the job in progress. Jobs being            crawled are usually running or paused.</p><p>To the right of the current            status, controls for pausing/resuming or terminating the current            job will appear as appropriate.</p><p>When a job is terminated, its status will be marked as            'Ended by operator'. All currently active threads will be allowed             to finish behind the scenes even though the WUI will report the             job being terminated at once. If the crawler is in "Crawling Jobs"             mode, a next pending job, if any, will start immediately.</p><p>When a running job is paused, it may take some time            for all the active threads to enter a paused state. Until then the            job is considered to be still running and 'pausing'. It is            possible to resume from this interim state.</p><p>Once paused a job is considered to be suspended and time            spent in that state does not count towards elapsed job time or            rates.</p></li><li><p><span class="bold"><strong>Rates</strong></span></p><p>The number of URIs successfully processed per second is            shown, both the rate in the latest sampling interval and (in            parentheses) the average rate since the crawl began. The sampling            interval is typically about 20 seconds, and is adjustable via            the "interval-seconds" setting. The latest rate of progress can            fluctuate considerably, as the crawler workload varies and             housekeeping memory and file operations occur -- especially if             the sampling interval has been set to a low value.</p><p>Also show is the rate of successful content collection,             in KB/sec, for the latest sampling interval and (in parentheses)            the average since the crawl began. (See <a href="glossary.html#bytes">Bytes, KB and statistics</a>.)            </p></li><li><p><span class="bold"><strong>Time</strong></span></p><p>The amount of time that has elapsed since the crawl began            (excluding any time spent paused) is displayed, as well as a             very crude estimate of the require time remaining. (This estimate            does not yet consider the typical certainty of discovering more            URIs to crawl, and ignored other factors, so should not be             relied upon until it can be improved in future releases.)</p></li><li><p><span class="bold"><strong>Load</strong></span></p><p>A number of measures are shown of how busy or loaded the            job has made the crawler. The number of active threads, compared to            the total available, is shown. Typically, if only a small number of            threads are active, it is because activating more threads would             exceed the configured politeness settings, given the remaining URI            workload. (For example, if all remaining URIs are on a single host,            no more than one thread will be active -- and often none will be,             as polite delays are observed between requests.)</p><p>The <span class="emphasis"><em>congestion ratio</em></span> is a rough estimate            of how much additional capacity, as a multiple of current capacity,             would be necessary to crawl the current workload at the maximum             rate allowable by politeness settings. (It is calculated by             comparing the number of internal queues that are progressing with            those that are waiting for a thread to become available.)</p><p>The <span class="emphasis"><em>deepest queue</em></span> number indicates the            longest chain of pending URIs that must be processed sequentially,            which is a better indicator of the work remaining than the total             number of URIs pending. (A thousand URIs in a thousand independent            queues can complete in parallel very quickly; a thousand in one             queue will take longer.)</p><p>The <span class="emphasis"><em>average depth</em></span> number indicates            the average depth of the last URI in every active sequential            queue.</p></li><li><p><span class="bold"><strong>Totals</strong></span></p><p>A progress bar indicates the relative percentage of             completed URIs to those known and pending. As with the remaining            time estimate, no consideration is given to the liklihood of             discovering additional URIs to crawl. So, the percentage             completed can shrink as well as grow, especially in broader            crawls.</p><p>To the left of the progress bar, the total number of URIs            successfully downloaded is shown; to the right, the total number            of URIs queued for future processing. Beneath the bar, the total             of downloaded plus queued is shown, as well as the uncompressed             total size of successfully downloaded data in kilobytes. See            <a href="glossary.html#bytes">Bytes, KB and statistics</a>. (Compressed ARCs on disk will be             somewhat smaller than this figure.)</p></li><li><p><span class="bold"><strong>Paused Operations</strong></span></p><p>When the job is paused, additional options will appear such            as <span class="emphasis"><em>View or Edit Frontier URIs</em></span>.</p><p>The <span class="emphasis"><em>View or Edit Frontier URIs</em></span> option            takes the operator to a page allowing the lookup and deletion of            URIs in the frontier by using a regular expression, or addition            of URIs from an external file (even URIs that have already            been processed).</p></li></ul></div><p>Some of this information is replicated in the head of each page        (see <a href="running.html#header" title="7.3.3.&nbsp;All page status header">Section&nbsp;7.3.3, &ldquo;All page status header&rdquo;</a>).</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10A65"></a>7.1.3.&nbsp;Console Bottom Operations</h4></div></div></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10A68"></a>7.1.3.1.&nbsp;Refresh</h5></div></div></div><p>Update the status display. The status display does not update          itself and quickly becomes out of date as crawling proceeds. This          also refreshes the options available if they've changed as a result          of a change in the state of the job being crawled.</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10A6D"></a>7.1.3.2.&nbsp;Shut down Heritrix</h5></div></div></div><p>It is possible to shut down Heritrix through this option.          Doing so will terminate the Java process running Heritrix and the          only way to start it up again will be via the command line as this          also disables the WUI.</p><p>The user is asked to confirm this action twice to prevent          accidental shut downs.</p><p>This option will try to terminate any current job gracefully

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -