📄 running.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
📖 第 1 页 / 共 2 页
字号:
上一页 12
          but will only wait a very short time for active threads to          finish.</p></div></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="pendingjobs"></a>7.2.&nbsp;Pending jobs</h3></div></div></div><p>At any given time there can be any number of crawl jobs waiting      for their turn to be crawled.</p><p>From the <span class="emphasis"><em>Jobs</em></span> tab the user can access a list      of these pending jobs (it also possible to get to them from the header,      see <a href="running.html#header" title="7.3.3.&nbsp;All page status header">Section&nbsp;7.3.3, &ldquo;All page status header&rdquo;</a>).</p><p>The list displays the name of each job, its status (currently all      pending jobs have the status 'Pending') and offers the following options      for each job:</p><div class="itemizedlist"><ul type="disc"><li><p><span class="bold"><strong>View order</strong></span></p><p>Opens up the actual XML configuration file in a seperate          window. Of interest to advanced users only.</p></li><li><p><span class="bold"><strong>Edit configuration</strong></span></p><p>Takes the user to the Settings page of the jobs configurations          (see <a href="config.html#settings" title="6.3.&nbsp;Settings">Section&nbsp;6.3, &ldquo;Settings&rdquo;</a>).</p></li><li><p><span class="bold"><strong>Journal</strong></span></p><p>Takes the user to the job's Journal (see <a href="running.html#journal">Section&nbsp;7.4.1, &ldquo;Journal&rdquo;</a>).</p></li><li><p><span class="bold"><strong>Delete</strong></span></p><p>Deletes the job (will only be marked as deleted, does not          delete it from disk).</p></li></ul></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N10AA9"></a>7.3.&nbsp;Monitoring a running job</h3></div></div></div><p>In addition to the logs and reports generally available on all      jobs (see <a href="analysis.html#logs" title="8.2.&nbsp;Logs">Section&nbsp;8.2, &ldquo;Logs&rdquo;</a> and <a href="analysis.html#reports" title="8.3.&nbsp;Reports">Section&nbsp;8.3, &ldquo;Reports&rdquo;</a>) some      information is provided only for jobs being crawled.<div class="note" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Note</h3><p>The Crawl Report (see <a href="analysis.html#crawlreport" title="8.3.1.&nbsp;Crawl report">Section&nbsp;8.3.1, &ldquo;Crawl report&rdquo;</a>) contains          one bit of information only available on active crawls. That is the          amount of time that has elapsed since a URI belonging to each host          was last finished.</p></div></p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10ABA"></a>7.3.1.&nbsp;Internal reports on ongoing crawl</h4></div></div></div><p>The following reports are only availible while the crawler is        running. They provide information about the internal status of certain        parts of the crawler. Generally this information is only of interest        to advanced users who possess detailed knowledge of the internal        workings of said modules.</p><p>These reports can be accessed from the Reports tab when a job is        being crawled.</p><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10AC1"></a>7.3.1.1.&nbsp;Frontier report</h5></div></div></div><p>A report on the internal state of the frontier.Can be unwieldy          in size or the amount of time/memory it takes to compose in large          crawls (with thousands of hosts with pending URIs).</p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="N10AC6"></a>7.3.1.2.&nbsp;Thread report</h5></div></div></div><p>Contains information about what each toe thread is doing and          how long it has been doing it. Also allows users to terminate threads          that have become stuck. Terminated threads will not actually be          removed from memory, Java does not provide a way of doing that.          Instead they will be isolated from the rest of the program running          and the URI they are working on will be reported back to the          frontier as if it had failed to be processed.<div class="caution" style="margin-left: 0.5in; margin-right: 0.5in;"><h3 class="title">Caution</h3><p>Terminating threads should only be done by advanced users              who understand the effect of doing so.</p></div></p></div><div class="sect4" lang="en"><div class="titlepage"><div><div><h5 class="title"><a name="processorsreport"></a>7.3.1.3.&nbsp;Processors report</h5></div></div></div><p>A report on each processor. Not all processors provide          reports. Typically these are numbers of URIs handled, links          extracted etc.</p><p>This report is saved to a file at the end of the crawl (see          <a href="outside.html#processorsreport.txt" title="9.1.6.&nbsp;processors-report.txt">Section&nbsp;9.1.6, &ldquo;processors-report.txt&rdquo;</a>).</p></div></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="failedtostart"></a>7.3.2.&nbsp;Job failed to start</h4></div></div></div><p>If a job is misconfigured in such a way that it is not possible        to do any crawling it might seem as if it never started. In fact what        happens is that the crawl is started but on the initialization it is        immediately terminated and sent to the list of completed jobs (<a href="analysis.html#completedjobs" title="8.1.&nbsp;Completed jobs">Section&nbsp;8.1, &ldquo;Completed jobs&rdquo;</a>). In those instances an explanation of what        went wrong is displayed on the completed jobs page. An alert will also        be created.</p><p>A common cause of this is forgetting to set the HTTP header's        <code class="literal">user-agent</code> and <code class="literal">from</code> attributes        to valid values (see <a href="config.html#httpheaders" title="6.3.1.3.&nbsp;HTTP headers">Section&nbsp;6.3.1.3, &ldquo;HTTP headers&rdquo;</a>).</p><p>If no processors are set on the job (or the modules otherwise        badly misconfigured) the job may succeed in initializing but        immediately exhaust the seed list, failing to actually download        anything. This will not trigger any errors but a review of the logs        for the job should highlight the problem. So if a job terminates        immediately after starting without errors, the configuration        (especially modules) should be reviewed for errors.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="header"></a>7.3.3.&nbsp;All page status header</h4></div></div></div><p>At the top of every page in the WUI, right next to the Heritrix        logo, is a brief overview of the crawler's current status.</p><p>The three lines contain the following information (starting at        the top left and working across and down).</p><p>First bit of information is the current time when the page was        displayed. This is useful since the status of the crawler will        continue to change after a page loads, but those changes will not be        reflected on the page until it is reloaded (usually manually by the        user). As always this time is in GMT.</p><p>Right next to it is the number of current and new alerts.</p><p>The second line tells the user if the crawler is in "Crawling         Jobs" or "Holding Jobs" mode. (See <a href="glossary.html#holdingvcrawling">"Holding Jobs" vs. "Crawling Jobs"</a>).         If a job is in progress, its status and name will also be shown.</p><p>At the beginning of the final line the number of pending and         completed jobs are displayed. Clicking on either value takes the user         to the related overview page. Finally if a job is in progress, total        current URIs completed, elapsed time, and URIs/sec figures are shown.        </p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10B04"></a>7.3.4.&nbsp;Alerts</h4></div></div></div><p>The number of existing and new alerts is displayed both in the        Console (<a href="running.html#console" title="7.1.&nbsp;Web Console">Section&nbsp;7.1, &ldquo;Web Console&rdquo;</a>) and the header of each page        (<a href="running.html#header" title="7.3.3.&nbsp;All page status header">Section&nbsp;7.3.3, &ldquo;All page status header&rdquo;</a>).</p><p>Clicking on the link made up of those numbers takes the user to        an overview of the alerts. The alerts are presented as messages, with        unread ones clearly marked in bold and offering the user the option of        reading them, marking as read and deleting them.</p><p>Clicking an alert brings up a screen with its details.</p><p>Alerts are generated in response to an error or problem of some        form. Alerts have severity levels that mirror the Java log        levels.</p><p>Serious exception that occur will have a        <span class="emphasis"><em>Severe</em></span> level. These may be indicative of bugs in        the code or problems with the configuration of a crawl job.</p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="editrun"></a>7.4.&nbsp;Editing a running job</h3></div></div></div><p>The configurations of a job can be edited while it is running.      This option is accessed from the <span class="emphasis"><em>Jobs</em></span> tab (Current      job/Edit configuration). When selected the user is taken to the settings      section of the job's configuration (see s<a href="config.html#settings" title="6.3.&nbsp;Settings">Section&nbsp;6.3, &ldquo;Settings&rdquo;</a>).</p><p>When a configuration file is edited, the old version of it is      saved to a new file (new file is named      <code class="filename">&lt;oldFilename&gt;_&lt;timestamp&gt;.xml</code>) before      it is updated. This way a record is kept of any changes. This record is      only kept for changes made <span class="emphasis"><em>after</em></span> crawling      begins.</p><p>It is not possible to edit all aspects of the configuration after      crawling starts. Most noticably the Modules section is disabled. Also,      although not enforced by the WUI, making changes to certain settings (in      particular filenames, directory locations etc.) will have no effect      (doing so will typically not harm the crawl, it will simply be      ignored).</p><p>However most settings can be changed. This includes the number of      threads being used and the seeds list and although it is not possible to      remove modules, most have the option to disable them. Settings a modules      <code class="literal">enabled</code> attribute to <code class="literal">false</code>      effectively removes them from the configuration.</p><p>If changing more than an existing atomic value -- for example,      adding a new filter -- it is good practice to pause the crawl first, as      some modifications to composite configuration entities may not occur in      a thread-safe manner with respect to ongoing crawling otherwise.</p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10B3E"></a>7.4.1.&nbsp;Journal</h4></div></div></div><p>The user can add notes to a journal that is kept for each job.        No entries are made automatically in the journal, it is only for user        added comments.</p><p>It can be useful to use it to document reasons behind        configuration changes to preserve that information along with the        actual changes.</p><p>The journal can be accessed from the Pending jobs page (<a href="running.html#pendingjobs" title="7.2.&nbsp;Pending jobs">Section&nbsp;7.2, &ldquo;Pending jobs&rdquo;</a>) for pending jobs, the Jobs tab for currently        running jobs and the Completed jobs page (<a href="analysis.html#completedjobs" title="8.1.&nbsp;Completed jobs">Section&nbsp;8.1, &ldquo;Completed jobs&rdquo;</a>) for completed jobs.</p><p>The journal is written to a plain text file that is stored along        with the logs.</p></div></div></div><div class="navfooter"><hr><table summary="Navigation footer" width="100%"><tr><td align="left" width="40%"><a accesskey="p" href="config.html">Prev</a>&nbsp;</td><td align="center" width="20%">&nbsp;</td><td align="right" width="40%">&nbsp;<a accesskey="n" href="analysis.html">Next</a></td></tr><tr><td valign="top" align="left" width="40%">6.&nbsp;Configuring jobs and profiles&nbsp;</td><td align="center" width="20%"><a accesskey="h" href="index.html">Home</a></td><td valign="top" align="right" width="40%">&nbsp;8.&nbsp;Analysis of jobs</td></tr></table></div></body></html>
上一页 12
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -