⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 outside.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
📖 第 1 页 / 共 3 页
字号:
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>9.&nbsp;Outside the user interface</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix User Manual"><link rel="up" href="index.html" title="Heritrix User Manual"><link rel="prev" href="analysis.html" title="8.&nbsp;Analysis of jobs"><link rel="next" href="usecases.html" title="A.&nbsp;Common Heritrix Use Cases"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">9.&nbsp;Outside the user interface</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="analysis.html">Prev</a>&nbsp;</td><th align="center" width="60%">&nbsp;</th><td align="right" width="20%">&nbsp;<a accesskey="n" href="usecases.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="outside"></a>9.&nbsp;Outside the user interface</h2></div></div></div><p>While it is possible to do a great many things via Heritrix's WUI it    is worth taking a look at some of what is not available in it.</p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N10CF3"></a>9.1.&nbsp;Generated files</h3></div></div></div><p>In addition to the logs discussed above (see <a href="analysis.html#logs" title="8.2.&nbsp;Logs">Section&nbsp;8.2, &ldquo;Logs&rdquo;</a>) the following files are generated. Some of the      information in them is also available via the WUI.</p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10CFB"></a>9.1.1.&nbsp;heritrix_out.log</h4></div></div></div><p>Captures what is written to the standard out and standard error        streams of the program. Mostly this consists of low level exceptions        (usually indicative of bugs) and also some information from third        party modules who do their own output logging.</p><p>This file is created in the same directory as the Heritrix JAR        file. It is not associated with any one job, but contains output from        all jobs run by the crawler.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="crawl-manifest.txt"></a>9.1.2.&nbsp;crawl-manifest.txt</h4></div></div></div><p>A manifest of all files (excluding ARC and other data files)        created while crawling a job.</p><p>An example of this file might be:</p><p><pre class="programlisting">  L+ /Heritrix/jobs/quickbroad-20040420191411593/disk/crawl.log  L+ /Heritrix/jobs/quickbroad-20040420191411593/disk/runtime-errors.log  L+ /Heritrix/jobs/quickbroad-20040420191411593/disk/local-errors.log  L+ /Heritrix/jobs/quickbroad-20040420191411593/disk/uri-errors.log  L+ /Heritrix/jobs/quickbroad-20040420191411593/disk/progress-statistics.log  L- /Heritrix/jobs/quickbroad-20040420191411593/disk/recover.gz  R+ /Heritrix/jobs/quickbroad-20040420191411593/disk/seeds-report.txt  R+ /Heritrix/jobs/quickbroad-20040420191411593/disk/hosts-report.txt  R+ /Heritrix/jobs/quickbroad-20040420191411593/disk/mimetype-report.txt  R+ /Heritrix/jobs/quickbroad-20040420191411593/disk/responsecode-report.txt  R+ /Heritrix/jobs/quickbroad-20040420191411593/disk/crawl-report.txt  R+ /Heritrix/jobs/quickbroad-20040420191411593/disk/processors-report.txt  C+ /Heritrix/jobs/quickbroad-20040420191411593/job-quickbroad.xml  C+ /Heritrix/jobs/quickbroad-20040420191411593/settings/org/settings.xml  C+ /Heritrix/jobs/quickbroad-20040420191411593/seeds-quickbroad.txt</pre></p><p>The first character of each line indicates the type of file. L        for logs, R for reports and C for configuration files.</p><p>The second character - a plus or minus sign - indicates if the        file should be included in a standard bundle of the job (see <a href="outside.html#manifest_bundle.pl" title="9.2.1.&nbsp;manifest_bundle.pl">Section&nbsp;9.2.1, &ldquo;manifest_bundle.pl&rdquo;</a>). In the example above the        <code class="literal">recover.gz</code> is marked for exclusion because it is        generally only of interest if the job crashes and must be restarted.        It has negligible value once the job is completed (See <a href="outside.html#recover" title="9.3.&nbsp;Recovery of Frontier State and recover.gz">Section&nbsp;9.3, &ldquo;Recovery of Frontier State and recover.gz&rdquo;</a>).</p><p>After this initial legend the filename with full path        follows.</p><p>This file is generated in the directory indicated by the        '<span class="emphasis"><em>disk</em></span>' attribute of the configuration at the very        end of the crawl.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10D23"></a>9.1.3.&nbsp;crawl-report.txt</h4></div></div></div><p>Contains some useful metrics about the completed jobs. This        report is created by the <code class="literal">StatisticsTracker</code> (see        <a href="config.html#stattrack" title="6.1.4.&nbsp;Statistics Tracking">Section&nbsp;6.1.4, &ldquo;Statistics Tracking&rdquo;</a>)</p><p>Written at the very end of the crawl only. See        <code class="literal">crawl-manifest.txt</code> for its location.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10D35"></a>9.1.4.&nbsp;hosts-report.txt</h4></div></div></div><p>Contains an overview of what hosts were crawled and how many        documents and bytes were downloaded from each.</p><p>This report is created by the        <code class="literal">StatisticsTracker</code> (see <a href="config.html#stattrack" title="6.1.4.&nbsp;Statistics Tracking">Section&nbsp;6.1.4, &ldquo;Statistics Tracking&rdquo;</a>) and is written at the very end of the crawl        only. See <code class="literal">crawl-manifest.txt</code> for its        location.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10D47"></a>9.1.5.&nbsp;mimetype-report.txt</h4></div></div></div><p>Contains on overview of the number of documents downloaded per        mime type. Also has the amount of data downloaded per mime        type.</p><p>This report is created by the        <code class="literal">StatisticsTracker</code> (see <a href="config.html#stattrack" title="6.1.4.&nbsp;Statistics Tracking">Section&nbsp;6.1.4, &ldquo;Statistics Tracking&rdquo;</a>) and is written at the very end of the crawl        only. See <code class="literal">crawl-manifest.txt</code> for its        location.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="processorsreport.txt"></a>9.1.6.&nbsp;processors-report.txt</h4></div></div></div><p>Contains the processors report (see <a href="running.html#processorsreport" title="7.3.1.3.&nbsp;Processors report">Section&nbsp;7.3.1.3, &ldquo;Processors report&rdquo;</a>) generated at the very end of the        crawl.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10D62"></a>9.1.7.&nbsp;responsecode-report.txt</h4></div></div></div><p>Contains on overview of the number of documents downloaded per        status code (see <a href="glossary.html#statuscodes">Status codes</a>), covers successful        codes only, does not tally failures, see <code class="literal">crawl.log</code>        for that information.</p><p>This report is created by the        <code class="literal">StatisticsTracker</code> (see <a href="config.html#stattrack" title="6.1.4.&nbsp;Statistics Tracking">Section&nbsp;6.1.4, &ldquo;Statistics Tracking&rdquo;</a>) and is written at the very end of the crawl        only. See <code class="literal">crawl-manifest.txt</code> for its        location.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10D7B"></a>9.1.8.&nbsp;seeds-report.txt</h4></div></div></div><p>An overview of the crawling of each seed. Did it succeed or not,        what status code was returned.</p><p>This report is created by the        <code class="literal">StatisticsTracker</code> (see <a href="config.html#stattrack" title="6.1.4.&nbsp;Statistics Tracking">Section&nbsp;6.1.4, &ldquo;Statistics Tracking&rdquo;</a>) and is written at the very end of the crawl        only. See <code class="literal">crawl-manifest.txt</code> for its        location.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="arcfiles"></a>9.1.9.&nbsp;ARC files</h4></div></div></div><p>Assuming that you are using the ARC writer that comes with        Heritrix a number of ARC files will be generated containing the        crawled pages.</p><p>It is possible to specify the location of these files on the        ARCWriter processor in settings page. Unless this is set as an        absolute path this is a path relative to the <span class="emphasis"><em>job</em></span>        directory.</p><p>ARC files are named as follows:</p><p><pre class="programlisting">  [prefix]-[12-digit-timestamp]-[series#-padded-to-5-digits]-[crawler-hostname].arc.gz</pre></p><p>The <code class="literal">prefix</code> is set by the user when he        configures the ARCWriter processor. By default it is IAH.</p><p>If you see an ARC file with an extra <code class="literal">.open</code>        suffix, this means the ARC is currently in use being written to by        Heritrix (It usually has more than one ARC open at a time).</p><p><a name="invalid"></a>Files with a <code class="literal">.invalid</code> are files        Heritrix had trouble writing to (Disk full, bad disk, etc.).         On IOException,        Heritrix closes the problematic ARC and gives it the        <code class="literal">.invalid</code> suffix. These files need to be        checked for coherence.</p><p>For more on ARC files refer to the <a href="http://crawler.archive.org/apidocs/org/archive/io/arc/ARCWriter.html" target="_top">ARCwriter        Javadoc</a> and to the <a href="http://crawler.archive.org/articles/developer_manual/arcs.html" target="_top">ARC        Writer developer documentation</a>.</p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N10DBF"></a>9.2.&nbsp;Helpful scripts</h3></div></div></div><p>Heritrix comes bundled with a few helpful scripts for      Linux.</p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="manifest_bundle.pl"></a>9.2.1.&nbsp;manifest_bundle.pl</h4></div></div></div><p>This script will bundle up all resources referenced by a crawl        manifest file (<a href="outside.html#crawl-manifest.txt" title="9.1.2.&nbsp;crawl-manifest.txt">Section&nbsp;9.1.2, &ldquo;crawl-manifest.txt&rdquo;</a>). Output bundle        is an uncompressed or compressed tar ball. Directory structure created        in the tar ball is as follow:</p><div class="itemizedlist"><ul type="disc"><li><p>Top level directory (crawl name)</p></li><li><p>Three default subdirectories (configuration, logs and            reports directories)</p></li><li><p>Any other arbitrary subdirectories</p></li></ul></div><p>Usage:<pre class="programlisting">  manifest_bundle.pl crawl_name manifest_file [-f output_tar_file] [-z] [ -flag directory]      -f output tar file. If omitted output to stdout.      -z compress tar file with gzip.      -flag is any upper case letter. Default values C, L, and are R are set to       configuration, logs and reports</pre></p><p>Example:<pre class="programlisting">  manifest_bundle.pl testcrawl crawl-manifest.txt -f    \        /0/testcrawl/manifest-bundle.tar.gz -z -F filters</pre></p><p>Produced tar ball for this example:<pre class="programlisting">  /0/testcrawl/manifest-bundle.tar.gz</pre>Bundled        directory structure for this example:<pre class="programlisting">  |-testcrawl      |- configurations      |- logs      |- reports      |- filters</pre></p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="hoppath"></a>9.2.2.&nbsp;hoppath.pl</h4></div></div></div><p>This perl script, found in $HERITRIX_HOME/bin recreates the hop        path to the specified url. The hop path is a path of links (URLs) that        we followed to get to the specified url.</p><p>Usage:<pre class="programlisting">Usage: hoppath.pl crawl.log URI_PREFIX  crawl.log    Full-path to Heritrix crawl.log instance.  URI_PREFIX   URI we're querying about. Must begin 'http(s)://' or 'dns:'.               Wrap this parameter in quotes to avoid shell interpretation               of any '&amp;' present in URI_PREFIX.</pre></p><p>Example:<pre class="programlisting">% hoppath.pl crawl.log 'http://www.house.gov/'</pre></p><p>Result:<pre class="programlisting">  2004-02-25-02-36-06 - http://www.house.gov/house/MemberWWW_by_State.html  2004-02-25-02-36-06   L http://wwws.house.gov/search97cgi/s97_cgi  2004-02-25-03-30-38    L http://www.house.gov/</pre></p><p>The L in the above example refers to the type of link followed        (see <a href="glossary.html#discoverypath">Discovery path</a>).</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="recoverylogmapper"></a>9.2.3.&nbsp;RecoverLogMapper</h4></div></div></div><p><code class="literal">org.archive.crawler.util.RecoveryLogMapper</code> is        similar to <a href="outside.html#hoppath" title="9.2.2.&nbsp;hoppath.pl">Section&nbsp;9.2.2, &ldquo;hoppath.pl&rdquo;</a>. It was contributed by Mike        Schwartz. RecoverLogMapper parses a Heritrix recovery log file, (See        <a href="outside.html#recover" title="9.3.&nbsp;Recovery of Frontier State and recover.gz">Section&nbsp;9.3, &ldquo;Recovery of Frontier State and recover.gz&rdquo;</a>), and builds maps that allow a caller to        look up any seed URL and get back an Iterator of all URLs successfully        crawled from given seed. It also allows lookup on any crawled URL to        find the seed URL from which the crawler reached that URL (through 1        or more discovered URL hops, which are collapsed in this        lookup).</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="jmxclient"></a>9.2.4.&nbsp;cmdline-jmxclient</h4></div></div></div><p>This jar file is checked in as a script. It enables command-line        control of Heritrix if Heritrix has been started up inside of a SUN        1.5.0 JDK. See the <a href="http://crawler.archive.org/cmdline-jmxclient/" target="_top">cmdline-jmxclient        project</a> to learn more about this script's capabilities and how

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -