⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 faq.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
📖 第 1 页 / 共 3 页
字号:
              </li>                            <li>                                                <a href="#nutchwax">                  I've downloaded all these ARC files, now what?                        </a>              </li>                          </ol><p>              <strong>References</strong>            </p><ol>                            <li>                                                <a href="#archive_data">                  Where can I learn more about what is stored at       the Internet Archive, the ARC file format, and tools for manipulating      ARC files?                      </a>              </li>                            <li>                                                <a href="#where_refs">                  Where can I get more background on Heritrix and         learn more about "crawling" in general?                      </a>              </li>                          </ol></div><div class="section"><a name="General"></a><h2>General</h2><title>General</title><dl>      <dt><a name="heritrix">            What does "Heritrix" mean?      </a></dt>      <dd>        <p><em>Heritrix</em> (sometimes spelled <em>heretrix</em>) is an archaic word for <em>inheritess</em>. Since our crawler seeks to collect the digital artifacts of our culture and <i>preserve</i> them for the benefit of future researchers and generations, this name seemed apt.</p>      </dd>    </dl><dl>      <dt><a name="introduction">      Where can I go to get a good introduction/overview of Heritrix?      </a></dt>      <dd>        <p><a href="An Introduction to Heritrix.pdf">An Introduction to            Heritrix</a>.        </p>      </dd>    </dl><dl>      <dt><a name="user-heritrix">    I need to crawl/archive a set of websites, can I use Heritrix?      </a></dt>      <dd>        <p>Yes.  Start by checking out the         <a href="articles/user_manual/index.html">Heritrix User Manual</a>.        </p>      </dd>    </dl><dl>      <dt><a name="developer">    I'm a developer, can I help?      </a></dt>      <dd>        <p>Yes -- especially if you have experience with Java,        open-source projects, and web protocols! Drop us a message or join        our <a href="mail-lists.html">mailing lists</a> for more details.        See also the <a href="articles/developer_manual/index.html">Heritrix        Developer Manual</a>.        </p>      </dd>    </dl><dl>      <dt><a name="license">What license does Heritrix use?      </a></dt>      <dd>        <p>The <a href="license.html">GNU LESSER GENERAL PUBLIC LICENSE</a>.	For discussion of 3rd party applications making use of LGPL code,	see David Turner on 	<a href="http://www.gnu.org/licenses/lgpl-java.html" class="externalLink" title="External Link">The LGPL	and Java</a>.</p>      </dd>    </dl></div><div class="section"><a name="Common_Problems"></a><h2>Common Problems</h2><title>Common Problems</title><dl>      <dt><a name="arc_closed">How do I know when heritrix is done with an ARC file?</a></dt>      <dd>      ARCs that are currently in use will have a '.open' suffix.      Those without such a suffix are fair-game for copying.      Also see       <a href="https://sourceforge.net/tracker/?func=detail&amp;aid=988277&amp;group_id=73833&amp;atid=539102">[988277] [Need feedback] "Done with ARC file" event</a> for a description for       how to enable logging of opening and closing of arcs.  See also      the conf/hertrix.properties for how to enable console logging going to      a FileHandler as well as to heritrix_out.log.</dd>    </dl><dl>      <dt><a name="limitations">Are there known limitations?</a></dt>      <dd>See the       <a href="articles/releasenotes/index.html">Release Notes</a>      page.</dd>    </dl><dl>      <dt><a name="testsfail">Why do unit tests fail when I build?</a></dt>      <dd>You're probably on a platform other than linux (or using       a 2.6.x kernel and a JVM that is other than the release version of       the SUN 1.5 jdk).      See sections 4.1.3/4.1.4 in       <a href="articles/releasenotes/1_0_0.html#984390">Release Notes</a>      page.</dd>    </dl><dl>    <dt><a name="linuxes">Which Linux distribution should I use to run Heritrix and which kernel version do I need?</a></dt>    <dd><p>Heritrix does not depend on a specific Linux distribution to    function and should work on any distro as long as a suitable Java Virtual    Machine can be installed on it. We know that Heritrix has been successfully    deployed on Red Hat 7.2, recent fedora core versions (2 and 4), as well as    on suse 9.3. Heritrix is known to work well with kernel versions 2.4.x.     With kernel versions 2.6.x there are issues when using JVMs other then the    release version of the SUN 1.5 jdk. See     <a href="#testsfail">Why do unit tests fail when I build?</a>     below. There are also issues when using the linux NPTL threading model,     particularly with older glibcs (i.e. debian).  See    <a href="#glibc2_3_2">Glibc 2.3.2 and NPTL</a> in the release    notes.</p></dd>    </dl><dl>        <dt><a name="windows">How do I run Heritrix on windows.</a></dt>        <dd>Before you begin, Heritrix is not supported on windows -- See        <a href="requirements.html">requirements</a> -- mostly because we don't        have the resources to support any more than the linux platform we use        internally at the Internet Archive.  That said, <code>BAT</code>        scripts that load jars in the right order have been pasted to the        mailing list (Here is one <a href="http://groups.yahoo.com/group/archive-crawler/message/2072" class="externalLink" title="External Link">2072</a> and  here is another,         <a href="http://groups.yahoo.com/group/archive-crawler/message/2576" class="externalLink" title="External Link">2576</a>). Eric Jensen has started work formally converting the Heritrix start        script to BAT format here,         <a href="http://sourceforge.net/tracker/index.php?func=detail&amp;aid=1514538&amp;group_id=73833&amp;atid=539102" class="externalLink" title="External Link">[1514538] should provide Windows batch file version of scripts"</a> (Needs a volunteer to finish the job).        A working Heritrix windows bundle has been posted to the list, here:        <a href="http://groups.yahoo.com/group/archive-crawler/message/2817" class="externalLink" title="External Link">2817</a>.Also see <a href="http://groups.yahoo.com/group/archive-crawler/message/3019" class="externalLink" title="External Link">Crawler Stalling on Windows</a>, <a href="http://groups.yahoo.com/group/archive-crawler/message/2085" class="externalLink" title="External Link">2085</a>        and the items below that pertain to windows:        <a href="#windowsstart">dns</a> and <a href="#windowsmkdir">mkdir</a>.        </dd>    </dl><dl>      <dt><a name="windowsstart">The crawler gets dns fine but nothing subsequently.      Why?</a></dt>      <dd>If you are running on windows, it may be because the       ordering of jars on the classpath is wrong.  See      <a href="http://groups.yahoo.com/group/archive-crawler/message/772" class="externalLink" title="External Link">Why crawler [sic] nothing ???</a>.</dd>    </dl><dl>      <dt><a name="windowsmkdir">The crawler, running on windows, complains it cannot      <code>mkdir</code>. Why?</a></dt>      <dd>      See <a href="http://groups.yahoo.com/group/archive-crawler/message/1880" class="externalLink" title="External Link"></a>1880</dd>    </dl><dl>      <dt><a name="midfetch">I only want to download <code>text/html</code> and nothing else.  Can I do it?</a></dt>      <dd>Tom Emerson describes one technique here,       <a href="http://www.dreamersrealm.net/~tree/blog/?s=text%2Fhtml&amp;submit=GO" class="externalLink" title="External Link">Focusing on HTML</a>.        You can also add a filter that excludes all filters that end      in other than 'html|htm', etc., or, if you want to instead      look at document mimetypes, you can Add a       <code>ContentTypeRegExpFilter</code>  filter      as a <code>midfetch</code> filter to the http fetcher.      This filter will be checked after the response headers      have been downloaded but before the response content      is fetched.  Configure it to only allow through documents of the      Content-Type desired.  Apply the same filter at the      writer stage of processing to eliminate recording of      response headers in ARCs.  See the      <a href="articles/user_manual/config.html#midfetch">User Manual</a>      (Prerequisite URLs by-pass the midfetch filters so it is not possible      to filter out robots.txt using this mechanism).      </dd>    </dl><dl>      <dt><a name="crawllogstatuscodes">Where do I go to learn about these cryptic crawl.log status      codes (-6, -7, -9998, etc.)?</a></dt>      <dd>See the <a href="http://crawler.archive.org/articles/user_manual.html#statuscodes" class="externalLink" title="External Link">User Manual Glossary</a>.</dd>    </dl><dl>      <dt><a name="toomanyopenfiles">Why do I get      <i>java.io.FileNotFoundException...(Too many open files)</i> or      <i>java.io.IOException...(Too many open files)</i>?      </a></dt>      <dd>      <p>On linux, a usual upper bound is 1024 file descriptors per      process.  To change this upper bound, there's a couple of things      you can do.</p>        <p>If running the crawler as non-root (recommended),       you can configure limits in <code>/etc/security/limits.conf</code>. For      example you can setup open files limit for all users in webcrawler group      as:    <div class="source"><pre># Each line describes a limit for a user in the form:## domain    type    item    value#@webcrawler     hard    nofile  32768</pre></div>        </p>      <p>Otherwise, running as root (You need to be root to up ulimits),      you can do the following:    <code># (ulimit -n 4096; JAVA_OPTS=-Xmx320 bin/heritrix -p 9876)</code>        to up the ulimit for the heritrix process only.      </p>      <p>Below is a rough accounting of FDs used in heritrix 1.0.x.</p>      <p>In Heritrix, the number of concurrent threads is configurable.  The      default frontier implementation allocates a thread per server.  Per      server, the frontier keeps a disk-backed queue.  Disk-backed queues      maintain three backing files with '.qin', '.qout', and '.top' suffixes      (One to read from while the other is being written to as well as queue      head file).  So, per thread there will be at least three      file descriptors occupied when queues need to persist to disk.</p>      <p>Apart from the above per thread FD cost, there is       a set FD cost instantiating the crawler:      <ul>      <li>The JVM, its native shared libs and jars count for about 40      FDs.</li>      <li>There are about 20 or so heritrix jars and 2 webapps.      </li>      <li>There are about 10-20 heritrix logging files counting      counting lock files.</li>      <li>Open ARC files.</li>      <li>Miscellaneous sockets, /dev/null, /dev/random,       and stderr/stdout make for 10 or 20 more FDs.</li>          </ul>      </p>      </dd>      </dl><dl>        <dt><a name="oome_broadcrawl">Why        do I get an OutOfMemoryException ten minutes after starting         a broad scoped crawl?</a></dt>        <dd>        <p>If using 64-bit JVM, see Gordon's note to the list on        12/19/2005, <a href="http://groups.yahoo.com/group/archive-crawler/message/2450" class="externalLink" title="External Link">Re: Large crawl experience (like, 500M links)</a>.        </p>        <p>See the note in        <a hef="https://sourceforge.net/tracker/?func=detail&amp;atid=539102&amp;aid=896772&amp;group_id=73833">[ 896772 ] "Site-first"/'frontline' prioritization</a> and this Release Note, <a href="http://crawler.archive.org/articles/releasenotes.html#1_0_0_limitations" class="externalLink" title="External Link">5.1.1 Crawl Size Upper Bounds</a>.        See this note by Kris from the list, <a href="http://groups.yahoo.com/group/archive-crawler/message/1027" class="externalLink" title="External Link">1027</a> for how        to mitigate memory-use when using HostQueuesFrontier. The advice is        less applicable if using a post-1.2.0, BdbFrontier Heritrix.  See        sections 'Crawl Size Upper Bounds Update' in the Release Notes.        </p>        </dd>      </dl><dl>        <dt><a name="new_writer">Can I insert        the crawl download directly into a MYSQL database instead of        into an ARC file on disk while crawling?</a></dt>        <dd>        <p>Yes.  See <a href="http://groups.yahoo.com/group/archive-crawler/message/508" class="externalLink" title="External Link">RE: [archive-crawler] Inserting information to MYSQL during crawl</a>        for pointers on how but also see the rest of this thread for why you            might rather do database insertion post-crawl rather than during.        </p>        </dd>      </dl><dl>        <dt><a name="mirror">Does Heritrix have to write ARC files?</a></dt>        <dd>        <p>See MirrorWriterProcessor.  It writes a file per        URL to the filesystem using a name that is a derivative of the        requested URL.        </p>        </dd>      </dl><dl>        <dt><a name="eclipse_assert">Why when        running heritrix in eclipse does it complain about the        'assert' keyword?</a></dt>        <dd>        <p>You'll need to configure Eclipse for Java 5.0 compliance to get rid        of the assert errors (prior to Java 5.0 'assert' was not a keyword and         currently Eclipse defaults 1.3).  This can be done by going into

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -