⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 faq.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
📖 第 1 页 / 共 3 页
字号:
        "Window&gt;Preferences&gt;Java/Compiler&gt;Compliance and Classfiles" and        setting "Compiler compliance level" to 5.0.  Make sure the        'Use default compliance level' is UNCHECKED and that the        'Generated .class files compatibility' and 'Source compatibility'        are also set to 5.0.</p>        </dd>      </dl><dl>        <dt><a name="crawl_finished">Why won't my crawl finish?</a></dt>        <dd>        <p>The crawl can get hung up on sites that are actually down or are non-responsive.        Manual intervention is necessary in such cases.         Study the frontier to get a picture of what is left to be crawled.        Looking at the local errors log will give let you see the problems with currently        crawled URIs.  Along with robots.txt retries, you will probably also see        httpclient timeouts.        In general you want to look for repetition of problems with particular        host/URIs.</p>         <p>Grepping the local errors log is a bit tricky because        of the shape of its content. Its recommend that you first "flatten"        the local errors file.  Here's an example :            <div class="source"><pre>% cat  local-errors.log | tr -d \\\n | perl -pe 's/([0-9]{17} )/\n$1/g'</pre></div>          </p>         <p>This will remove all new lines and then add a new line in front of 17-digit dates (hopefully only 17-digit tokens followed by a space are dates.).  The result is one line per entry with a 17-digit         date prefix. Makes it easier to parse.         </p>        <p>To eliminate URIs for unresponsive hosts from the frontier queue,        pause the crawl and block the        fetch from that host by creating a new per-host setting         -- an override -- in the preselector processor.</p>         <p>Also, check for any hung threads. This does not happen         anymore (0.8.0+). Check the threads report for threads that         have been active for a long time but that should not be:          i.e. documents being downloaded are small in size.        </p>        <p>Once you've identified hung threads, kill and replace it.</p>        </dd>      </dl><dl>        <dt><a name="traps">What are crawler traps?</a></dt>        <dd>        <p>Traps are infinite page sources put up to occupy ('trap') a crawler.        Traps may be as innocent as a calendar that returns pages years into        the future or not-so-innocent         <a href="http://spiders.must.die.net/" class="externalLink" title="External Link">http://spiders.must.die.net/</a>.        Traps are created by CGIs/server-side code that dynamically conjures        'nonsense' pages or else exploits combination of soft and relative links        to generate URI paths of infinite variety and depth.        Once identified, use filters to guard against falling in.        </p>        <p>Another trap that works by feeding documents of infinite sizes        to the crawler is        http://yahoo.domain.com.au/tools/spiderbait.aspx* as in        http://yahoo.domain.com.au/tools/spiderbait.aspx?state=vic or        http://yahoo.domain.com.au/tools/spiderbait.aspx?state=nsw.        To filter out infinite document size traps, add a maximum doc.        size filter to your crawl order.        </p>        <p>See <a href="#crawl_junk">What to do when I'm crawling "junk"?</a>        </p>        </dd>      </dl><dl>        <dt><a name="crawl_junk">What do I do to avoid crawling "junk"?</a></dt>        <dd>         <p>In the past crawls were stopped when we ran into "junk."           An example of what we mean by "junk" is the crawler stuck         in a web calender crawling the year 2020.  Nowadays, if          "junk" is detected, we'll pause the crawl and set filters         to eliminate "junk" and then resume (Eliminated URIs will         show in the logs.  Helps when doing post-crawl analysis).         </p>         <p>To help guard against the crawling of "junk" setup          the pathological and path-depth filters.         This will also help the crawler avoid <a href="#traps">traps</a>.         Recommended values for pathological filter is 3 repetitions of same         pattern -- e.g. /images/images/images/... -- and for path-depth, a         value of 20.         </p>        </dd>      </dl><dl>        <dt><a name="war">Can Heritrix be made run in Tomcat (or Websphere, or        Resin, or Weblogic)?  Does it have to be run embedded in        Jetty?</a></dt>        <dd>        <p>Try out Heritrix bundled as a WAR file.        Use the maven 'war' target to produce a heritrix.war or        pull the war from the build         <a href="http://crawltools:8080/cruisecontrol/buildresults/HEAD-heritrix" class="externalLink" title="External Link">downloads page</a> (Click on the 'Build Artifacts' link).        Heritrix as a WAR is available in HEAD only (post-1.2.0) and        currently has 'experimental' status (i.e. It needs exercising).        </p>        </dd>      </dl><dl>        <dt><a name="embedding">Can I embedd Heritrix in another application?        </a></dt>        <dd>        <p>Sure.  Make sure all that is in the Heritrix lib directory is on        your CLASSPATH (ensuring the heritrix.jar is found first).        Thereafter, using HEAD (post-1.2.0), doing the following should        get you a long ways:             <div class="source"><pre>Heritrix h = new Heritrix();        h.launch();</pre></div>          You'll then need to have your program hangaround while the crawl runs.        See        <a href="http://groups.yahoo.com/group/archive-crawler/message/1276" class="externalLink" title="External Link">message1276</a>  for an example. See also the answer to the next question        and this page up on our wiki, <a href="http://crawler.archive.org/cgi-bin/wiki.pl?EmbeddingHeritrix" class="externalLink" title="External Link">Embedding Heritrix</a>.        </p>        </dd>      </dl><dl>        <dt><a name="cmdlinecontrol">Can I stop/pause and get status from a running Heritrix        using command-line tools? Can I remote control Heritrix?        </a></dt>        <dd>        <p>A JMX interface has been added to the crawler.  The intent is that        all features of the UI are exposed in JMX so Heritrix can be        remotely controlled.</p>        <p>A cmdline control utility that makes use of the JMX API has been        added.  The script can        be found in the scripts directory.  Itspackaged as a jar file named <code>cmdline-jmxclient.X.X.X.jar</code>.  It has no dependencies on other jars being found in its classpath so it can be safely moved from this location.  Its only dependency is jdk1.5.0.To learn more, obtain client usage by typing the following:<code>${PATH_TO_JDK1.5.0}/bin/java -jar cmdline-jmxclient.X.X.X.jar</code>. See also <a href="http://crawler.archive.org/cmdline-jmxclient/" class="externalLink" title="External Link">cmdline-jmxclient</a>to learn more.</p>          </dd>      </dl><dl>        <dt><a name="more_than_one_job">What techniques exist for crawling more than one job at         time?        </a></dt>        <dd>        <p>See this, <a href="http://groups.yahoo.com/group/archive-crawler/message/1182" class="externalLink" title="External Link">1182</a>, Tom Emerson note for a suggestion.</p>          <p>Its also possible post-1.4.0 to run multiple Heritrix instances        in a single JVM.  Browse to <code>/local-instances.jsp</code>.        </p>        </dd>      </dl><dl><dt><a name="toethreads">Why are the main crawler worker threads called "ToeThreads"??</a></dt><dd><p>While the mascots of web crawlers have usually been spider-related,I'd rather think of Heritrix as a centipede or millipede: fast andmany-segmented.</p><p>Anything that "crawls" over many things at once would presumably havea lot of feet and toes. Heritrix will often use many hundreds of workerthreads to "crawl", but 'WorkerThread' or 'CrawlThread' seem mundane.</p><p>So instead, we have 'ToeThreads'.  :)</p></dd></dl><dl><dt><a name="using_heritrix">Who is using Heritrix?</a></dt><dd><p>Below is listing of users of Heritrix (To qualify for inclusion in thelist below, send a description of a couple of lines to the mailing list).<ul>    <li><a href="http://www.bok.hi.is/" class="externalLink" title="External Link">The National and University Library of    Iceland</a>: Crawls the    entire <i>.is</i> domain (~11,000 domains) using Heritrix. Has performed    complete snapshot using Heritrix 1.0.4 (35million URIs) and plans on    running three more snapshots in 2005. See <a href="http://groups.yahoo.com/group/archive-crawler/message/1385" class="externalLink" title="External Link">1385</a>.</li>    <li><a href="http://www.lib.helsinki.fi/english/index.htm" class="externalLink" title="External Link">The National    Library of Finland</a>:  Has used Heritrix to crawl Finnish    museum sites and sites pertaining to the June 2004 European parliament    elections.  The main crawl done in    2004 was of Finnish university sites (~4million URLs).    Kaisa supplies more detail on how this larger crawl was done: <a href="http://groups.yahoo.com/group/archive-crawler/message/1406" class="externalLink" title="External Link">1406</a>.    </li>    <li><a href="http://www.geometa.info" class="externalLink" title="External Link">metainfo</a>:     Geometa.info is a search machine for spatially related geo-data,    geo-services and geo-news for Switzerland, Germany and Austria. We use    Heritrix with specialised plugins to find geo-relevant datas and    websites. This are formats like Geotiff-, GML-, Interlis-, ESRI-files,    WFS- or WMS-services and other sites with georelevant content.    Geometabot (Heritrix) is the feeder for the Lucene search engine which    provides the coresearchservice for geometa.info.    </li>    <li>Saurabh Pathak and Donna Bergmark have written a module for Heritrix    that asks of a Rainbow classifier if a page should be crawled or not.    See <a href="http://groups.yahoo.com/group/archive-crawler/message/1905" class="externalLink" title="External Link">1905</a>    for their announcement of the project with links to src and    HOWTO documentation.    </li></ul></p></dd></dl><dl>        <dt><a name="nutchwax">I've downloaded all these ARC files, now what?        </a></dt>        <dd>        <p>See the <ulink url="http://crawler.archive.org/articles/developer_manual.html#arcs">Developer's Manual</ulink> for more on ARCs and tools for reading and writing        them.  There are also tools for searching ARC collections        available over in the <ulink url="http://archive-access.sourceforge.net/">archive-access</ulink>        project.  Checkout the nutch-based <ulink url="http://archive-access.sourceforge.net/projects/nutch/">NutchWAX</ulink>        and its companion viewer application, <ulink url="http://archive-access.sourceforge.net/projects/nutch/">WERA</ulink>.        </p>        </dd>      </dl></div><div class="section"><a name="References"></a><h2>References</h2><title>References</title><dl>      <dt><a name="archive_data">Where can I learn more about what is stored at       the Internet Archive, the ARC file format, and tools for manipulating      ARC files?      </a></dt>      <dd>        <p>See the ARC section in the        <a href="articles/developer_manual.html#arcs">Developer        Manual</a>.        </p>      </dd>    </dl><dl>      <dt><a name="where_refs">Where can I get more background on Heritrix and         learn more about "crawling" in general?      </a></dt>      <dd>        <p>The following are all worth at least a quick skim:    <ol>    <li><a href="http://en.wikipedia.org/wiki/Webcrawler" class="externalLink" title="External Link">The Wikipedia    Webcrawler</a> page offers a nice introduction on general crawling    problem.  It has a good overview of current, most cited literature.</li>    <li><a href="http://citeseer.nj.nec.com/heydon99mercator.html" class="externalLink" title="External Link">Mercator: A     Scalable, Extensible Web Crawler</a> is an overview of the original    Mercator design, which the Heritrix crawler parallels in many ways.</li>    <li><a href="http://citeseer.nj.nec.com/najork01highperformance.html" class="externalLink" title="External Link">High-performance Web Crawling</a> is info on experience scaling Mercator.</li>    <li><a href="http://citeseer.nj.nec.com/heydon00performance.html" class="externalLink" title="External Link">Performance     Limitations of the Java Core Libraries</a> is info on Mercator's    experience working around Java problems and bottlenecks.     Fortunately, many of these issues have been improved for us by later JVMs    and Java core API updates -- but some of these are still issues, and in    any case it gives a good flavor for the kinds of    problems and profiling one might need to do.    </li>    <li><a href="http://vigna.dsi.unimi.it/ftp/papers/UbiCrawler.pdf" class="externalLink" title="External Link">Ubicrawler</a>, a scalable distributed    web crawler.    </li>    <li><a href="http://citeseer.nj.nec.com/leung01towards.html" class="externalLink" title="External Link">Towards    Web-Scale Web Archeaology</a> is a higher-level view, not as focused on    crawling details, but rather the post-crawl needs that motivate crawling    in the first place.</li>   <li>A number of other potentially interesting papers are linked off   the "crawl-links.html" file in the <a href="http://groups.yahoo.com/group/archive-crawler/files/" class="externalLink" title="External Link">YahooGroups        files section...</a>    </li>    <li><a href="http://groups.yahoo.com/group/archive-crawler/message/1498" class="externalLink" title="External Link">Msg1498</a> is a note from the list on page similarity/containment issues.    </li>    <li>Thesis paper on creation specialized Frontier and other modules    for Heritrix by Kristinn Sigurdsson:    <a href="http://vefsofnun.bok.hi.is/thesis/ar.pdf" class="externalLink" title="External Link">Adaptive Revisiting    with Heritrix</a></li>    </ol>    </p>      </dd>    </dl></div></div></div><div class="clear"><hr></hr></div><div id="footer"><div class="xleft"><a href="http://sourceforge.net/projects/archive-crawler/" class="externalLink" title="External Link">            <img src="http://sourceforge.net/sflogo.php?group_id=archive-crawler&amp;type=1" border="0" alt="sf logo"></img></a></div><div class="xright">漏 2003-2006, Internet Archive</div><div class="clear"><hr></hr></div></div></body></html>

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -