📄 faq.html
字号:
"Window>Preferences>Java/Compiler>Compliance and Classfiles" and setting "Compiler compliance level" to 5.0. Make sure the 'Use default compliance level' is UNCHECKED and that the 'Generated .class files compatibility' and 'Source compatibility' are also set to 5.0.</p> </dd> </dl><dl> <dt><a name="crawl_finished">Why won't my crawl finish?</a></dt> <dd> <p>The crawl can get hung up on sites that are actually down or are non-responsive. Manual intervention is necessary in such cases. Study the frontier to get a picture of what is left to be crawled. Looking at the local errors log will give let you see the problems with currently crawled URIs. Along with robots.txt retries, you will probably also see httpclient timeouts. In general you want to look for repetition of problems with particular host/URIs.</p> <p>Grepping the local errors log is a bit tricky because of the shape of its content. Its recommend that you first "flatten" the local errors file. Here's an example : <div class="source"><pre>% cat local-errors.log | tr -d \\\n | perl -pe 's/([0-9]{17} )/\n$1/g'</pre></div> </p> <p>This will remove all new lines and then add a new line in front of 17-digit dates (hopefully only 17-digit tokens followed by a space are dates.). The result is one line per entry with a 17-digit date prefix. Makes it easier to parse. </p> <p>To eliminate URIs for unresponsive hosts from the frontier queue, pause the crawl and block the fetch from that host by creating a new per-host setting -- an override -- in the preselector processor.</p> <p>Also, check for any hung threads. This does not happen anymore (0.8.0+). Check the threads report for threads that have been active for a long time but that should not be: i.e. documents being downloaded are small in size. </p> <p>Once you've identified hung threads, kill and replace it.</p> </dd> </dl><dl> <dt><a name="traps">What are crawler traps?</a></dt> <dd> <p>Traps are infinite page sources put up to occupy ('trap') a crawler. Traps may be as innocent as a calendar that returns pages years into the future or not-so-innocent <a href="http://spiders.must.die.net/" class="externalLink" title="External Link">http://spiders.must.die.net/</a>. Traps are created by CGIs/server-side code that dynamically conjures 'nonsense' pages or else exploits combination of soft and relative links to generate URI paths of infinite variety and depth. Once identified, use filters to guard against falling in. </p> <p>Another trap that works by feeding documents of infinite sizes to the crawler is http://yahoo.domain.com.au/tools/spiderbait.aspx* as in http://yahoo.domain.com.au/tools/spiderbait.aspx?state=vic or http://yahoo.domain.com.au/tools/spiderbait.aspx?state=nsw. To filter out infinite document size traps, add a maximum doc. size filter to your crawl order. </p> <p>See <a href="#crawl_junk">What to do when I'm crawling "junk"?</a> </p> </dd> </dl><dl> <dt><a name="crawl_junk">What do I do to avoid crawling "junk"?</a></dt> <dd> <p>In the past crawls were stopped when we ran into "junk." An example of what we mean by "junk" is the crawler stuck in a web calender crawling the year 2020. Nowadays, if "junk" is detected, we'll pause the crawl and set filters to eliminate "junk" and then resume (Eliminated URIs will show in the logs. Helps when doing post-crawl analysis). </p> <p>To help guard against the crawling of "junk" setup the pathological and path-depth filters. This will also help the crawler avoid <a href="#traps">traps</a>. Recommended values for pathological filter is 3 repetitions of same pattern -- e.g. /images/images/images/... -- and for path-depth, a value of 20. </p> </dd> </dl><dl> <dt><a name="war">Can Heritrix be made run in Tomcat (or Websphere, or Resin, or Weblogic)? Does it have to be run embedded in Jetty?</a></dt> <dd> <p>Try out Heritrix bundled as a WAR file. Use the maven 'war' target to produce a heritrix.war or pull the war from the build <a href="http://crawltools:8080/cruisecontrol/buildresults/HEAD-heritrix" class="externalLink" title="External Link">downloads page</a> (Click on the 'Build Artifacts' link). Heritrix as a WAR is available in HEAD only (post-1.2.0) and currently has 'experimental' status (i.e. It needs exercising). </p> </dd> </dl><dl> <dt><a name="embedding">Can I embedd Heritrix in another application? </a></dt> <dd> <p>Sure. Make sure all that is in the Heritrix lib directory is on your CLASSPATH (ensuring the heritrix.jar is found first). Thereafter, using HEAD (post-1.2.0), doing the following should get you a long ways: <div class="source"><pre>Heritrix h = new Heritrix(); h.launch();</pre></div> You'll then need to have your program hangaround while the crawl runs. See <a href="http://groups.yahoo.com/group/archive-crawler/message/1276" class="externalLink" title="External Link">message1276</a> for an example. See also the answer to the next question and this page up on our wiki, <a href="http://crawler.archive.org/cgi-bin/wiki.pl?EmbeddingHeritrix" class="externalLink" title="External Link">Embedding Heritrix</a>. </p> </dd> </dl><dl> <dt><a name="cmdlinecontrol">Can I stop/pause and get status from a running Heritrix using command-line tools? Can I remote control Heritrix? </a></dt> <dd> <p>A JMX interface has been added to the crawler. The intent is that all features of the UI are exposed in JMX so Heritrix can be remotely controlled.</p> <p>A cmdline control utility that makes use of the JMX API has been added. The script can be found in the scripts directory. Itspackaged as a jar file named <code>cmdline-jmxclient.X.X.X.jar</code>. It has no dependencies on other jars being found in its classpath so it can be safely moved from this location. Its only dependency is jdk1.5.0.To learn more, obtain client usage by typing the following:<code>${PATH_TO_JDK1.5.0}/bin/java -jar cmdline-jmxclient.X.X.X.jar</code>. See also <a href="http://crawler.archive.org/cmdline-jmxclient/" class="externalLink" title="External Link">cmdline-jmxclient</a>to learn more.</p> </dd> </dl><dl> <dt><a name="more_than_one_job">What techniques exist for crawling more than one job at time? </a></dt> <dd> <p>See this, <a href="http://groups.yahoo.com/group/archive-crawler/message/1182" class="externalLink" title="External Link">1182</a>, Tom Emerson note for a suggestion.</p> <p>Its also possible post-1.4.0 to run multiple Heritrix instances in a single JVM. Browse to <code>/local-instances.jsp</code>. </p> </dd> </dl><dl><dt><a name="toethreads">Why are the main crawler worker threads called "ToeThreads"??</a></dt><dd><p>While the mascots of web crawlers have usually been spider-related,I'd rather think of Heritrix as a centipede or millipede: fast andmany-segmented.</p><p>Anything that "crawls" over many things at once would presumably havea lot of feet and toes. Heritrix will often use many hundreds of workerthreads to "crawl", but 'WorkerThread' or 'CrawlThread' seem mundane.</p><p>So instead, we have 'ToeThreads'. :)</p></dd></dl><dl><dt><a name="using_heritrix">Who is using Heritrix?</a></dt><dd><p>Below is listing of users of Heritrix (To qualify for inclusion in thelist below, send a description of a couple of lines to the mailing list).<ul> <li><a href="http://www.bok.hi.is/" class="externalLink" title="External Link">The National and University Library of Iceland</a>: Crawls the entire <i>.is</i> domain (~11,000 domains) using Heritrix. Has performed complete snapshot using Heritrix 1.0.4 (35million URIs) and plans on running three more snapshots in 2005. See <a href="http://groups.yahoo.com/group/archive-crawler/message/1385" class="externalLink" title="External Link">1385</a>.</li> <li><a href="http://www.lib.helsinki.fi/english/index.htm" class="externalLink" title="External Link">The National Library of Finland</a>: Has used Heritrix to crawl Finnish museum sites and sites pertaining to the June 2004 European parliament elections. The main crawl done in 2004 was of Finnish university sites (~4million URLs). Kaisa supplies more detail on how this larger crawl was done: <a href="http://groups.yahoo.com/group/archive-crawler/message/1406" class="externalLink" title="External Link">1406</a>. </li> <li><a href="http://www.geometa.info" class="externalLink" title="External Link">metainfo</a>: Geometa.info is a search machine for spatially related geo-data, geo-services and geo-news for Switzerland, Germany and Austria. We use Heritrix with specialised plugins to find geo-relevant datas and websites. This are formats like Geotiff-, GML-, Interlis-, ESRI-files, WFS- or WMS-services and other sites with georelevant content. Geometabot (Heritrix) is the feeder for the Lucene search engine which provides the coresearchservice for geometa.info. </li> <li>Saurabh Pathak and Donna Bergmark have written a module for Heritrix that asks of a Rainbow classifier if a page should be crawled or not. See <a href="http://groups.yahoo.com/group/archive-crawler/message/1905" class="externalLink" title="External Link">1905</a> for their announcement of the project with links to src and HOWTO documentation. </li></ul></p></dd></dl><dl> <dt><a name="nutchwax">I've downloaded all these ARC files, now what? </a></dt> <dd> <p>See the <ulink url="http://crawler.archive.org/articles/developer_manual.html#arcs">Developer's Manual</ulink> for more on ARCs and tools for reading and writing them. There are also tools for searching ARC collections available over in the <ulink url="http://archive-access.sourceforge.net/">archive-access</ulink> project. Checkout the nutch-based <ulink url="http://archive-access.sourceforge.net/projects/nutch/">NutchWAX</ulink> and its companion viewer application, <ulink url="http://archive-access.sourceforge.net/projects/nutch/">WERA</ulink>. </p> </dd> </dl></div><div class="section"><a name="References"></a><h2>References</h2><title>References</title><dl> <dt><a name="archive_data">Where can I learn more about what is stored at the Internet Archive, the ARC file format, and tools for manipulating ARC files? </a></dt> <dd> <p>See the ARC section in the <a href="articles/developer_manual.html#arcs">Developer Manual</a>. </p> </dd> </dl><dl> <dt><a name="where_refs">Where can I get more background on Heritrix and learn more about "crawling" in general? </a></dt> <dd> <p>The following are all worth at least a quick skim: <ol> <li><a href="http://en.wikipedia.org/wiki/Webcrawler" class="externalLink" title="External Link">The Wikipedia Webcrawler</a> page offers a nice introduction on general crawling problem. It has a good overview of current, most cited literature.</li> <li><a href="http://citeseer.nj.nec.com/heydon99mercator.html" class="externalLink" title="External Link">Mercator: A Scalable, Extensible Web Crawler</a> is an overview of the original Mercator design, which the Heritrix crawler parallels in many ways.</li> <li><a href="http://citeseer.nj.nec.com/najork01highperformance.html" class="externalLink" title="External Link">High-performance Web Crawling</a> is info on experience scaling Mercator.</li> <li><a href="http://citeseer.nj.nec.com/heydon00performance.html" class="externalLink" title="External Link">Performance Limitations of the Java Core Libraries</a> is info on Mercator's experience working around Java problems and bottlenecks. Fortunately, many of these issues have been improved for us by later JVMs and Java core API updates -- but some of these are still issues, and in any case it gives a good flavor for the kinds of problems and profiling one might need to do. </li> <li><a href="http://vigna.dsi.unimi.it/ftp/papers/UbiCrawler.pdf" class="externalLink" title="External Link">Ubicrawler</a>, a scalable distributed web crawler. </li> <li><a href="http://citeseer.nj.nec.com/leung01towards.html" class="externalLink" title="External Link">Towards Web-Scale Web Archeaology</a> is a higher-level view, not as focused on crawling details, but rather the post-crawl needs that motivate crawling in the first place.</li> <li>A number of other potentially interesting papers are linked off the "crawl-links.html" file in the <a href="http://groups.yahoo.com/group/archive-crawler/files/" class="externalLink" title="External Link">YahooGroups files section...</a> </li> <li><a href="http://groups.yahoo.com/group/archive-crawler/message/1498" class="externalLink" title="External Link">Msg1498</a> is a note from the list on page similarity/containment issues. </li> <li>Thesis paper on creation specialized Frontier and other modules for Heritrix by Kristinn Sigurdsson: <a href="http://vefsofnun.bok.hi.is/thesis/ar.pdf" class="externalLink" title="External Link">Adaptive Revisiting with Heritrix</a></li> </ol> </p> </dd> </dl></div></div></div><div class="clear"><hr></hr></div><div id="footer"><div class="xleft"><a href="http://sourceforge.net/projects/archive-crawler/" class="externalLink" title="External Link"> <img src="http://sourceforge.net/sflogo.php?group_id=archive-crawler&type=1" border="0" alt="sf logo"></img></a></div><div class="xright">漏 2003-2006, Internet Archive</div><div class="clear"><hr></hr></div></div></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -