📄 faq.html
字号:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><head><title>Heritrix - Frequently Asked Questions</title><style type="text/css" media="all"> @import url("./style/maven-base.css"); @import url("./style/maven-theme.css");@import url("./style/project.css");</style><link rel="stylesheet" href="./style/print.css" type="text/css" media="print"></link><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></meta></head><body class="composite"><div id="banner"><a href="http://www.archive.org/" id="organizationLogo"><img alt="Internet Archive" src="http://www.archive.org/images/logo.jpg"></img></a><a href="http://crawler.archive.org" id="projectLogo"><img alt="Heritrix" src="./images/logo.gif"></img></a><div class="clear"><hr></hr></div></div><div id="breadcrumbs"><div class="xleft"> Last published: 27 September 2006 | Doc for 1.10.1</div><div class="xright"></div><div class="clear"><hr></hr></div></div><div id="leftColumn"><div id="navcolumn"><div id="menuOverview"><h5>Overview</h5><ul><li class="none"><a href="license.html">License</a></li><li class="none"><a href="requirements.html">System Requirements</a></li><li class="none"><a href="downloads.html">Downloads</a></li><li class="none"><a href="articles/user_manual/index.html">User Manual</a></li><li class="none"><a href="articles/developer_manual/index.html">Developer Manual</a></li><li class="none"><a href="apidocs/index.html">Javadocs</a></li><li class="none"><strong><a href="faq.html">FAQ</a></strong></li><li class="none"><a href="http://crawler.archive.org/cgi-bin/wiki.pl?HomePage" class="externalLink" title="External Link">Wiki</a></li><li class="none"><a href="http://sourceforge.net/tracker/?group_id=73833&atid=539099" class="externalLink" title="External Link">Browse/Submit a Bug</a></li></ul></div><div id="menuProject_Documentation"><h5>Project Documentation</h5><ul><li class="none"><a href="index.html">About Heritrix</a></li><li class="collapsed"><a href="project-info.html">Project Info</a></li><li class="collapsed"><a href="maven-reports.html">Project Reports</a></li><li class="none"><a href="http://maven.apache.org/development-process.html" class="externalLink" title="External Link">Development Process</a></li></ul></div><a href="http://maven.apache.org/" title="Built by Maven" id="poweredBy"><img alt="Built by Maven" src="./images/logos/maven-button-1.png"></img></a></div></div><div id="bodyColumn"><div class="contentBox"><div class="section"><a name="Frequently_Asked_Questions"></a><h2>Frequently Asked Questions</h2><p> <strong>General</strong> </p><ol> <li> <a href="#heritrix"> What does "Heritrix" mean? </a> </li> <li> <a href="#introduction"> Where can I go to get a good introduction/overview of Heritrix? </a> </li> <li> <a href="#user-heritrix"> I need to crawl/archive a set of websites, can I use Heritrix? </a> </li> <li> <a href="#developer"> I'm a developer, can I help? </a> </li> <li> <a href="#license"> What license does Heritrix use? </a> </li> </ol><p> <strong>Common Problems</strong> </p><ol> <li> <a href="#arc_closed"> How do I know when heritrix is done with an ARC file? </a> </li> <li> <a href="#limitations"> Are there known limitations? </a> </li> <li> <a href="#testsfail"> Why do unit tests fail when I build? </a> </li> <li> <a href="#linuxes"> Which Linux distribution should I use to run Heritrix and which kernel version do I need? </a> </li> <li> <a href="#windows"> How do I run Heritrix on windows. </a> </li> <li> <a href="#windowsstart"> The crawler gets dns fine but nothing subsequently. Why? </a> </li> <li> <a href="#windowsmkdir"> The crawler, running on windows, complains it cannot <code>mkdir</code>. Why? </a> </li> <li> <a href="#midfetch"> I only want to download <code>text/html</code> and nothing else. Can I do it? </a> </li> <li> <a href="#crawllogstatuscodes"> Where do I go to learn about these cryptic crawl.log status codes (-6, -7, -9998, etc.)? </a> </li> <li> <a href="#toomanyopenfiles"> Why do I get <i>java.io.FileNotFoundException...(Too many open files)</i> or <i>java.io.IOException...(Too many open files)</i>? </a> </li> <li> <a href="#oome_broadcrawl"> Why do I get an OutOfMemoryException ten minutes after starting a broad scoped crawl? </a> </li> <li> <a href="#new_writer"> Can I insert the crawl download directly into a MYSQL database instead of into an ARC file on disk while crawling? </a> </li> <li> <a href="#mirror"> Does Heritrix have to write ARC files? </a> </li> <li> <a href="#eclipse_assert"> Why when running heritrix in eclipse does it complain about the 'assert' keyword? </a> </li> <li> <a href="#crawl_finished"> Why won't my crawl finish? </a> </li> <li> <a href="#traps"> What are crawler traps? </a> </li> <li> <a href="#crawl_junk"> What do I do to avoid crawling "junk"? </a> </li> <li> <a href="#war"> Can Heritrix be made run in Tomcat (or Websphere, or Resin, or Weblogic)? Does it have to be run embedded in Jetty? </a> </li> <li> <a href="#embedding"> Can I embedd Heritrix in another application? </a> </li> <li> <a href="#cmdlinecontrol"> Can I stop/pause and get status from a running Heritrix using command-line tools? Can I remote control Heritrix? </a> </li> <li> <a href="#more_than_one_job"> What techniques exist for crawling more than one job at time? </a> </li> <li> <a href="#toethreads"> Why are the main crawler worker threads called "ToeThreads"?? </a> </li> <li> <a href="#using_heritrix"> Who is using Heritrix? </a>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -