⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 faq.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
📖 第 1 页 / 共 3 页
字号:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><head><title>Heritrix - Frequently Asked Questions</title><style type="text/css" media="all">          @import url("./style/maven-base.css");          			    @import url("./style/maven-theme.css");@import url("./style/project.css");</style><link rel="stylesheet" href="./style/print.css" type="text/css" media="print"></link><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"></meta></head><body class="composite"><div id="banner"><a href="http://www.archive.org/" id="organizationLogo"><img alt="Internet Archive" src="http://www.archive.org/images/logo.jpg"></img></a><a href="http://crawler.archive.org" id="projectLogo"><img alt="Heritrix" src="./images/logo.gif"></img></a><div class="clear"><hr></hr></div></div><div id="breadcrumbs"><div class="xleft">                	Last published: 27 September 2006                  | Doc for 1.10.1</div><div class="xright"></div><div class="clear"><hr></hr></div></div><div id="leftColumn"><div id="navcolumn"><div id="menuOverview"><h5>Overview</h5><ul><li class="none"><a href="license.html">License</a></li><li class="none"><a href="requirements.html">System Requirements</a></li><li class="none"><a href="downloads.html">Downloads</a></li><li class="none"><a href="articles/user_manual/index.html">User Manual</a></li><li class="none"><a href="articles/developer_manual/index.html">Developer Manual</a></li><li class="none"><a href="apidocs/index.html">Javadocs</a></li><li class="none"><strong><a href="faq.html">FAQ</a></strong></li><li class="none"><a href="http://crawler.archive.org/cgi-bin/wiki.pl?HomePage" class="externalLink" title="External Link">Wiki</a></li><li class="none"><a href="http://sourceforge.net/tracker/?group_id=73833&amp;atid=539099" class="externalLink" title="External Link">Browse/Submit a Bug</a></li></ul></div><div id="menuProject_Documentation"><h5>Project Documentation</h5><ul><li class="none"><a href="index.html">About Heritrix</a></li><li class="collapsed"><a href="project-info.html">Project Info</a></li><li class="collapsed"><a href="maven-reports.html">Project Reports</a></li><li class="none"><a href="http://maven.apache.org/development-process.html" class="externalLink" title="External Link">Development Process</a></li></ul></div><a href="http://maven.apache.org/" title="Built by Maven" id="poweredBy"><img alt="Built by Maven" src="./images/logos/maven-button-1.png"></img></a></div></div><div id="bodyColumn"><div class="contentBox"><div class="section"><a name="Frequently_Asked_Questions"></a><h2>Frequently Asked Questions</h2><p>              <strong>General</strong>            </p><ol>                            <li>                                                <a href="#heritrix">                              What does "Heritrix" mean?                      </a>              </li>                            <li>                                                <a href="#introduction">                        Where can I go to get a good introduction/overview of Heritrix?                      </a>              </li>                            <li>                                                <a href="#user-heritrix">                      I need to crawl/archive a set of websites, can I use Heritrix?                      </a>              </li>                            <li>                                                <a href="#developer">                      I'm a developer, can I help?                      </a>              </li>                            <li>                                                <a href="#license">                  What license does Heritrix use?                      </a>              </li>                          </ol><p>              <strong>Common Problems</strong>            </p><ol>                            <li>                                                <a href="#arc_closed">                  How do I know when heritrix is done with an ARC file?                </a>              </li>                            <li>                                                <a href="#limitations">                  Are there known limitations?                </a>              </li>                            <li>                                                <a href="#testsfail">                  Why do unit tests fail when I build?                </a>              </li>                            <li>                                                <a href="#linuxes">                  Which Linux distribution should I use to run Heritrix and which kernel version do I need?                </a>              </li>                            <li>                                                <a href="#windows">                  How do I run Heritrix on windows.                </a>              </li>                            <li>                                                <a href="#windowsstart">                  The crawler gets dns fine but nothing subsequently.      Why?                </a>              </li>                            <li>                                                <a href="#windowsmkdir">                  The crawler, running on windows, complains it cannot      <code>mkdir</code>. Why?                </a>              </li>                            <li>                                                <a href="#midfetch">                  I only want to download <code>text/html</code> and nothing else.  Can I do it?                </a>              </li>                            <li>                                                <a href="#crawllogstatuscodes">                  Where do I go to learn about these cryptic crawl.log status      codes (-6, -7, -9998, etc.)?                </a>              </li>                            <li>                                                <a href="#toomanyopenfiles">                  Why do I get      <i>java.io.FileNotFoundException...(Too many open files)</i> or      <i>java.io.IOException...(Too many open files)</i>?                      </a>              </li>                            <li>                                                <a href="#oome_broadcrawl">                  Why        do I get an OutOfMemoryException ten minutes after starting         a broad scoped crawl?                </a>              </li>                            <li>                                                <a href="#new_writer">                  Can I insert        the crawl download directly into a MYSQL database instead of        into an ARC file on disk while crawling?                </a>              </li>                            <li>                                                <a href="#mirror">                  Does Heritrix have to write ARC files?                </a>              </li>                            <li>                                                <a href="#eclipse_assert">                  Why when        running heritrix in eclipse does it complain about the        'assert' keyword?                </a>              </li>                            <li>                                                <a href="#crawl_finished">                  Why won't my crawl finish?                </a>              </li>                            <li>                                                <a href="#traps">                  What are crawler traps?                </a>              </li>                            <li>                                                <a href="#crawl_junk">                  What do I do to avoid crawling "junk"?                </a>              </li>                            <li>                                                <a href="#war">                  Can Heritrix be made run in Tomcat (or Websphere, or        Resin, or Weblogic)?  Does it have to be run embedded in        Jetty?                </a>              </li>                            <li>                                                <a href="#embedding">                  Can I embedd Heritrix in another application?                        </a>              </li>                            <li>                                                <a href="#cmdlinecontrol">                  Can I stop/pause and get status from a running Heritrix        using command-line tools? Can I remote control Heritrix?                        </a>              </li>                            <li>                                                <a href="#more_than_one_job">                  What techniques exist for crawling more than one job at         time?                        </a>              </li>                            <li>                                                <a href="#toethreads">                  Why are the main crawler worker threads called "ToeThreads"??                </a>              </li>                            <li>                                                <a href="#using_heritrix">                  Who is using Heritrix?                </a>

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -