📄 package-summary.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
📖 第 1 页 / 共 2 页
字号:
上一页 12
        The <tt>admin</tt> package contains classes that are used by the Web UI.        This includes some core classes and a specific implementation of the         <tt>Statistics Tracking</tt> interface found in the <tt>framework</tt>        package that is designed to provide the UI with information about         ongoing crawls.    <h2>Pluggable modules</h2>    <p>        The following is a listing of the types of pluggable modules found in         Heritrix with brief explanations of each and linking to their respective        API documentation.        <h3>Frontier</h3>    <p>        A <tt>Frontier</tt> maintains the internal state of a crawl while it is        in progress. What URIs have been discovered, which should be crawled next,        etc.    <p>        Needless to say this is one of the most important modules in any crawl and        the provided implementation should generally be appropriate unless a very        different strategy for ordering URIs for crawling is desired.    <p>        <A HREF="../../../org/archive/crawler/framework/Frontier.html" title="interface in org.archive.crawler.framework"><CODE>Frontier</CODE></A> is the interface        that all <tt>Frontiers</tt> must implement.<br>        <A HREF="../../../org/archive/crawler/frontier/package-summary.html"><CODE>org.archive.crawler.frontier</CODE></A> package         contains the provided implementation of a <tt>Frontier</tt> along with it's        supporting classes.        <h3>Processor</h3><img src="doc-files/processing_steps.png"     alt="Processing Steps" style="width: 198px; height: 470px;"     align="right" />    <p>        When a URI is crawled, a <A HREF="../../../org/archive/crawler/framework/ToeThread.html" title="class in org.archive.crawler.framework"><CODE>ToeThread</CODE></A> will execute a series of <tt>processors</tt> on it.    <p>        The processors are split into 5 distinct chains that are exectued in sequence:                              <ol>            <li>Pre-fetch processing chain            <li>Fetch processing chain            <li>Extractor processing chain            <li>Write/Index processing chain            <li>Post-processing chain        </ol>        Each of these chains contain any number of <tt>processors</tt>. The processors        all inherit from a generic <A HREF="../../../org/archive/crawler/framework/Processor.html" title="class in org.archive.crawler.framework"><CODE>Processor</CODE></A>. While the processors are divided into the five categories above that        is strictly a high level configuration and any processor can be in any chain        (although doing link extraction before fetching a document is clearly of no        use).    <p>        Numerous processors are provided with Heritrix in the following packages:<br>        <A HREF="../../../org/archive/crawler/prefetch/package-summary.html"><CODE>org.archive.crawler.prefetch</CODE></A> package        contains processors run before the URI is fetched from the Internet.<br>        <A HREF="../../../org/archive/crawler/fetcher/package-summary.html"><CODE>org.archive.crawler.fetcher</CODE></A> package        contains processors that fetch URI from the Internet. Typically each        processor handles a different protocol.<br>        <A HREF="../../../org/archive/crawler/extractor/package-summary.html"><CODE>org.archive.crawler.extractor</CODE></A> package        contains processors that perform link extractions on various document types.<br>        <A HREF="../../../org/archive/crawler/writer/package-summary.html"><CODE>org.archive.crawler.writer</CODE></A> package contains        a processor that writes an ARC file with the fetched document.<br>        <A HREF="../../../org/archive/crawler/postprocessor/package-summary.html"><CODE>org.archive.crawler.postprocessor</CODE></A>        package contain processors that do wrapup on the processing, reporting links        back to the Frontier etc.    <h3>Filter</h3>        <h3>Scope</h3>    <p>        Scopes are special filters that are applied to the crawl as a whole to        define it's <i>scope</i>. Any given crawl will employ exactly one scope        object to define what URIs are considered 'within scope'.    <p>        Several implementations covering the most commonly        desired scopes are provided (broad, domain, host etc.). However custom        implementations can be made of these to define any arbitrary scope.        It should be noted though that usually any type of limitations to the scope        of a crawl can be more easily achived using one of the existing scopes and        modifing it with appropriate filters.    <p>        <A HREF="../../../org/archive/crawler/framework/CrawlScope.html" title="class in org.archive.crawler.framework"><CODE>CrawlScope</CODE></A> - Base class for        scopes.<br>        <A HREF="../../../org/archive/crawler/scope/package-summary.html"><CODE>org.archive.crawler.scope</CODE></A> package. Contains         provided scopes.            <h3>Statistics Tracking</h3>    <p>        Any number of statistics tracking modules can be added to a crawl to gather        run time information about it's progress.    <p>        These modules can both interrogate the <tt>Frontier</tt> for what sparse        date it exposes but they can also subscribe to         <A HREF="../../../org/archive/crawler/event/CrawlURIDispositionListener.html" title="interface in org.archive.crawler.event"><CODE>Crawled URI        Disposition</CODE></A> events to monitor the completion of each URI that is processed.    <p>        An interface for <A HREF="../../../org/archive/crawler/framework/StatisticsTracking.html" title="interface in org.archive.crawler.framework"><CODE>statistics tracking</CODE></A> is provided as well as a partial implementation         (<A HREF="../../../org/archive/crawler/framework/AbstractTracker.html" title="class in org.archive.crawler.framework"><CODE>AbstractTracker</CODE></A>)         that does much of the work common to most statistics tracking modules.    <p>        Furthermore the <tt>admin</tt> package implements a statistics tracking        module (<A HREF="../../../org/archive/crawler/admin/StatisticsTracker.html" title="class in org.archive.crawler.admin"><CODE>StatisticsTracker</CODE></A>)        that generates a log of the crawlers progress as well as providing information        that the UI uses. It also compiles end-of-crawl reports that contain all of the        information it has gathered in the course of the crawl.<br>        It is highly recommended that it always be used when running crawls via the UI.<P><P><DL></DL><HR><!-- ======= START OF BOTTOM NAVBAR ====== --><A NAME="navbar_bottom"><!-- --></A><A HREF="#skip-navbar_bottom" title="Skip navigation links"></A><TABLE BORDER="0" WIDTH="100%" CELLPADDING="1" CELLSPACING="0" SUMMARY=""><TR><TD COLSPAN=2 BGCOLOR="#EEEEFF" CLASS="NavBarCell1"><A NAME="navbar_bottom_firstrow"><!-- --></A><TABLE BORDER="0" CELLPADDING="0" CELLSPACING="3" SUMMARY="">  <TR ALIGN="center" VALIGN="top">  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="../../../overview-summary.html"><FONT CLASS="NavBarFont1"><B>Overview</B></FONT></A>&nbsp;</TD>  <TD BGCOLOR="#FFFFFF" CLASS="NavBarCell1Rev"> &nbsp;<FONT CLASS="NavBarFont1Rev"><B>Package</B></FONT>&nbsp;</TD>  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <FONT CLASS="NavBarFont1">Class</FONT>&nbsp;</TD>  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="package-use.html"><FONT CLASS="NavBarFont1"><B>Use</B></FONT></A>&nbsp;</TD>  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="package-tree.html"><FONT CLASS="NavBarFont1"><B>Tree</B></FONT></A>&nbsp;</TD>  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="../../../deprecated-list.html"><FONT CLASS="NavBarFont1"><B>Deprecated</B></FONT></A>&nbsp;</TD>  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="../../../index-all.html"><FONT CLASS="NavBarFont1"><B>Index</B></FONT></A>&nbsp;</TD>  <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1">    <A HREF="../../../help-doc.html"><FONT CLASS="NavBarFont1"><B>Help</B></FONT></A>&nbsp;</TD>  </TR></TABLE></TD><TD ALIGN="right" VALIGN="top" ROWSPAN=3><EM></EM></TD></TR><TR><TD BGCOLOR="white" CLASS="NavBarCell2"><FONT SIZE="-2">&nbsp;PREV PACKAGE&nbsp;&nbsp;<A HREF="../../../org/archive/crawler/admin/package-summary.html"><B>NEXT PACKAGE</B></A></FONT></TD><TD BGCOLOR="white" CLASS="NavBarCell2"><FONT SIZE="-2">  <A HREF="../../../index.html?org/archive/crawler/package-summary.html" target="_top"><B>FRAMES</B></A>  &nbsp;&nbsp;<A HREF="package-summary.html" target="_top"><B>NO FRAMES</B></A>  &nbsp;&nbsp;<SCRIPT type="text/javascript">  <!--  if(window==top) {    document.writeln('<A HREF="../../../allclasses-noframe.html"><B>All Classes</B></A>');  }  //--></SCRIPT><NOSCRIPT>  <A HREF="../../../allclasses-noframe.html"><B>All Classes</B></A></NOSCRIPT></FONT></TD></TR></TABLE><A NAME="skip-navbar_bottom"></A><!-- ======== END OF BOTTOM NAVBAR ======= --><HR>Copyright &copy; 2003-2006 Internet Archive. All Rights Reserved.</BODY></HTML>
上一页 12
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -