📄 package-summary.html
字号:
The <tt>admin</tt> package contains classes that are used by the Web UI. This includes some core classes and a specific implementation of the <tt>Statistics Tracking</tt> interface found in the <tt>framework</tt> package that is designed to provide the UI with information about ongoing crawls. <h2>Pluggable modules</h2> <p> The following is a listing of the types of pluggable modules found in Heritrix with brief explanations of each and linking to their respective API documentation. <h3>Frontier</h3> <p> A <tt>Frontier</tt> maintains the internal state of a crawl while it is in progress. What URIs have been discovered, which should be crawled next, etc. <p> Needless to say this is one of the most important modules in any crawl and the provided implementation should generally be appropriate unless a very different strategy for ordering URIs for crawling is desired. <p> <A HREF="../../../org/archive/crawler/framework/Frontier.html" title="interface in org.archive.crawler.framework"><CODE>Frontier</CODE></A> is the interface that all <tt>Frontiers</tt> must implement.<br> <A HREF="../../../org/archive/crawler/frontier/package-summary.html"><CODE>org.archive.crawler.frontier</CODE></A> package contains the provided implementation of a <tt>Frontier</tt> along with it's supporting classes. <h3>Processor</h3><img src="doc-files/processing_steps.png" alt="Processing Steps" style="width: 198px; height: 470px;" align="right" /> <p> When a URI is crawled, a <A HREF="../../../org/archive/crawler/framework/ToeThread.html" title="class in org.archive.crawler.framework"><CODE>ToeThread</CODE></A> will execute a series of <tt>processors</tt> on it. <p> The processors are split into 5 distinct chains that are exectued in sequence: <ol> <li>Pre-fetch processing chain <li>Fetch processing chain <li>Extractor processing chain <li>Write/Index processing chain <li>Post-processing chain </ol> Each of these chains contain any number of <tt>processors</tt>. The processors all inherit from a generic <A HREF="../../../org/archive/crawler/framework/Processor.html" title="class in org.archive.crawler.framework"><CODE>Processor</CODE></A>. While the processors are divided into the five categories above that is strictly a high level configuration and any processor can be in any chain (although doing link extraction before fetching a document is clearly of no use). <p> Numerous processors are provided with Heritrix in the following packages:<br> <A HREF="../../../org/archive/crawler/prefetch/package-summary.html"><CODE>org.archive.crawler.prefetch</CODE></A> package contains processors run before the URI is fetched from the Internet.<br> <A HREF="../../../org/archive/crawler/fetcher/package-summary.html"><CODE>org.archive.crawler.fetcher</CODE></A> package contains processors that fetch URI from the Internet. Typically each processor handles a different protocol.<br> <A HREF="../../../org/archive/crawler/extractor/package-summary.html"><CODE>org.archive.crawler.extractor</CODE></A> package contains processors that perform link extractions on various document types.<br> <A HREF="../../../org/archive/crawler/writer/package-summary.html"><CODE>org.archive.crawler.writer</CODE></A> package contains a processor that writes an ARC file with the fetched document.<br> <A HREF="../../../org/archive/crawler/postprocessor/package-summary.html"><CODE>org.archive.crawler.postprocessor</CODE></A> package contain processors that do wrapup on the processing, reporting links back to the Frontier etc. <h3>Filter</h3> <h3>Scope</h3> <p> Scopes are special filters that are applied to the crawl as a whole to define it's <i>scope</i>. Any given crawl will employ exactly one scope object to define what URIs are considered 'within scope'. <p> Several implementations covering the most commonly desired scopes are provided (broad, domain, host etc.). However custom implementations can be made of these to define any arbitrary scope. It should be noted though that usually any type of limitations to the scope of a crawl can be more easily achived using one of the existing scopes and modifing it with appropriate filters. <p> <A HREF="../../../org/archive/crawler/framework/CrawlScope.html" title="class in org.archive.crawler.framework"><CODE>CrawlScope</CODE></A> - Base class for scopes.<br> <A HREF="../../../org/archive/crawler/scope/package-summary.html"><CODE>org.archive.crawler.scope</CODE></A> package. Contains provided scopes. <h3>Statistics Tracking</h3> <p> Any number of statistics tracking modules can be added to a crawl to gather run time information about it's progress. <p> These modules can both interrogate the <tt>Frontier</tt> for what sparse date it exposes but they can also subscribe to <A HREF="../../../org/archive/crawler/event/CrawlURIDispositionListener.html" title="interface in org.archive.crawler.event"><CODE>Crawled URI Disposition</CODE></A> events to monitor the completion of each URI that is processed. <p> An interface for <A HREF="../../../org/archive/crawler/framework/StatisticsTracking.html" title="interface in org.archive.crawler.framework"><CODE>statistics tracking</CODE></A> is provided as well as a partial implementation (<A HREF="../../../org/archive/crawler/framework/AbstractTracker.html" title="class in org.archive.crawler.framework"><CODE>AbstractTracker</CODE></A>) that does much of the work common to most statistics tracking modules. <p> Furthermore the <tt>admin</tt> package implements a statistics tracking module (<A HREF="../../../org/archive/crawler/admin/StatisticsTracker.html" title="class in org.archive.crawler.admin"><CODE>StatisticsTracker</CODE></A>) that generates a log of the crawlers progress as well as providing information that the UI uses. It also compiles end-of-crawl reports that contain all of the information it has gathered in the course of the crawl.<br> It is highly recommended that it always be used when running crawls via the UI.<P><P><DL></DL><HR><!-- ======= START OF BOTTOM NAVBAR ====== --><A NAME="navbar_bottom"><!-- --></A><A HREF="#skip-navbar_bottom" title="Skip navigation links"></A><TABLE BORDER="0" WIDTH="100%" CELLPADDING="1" CELLSPACING="0" SUMMARY=""><TR><TD COLSPAN=2 BGCOLOR="#EEEEFF" CLASS="NavBarCell1"><A NAME="navbar_bottom_firstrow"><!-- --></A><TABLE BORDER="0" CELLPADDING="0" CELLSPACING="3" SUMMARY=""> <TR ALIGN="center" VALIGN="top"> <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1"> <A HREF="../../../overview-summary.html"><FONT CLASS="NavBarFont1"><B>Overview</B></FONT></A> </TD> <TD BGCOLOR="#FFFFFF" CLASS="NavBarCell1Rev"> <FONT CLASS="NavBarFont1Rev"><B>Package</B></FONT> </TD> <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1"> <FONT CLASS="NavBarFont1">Class</FONT> </TD> <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1"> <A HREF="package-use.html"><FONT CLASS="NavBarFont1"><B>Use</B></FONT></A> </TD> <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1"> <A HREF="package-tree.html"><FONT CLASS="NavBarFont1"><B>Tree</B></FONT></A> </TD> <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1"> <A HREF="../../../deprecated-list.html"><FONT CLASS="NavBarFont1"><B>Deprecated</B></FONT></A> </TD> <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1"> <A HREF="../../../index-all.html"><FONT CLASS="NavBarFont1"><B>Index</B></FONT></A> </TD> <TD BGCOLOR="#EEEEFF" CLASS="NavBarCell1"> <A HREF="../../../help-doc.html"><FONT CLASS="NavBarFont1"><B>Help</B></FONT></A> </TD> </TR></TABLE></TD><TD ALIGN="right" VALIGN="top" ROWSPAN=3><EM></EM></TD></TR><TR><TD BGCOLOR="white" CLASS="NavBarCell2"><FONT SIZE="-2"> PREV PACKAGE <A HREF="../../../org/archive/crawler/admin/package-summary.html"><B>NEXT PACKAGE</B></A></FONT></TD><TD BGCOLOR="white" CLASS="NavBarCell2"><FONT SIZE="-2"> <A HREF="../../../index.html?org/archive/crawler/package-summary.html" target="_top"><B>FRAMES</B></A> <A HREF="package-summary.html" target="_top"><B>NO FRAMES</B></A> <SCRIPT type="text/javascript"> <!-- if(window==top) { document.writeln('<A HREF="../../../allclasses-noframe.html"><B>All Classes</B></A>'); } //--></SCRIPT><NOSCRIPT> <A HREF="../../../allclasses-noframe.html"><B>All Classes</B></A></NOSCRIPT></FONT></TD></TR></TABLE><A NAME="skip-navbar_bottom"></A><!-- ======== END OF BOTTOM NAVBAR ======= --><HR>Copyright © 2003-2006 Internet Archive. All Rights Reserved.</BODY></HTML>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -