📄 frontier.html

📁 用JAVA编写的,在做实验的时候留下来的,本来想删的,但是传上来,大家分享吧
💻 HTML
📖 第 1 页 / 共 4 页
字号:
<DD><DL></DL></DD><DD><DL><DT><B>Parameters:</B><DD><CODE>caURI</CODE> - The URI to schedule.<DT><B>See Also:</B><DD><A HREF="../../../../org/archive/crawler/datamodel/CandidateURI.html#setSchedulingDirective(int)"><CODE>CandidateURI.setSchedulingDirective(int)</CODE></A></DL></DD></DL><HR><A NAME="finished(org.archive.crawler.datamodel.CrawlURI)"><!-- --></A><H3>finished</H3><PRE>void <B>finished</B>(<A HREF="../../../../org/archive/crawler/datamodel/CrawlURI.html" title="class in org.archive.crawler.datamodel">CrawlURI</A>&nbsp;cURI)</PRE><DL><DD>Report a URI being processed as having finished processing. <p>ToeThreads will invoke this method once they have completed work on their assigned URI. <p>This method is synchronized.<P><DD><DL></DL></DD><DD><DL><DT><B>Parameters:</B><DD><CODE>cURI</CODE> - The URI that has finished processing.</DL></DD></DL><HR><A NAME="discoveredUriCount()"><!-- --></A><H3>discoveredUriCount</H3><PRE>long <B>discoveredUriCount</B>()</PRE><DL><DD>Number of <i>discovered</i> URIs. <p>That is any URI that has been confirmed be within 'scope' (i.e. the Frontier decides that it should be processed). This includes those that have been processed, are being processed and have finished processing. Does not include URIs that have been 'forgotten' (deemed out of scope when trying to fetch, most likely due to operator changing scope definition). <p><b>Note:</b> This only counts discovered URIs. Since the same URI can (at least in most frontiers) be fetched multiple times, this number may be somewhat lower then the combined <i>queued</i>, <i>in process</i> and <i>finished</i> items combined due to duplicate URIs being queued and processed. This variance is likely to be especially high in Frontiers implementing 'revist' strategies.<P><DD><DL></DL></DD><DD><DL><DT><B>Returns:</B><DD>Number of discovered URIs.</DL></DD></DL><HR><A NAME="queuedUriCount()"><!-- --></A><H3>queuedUriCount</H3><PRE>long <B>queuedUriCount</B>()</PRE><DL><DD>Number of URIs <i>queued</i> up and waiting for processing. <p>This includes any URIs that failed but will be retried. Basically this is any <i>discovered</i> URI that has not either been processed or is being processed. The same discovered URI can be queued multiple times.<P><DD><DL></DL></DD><DD><DL><DT><B>Returns:</B><DD>Number of queued URIs.</DL></DD></DL><HR><A NAME="deepestUri()"><!-- --></A><H3>deepestUri</H3><PRE>long <B>deepestUri</B>()</PRE><DL><DD><DL></DL></DD><DD><DL></DL></DD></DL><HR><A NAME="averageDepth()"><!-- --></A><H3>averageDepth</H3><PRE>long <B>averageDepth</B>()</PRE><DL><DD><DL></DL></DD><DD><DL></DL></DD></DL><HR><A NAME="congestionRatio()"><!-- --></A><H3>congestionRatio</H3><PRE>float <B>congestionRatio</B>()</PRE><DL><DD><DL></DL></DD><DD><DL></DL></DD></DL><HR><A NAME="finishedUriCount()"><!-- --></A><H3>finishedUriCount</H3><PRE>long <B>finishedUriCount</B>()</PRE><DL><DD>Number of URIs that have <i>finished</i> processing. <p>Includes both those that were processed successfully and failed to be processed (excluding those that failed but will be retried). Does not include those URIs that have been 'forgotten' (deemed out of scope when trying to fetch, most likely due to operator changing scope definition).<P><DD><DL></DL></DD><DD><DL><DT><B>Returns:</B><DD>Number of finished URIs.</DL></DD></DL><HR><A NAME="succeededFetchCount()"><!-- --></A><H3>succeededFetchCount</H3><PRE>long <B>succeededFetchCount</B>()</PRE><DL><DD>Number of <i>successfully</i> processed URIs. <p>Any URI that was processed successfully. This includes URIs that returned 404s and other error codes that do not originate within the crawler.<P><DD><DL></DL></DD><DD><DL><DT><B>Returns:</B><DD>Number of <i>successfully</i> processed URIs.</DL></DD></DL><HR><A NAME="failedFetchCount()"><!-- --></A><H3>failedFetchCount</H3><PRE>long <B>failedFetchCount</B>()</PRE><DL><DD>Number of URIs that <i>failed</i> to process. <p>URIs that could not be processed because of some error or failure in the processing chain. Can include failure to acquire prerequisites, to establish a connection with the host and any number of other problems. Does not count those that will be retried, only those that have permenantly failed.<P><DD><DL></DL></DD><DD><DL><DT><B>Returns:</B><DD>Number of URIs that failed to process.</DL></DD></DL><HR><A NAME="disregardedUriCount()"><!-- --></A><H3>disregardedUriCount</H3><PRE>long <B>disregardedUriCount</B>()</PRE><DL><DD>Number of URIs that were scheduled at one point but have been <i>disregarded</i>. <p>Counts any URI that is scheduled only to be disregarded because it is determined to lie outside the scope of the crawl. Most commonly this will be due to robots.txt exclusions.<P><DD><DL></DL></DD><DD><DL><DT><B>Returns:</B><DD>The number of URIs that have been disregarded.</DL></DD></DL><HR><A NAME="totalBytesWritten()"><!-- --></A><H3>totalBytesWritten</H3><PRE>long <B>totalBytesWritten</B>()</PRE><DL><DD>Total number of bytes contained in all URIs that have been processed.<P><DD><DL></DL></DD><DD><DL><DT><B>Returns:</B><DD>The total amounts of bytes in all processed URIs.</DL></DD></DL><HR><A NAME="importRecoverLog(java.lang.String, boolean)"><!-- --></A><H3>importRecoverLog</H3><PRE>void <B>importRecoverLog</B>(java.lang.String&nbsp;pathToLog,                      boolean&nbsp;retainFailures)                      throws java.io.IOException</PRE><DL><DD>Recover earlier state by reading a recovery log. <p>Some Frontiers are able to write detailed logs that can be loaded after a system crash to recover the state of the Frontier prior to the crash. This method is the one used to achive this.<P><DD><DL></DL></DD><DD><DL><DT><B>Parameters:</B><DD><CODE>pathToLog</CODE> - The name (with full path) of the recover log.<DD><CODE>retainFailures</CODE> - If true, failures in log should count as  having been included. (If false, failures will be ignored, meaning the corresponding URIs will be retried in the recovered crawl.)<DT><B>Throws:</B><DD><CODE>java.io.IOException</CODE> - If problems occur reading the recover log.</DL></DD></DL><HR><A NAME="getInitialMarker(java.lang.String, boolean)"><!-- --></A><H3>getInitialMarker</H3><PRE><A HREF="../../../../org/archive/crawler/framework/FrontierMarker.html" title="interface in org.archive.crawler.framework">FrontierMarker</A> <B>getInitialMarker</B>(java.lang.String&nbsp;regexpr,                                boolean&nbsp;inCacheOnly)</PRE><DL><DD>Get a <code>URIFrontierMarker</code> initialized with the given regular expression at the 'start' of the Frontier.<P><DD><DL></DL></DD><DD><DL><DT><B>Parameters:</B><DD><CODE>regexpr</CODE> - The regular expression that URIs within the frontier must                match to be considered within the scope of this marker<DD><CODE>inCacheOnly</CODE> - If set to true, only those URIs within the frontier                that are stored in cache (usually this means in memory                rather then on disk, but that is an implementation                detail) will be considered. Others will be entierly                ignored, as if they dont exist. This is usefull for quick                peeks at the top of the URI list.<DT><B>Returns:</B><DD>A URIFrontierMarker that is set for the 'start' of the frontier's                URI list.</DL></DD></DL><HR><A NAME="getURIsList(org.archive.crawler.framework.FrontierMarker, int, boolean)"><!-- --></A><H3>getURIsList</H3><PRE>java.util.ArrayList <B>getURIsList</B>(<A HREF="../../../../org/archive/crawler/framework/FrontierMarker.html" title="interface in org.archive.crawler.framework">FrontierMarker</A>&nbsp;marker,                                int&nbsp;numberOfMatches,                                boolean&nbsp;verbose)                                throws <A HREF="../../../../org/archive/crawler/framework/exceptions/InvalidFrontierMarkerException.html" title="class in org.archive.crawler.framework.exceptions">InvalidFrontierMarkerException</A></PRE><DL><DD>Returns a list of all uncrawled URIs starting from a specified marker until <code>numberOfMatches</code> is reached. <p>Any encountered URI that has not been successfully crawled, terminally failed, disregarded or is currently being processed is included. As there may be duplicates in the frontier, there may also be duplicates in the report. Thus this includes both discovered and pending URIs. <p>The list is a set of strings containing the URI strings. If verbose is true the string will include some additional information (path to URI and parent). <p>The <code>URIFrontierMarker</code> will be advanced to the position at which it's maximum number of matches found is reached. Reusing it for subsequent calls will thus effectively get the 'next' batch. Making any changes to the frontier can invalidate the marker.
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -