📄 processor.html
字号:
* A very simple extractor. Will assume that any string that matches a * configurable regular expression is a link. * * @author Kristinn Sigurdsson */public class SimpleExtractor extends Processor implements CoreAttributeConstants{ public static final String ATTR_REGULAR_EXPRESSION = "input-param"; public static final String DEFAULT_REGULAR_EXPRESSION = "http://([a-zA-Z0-9]+\\.)+[a-zA-Z0-9]+/"; //Find domains int numberOfCURIsHandled = 0; int numberOfLinksExtracted = 0; public SimpleExtractor(String name) { <a name="simple_1" href="processor.html#co_simple_1"><img border="0" alt="1" src="images/callouts/1.png"></a> super(name, "A very simple link extractor. Doesn't do anything useful."); Type e; e = addElementToDefinition(new SimpleType(ATTR_REGULAR_EXPRESSION, "How deep to look into files for URI strings, in bytes", DEFAULT_REGULAR_EXPRESSION)); e.setExpertSetting(true); } protected void innerProcess(CrawlURI curi) { if (!curi.isHttpTransaction()) <a name="simple_2" href="processor.html#co_simple_2"><img border="0" alt="2" src="images/callouts/2.png"></a> { // We only handle HTTP at the moment. return; } numberOfCURIsHandled++; <a name="simple_3" href="processor.html#co_simple_3"><img border="0" alt="3" src="images/callouts/3.png"></a> CharSequence cs = curi.getHttpRecorder().getReplayCharSequence(); <a name="simple_4" href="processor.html#co_simple_4"><img border="0" alt="4" src="images/callouts/4.png"></a> String regexpr = null; try { regexpr = (String)getAttribute(ATTR_REGULAR_EXPRESSION,curi); <a name="simple_5" href="processor.html#co_simple_5"><img border="0" alt="5" src="images/callouts/5.png"></a> } catch(AttributeNotFoundException e) { regexpr = DEFAULT_REGULAR_EXPRESSION; } Matcher match = TextUtils.getMatcher(regexpr, cs); <a name="simple_6" href="processor.html#co_simple_6"><img border="0" alt="6" src="images/callouts/6.png"></a> while (match.find()){ String link = cs.subSequence(match.start(),match.end()).toString(); <a name="simple_7" href="processor.html#co_simple_7"><img border="0" alt="7" src="images/callouts/7.png"></a> curi.createAndAddLink(link, Link.SPECULATIVE_MISC, Link.NAVLINK_HOP);<a name="simple_8" href="processor.html#co_simple_8"><img border="0" alt="8" src="images/callouts/8.png"></a> numberOfLinksExtracted++; <a name="simple_9" href="processor.html#co_simple_9"><img border="0" alt="9" src="images/callouts/9.png"></a> System.out.println("SimpleExtractor: " + link); <a name="simple_10" href="processor.html#co_simple_10"><img border="0" alt="10" src="images/callouts/10.png"></a> } TextUtils.recycleMatcher(match); <a name="simple_11" href="processor.html#co_simple_11"><img border="0" alt="11" src="images/callouts/11.png"></a> } public String report() { <a name="simple_12" href="processor.html#co_simple_12"><img border="0" alt="12" src="images/callouts/12.png"></a> StringBuffer ret = new StringBuffer(); ret.append("Processor: org.archive.crawler.extractor." + "SimpleExtractor\n"); ret.append(" Function: Example extractor\n"); ret.append(" CrawlURIs handled: " + numberOfCURIsHandled + "\n"); ret.append(" Links extracted: " + numberOfLinksExtracted + "\n\n"); return ret.toString(); }}</pre><div class="calloutlist"><table summary="Callout list" border="0"><tr><td align="left" valign="top" width="5%"><a name="co_simple_1"></a><a href="#simple_1"><img border="0" alt="1" src="images/callouts/1.png"></a> </td><td align="left" valign="top"><p>The constructor. As with any Heritrix module it set's up the processors name, description and configurable parameters. In this case the only configurable parameter is the Regular expression that will be used to find links. Both a name and a default value is provided for this parameter. It is also marked as an expert setting.</p></td></tr><tr><td align="left" valign="top" width="5%"><a name="co_simple_2"></a><a href="#simple_2"><img border="0" alt="2" src="images/callouts/2.png"></a> </td><td align="left" valign="top"><p>Check if the URI was fetched via a HTTP transaction. If not it is probably a DNS lookup or was not fetched. Either way regular link extraction is not possible.</p></td></tr><tr><td align="left" valign="top" width="5%"><a name="co_simple_3"></a><a href="#simple_3"><img border="0" alt="3" src="images/callouts/3.png"></a> </td><td align="left" valign="top"><p>If we get this far then we have a URI that the processor will try to extract links from. Bump URI counter up by one.</p></td></tr><tr><td align="left" valign="top" width="5%"><a name="co_simple_4"></a><a href="#simple_4"><img border="0" alt="4" src="images/callouts/4.png"></a> </td><td align="left" valign="top"><p>Get the ReplayCharSequence. Can apply regular expressions on it directly.</p></td></tr><tr><td align="left" valign="top" width="5%"><a name="co_simple_5"></a><a href="#simple_5"><img border="0" alt="5" src="images/callouts/5.png"></a> </td><td align="left" valign="top"><p>Look up the regular expression to use. If the attribute is not found we'll use the default value.</p></td></tr><tr><td align="left" valign="top" width="5%"><a name="co_simple_6"></a><a href="#simple_6"><img border="0" alt="6" src="images/callouts/6.png"></a> </td><td align="left" valign="top"><p>Apply the regular expression. We'll use the <a href="http://crawler.archive.org/apidocs/org/archive/util/TextUtils.html#getMatcher(java.lang.String,%20java.lang.CharSequence)" target="_top">TextUtils.getMatcher()</a> utility method for performance reasons.</p></td></tr><tr><td align="left" valign="top" width="5%"><a name="co_simple_7"></a><a href="#simple_7"><img border="0" alt="7" src="images/callouts/7.png"></a> </td><td align="left" valign="top"><p>Extract a link discovered by the regular expression from the character sequence and store it as a string.</p></td></tr><tr><td align="left" valign="top" width="5%"><a name="co_simple_8"></a><a href="#simple_8"><img border="0" alt="8" src="images/callouts/8.png"></a> </td><td align="left" valign="top"><p>Add discovered link to the collection of regular links extracted from the current URI.</p></td></tr><tr><td align="left" valign="top" width="5%"><a name="co_simple_9"></a><a href="#simple_9"><img border="0" alt="9" src="images/callouts/9.png"></a> </td><td align="left" valign="top"><p>Note that we just discovered another link.</p></td></tr><tr><td align="left" valign="top" width="5%"><a name="co_simple_10"></a><a href="#simple_10"><img border="0" alt="10" src="images/callouts/10.png"></a> </td><td align="left" valign="top"><p>This is a handy debug line that will print each extracted link to the standard output. You would not want this in production code.</p></td></tr><tr><td align="left" valign="top" width="5%"><a name="co_simple_11"></a><a href="#simple_11"><img border="0" alt="11" src="images/callouts/11.png"></a> </td><td align="left" valign="top"><p>Free up the matcher object. This too is for performance. See the related <a href="http://crawler.archive.org/apidocs/org/archive/util/TextUtils.html#freeMatcher(java.util.regex.Matcher)" target="_top">javadoc</a>.</p></td></tr><tr><td align="left" valign="top" width="5%"><a name="co_simple_12"></a><a href="#simple_12"><img border="0" alt="12" src="images/callouts/12.png"></a> </td><td align="left" valign="top"><p>The report states the name of the processor, its function and the totals of how many URIs were handled and how many links were extracted. A fairly typical report for an extractor.</p></td></tr></table></div><p>Even though the example above is fairly simple the processor nevertheless works as intended.</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N10665"></a>11.4. Things to keep in mind when writing a processor</h3></div></div></div><p></p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10669"></a>11.4.1. Interruptions</h4></div></div></div><p>Classes extending Processor should not trap InterruptedExceptions.</p><p>InterruptedExceptions should be allowed to propagate to the ToeThread executing the processor.</p><p>Also they should immediately exit their main method (<code class="literal">innerProcess()</code>) if the <code class="literal">interrupted</code> flag is set.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N1067A"></a>11.4.2. One processor, many threads</h4></div></div></div><p>For each processor only one instance is created per crawl. As there are multiple threads running, these processors must be carefully written so that no conflicts arise. This usually means that class variables can not be used for other things then gathering incremental statistics and data.</p><p>There is a facility for having an instance per thread but it has not been tested and will not be covered in this document.</p></div></div></div><div class="navfooter"><hr><table summary="Navigation footer" width="100%"><tr><td align="left" width="40%"><a accesskey="p" href="scope.html">Prev</a> </td><td align="center" width="20%"> </td><td align="right" width="40%"> <a accesskey="n" href="statistics.html">Next</a></td></tr><tr><td valign="top" align="left" width="40%">10. Writing a Scope </td><td align="center" width="20%"><a accesskey="h" href="index.html">Home</a></td><td valign="top" align="right" width="40%"> 12. Writing a Statistics Tracker</td></tr></table></div></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -