📄 processor.html

📁 网络爬虫开源代码
💻 HTML
📖 第 1 页 / 共 2 页
字号:
上一页 12
 * A very simple extractor. Will assume that any string that matches a  * configurable regular expression is a link. * * @author Kristinn Sigurdsson */public class SimpleExtractor extends Processor    implements CoreAttributeConstants{    public static final String ATTR_REGULAR_EXPRESSION = "input-param";    public static final String DEFAULT_REGULAR_EXPRESSION =         "http://([a-zA-Z0-9]+\\.)+[a-zA-Z0-9]+/"; //Find domains        int numberOfCURIsHandled = 0;     int numberOfLinksExtracted = 0;    public SimpleExtractor(String name) { <a name="simple_1" href="processor.html#co_simple_1"><img border="0" alt="1" src="images/callouts/1.png"></a>        super(name, "A very simple link extractor. Doesn't do anything useful.");        Type e;        e = addElementToDefinition(new SimpleType(ATTR_REGULAR_EXPRESSION,            "How deep to look into files for URI strings, in bytes",            DEFAULT_REGULAR_EXPRESSION));        e.setExpertSetting(true);    }    protected void innerProcess(CrawlURI curi) {        if (!curi.isHttpTransaction()) <a name="simple_2" href="processor.html#co_simple_2"><img border="0" alt="2" src="images/callouts/2.png"></a>        {            // We only handle HTTP at the moment.            return;        }                numberOfCURIsHandled++; <a name="simple_3" href="processor.html#co_simple_3"><img border="0" alt="3" src="images/callouts/3.png"></a>        CharSequence cs = curi.getHttpRecorder().getReplayCharSequence(); <a name="simple_4" href="processor.html#co_simple_4"><img border="0" alt="4" src="images/callouts/4.png"></a>        String regexpr = null;        try {            regexpr = (String)getAttribute(ATTR_REGULAR_EXPRESSION,curi); <a name="simple_5" href="processor.html#co_simple_5"><img border="0" alt="5" src="images/callouts/5.png"></a>        } catch(AttributeNotFoundException e) {            regexpr = DEFAULT_REGULAR_EXPRESSION;        }        Matcher match = TextUtils.getMatcher(regexpr, cs); <a name="simple_6" href="processor.html#co_simple_6"><img border="0" alt="6" src="images/callouts/6.png"></a>                while (match.find()){             String link = cs.subSequence(match.start(),match.end()).toString(); <a name="simple_7" href="processor.html#co_simple_7"><img border="0" alt="7" src="images/callouts/7.png"></a>            curi.createAndAddLink(link, Link.SPECULATIVE_MISC, Link.NAVLINK_HOP);<a name="simple_8" href="processor.html#co_simple_8"><img border="0" alt="8" src="images/callouts/8.png"></a>            numberOfLinksExtracted++; <a name="simple_9" href="processor.html#co_simple_9"><img border="0" alt="9" src="images/callouts/9.png"></a>            System.out.println("SimpleExtractor: " + link); <a name="simple_10" href="processor.html#co_simple_10"><img border="0" alt="10" src="images/callouts/10.png"></a>        }                TextUtils.recycleMatcher(match); <a name="simple_11" href="processor.html#co_simple_11"><img border="0" alt="11" src="images/callouts/11.png"></a>    }    public String report() { <a name="simple_12" href="processor.html#co_simple_12"><img border="0" alt="12" src="images/callouts/12.png"></a>        StringBuffer ret = new StringBuffer();        ret.append("Processor: org.archive.crawler.extractor." +            "SimpleExtractor\n");        ret.append("  Function:          Example extractor\n");        ret.append("  CrawlURIs handled: " + numberOfCURIsHandled + "\n");        ret.append("  Links extracted:   " + numberOfLinksExtracted + "\n\n");        return ret.toString();    }}</pre><div class="calloutlist"><table summary="Callout list" border="0"><tr><td align="left" valign="top" width="5%"><a name="co_simple_1"></a><a href="#simple_1"><img border="0" alt="1" src="images/callouts/1.png"></a> </td><td align="left" valign="top"><p>The constructor. As with any Heritrix module it set's up the          processors name, description and configurable parameters. In this          case the only configurable parameter is the Regular expression that          will be used to find links. Both a name and a default value is          provided for this parameter. It is also marked as an expert          setting.</p></td></tr><tr><td align="left" valign="top" width="5%"><a name="co_simple_2"></a><a href="#simple_2"><img border="0" alt="2" src="images/callouts/2.png"></a> </td><td align="left" valign="top"><p>Check if the URI was fetched via a HTTP transaction. If not it          is probably a DNS lookup or was not fetched. Either way regular link          extraction is not possible.</p></td></tr><tr><td align="left" valign="top" width="5%"><a name="co_simple_3"></a><a href="#simple_3"><img border="0" alt="3" src="images/callouts/3.png"></a> </td><td align="left" valign="top"><p>If we get this far then we have a URI that the processor will          try to extract links from. Bump URI counter up by one.</p></td></tr><tr><td align="left" valign="top" width="5%"><a name="co_simple_4"></a><a href="#simple_4"><img border="0" alt="4" src="images/callouts/4.png"></a> </td><td align="left" valign="top"><p>Get the ReplayCharSequence. Can apply regular expressions on          it directly.</p></td></tr><tr><td align="left" valign="top" width="5%"><a name="co_simple_5"></a><a href="#simple_5"><img border="0" alt="5" src="images/callouts/5.png"></a> </td><td align="left" valign="top"><p>Look up the regular expression to use. If the attribute is not          found we'll use the default value.</p></td></tr><tr><td align="left" valign="top" width="5%"><a name="co_simple_6"></a><a href="#simple_6"><img border="0" alt="6" src="images/callouts/6.png"></a> </td><td align="left" valign="top"><p>Apply the regular expression. We'll use the <a href="http://crawler.archive.org/apidocs/org/archive/util/TextUtils.html#getMatcher(java.lang.String,%20java.lang.CharSequence)" target="_top">TextUtils.getMatcher()</a>          utility method for performance reasons.</p></td></tr><tr><td align="left" valign="top" width="5%"><a name="co_simple_7"></a><a href="#simple_7"><img border="0" alt="7" src="images/callouts/7.png"></a> </td><td align="left" valign="top"><p>Extract a link discovered by the regular expression from the          character sequence and store it as a string.</p></td></tr><tr><td align="left" valign="top" width="5%"><a name="co_simple_8"></a><a href="#simple_8"><img border="0" alt="8" src="images/callouts/8.png"></a> </td><td align="left" valign="top"><p>Add discovered link to the collection of regular links          extracted from the current URI.</p></td></tr><tr><td align="left" valign="top" width="5%"><a name="co_simple_9"></a><a href="#simple_9"><img border="0" alt="9" src="images/callouts/9.png"></a> </td><td align="left" valign="top"><p>Note that we just discovered another link.</p></td></tr><tr><td align="left" valign="top" width="5%"><a name="co_simple_10"></a><a href="#simple_10"><img border="0" alt="10" src="images/callouts/10.png"></a> </td><td align="left" valign="top"><p>This is a handy debug line that will print each extracted link          to the standard output. You would not want this in production          code.</p></td></tr><tr><td align="left" valign="top" width="5%"><a name="co_simple_11"></a><a href="#simple_11"><img border="0" alt="11" src="images/callouts/11.png"></a> </td><td align="left" valign="top"><p>Free up the matcher object. This too is for performance. See          the related <a href="http://crawler.archive.org/apidocs/org/archive/util/TextUtils.html#freeMatcher(java.util.regex.Matcher)" target="_top">javadoc</a>.</p></td></tr><tr><td align="left" valign="top" width="5%"><a name="co_simple_12"></a><a href="#simple_12"><img border="0" alt="12" src="images/callouts/12.png"></a> </td><td align="left" valign="top"><p>The report states the name of the processor, its function and          the totals of how many URIs were handled and how many links were          extracted. A fairly typical report for an extractor.</p></td></tr></table></div><p>Even though the example above is fairly simple the processor      nevertheless works as intended.</p></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N10665"></a>11.4.&nbsp;Things to keep in mind when writing a processor</h3></div></div></div><p></p><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10669"></a>11.4.1.&nbsp;Interruptions</h4></div></div></div><p>Classes extending Processor should not trap        InterruptedExceptions.</p><p>InterruptedExceptions should be allowed to propagate to the        ToeThread executing the processor.</p><p>Also they should immediately exit their main method        (<code class="literal">innerProcess()</code>) if the        <code class="literal">interrupted</code> flag is set.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N1067A"></a>11.4.2.&nbsp;One processor, many threads</h4></div></div></div><p>For each processor only one instance is created per crawl. As        there are multiple threads running, these processors must be carefully        written so that no conflicts arise. This usually means that class        variables can not be used for other things then gathering incremental        statistics and data.</p><p>There is a facility for having an instance per thread but it has        not been tested and will not be covered in this document.</p></div></div></div><div class="navfooter"><hr><table summary="Navigation footer" width="100%"><tr><td align="left" width="40%"><a accesskey="p" href="scope.html">Prev</a>&nbsp;</td><td align="center" width="20%">&nbsp;</td><td align="right" width="40%">&nbsp;<a accesskey="n" href="statistics.html">Next</a></td></tr><tr><td valign="top" align="left" width="40%">10.&nbsp;Writing a Scope&nbsp;</td><td align="center" width="20%"><a accesskey="h" href="index.html">Home</a></td><td valign="top" align="right" width="40%">&nbsp;12.&nbsp;Writing a Statistics Tracker</td></tr></table></div></body></html>
上一页 12
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -