📄 chap_modules_common.html
字号:
<html><head><META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>6. Common needs for all configurable modules</title><link href="../docbook.css" rel="stylesheet" type="text/css"><meta content="DocBook XSL Stylesheets V1.67.2" name="generator"><link rel="start" href="index.html" title="Heritrix developer documentation"><link rel="up" href="index.html" title="Heritrix developer documentation"><link rel="prev" href="ar01s05.html" title="5. Settings"><link rel="next" href="ar01s07.html" title="7. Some notes on the URI classes"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table summary="Navigation header" width="100%"><tr><th align="center" colspan="3">6. Common needs for all configurable modules</th></tr><tr><td align="left" width="20%"><a accesskey="p" href="ar01s05.html">Prev</a> </td><th align="center" width="60%"> </th><td align="right" width="20%"> <a accesskey="n" href="ar01s07.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="chap_modules_common"></a>6. Common needs for all configurable modules</h2></div></div></div><p>As mentioned earlier all configurable modules in Heritrix subclasses ComplexType (or one of its descendants). When you write your own module you should inherit from <a href="http://crawler.archive.org/apidocs/org/archive/crawler/settings/ModuleType.html" target="_top">ModuleType</a> which is a subclass of ComplexType intended for be subclassed by all modules in Heritrix.</p><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N10279"></a>6.1. Definition of a module</h3></div></div></div><p>Heritrix knows how to handle a ComplexType and to get the needed information to render the user interface part for it. To make this happen your module has to obey some rules.</p><div class="orderedlist"><ol type="1"><li><p>A module should always implement a constructor taking exactly one argument - the name argument (<a href="http://crawler.archive.org/apidocs/org/archive/crawler/settings/ModuleType.html#ModuleType(java.lang.String)" target="_top">see ModuleType(String name)</a>).</p></li><li><p>All attributes you want to be configurable should be defined in the constructor of the module.</p></li></ol></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N1028B"></a>6.1.1. The obligatory one argument constructor</h4></div></div></div><p>All modules need to have a constructor taking a String argument. This string is used to identify the module. In the case where a module is of a type that is replacing an existing module of which there could only be one, it is important that the same name is being used. In this case the constructor might choose to ignore the name string and substitute it with a hard coded one. This is for example the case with the Frontier. The name of the Frontier should always be the string "frontier". For this reason the Frontier interface that all Frontiers should implement has a static variable: <pre class="programlisting">public static final String ATTR_NAME = "frontier";</pre> which implementations of the Frontier use instead of the string argument submitted to the constructor. Here is the part of the default Frontiers' constructor that shows how this should be done. <pre class="programlisting">public Frontier(String name) { //The 'name' of all frontiers should be the same (Frontier.ATTR_NAME) //therefore we'll ignore the supplied parameter. super(Frontier.ATTR_NAME, "HostQueuesFrontier. Maintains the internal" + " state of the crawl. It dictates the order in which URIs" + " will be scheduled. \nThis frontier is mostly a breadth-first" + " frontier, which refrains from emitting more than one" + " CrawlURI of the same \'key\' (host) at once, and respects" + " minimum-delay and delay-factor specifications for" + " politeness.");</pre> As shown in this example, the constructor must call the superclass's constructor. This example also shows how to set the description of a module. The description is used by the user interface to guide the user in configuring the crawl. If you don't want to set a description (strongly discouraged), the ModuleType also has a one argument constructor taking just the name.</p></div><div class="sect3" lang="en"><div class="titlepage"><div><div><h4 class="title"><a name="N10298"></a>6.1.2. Defining attributes</h4></div></div></div><p>The attributes on a module you want to be configurable must be defined in the modules constructor. For this purpose the ComplexType has a method <a href="http://crawler.archive.org/apidocs/org/archive/crawler/settings/ComplexType.html#addElementToDefinition(org.archive.crawler.settings.Type)" target="_top">addElementToDefinition(Type type)</a>. The argument given to this method is a definition of the attribute. The <a href="http://crawler.archive.org/apidocs/org/archive/crawler/settings/Type.html" target="_top">Type</a> class is the superclass of all the attribute definitions allowed for a ModuleType. Since the ComplexType, which ModuleType inherits, is itself a subclass of Type, you can add new ModuleTypes as attributes to your module. The Type class implements configuration methods common for all Types that defines an attribute on your module. The addElementToDefinition method returns the added Type so that it is easy to refine the configuration of the Type. Lets look at an example (also from the default Frontier) of an attribute definition.<pre class="programlisting">public final static String ATTR_MAX_OVERALL_BANDWIDTH_USAGE = "total-bandwidth-usage-KB-sec";private final static Integer DEFAULT_MAX_OVERALL_BANDWIDTH_USAGE = new Integer(0);...Type t;t = addElementToDefinition( new SimpleType(ATTR_MAX_OVERALL_BANDWIDTH_USAGE, "The maximum average bandwidth the crawler is allowed to use. " + "The actual readspeed is not affected by this setting, it only " + "holds back new URIs from being processed when the bandwidth " + "usage has been to high.\n0 means no bandwidth limitation.", DEFAULT_MAX_OVERALL_BANDWIDTH_USAGE));t.setOverrideable(false);</pre> Here we add an attribute definition of the SimpleType (which is a subclass of Type). The SimpleType's constructor takes three arguments: name, description and a default value. Usually the name and default value are defined as constants like here, but this is of course optional. The line <span><strong class="command">t.setOverrideable(false);</strong></span> informs the settings framework to not allow per overrides on this attribute. For a full list of methods for configuring a Type see the <a href="http://crawler.archive.org/apidocs/org/archive/crawler/settings/Type.html" target="_top">Type</a> class.</p></div></div><div class="sect2" lang="en"><div class="titlepage"><div><div><h3 class="title"><a name="N102B2"></a>6.2. Accessing attributes</h3></div></div></div><p>In most cases when the module needs to access its own attributes, a CrawlURI is available. The right way to make sure that all the overrides and refinements is considered is then to use the method <a href="http://crawler.archive.org/apidocs/org/archive/crawler/settings/ComplexType.html#getAttribute(java.lang.String,%20org.archive.crawler.datamodel.CrawlURI)" target="_top">getAttribute(String name, CrawlURI uri)</a> to get the attribute. Sometimes the context you are working in could be defined by other objects than the CrawlURI, then use the <a href="http://crawler.archive.org/apidocs/org/archive/crawler/settings/ComplexType.html#getAttribute(java.lang.Object,%20java.lang.String)" target="_top">getAttribute(Object context, String name)</a> method to get the value. This method tries its best at getting some useful context information out of an object. What it does is checking if the context is any kind of URI or a settings
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -