⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 custom-eng.html

📁 Larbin互联网蜘蛛索引系统
💻 HTML
字号:
<html><head>   <meta http-equiv="Content-Type" content="text/html">   <title>Larbin : Parcourir le web, telle est ma passion</title></head><body bgcolor="#FFFFFF"><center><font color="#FF0000"><h1>How to customize Larbin</h1></font></center><h2>Where do the pages arrive ?</h2>Every time a page is fetched from the web, one of the followingfunction is called (both are in <a href="../src/xinterf/output.cc">src/xinterf/output.cc</a>) :<ul><li>void loaded (html *page) : This function is called when thefetch ended with success. From the page object, you can<ul><li>get the url of the page by calling the method getUrl()<li>get the content of the page by calling the method getPage()<li>get the list of the sons by calling the method getLinks()<li>get the http headers by calling the method getHeaders()</ul>For more details, see<ahref="../include/xfetcher/file.h">include/xfetcher/file.h</a>(for html),<a href="../include/xutils/url.h">include/xutils/url.h</a>,<a href="../include/xutils/string.h">include/xutils/string.h</a>,<a href="../include/xutils/Vector.h">include/xutils/Vector.h</a>.<li>void fetchFailInteresting (url *u, FetchError reason) : This function iscalled when the fetch ended by an error, but the page has the goodmime type (only called with specificSearch). u describes the url of thepage. A description of its class can be found in<a href="../include/xutils/url.h">include/xutils/url.h</a>.reason explains why the fetch failed. enum FetchError is defined in <a href="../include/types.h">include/types.h</a>.<li>void fetchFail (url *u, FetchError reason) : This function iscalled when the fetch ended by an error. u describes the url of thepage. A description of its class can be found in<a href="../include/xutils/url.h">include/xutils/url.h</a>.reason explains why the fetch failed. enum FetchError is defined in <a href="../include/types.h">include/types.h</a>.</ul><h2>Simple customizations</h2><h4><a href="../larbin.conf">larbin.conf</a></h4>The basic configurations are made in larbin.conf. Here are thedifferent fields of this file :<ul><li>From : YOUR mail : sent with http headers : very usefull when someonewants to complain about the robot :-(<li>UserAgent : name of the robot (sent with each request)<li>httpPort : port on which is launched the http statistic webserver(see http://localhost:8081/ when larbin is launched)<li>inputPort : port on which you can submit urls to fetch(should not be < 30 : it might be bad for the server)<li>pagesConnexions : Number of page you fetch in parallel (to adaptdepending of your network speed). Decrease this if you have too ,anytimeouts (see stats) : 5% seems ok, but not more.<li>dnsConnexions : Number of dns calls in parallel. 10 should be ok<li>deapthInSite : How deep do you want to go in a site<li>waitDuration : time between 2 calls at the same server inseconds. It should never be less than 30 s. However, even with 60 s,it won't change the overall speed of the crawler.<li>proxy : if you want to connect through a proxy (host port)<li>StartUrl : Where the search starts<li>limitToDomain : with this option enabled, you will only crawlpages of some specific domain (.fr and .dk for example).<li>specificSearch : this option allows you to look for a specifickind of page (wap page, xml page, mp3...)<li>forbiddenExtensions : What are the extensions you don't want ?(write all of them and terminate your list with end)</ul><h4><a href="../include/types.h">include/types.h</a></h4>If you want to tune larbin a little more, go and see this file (it issupposed to be commented enough). Of course, for those changes to haveeffects, you have to recompile larbin.<h2>More customizations</h2>If you need something more, you'll have to go into the code (or ask meto do so :-)).<hr><table border=0 width="100%"><tr><td><a HREF="mailto:sebastien.ailleret@inria.fr">sebastien.ailleret@inria.fr</a></td><td align="right"><a href="http://pauillac.inria.fr/~ailleret/index-eng.html">Home Page</a></td></tr></table></body></html>

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -