robotrules.html
来自「perl教程」· HTML 代码 · 共 208 行
HTML
208 行
<?xml version="1.0" ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<!-- saved from url=(0017)http://localhost/ -->
<script language="JavaScript" src="../../displayToc.js"></script>
<script language="JavaScript" src="../../tocParas.js"></script>
<script language="JavaScript" src="../../tocTab.js"></script>
<link rel="stylesheet" type="text/css" href="../../scineplex.css">
<title>WWW::RobotRules - database of robots.txt-derived permissions</title>
<link rel="stylesheet" href="../../Active.css" type="text/css" />
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<link rev="made" href="mailto:" />
</head>
<body>
<script>writelinks('__top__',2);</script>
<h1><a>WWW::RobotRules - database of robots.txt-derived permissions</a></h1>
<p><a name="__index__"></a></p>
<!-- INDEX BEGIN -->
<ul>
<li><a href="#name">NAME</a></li>
<li><a href="#synopsis">SYNOPSIS</a></li>
<li><a href="#description">DESCRIPTION</a></li>
<li><a href="#robots_txt">ROBOTS.TXT</a></li>
<li><a href="#robots_txt_examples">ROBOTS.TXT EXAMPLES</a></li>
<li><a href="#see_also">SEE ALSO</a></li>
</ul>
<!-- INDEX END -->
<hr />
<p>
</p>
<h1><a name="name">NAME</a></h1>
<p>WWW::RobotRules - database of robots.txt-derived permissions</p>
<p>
</p>
<hr />
<h1><a name="synopsis">SYNOPSIS</a></h1>
<pre>
<span class="keyword">use</span> <span class="variable">WWW::RobotRules</span><span class="operator">;</span>
<span class="keyword">my</span> <span class="variable">$rules</span> <span class="operator">=</span> <span class="variable">WWW::RobotRules</span><span class="operator">-></span><span class="variable">new</span><span class="operator">(</span><span class="string">'MOMspider/1.0'</span><span class="operator">);</span>
</pre>
<pre>
<span class="keyword">use</span> <span class="variable">LWP::Simple</span> <span class="string">qw(get)</span><span class="operator">;</span>
</pre>
<pre>
<span class="operator">{</span>
<span class="keyword">my</span> <span class="variable">$url</span> <span class="operator">=</span> <span class="string">"http://some.place/robots.txt"</span><span class="operator">;</span>
<span class="keyword">my</span> <span class="variable">$robots_txt</span> <span class="operator">=</span> <span class="variable">get</span> <span class="variable">$url</span><span class="operator">;</span>
<span class="variable">$rules</span><span class="operator">-></span><span class="variable">parse</span><span class="operator">(</span><span class="variable">$url</span><span class="operator">,</span> <span class="variable">$robots_txt</span><span class="operator">)</span> <span class="keyword">if</span> <span class="keyword">defined</span> <span class="variable">$robots_txt</span><span class="operator">;</span>
<span class="operator">}</span>
</pre>
<pre>
<span class="operator">{</span>
<span class="keyword">my</span> <span class="variable">$url</span> <span class="operator">=</span> <span class="string">"http://some.other.place/robots.txt"</span><span class="operator">;</span>
<span class="keyword">my</span> <span class="variable">$robots_txt</span> <span class="operator">=</span> <span class="variable">get</span> <span class="variable">$url</span><span class="operator">;</span>
<span class="variable">$rules</span><span class="operator">-></span><span class="variable">parse</span><span class="operator">(</span><span class="variable">$url</span><span class="operator">,</span> <span class="variable">$robots_txt</span><span class="operator">)</span> <span class="keyword">if</span> <span class="keyword">defined</span> <span class="variable">$robots_txt</span><span class="operator">;</span>
<span class="operator">}</span>
</pre>
<pre>
<span class="comment"># Now we can check if a URL is valid for those servers</span>
<span class="comment"># whose "robots.txt" files we've gotten and parsed:</span>
<span class="keyword">if</span><span class="operator">(</span><span class="variable">$rules</span><span class="operator">-></span><span class="variable">allowed</span><span class="operator">(</span><span class="variable">$url</span><span class="operator">))</span> <span class="operator">{</span>
<span class="variable">$c</span> <span class="operator">=</span> <span class="variable">get</span> <span class="variable">$url</span><span class="operator">;</span>
<span class="operator">...</span>
<span class="operator">}</span>
</pre>
<p>
</p>
<hr />
<h1><a name="description">DESCRIPTION</a></h1>
<p>This module parses <em>/robots.txt</em> files as specified in
"A Standard for Robot Exclusion", at
<http://www.robotstxt.org/wc/norobots.html>
Webmasters can use the <em>/robots.txt</em> file to forbid conforming
robots from accessing parts of their web site.</p>
<p>The parsed files are kept in a WWW::RobotRules object, and this object
provides methods to check if access to a given URL is prohibited. The
same WWW::RobotRules object can be used for one or more parsed
<em>/robots.txt</em> files on any number of hosts.</p>
<p>The following methods are provided:</p>
<dl>
<dt><strong><a name="item_new">$rules = WWW::RobotRules-><code>new($robot_name)</code></a></strong>
<dd>
<p>This is the constructor for WWW::RobotRules objects. The first
argument given to <a href="#item_new"><code>new()</code></a> is the name of the robot.</p>
</dd>
</li>
<dt><strong><a name="item_parse">$rules->parse($robot_txt_url, $content, $fresh_until)</a></strong>
<dd>
<p>The <a href="#item_parse"><code>parse()</code></a> method takes as arguments the URL that was used to
retrieve the <em>/robots.txt</em> file, and the contents of the file.</p>
</dd>
</li>
<dt><strong><a name="item_allowed">$rules-><code>allowed($uri)</code></a></strong>
<dd>
<p>Returns TRUE if this robot is allowed to retrieve this URL.</p>
</dd>
</li>
<dt><strong><a name="item_agent">$rules-><code>agent([$name])</code></a></strong>
<dd>
<p>Get/set the agent name. NOTE: Changing the agent name will clear the robots.txt
rules and expire times out of the cache.</p>
</dd>
</li>
</dl>
<p>
</p>
<hr />
<h1><a name="robots_txt">ROBOTS.TXT</a></h1>
<p>The format and semantics of the "/robots.txt" file are as follows
(this is an edited abstract of
<http://www.robotstxt.org/wc/norobots.html> ):</p>
<p>The file consists of one or more records separated by one or more
blank lines. Each record contains lines of the form</p>
<pre>
<field-name>: <value></pre>
<p>The field name is case insensitive. Text after the '#' character on a
line is ignored during parsing. This is used for comments. The
following <field-names> can be used:</p>
<dl>
<dt><strong><a name="item_user_2dagent">User-Agent</a></strong>
<dd>
<p>The value of this field is the name of the robot the record is
describing access policy for. If more than one <em>User-Agent</em> field is
present the record describes an identical access policy for more than
one robot. At least one field needs to be present per record. If the
value is '*', the record describes the default access policy for any
robot that has not not matched any of the other records.</p>
</dd>
<dd>
<p>The <em>User-Agent</em> fields must occur before the <em>Disallow</em> fields. If a
record contains a <em>User-Agent</em> field after a <em>Disallow</em> field, that
constitutes a malformed record. This parser will assume that a blank
line should have been placed before that <em>User-Agent</em> field, and will
break the record into two. All the fields before the <em>User-Agent</em> field
will constitute a record, and the <em>User-Agent</em> field will be the first
field in a new record.</p>
</dd>
</li>
<dt><strong><a name="item_disallow">Disallow</a></strong>
<dd>
<p>The value of this field specifies a partial URL that is not to be
visited. This can be a full path, or a partial path; any URL that
starts with this value will not be retrieved</p>
</dd>
</li>
</dl>
<p>
</p>
<hr />
<h1><a name="robots_txt_examples">ROBOTS.TXT EXAMPLES</a></h1>
<p>The following example "/robots.txt" file specifies that no robots
should visit any URL starting with "/cyberworld/map/" or "/tmp/":</p>
<pre>
User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
Disallow: /tmp/ # these will soon disappear</pre>
<p>This example "/robots.txt" file specifies that no robots should visit
any URL starting with "/cyberworld/map/", except the robot called
"cybermapper":</p>
<pre>
User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space</pre>
<pre>
# Cybermapper knows where to go.
User-agent: cybermapper
Disallow:</pre>
<p>This example indicates that no robots should visit this site further:</p>
<pre>
# go away
User-agent: *
Disallow: /</pre>
<p>This is an example of a malformed robots.txt file.</p>
<pre>
# robots.txt for ancientcastle.example.com
# I've locked myself away.
User-agent: *
Disallow: /
# The castle is your home now, so you can go anywhere you like.
User-agent: Belle
Disallow: /west-wing/ # except the west wing!
# It's good to be the Prince...
User-agent: Beast
Disallow:</pre>
<p>This file is missing the required blank lines between records.
However, the intention is clear.</p>
<p>
</p>
<hr />
<h1><a name="see_also">SEE ALSO</a></h1>
<p><a href="../../lib/LWP/RobotUA.html">the LWP::RobotUA manpage</a>, <a href="../../lib/WWW/RobotRules/AnyDBM_File.html">the WWW::RobotRules::AnyDBM_File manpage</a></p>
</body>
</html>
⌨️ 快捷键说明
复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?