📄 lwptut.html
字号:
<pre>
<span class="keyword">use</span> <span class="variable">URI</span><span class="operator">;</span>
<span class="variable">$abs</span> <span class="operator">=</span> <span class="variable">URI</span><span class="operator">-></span><span class="variable">new_abs</span><span class="operator">(</span><span class="variable">$maybe_relative</span><span class="operator">,</span> <span class="variable">$base</span><span class="operator">);</span>
</pre>
<p>For example, consider this program that matches URLs in the HTML
list of new modules in CPAN:</p>
<pre>
<span class="keyword">use</span> <span class="variable">strict</span><span class="operator">;</span>
<span class="keyword">use</span> <span class="variable">warnings</span><span class="operator">;</span>
<span class="keyword">use</span> <span class="variable">LWP</span><span class="operator">;</span>
<span class="keyword">my</span> <span class="variable">$browser</span> <span class="operator">=</span> <span class="variable">LWP::UserAgent</span><span class="operator">-></span><span class="variable">new</span><span class="operator">;</span>
<span class="keyword">my</span> <span class="variable">$url</span> <span class="operator">=</span> <span class="string">'http://www.cpan.org/RECENT.html'</span><span class="operator">;</span>
<span class="keyword">my</span> <span class="variable">$response</span> <span class="operator">=</span> <span class="variable">$browser</span><span class="operator">-></span><span class="variable">get</span><span class="operator">(</span><span class="variable">$url</span><span class="operator">);</span>
<span class="keyword">die</span> <span class="string">"Can't get $url -- "</span><span class="operator">,</span> <span class="variable">$response</span><span class="operator">-></span><span class="variable">status_line</span>
<span class="keyword">unless</span> <span class="variable">$response</span><span class="operator">-></span><span class="variable">is_success</span><span class="operator">;</span>
<span class="keyword">my</span> <span class="variable">$html</span> <span class="operator">=</span> <span class="variable">$response</span><span class="operator">-></span><span class="variable">decoded_content</span><span class="operator">;</span>
<span class="keyword">while</span><span class="operator">(</span> <span class="variable">$html</span> <span class="operator">=~</span> <span class="regex">m/<A HREF=\"(.*?)\"/g</span> <span class="operator">)</span> <span class="operator">{</span>
<span class="keyword">print</span> <span class="string">"$1\n"</span><span class="operator">;</span>
<span class="operator">}</span>
</pre>
<p>When run, it emits output that starts out something like this:</p>
<pre>
MIRRORING.FROM
RECENT
RECENT.html
authors/00whois.html
authors/01mailrc.txt.gz
authors/id/A/AA/AASSAD/CHECKSUMS
...</pre>
<p>However, if you actually want to have those be absolute URLs, you
can use the URI module's <code>new_abs</code> method, by changing the <code>while</code>
loop to this:</p>
<pre>
<span class="keyword">while</span><span class="operator">(</span> <span class="variable">$html</span> <span class="operator">=~</span> <span class="regex">m/<A HREF=\"(.*?)\"/g</span> <span class="operator">)</span> <span class="operator">{</span>
<span class="keyword">print</span> <span class="variable">URI</span><span class="operator">-></span><span class="variable">new_abs</span><span class="operator">(</span> <span class="variable">$1</span><span class="operator">,</span> <span class="variable">$response</span><span class="operator">-></span><span class="variable">base</span> <span class="operator">)</span> <span class="operator">,</span><span class="string">"\n"</span><span class="operator">;</span>
<span class="operator">}</span>
</pre>
<p>(The <code>$response->base</code> method from <a href="../lib/HTTP/Message.html">HTTP::Message</a>
is for returning what URL
should be used for resolving relative URLs -- it's usually just
the same as the URL that you requested.)</p>
<p>That program then emits nicely absolute URLs:</p>
<pre>
<a href="http://www.cpan.org/MIRRORING.FROM">http://www.cpan.org/MIRRORING.FROM</a>
<a href="http://www.cpan.org/RECENT">http://www.cpan.org/RECENT</a>
<a href="http://www.cpan.org/RECENT.html">http://www.cpan.org/RECENT.html</a>
<a href="http://www.cpan.org/authors/00whois.html">http://www.cpan.org/authors/00whois.html</a>
<a href="http://www.cpan.org/authors/01mailrc.txt.gz">http://www.cpan.org/authors/01mailrc.txt.gz</a>
<a href="http://www.cpan.org/authors/id/A/AA/AASSAD/CHECKSUMS">http://www.cpan.org/authors/id/A/AA/AASSAD/CHECKSUMS</a>
...</pre>
<p>See chapter 4 of <em>Perl & LWP</em> for a longer discussion of URI objects.</p>
<p>Of course, using a regexp to match hrefs is a bit simplistic, and for
more robust programs, you'll probably want to use an HTML-parsing module
like <a href="../lib/HTML/LinkExtor.html">the HTML::LinkExtor manpage</a> or <a href="../lib/HTML/TokeParser.html">the HTML::TokeParser manpage</a> or even maybe
<a href="../lib/HTML/TreeBuilder.html">the HTML::TreeBuilder manpage</a>.</p>
<p>
</p>
<h2><a name="other_browser_attributes">Other Browser Attributes</a></h2>
<p>LWP::UserAgent objects have many attributes for controlling how they
work. Here are a few notable ones:</p>
<ul>
<li>
<p><code>$browser->timeout(15);</code></p>
<p>This sets this browser object to give up on requests that don't answer
within 15 seconds.</p>
</li>
<li>
<p><code>$browser->protocols_allowed( [ 'http', 'gopher'] );</code></p>
<p>This sets this browser object to not speak any protocols other than HTTP
and gopher. If it tries accessing any other kind of URL (like an "ftp:"
or "mailto:" or "news:" URL), then it won't actually try connecting, but
instead will immediately return an error code 500, with a message like
"Access to 'ftp' URIs has been disabled".</p>
</li>
<li>
<p><code>use LWP::ConnCache; $browser->conn_cache(LWP::ConnCache->new());</code></p>
<p>This tells the browser object to try using the HTTP/1.1 "Keep-Alive"
feature, which speeds up requests by reusing the same socket connection
for multiple requests to the same server.</p>
</li>
<li>
<p><code>$browser->agent( 'SomeName/1.23 (more info here maybe)' )</code></p>
<p>This changes how the browser object will identify itself in
the default "User-Agent" line is its HTTP requests. By default,
it'll send "libwww-perl/<em>versionnumber</em>", like
"libwww-perl/5.65". You can change that to something more descriptive
like this:</p>
<pre>
<span class="variable">$browser</span><span class="operator">-></span><span class="variable">agent</span><span class="operator">(</span> <span class="string">'SomeName/3.14 (contact@robotplexus.int)'</span> <span class="operator">);</span>
</pre>
<p>Or if need be, you can go in disguise, like this:</p>
<pre>
<span class="variable">$browser</span><span class="operator">-></span><span class="variable">agent</span><span class="operator">(</span> <span class="string">'Mozilla/4.0 (compatible; MSIE 5.12; Mac_PowerPC)'</span> <span class="operator">);</span>
</pre>
</li>
<li>
<p><code>push @{ $ua->requests_redirectable }, 'POST';</code></p>
<p>This tells this browser to obey redirection responses to POST requests
(like most modern interactive browsers), even though the HTTP RFC says
that should not normally be done.</p>
</li>
</ul>
<p>For more options and information, see <a href="../lib/LWP/UserAgent.html">the full documentation for LWP::UserAgent</a>.</p>
<p>
</p>
<h2><a name="writing_polite_robots">Writing Polite Robots</a></h2>
<p>If you want to make sure that your LWP-based program respects <em>robots.txt</em>
files and doesn't make too many requests too fast, you can use the LWP::RobotUA
class instead of the LWP::UserAgent class.</p>
<p>LWP::RobotUA class is just like LWP::UserAgent, and you can use it like so:</p>
<pre>
<span class="keyword">use</span> <span class="variable">LWP::RobotUA</span><span class="operator">;</span>
<span class="keyword">my</span> <span class="variable">$browser</span> <span class="operator">=</span> <span class="variable">LWP::RobotUA</span><span class="operator">-></span><span class="variable">new</span><span class="operator">(</span><span class="string">'YourSuperBot/1.34'</span><span class="operator">,</span> <span class="string">'you@yoursite.com'</span><span class="operator">);</span>
<span class="comment"># Your bot's name and your email address</span>
</pre>
<pre>
<span class="keyword">my</span> <span class="variable">$response</span> <span class="operator">=</span> <span class="variable">$browser</span><span class="operator">-></span><span class="variable">get</span><span class="operator">(</span><span class="variable">$url</span><span class="operator">);</span>
</pre>
<p>But HTTP::RobotUA adds these features:</p>
<ul>
<li>
<p>If the <em>robots.txt</em> on <code>$url</code>'s server forbids you from accessing
<code>$url</code>, then the <code>$browser</code> object (assuming it's of class LWP::RobotUA)
won't actually request it, but instead will give you back (in <code>$response</code>) a 403 error
with a message "Forbidden by robots.txt". That is, if you have this line:</p>
<pre>
<span class="keyword">die</span> <span class="string">"$url -- "</span><span class="operator">,</span> <span class="variable">$response</span><span class="operator">-></span><span class="variable">status_line</span><span class="operator">,</span> <span class="string">"\nAborted"</span>
<span class="keyword">unless</span> <span class="variable">$response</span><span class="operator">-></span><span class="variable">is_success</span><span class="operator">;</span>
</pre>
<p>then the program would die with an error message like this:</p>
<pre>
<a href="http://whatever.site.int/pith/x.html">http://whatever.site.int/pith/x.html</a> -- 403 Forbidden by robots.txt
Aborted at whateverprogram.pl line 1234</pre>
</li>
<li>
<p>If this <code>$browser</code> object sees that the last time it talked to
<code>$url</code>'s server was too recently, then it will pause (via <a href="../lib/Pod/perlfunc.html#item_sleep"><code>sleep</code></a>) to
avoid making too many requests too often. How long it will pause for, is
by default one minute -- but you can control it with the <code><
$browser-</code>delay( <em>minutes</em> ) >> attribute.</p>
<p>For example, this code:</p>
<pre>
<span class="variable">$browser</span><span class="operator">-></span><span class="variable">delay</span><span class="operator">(</span> <span class="number">7</span><span class="operator">/</span><span class="number">60</span> <span class="operator">);</span>
</pre>
<p>...means that this browser will pause when it needs to avoid talking to
any given server more than once every 7 seconds.</p>
</li>
</ul>
<p>For more options and information, see <a href="../lib/LWP/RobotUA.html">the full documentation for LWP::RobotUA</a>.</p>
<p>
</p>
<h2><a name="using_proxies">Using Proxies</a></h2>
<p>In some cases, you will want to (or will have to) use proxies for
accessing certain sites and/or using certain protocols. This is most
commonly the case when your LWP program is running (or could be running)
on a machine that is behind a firewall.</p>
<p>To make a browser object use proxies that are defined in the usual
environment variables (<code>HTTP_PROXY</code>, etc.), just call the <code>env_proxy</code>
on a user-agent object before you go making any requests on it.
Specifically:</p>
<pre>
<span class="keyword">use</span> <span class="variable">LWP::UserAgent</span><span class="operator">;</span>
<span class="keyword">my</span> <span class="variable">$browser</span> <span class="operator">=</span> <span class="variable">LWP::UserAgent</span><span class="operator">-></span><span class="variable">new</span><span class="operator">;</span>
<span class="comment"># And before you go making any requests:</span>
<span class="variable">$browser</span><span class="operator">-></span><span class="variable">env_proxy</span><span class="operator">;</span>
</pre>
<p>For more information on proxy parameters, see <a href="../lib/LWP/UserAgent.html">the LWP::UserAgent documentation</a>, specifically the <code>proxy</code>, <code>env_proxy</code>,
and <code>no_proxy</code> methods.</p>
<p>
</p>
<h2><a name="http_authentication">HTTP Authentication</a></h2>
<p>Many web sites restrict access to documents by using "HTTP
Authentication". This isn't just any form of "enter your password"
restriction, but is a specific mechanism where the HTTP server sends the
browser an HTTP code that says "That document is part of a protected
'realm', and you can access it only if you re-request it and add some
special authorization headers to your request".</p>
<p>For example, the Unicode.org admins stop email-harvesting bots from
harvesting the contents of their mailing list archives, by protecting
them with HTTP Authentication, and then publicly stating the username
and password (at <code>http://www.unicode.org/mail-arch/</code>) -- namely
username "unicode-ml" and password "unicode".</p>
<p>For example, consider this URL, which is part of the protected
area of the web site:</p>
<pre>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -