📄 lwptut.3
字号:
Or if need be, you can go in disguise, like this:.Sp.Vb 1\& $browser\->agent( \*(AqMozilla/4.0 (compatible; MSIE 5.12; Mac_PowerPC)\*(Aq );.Ve.IP "\(bu" 4\&\f(CW\*(C`push @{ $ua\->requests_redirectable }, \*(AqPOST\*(Aq;\*(C'\fR.SpThis tells this browser to obey redirection responses to \s-1POST\s0 requests(like most modern interactive browsers), even though the \s-1HTTP\s0 \s-1RFC\s0 saysthat should not normally be done..PPFor more options and information, see the full documentation forLWP::UserAgent..Sh "Writing Polite Robots".IX Subsection "Writing Polite Robots"If you want to make sure that your LWP-based program respects \fIrobots.txt\fRfiles and doesn't make too many requests too fast, you can use the LWP::RobotUAclass instead of the LWP::UserAgent class..PPLWP::RobotUA class is just like LWP::UserAgent, and you can use it like so:.PP.Vb 3\& use LWP::RobotUA;\& my $browser = LWP::RobotUA\->new(\*(AqYourSuperBot/1.34\*(Aq, \*(Aqyou@yoursite.com\*(Aq);\& # Your bot\*(Aqs name and your email address\&\& my $response = $browser\->get($url);.Ve.PPBut HTTP::RobotUA adds these features:.IP "\(bu" 4If the \fIrobots.txt\fR on \f(CW$url\fR's server forbids you from accessing\&\f(CW$url\fR, then the \f(CW$browser\fR object (assuming it's of class LWP::RobotUA)won't actually request it, but instead will give you back (in \f(CW$response\fR) a 403 errorwith a message \*(L"Forbidden by robots.txt\*(R". That is, if you have this line:.Sp.Vb 2\& die "$url \-\- ", $response\->status_line, "\enAborted"\& unless $response\->is_success;.Ve.Spthen the program would die with an error message like this:.Sp.Vb 2\& http://whatever.site.int/pith/x.html \-\- 403 Forbidden by robots.txt\& Aborted at whateverprogram.pl line 1234.Ve.IP "\(bu" 4If this \f(CW$browser\fR object sees that the last time it talked to\&\f(CW$url\fR's server was too recently, then it will pause (via \f(CW\*(C`sleep\*(C'\fR) toavoid making too many requests too often. How long it will pause for, isby default one minute \*(-- but you can control it with the \f(CW\*(C`$browser\->delay( \f(CIminutes\f(CW )\*(C'\fR attribute..SpFor example, this code:.Sp.Vb 1\& $browser\->delay( 7/60 );.Ve.Sp\&...means that this browser will pause when it needs to avoid talking toany given server more than once every 7 seconds..PPFor more options and information, see the full documentation forLWP::RobotUA..Sh "Using Proxies".IX Subsection "Using Proxies"In some cases, you will want to (or will have to) use proxies foraccessing certain sites and/or using certain protocols. This is mostcommonly the case when your \s-1LWP\s0 program is running (or could be running)on a machine that is behind a firewall..PPTo make a browser object use proxies that are defined in the usualenvironment variables (\f(CW\*(C`HTTP_PROXY\*(C'\fR, etc.), just call the \f(CW\*(C`env_proxy\*(C'\fRon a user-agent object before you go making any requests on it.Specifically:.PP.Vb 2\& use LWP::UserAgent;\& my $browser = LWP::UserAgent\->new;\& \& # And before you go making any requests:\& $browser\->env_proxy;.Ve.PPFor more information on proxy parameters, see the LWP::UserAgentdocumentation, specifically the \f(CW\*(C`proxy\*(C'\fR, \f(CW\*(C`env_proxy\*(C'\fR,and \f(CW\*(C`no_proxy\*(C'\fR methods..Sh "\s-1HTTP\s0 Authentication".IX Subsection "HTTP Authentication"Many web sites restrict access to documents by using \*(L"\s-1HTTP\s0Authentication\*(R". This isn't just any form of \*(L"enter your password\*(R"restriction, but is a specific mechanism where the \s-1HTTP\s0 server sends thebrowser an \s-1HTTP\s0 code that says \*(L"That document is part of a protected\&'realm', and you can access it only if you re-request it and add somespecial authorization headers to your request\*(R"..PPFor example, the Unicode.org admins stop email-harvesting bots fromharvesting the contents of their mailing list archives, by protectingthem with \s-1HTTP\s0 Authentication, and then publicly stating the usernameand password (at \f(CW\*(C`http://www.unicode.org/mail\-arch/\*(C'\fR) \*(-- namelyusername \*(L"unicode-ml\*(R" and password \*(L"unicode\*(R"..PPFor example, consider this \s-1URL\s0, which is part of the protectedarea of the web site:.PP.Vb 1\& http://www.unicode.org/mail\-arch/unicode\-ml/y2002\-m08/0067.html.Ve.PPIf you access that with a browser, you'll get a promptlike \&\*(L"Enter username and password for 'Unicode\-MailList\-Archives' at server\&'www.unicode.org'\*(R"..PPIn \s-1LWP\s0, if you just request that \s-1URL\s0, like this:.PP.Vb 2\& use LWP;\& my $browser = LWP::UserAgent\->new;\&\& my $url =\& \*(Aqhttp://www.unicode.org/mail\-arch/unicode\-ml/y2002\-m08/0067.html\*(Aq;\& my $response = $browser\->get($url);\&\& die "Error: ", $response\->header(\*(AqWWW\-Authenticate\*(Aq) || \*(AqError accessing\*(Aq,\& # (\*(AqWWW\-Authenticate\*(Aq is the realm\-name)\& "\en ", $response\->status_line, "\en at $url\en Aborting"\& unless $response\->is_success;.Ve.PPThen you'll get this error:.PP.Vb 4\& Error: Basic realm="Unicode\-MailList\-Archives"\& 401 Authorization Required\& at http://www.unicode.org/mail\-arch/unicode\-ml/y2002\-m08/0067.html\& Aborting at auth1.pl line 9. [or wherever].Ve.PP\&...because the \f(CW$browser\fR doesn't know any the username and passwordfor that realm (\*(L"Unicode-MailList-Archives\*(R") at that host(\*(L"www.unicode.org\*(R"). The simplest way to let the browser know about thisis to use the \f(CW\*(C`credentials\*(C'\fR method to let it know about a username andpassword that it can try using for that realm at that host. The syntax is:.PP.Vb 5\& $browser\->credentials(\& \*(Aqservername:portnumber\*(Aq,\& \*(Aqrealm\-name\*(Aq,\& \*(Aqusername\*(Aq => \*(Aqpassword\*(Aq\& );.Ve.PPIn most cases, the port number is 80, the default \s-1TCP/IP\s0 port for \s-1HTTP\s0; andyou usually call the \f(CW\*(C`credentials\*(C'\fR method before you make any requests.For example:.PP.Vb 5\& $browser\->credentials(\& \*(Aqreports.mybazouki.com:80\*(Aq,\& \*(Aqweb_server_usage_reports\*(Aq,\& \*(Aqplinky\*(Aq => \*(Aqbanjo123\*(Aq\& );.Ve.PPSo if we add the following to the program above, right after the \f(CW\*(C`$browser = LWP::UserAgent\->new;\*(C'\fR line....PP.Vb 5\& $browser\->credentials( # add this to our $browser \*(Aqs "key ring"\& \*(Aqwww.unicode.org:80\*(Aq,\& \*(AqUnicode\-MailList\-Archives\*(Aq,\& \*(Aqunicode\-ml\*(Aq => \*(Aqunicode\*(Aq\& );.Ve.PP\&...then when we run it, the request succeeds, instead of causing the\&\f(CW\*(C`die\*(C'\fR to be called..Sh "Accessing \s-1HTTPS\s0 URLs".IX Subsection "Accessing HTTPS URLs"When you access an \s-1HTTPS\s0 \s-1URL\s0, it'll work for you just like an \s-1HTTP\s0 \s-1URL\s0would \*(-- if your \s-1LWP\s0 installation has \s-1HTTPS\s0 support (via an appropriateSecure Sockets Layer library). For example:.PP.Vb 8\& use LWP;\& my $url = \*(Aqhttps://www.paypal.com/\*(Aq; # Yes, HTTPS!\& my $browser = LWP::UserAgent\->new;\& my $response = $browser\->get($url);\& die "Error at $url\en ", $response\->status_line, "\en Aborting"\& unless $response\->is_success;\& print "Whee, it worked! I got that ",\& $response\->content_type, " document!\en";.Ve.PPIf your \s-1LWP\s0 installation doesn't have \s-1HTTPS\s0 support set up, then theresponse will be unsuccessful, and you'll get this error message:.PP.Vb 3\& Error at https://www.paypal.com/\& 501 Protocol scheme \*(Aqhttps\*(Aq is not supported\& Aborting at paypal.pl line 7. [or whatever program and line].Ve.PPIf your \s-1LWP\s0 installation \fIdoes\fR have \s-1HTTPS\s0 support installed, then theresponse should be successful, and you should be able to consult\&\f(CW$response\fR just like with any normal \s-1HTTP\s0 response..PPFor information about installing \s-1HTTPS\s0 support for your \s-1LWP\s0installation, see the helpful \fI\s-1README\s0.SSL\fR file that comes in thelibwww-perl distribution..Sh "Getting Large Documents".IX Subsection "Getting Large Documents"When you're requesting a large (or at least potentially large) document,a problem with the normal way of using the request methods (like \f(CW\*(C`$response = $browser\->get($url)\*(C'\fR) is that the response object inmemory will have to hold the whole document \*(-- \fIin memory\fR. If theresponse is a thirty megabyte file, this is likely to be quite animposition on this process's memory usage..PPA notable alternative is to have \s-1LWP\s0 save the content to a file on disk,instead of saving it up in memory. This is the syntax to use:.PP.Vb 3\& $response = $ua\->get($url,\& \*(Aq:content_file\*(Aq => $filespec,\& );.Ve.PPFor example,.PP.Vb 3\& $response = $ua\->get(\*(Aqhttp://search.cpan.org/\*(Aq,\& \*(Aq:content_file\*(Aq => \*(Aq/tmp/sco.html\*(Aq\& );.Ve.PPWhen you use this \f(CW\*(C`:content_file\*(C'\fR option, the \f(CW$response\fR will haveall the normal header lines, but \f(CW\*(C`$response\->content\*(C'\fR will beempty..PPNote that this \*(L":content_file\*(R" option isn't supported under olderversions of \s-1LWP\s0, so you should consider adding \f(CW\*(C`use LWP 5.66;\*(C'\fR to checkthe \s-1LWP\s0 version, if you think your program might run on systems witholder versions..PPIf you need to be compatible with older \s-1LWP\s0 versions, then usethis syntax, which does the same thing:.PP.Vb 2\& use HTTP::Request::Common;\& $response = $ua\->request( GET($url), $filespec );.Ve.SH "SEE ALSO".IX Header "SEE ALSO"Remember, this article is just the most rudimentary introduction to\&\s-1LWP\s0 \*(-- to learn more about \s-1LWP\s0 and LWP-related tasks, you reallymust read from the following:.IP "\(bu" 4LWP::Simple \*(-- simple functions for getting/heading/mirroring URLs.IP "\(bu" 4\&\s-1LWP\s0 \*(-- overview of the libwww-perl modules.IP "\(bu" 4LWP::UserAgent \*(-- the class for objects that represent \*(L"virtual browsers\*(R".IP "\(bu" 4HTTP::Response \*(-- the class for objects that represent the response toa \s-1LWP\s0 response, as in \f(CW\*(C`$response = $browser\->get(...)\*(C'\fR.IP "\(bu" 4HTTP::Message and HTTP::Headers \*(-- classes that provide more methodsto HTTP::Response..IP "\(bu" 4\&\s-1URI\s0 \*(-- class for objects that represent absolute or relative URLs.IP "\(bu" 4URI::Escape \*(-- functions for URL-escaping and URL-unescaping strings(like turning \*(L"this & that\*(R" to and from \*(L"this%20%26%20that\*(R")..IP "\(bu" 4HTML::Entities \*(-- functions for HTML-escaping and HTML-unescaping strings(like turning \*(L"C. & E. Bronte\*:\*(R" to and from \*(L"C. & E. Brontë\*(R").IP "\(bu" 4HTML::TokeParser and HTML::TreeBuilder \*(-- classes for parsing \s-1HTML\s0.IP "\(bu" 4HTML::LinkExtor \*(-- class for finding links in \s-1HTML\s0 documents.IP "\(bu" 4The book \fIPerl & \s-1LWP\s0\fR by Sean M. Burke. O'Reilly & Associates, 2002.\&\s-1ISBN:\s0 0\-596\-00178\-9. \f(CW\*(C`http://www.oreilly.com/catalog/perllwp/\*(C'\fR.SH "COPYRIGHT".IX Header "COPYRIGHT"Copyright 2002, Sean M. Burke. You can redistribute this document and/ormodify it, but only under the same terms as Perl itself..SH "AUTHOR".IX Header "AUTHOR"Sean M. Burke \f(CW\*(C`sburke@cpan.org\*(C'\fR
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -