📄 lwptut.pod
字号:
=head1 NAMElwptut -- An LWP Tutorial=head1 DESCRIPTIONLWP (short for "Library for WWW in Perl") is a very popular group ofPerl modules for accessing data on the Web. Like most Perlmodule-distributions, each of LWP's component modules comes withdocumentation that is a complete reference to its interface. However,there are so many modules in LWP that it's hard to know where to startlooking for information on how to do even the simplest most commonthings.Really introducing you to using LWP would require a whole book -- a bookthat just happens to exist, called I<Perl & LWP>. But this articleshould give you a taste of how you can go about some common tasks withLWP.=head2 Getting documents with LWP::SimpleIf you just want to get what's at a particular URL, the simplest wayto do it is LWP::Simple's functions.In a Perl program, you can call its C<get($url)> function. It will trygetting that URL's content. If it works, then it'll return thecontent; but if there's some error, it'll return undef. my $url = 'http://freshair.npr.org/dayFA.cfm?todayDate=current'; # Just an example: the URL for the most recent /Fresh Air/ show use LWP::Simple; my $content = get $url; die "Couldn't get $url" unless defined $content; # Then go do things with $content, like this: if($content =~ m/jazz/i) { print "They're talking about jazz today on Fresh Air!\n"; } else { print "Fresh Air is apparently jazzless today.\n"; }The handiest variant on C<get> is C<getprint>, which is useful in Perlone-liners. If it can get the page whose URL you provide, it sends itto STDOUT; otherwise it complains to STDERR. % perl -MLWP::Simple -e "getprint 'http://cpan.org/RECENT'"That is the URL of a plaintext file that lists new files in CPAN inthe past two weeks. You can easily make it part of a tidy littleshell command, like this one that mails you the list of newC<Acme::> modules: % perl -MLWP::Simple -e "getprint 'http://cpan.org/RECENT'" \ | grep "/by-module/Acme" | mail -s "New Acme modules! Joy!" $USERThere are other useful functions in LWP::Simple, including one functionfor running a HEAD request on a URL (useful for checking links, orgetting the last-revised time of a URL), and two functions forsaving/mirroring a URL to a local file. See L<the LWP::Simpledocumentation|LWP::Simple> for the full details, or chapter 2 of I<Perl& LWP> for more examples.=for comment ##########################################################################=head2 The Basics of the LWP Class ModelLWP::Simple's functions are handy for simple cases, but its functionsdon't support cookies or authorization, don't support setting headerlines in the HTTP request, generally don't support reading header linesin the HTTP response (notably the full HTTP error message, in case of anerror). To get at all those features, you'll have to use the full LWPclass model.While LWP consists of dozens of classes, the main two that you have tounderstand are L<LWP::UserAgent> and L<HTTP::Response>. LWP::UserAgentis a class for "virtual browsers" which you use for performing requests,and L<HTTP::Response> is a class for the responses (or error messages)that you get back from those requests.The basic idiom is C<< $response = $browser->get($url) >>, or more fullyillustrated: # Early in your program: use LWP 5.64; # Loads all important LWP classes, and makes # sure your version is reasonably recent. my $browser = LWP::UserAgent->new; ... # Then later, whenever you need to make a get request: my $url = 'http://freshair.npr.org/dayFA.cfm?todayDate=current'; my $response = $browser->get( $url ); die "Can't get $url -- ", $response->status_line unless $response->is_success; die "Hey, I was expecting HTML, not ", $response->content_type unless $response->content_type eq 'text/html'; # or whatever content-type you're equipped to deal with # Otherwise, process the content somehow: if($response->decoded_content =~ m/jazz/i) { print "They're talking about jazz today on Fresh Air!\n"; } else { print "Fresh Air is apparently jazzless today.\n"; }There are two objects involved: C<$browser>, which holds an object ofclass LWP::UserAgent, and then the C<$response> object, which is ofclass HTTP::Response. You really need only one browser object perprogram; but every time you make a request, you get back a newHTTP::Response object, which will have some interesting attributes:=over=item *A status code indicatingsuccess or failure(which you can test with C<< $response->is_success >>).=item *An HTTP statusline that is hopefully informative if there's failure (which you cansee with C<< $response->status_line >>,returning something like "404 Not Found").=item *A MIME content-type like "text/html", "image/gif","application/xml", etc., which you can see with C<< $response->content_type >>=item *The actual content of the response, in C<< $response->decoded_content >>.If the response is HTML, that's where the HTML source will be; ifit's a GIF, then C<< $response->decoded_content >> will be the binaryGIF data.=item *And dozens of other convenient and more specific methods that aredocumented in the docs for L<HTML::Response>, and its superclassesL<HTML::Message> and L<HTML::Headers>.=back=for comment ##########################################################################=head2 Adding Other HTTP Request HeadersThe most commonly used syntax for requests is C<< $response =$browser->get($url) >>, but in truth, you can add extra HTTP headerlines to the request by adding a list of key-value pairs after the URL,like so: $response = $browser->get( $url, $key1, $value1, $key2, $value2, ... );For example, here's how to send some more Netscape-like headers, in caseyou're dealing with a site that would otherwise reject your request: my @ns_headers = ( 'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)', 'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*', 'Accept-Charset' => 'iso-8859-1,*,utf-8', 'Accept-Language' => 'en-US', ); ... $response = $browser->get($url, @ns_headers);If you weren't reusing that array, you could just go ahead and do this: $response = $browser->get($url, 'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)', 'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*', 'Accept-Charset' => 'iso-8859-1,*,utf-8', 'Accept-Language' => 'en-US', );If you were only ever changing the 'User-Agent' line, you could just changethe C<$browser> object's default line from "libwww-perl/5.65" (or the like)to whatever you like, using the LWP::UserAgent C<agent> method: $browser->agent('Mozilla/4.76 [en] (Win98; U)');=for comment ##########################################################################=head2 Enabling CookiesA default LWP::UserAgent object acts like a browser with its cookiessupport turned off. There are various ways of turning it on, by settingits C<cookie_jar> attribute. A "cookie jar" is an object representinga little database of allthe HTTP cookies that a browser can know about. It can correspond to afile on disk (the way Netscape uses its F<cookies.txt> file), or it canbe just an in-memory object that starts out empty, and whose collection ofcookies will disappear once the program is finished running.To give a browser an in-memory empty cookie jar, you set its C<cookie_jar>attribute like so: $browser->cookie_jar({});To give it a copy that will be read from a file on disk, and will be savedto it when the program is finished running, set the C<cookie_jar> attributelike this: use HTTP::Cookies; $browser->cookie_jar( HTTP::Cookies->new( 'file' => '/some/where/cookies.lwp', # where to read/write cookies 'autosave' => 1, # save it to disk when done ));That file will be an LWP-specific format. If you want to be access thecookies in your Netscape cookies file, you can use theHTTP::Cookies::Netscape class: use HTTP::Cookies; # yes, loads HTTP::Cookies::Netscape too $browser->cookie_jar( HTTP::Cookies::Netscape->new( 'file' => 'c:/Program Files/Netscape/Users/DIR-NAME-HERE/cookies.txt', # where to read cookies ));You could add an C<< 'autosave' => 1 >> line as further above, but attime of writing, it's uncertain whether Netscape might discard some ofthe cookies you could be writing back to disk.=for comment ##########################################################################=head2 Posting Form DataMany HTML forms send data to their server using an HTTP POST request, whichyou can send with this syntax: $response = $browser->post( $url, [ formkey1 => value1, formkey2 => value2, ... ], );Or if you need to send HTTP headers: $response = $browser->post( $url, [ formkey1 => value1, formkey2 => value2, ... ], headerkey1 => value1, headerkey2 => value2, );For example, the following program makes a search request to AltaVista(by sending some form data via an HTTP POST request), and extracts fromthe HTML the report of the number of matches: use strict; use warnings; use LWP 5.64; my $browser = LWP::UserAgent->new; my $word = 'tarragon'; my $url = 'http://www.altavista.com/sites/search/web'; my $response = $browser->post( $url, [ 'q' => $word, # the Altavista query string 'pg' => 'q', 'avkw' => 'tgz', 'kl' => 'XX', ] ); die "$url error: ", $response->status_line unless $response->is_success; die "Weird content type at $url -- ", $response->content_type unless $response->content_type eq 'text/html'; if( $response->decoded_content =~ m{AltaVista found ([0-9,]+) results} ) { # The substring will be like "AltaVista found 2,345 results" print "$word: $1\n"; } else { print "Couldn't find the match-string in the response\n"; }=for comment ##########################################################################=head2 Sending GET Form DataSome HTML forms convey their form data not by sending the datain an HTTP POST request, but by making a normal GET request withthe data stuck on the end of the URL. For example, if you went toC<imdb.com> and ran a search on "Blade Runner", the URL you'd seein your browser window would be: http://us.imdb.com/Tsearch?title=Blade%20Runner&restrict=Movies+and+TVTo run the same search with LWP, you'd use this idiom, which involvesthe URI class: use URI; my $url = URI->new( 'http://us.imdb.com/Tsearch' ); # makes an object representing the URL $url->query_form( # And here the form data pairs: 'title' => 'Blade Runner', 'restrict' => 'Movies and TV', ); my $response = $browser->get($url);See chapter 5 of I<Perl & LWP> for a longer discussion of HTML formsand of form data, and chapters 6 through 9 for a longer discussion ofextracting data from HTML.=head2 Absolutizing URLsThe URI class that we just mentioned above provides all sorts of methodsfor accessing and modifying parts of URLs (such as asking sort of URL itis with C<< $url->scheme >>, and asking what host it refers to with C<<$url->host >>, and so on, as described in L<the docs for the URIclass|URI>. However, the methods of most immediate interestare the C<query_form> method seen above, and now the C<new_abs> methodfor taking a probably-relative URL string (like "../foo.html") and gettingback an absolute URL (like "http://www.perl.com/stuff/foo.html"), asshown here: use URI; $abs = URI->new_abs($maybe_relative, $base);For example, consider this program that matches URLs in the HTMLlist of new modules in CPAN: use strict; use warnings; use LWP; my $browser = LWP::UserAgent->new; my $url = 'http://www.cpan.org/RECENT.html'; my $response = $browser->get($url); die "Can't get $url -- ", $response->status_line unless $response->is_success; my $html = $response->decoded_content; while( $html =~ m/<A HREF=\"(.*?)\"/g ) { print "$1\n"; }When run, it emits output that starts out something like this: MIRRORING.FROM RECENT RECENT.html authors/00whois.html authors/01mailrc.txt.gz authors/id/A/AA/AASSAD/CHECKSUMS ...However, if you actually want to have those be absolute URLs, youcan use the URI module's C<new_abs> method, by changing the C<while>loop to this: while( $html =~ m/<A HREF=\"(.*?)\"/g ) { print URI->new_abs( $1, $response->base ) ,"\n"; }(The C<< $response->base >> method from L<HTTP::Message|HTTP::Message>is for returning what URLshould be used for resolving relative URLs -- it's usually justthe same as the URL that you requested.)That program then emits nicely absolute URLs: http://www.cpan.org/MIRRORING.FROM http://www.cpan.org/RECENT http://www.cpan.org/RECENT.html
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -