📄 lw.pm
字号:
HTTP response code).callback- crawl will call this function (if this is a reference to a function), passing it the current URI and the @ST array (which has host, port, SSL, etc info). If the function returns a TRUE value, then crawl will skip that URI. Set to value 0 (zero) if you do not want to use a callback.slashdot_bug- slashdot.org uses a screwy piece of invalid (yet it works) HTML in the form of <FORM ACTION="//slashdot.org/somefile">. So basically, when a URL starts with '//' and slashdot_bug is set to 1 (which it is by default), then the proper 'http:' or 'https:' will be prepended to the URL.source_callback- crawl will call this function (if this is a reference to a function), passing references to %hin and %hout, right before it parses the page for HTML links. This allows the callback function to review or modify the HTML before it's parsed for links. Return value is ignored. url_limit- number or URLs that crawl will queue up at one time; defaults to 1000do_head- use head requests to determine if a file has a content-type worth downloading. Potentially saves some time, assuming the server properly supports HEAD requests. Set to value 1 to use (0/off by default).=cutsub crawl_set_config { return if(!defined $_[0]); my %opts=@_; while( my($k,$v)=each %opts){ $LW::crawl_config{lc($k)}=$v; }}#####################################################=pod=head1 - Function: LW::crawl_extract_links_test (INTERNAL) Params: $TAG, \%elements, \$html_data, $offset, $lenReturn: nothingThis is the callback function used by the crawl function, and passed to html_find_tags. It will find URL/URI links and place them in @LW::crawl_urls.=cutsub crawl_extract_links_test { my ($TAG, $hr, $dr, $start, $len)=(lc(shift),@_); my $t; # this should be most of the time... return undef if(!defined ($t=$LW::crawl_linktags{$TAG})); return undef if(!scalar %$hr); # fastpath quickie while( my ($key,$val)= each %$hr){ # normalize element values $$hr{lc($key)} = $val; } if(ref($t)){ foreach (@$t){ push(@LW::crawl_urls,$$hr{$_}) if(defined $$hr{$_}); } } else { push(@LW::crawl_urls,$$hr{$t}) if(defined $$hr{$t}); } if($TAG eq 'form' && defined $$hr{action}){ my $u=$LW::crawl_config{'ref_hout'}->{'whisker'}->{'uri'}; $LW::crawl_forms{utils_absolute_uri($$hr{action},$u,1)}++; } return undef;}################################################################=pod=head1 - Function: LW::crawl_make_request (INTERNAL) Params: \%hin, \%houtReturn: $status_codeThis is an internal function used by LW::crawl(), and is responsible formaking HTTP requests, including any HEAD pre-requests and following moveresponses. Status codes are: 0 Success 1 Error during request 2 Error on connection setup 3 Move request; follow Location header 4 File not of text/htm(l) type 5 File not available=cutsub crawl_do_request { my ($hrin,$hrout) = @_; my $ret; if($LW::crawl_config{'do_head'}){ my $save=$$hrin{'whisker'}->{'method'}; $$hrin{'whisker'}->{'method'}='HEAD'; $ret=http_do_request($hrin,$hrout); $$hrin{'whisker'}->{'method'}=$save; return 2 if($ret==2); # if there was connection error, do not continue if($ret==0){ # successful request if($$hrout{'whisker'}->{'http_resp'}==501){ # HEAD not allowed $LW::crawl_config{'do_head'}=0; # no more HEAD requests } if($$hrout{'whisker'}->{'http_resp'} <308 && $$hrout{'whisker'}->{'http_resp'} >300){ if($LW::crawl_config{'follow_moves'} >0){ return 3 if(defined $$hrout{'location'}); } return 5; # not avail } if($$hrout{'whisker'}->{'http_resp'}==200){ # no content-type is treated as text/htm if(defined $$hrout{'content-type'} && $$hrout{'content-type'}!~/^text\/htm/i){ return 4; } # fall through to GET request below } } # request errors are essentially redone via GET, below } return http_do_request($hrin,$hrout);}#####################################################=pod=head1 ++ Sub package: dumpThe dump subpackage contains various utility functions which emulatethe basic functionality provided by Data::Dumper.=cut########################################################################=pod=head1 - Function: LW::dumper Params: $name, \@array [, $name, \%hash, $name, \$scalar ]Return: $code, undef on errorThe dumper function will take the given $name and data reference, andwill create an ASCII perl code representation suitable for eval'inglater to recreate the same structure. $name is the name of the variablethat it will be saved as. Example: $output = LW::dumper('hin',\%hin);NOTE: dumper() creates anonymous structures under the name given. Forexample, if you dump the hash %hin under the name 'hin', then when youeval the dumped code you will need to use %$hin, since $hin is now a*reference* to a hash.=cutsub dumper { my %what=@_; my ($final,$k,$v)=(''); while( ($k,$v)=each %what){ return undef if(ref($k) || !ref($v)); $final.="\$$k = "._dump(1,$v,1); $final=~s#,\n$##; $final.=";\n"; } return $final;}########################################################################=pod=head1 - Function: LW::dumper_writefile Params: $file, $name, \@array [, $name, \%hash, $name, \@scalar ]Return: 0 if success; 1 if errorThis calls dumper() and saves the output to the specified $file. Note: LW does not checking on the validity of the file name, it'screation, or anything of the sort. Files are opened in overwritemode.=cutsub dumper_writefile { my $file=shift; my $output=dumper(@_); return 1 if(!open(OUT,">$file") || $output eq 'ERROR'); print OUT $output; close(OUT);}########################################################################=pod=head1 - Function: LW::_dump (INTERNAL) Params: $tabs, $refReturn: $outputThis is an internal function to dumper() which will dereference allelements and produce the resulting code.This function is not intended for external use.=cutsub _dump { # dereference and dump an element my ($t, $ref, $depth)=@_; my ($out,$k,$v)=(''); $depth||=1; # to protect against circular loops return 'undef' if($depth > 128); if(!defined $ref){ return 'undef'; } elsif(ref($ref) eq 'HASH'){ $out.="{\n"; while( ($k,$v)=each %$ref){ next if($k eq ''); $out.= "\t"x$t; $out.=_dumpd($k).' => '; if(ref($v)){ $out.=_dump($t+1,$v,$depth+1); } else { $out.=_dumpd($v); } $out.=",\n" unless( substr($out,-2,2) eq ",\n"); } $out=~s#,\n$#\n#; $out.="\t"x($t-1); $out.="},\n"; } elsif(ref($ref) eq 'ARRAY'){ $out.="["; if(~~@$ref){ $out.="\n"; foreach $v (@$ref) { $out.= "\t"x$t; if(ref($v)){ $out.=_dump($t+1,$v,$depth+1); } else { $out.=_dumpd($v); } $out.=",\n" unless( substr($out,-2,2) eq ",\n"); } $out=~s#,\n$#\n#; $out.="\t"x($t-1); } $out.="],\n"; } elsif(ref($ref) eq 'SCALAR'){ $out.=_dumpd($$ref); } elsif(ref($ref) eq 'REF'){ $out.=_dump($t,$$ref,$depth+1); } elsif(ref($ref)){ # unknown/unsupported ref $out.="undef"; } else { # normal scalar $out.=_dumpd($ref); } return $out;}########################################################################=pod=head1 - Function: LW::_dumpd (INTERNAL) Params: $stringReturn: $escaped_stringThis is an internal function to dumper() which will escape the givenstring to make it suitable for printing.This function is not intended for external use.=cutsub _dumpd { # escape a scalar string my $v=shift; return 'undef' if(!defined $v); return "''" if($v eq ''); return "$v" if($v!~tr/0-9//c); return "'$v'" if($v!~tr/ !-~//c); $v=~s#\\#\\\\#g; $v=~s#"#\\"#g; $v=~s#\r#\\r#g; $v=~s#\n#\\n#g; $v=~s#\0#\\0#g; $v=~s#\t#\\t#g; $v=~s#([^!-~ ])#sprintf('\\x%02x',ord($1))#eg; return "\"$v\"";}########################################################################=pod=head1 ++ Sub package: easyThe 'easy' subpackage contains many high-level/simple functions todo basic web tasks. This should make it easier to use libwhiskerto do basic tasks.=cut########################################################################=pod=head1 - Function: LW::get_page Params: $url [, \%hin_request]Return: $code, $data ($code will be set to undef on error, $data will contain error message)This function will fetch the page at the given URL, and return the HTTP response codeand page contents. Use this in the form of:($code,$html)=LW::get_page("http://host.com/page.html")The optional %hin_request will be used if supplied. This allows you to setheaders and other parameters.=cutsub get_page { my ($URL,$hr)=(shift,shift); return (undef,"No URL supplied") if(length($URL)==0); my (%req,%resp); my $rptr; if(defined $hr && ref($hr)){ $rptr=$hr; } else { $rptr=\%req; LW::http_init_request(\%req); } LW::utils_split_uri($URL,$rptr); # this is newer >=1.1 syntax LW::http_fixup_request($rptr); if(http_do_request($rptr,\%resp)){ return (undef,$resp{'whisker'}->{'error'}); } return ($resp{'whisker'}->{'code'}, $resp{'whisker'}->{'data'});}########################################################################=pod=head1 - Function: LW::get_page_hash Params: $url [, \%hin_request]Return: $hash_ref (undef on no URL)This function will fetch the page at the given URL, and return the whiskerHTTP response hash. The return code of the function is set to$hash_ref->{whisker}->{get_page_hash}, and uses the LW::http_do_request()response values.Note: undef is returned if no URL is supplied=cutsub get_page_hash { my ($URL,$hr)=(shift,shift); return undef if(length($URL)==0); my (%req,%resp); my $rptr; if(defined $hr && ref($hr)){ $rptr=$hr; } else { $rptr=\%req; LW::http_init_request(\%req); } LW::utils_split_uri($URL,$rptr); # this is newer >=1.1 syntax LW::http_fixup_request($rptr); my $r=http_do_request($rptr,\%resp); $resp{whisker}->{get_page_hash}=$r; return \%resp;}########################################################################=pod=head1 - Function: LW::get_page_to_file Params: $url, $filepath [, \%hin_request]Return: $code ($code will be set to undef on error)This function will fetch the page at the given URL, place the resulting HTMLin the file specified, and return the HTTP response code. The optional%hin_request hash sets the default parameters to be used in the request.NOTE: libwhisker does not do any file checking; libwhisker will open thesupplied filepath for writing, overwriting any previously-existing files.Libwhisker does not differentiate between a bad request, and a bad fileopen. If you're having troubles making this function work, make surethat your $filepath is legal and valid, and that you have appropriatewrite permissions to create/overwrite that file.=cutsub get_page_to_file { my ($URL, $filepath, $hr)=@_; return undef if(length($URL)==0); return undef if(length($filepath)==0); my (%req,%resp); my $rptr; if(defined $hr && ref($hr)){ $rptr=$hr; } else { $rptr=\%req; LW::http_init_request(\%req); } LW::utils_split_uri($URL,$rptr); # this is newer >=1.1 syntax LW::http_fixup_request($rptr); if(http_do_request($rptr,\%resp)){ return undef; } open(OUT,">$filepath") || return undef; binmode(OUT); # stupid Windows print OUT $resp{'whisker'}->{'data'}; close(OUT); return $resp{'whisker'}->{'code'};}########################################################################=pod=head1 - Function: LW::upload_file Params: $url, $filepath, $paramname [, \%hin_request]Return: $code ($code will be set to undef on error)This function will upload the specified $file to the given $url asthe parameter named $paramname via a multipart POST request. The optional $hin_request hash lets you set any other particular requestparameters.NOTE: this is a highly simplied function for basic uploads. If you
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -