📄 pavuk.1.in
字号:
Available only on platforms which have any supported RE implementation..TP.I -url_pattern $patternThis option allows you to specify wildcard pattern for URLs. All URLsare tested if they match this pattern..br.B Example:.br-url_pattern http://\\*.idata.sk:\\*/~ondrej/\\* . this option enables all HTTP URLs from domain .idata.sk on all ports which are located under /~ondrej/..TP.I -url_rpattern $reg_expThis is equal option as previous, but this uses regular expressions.Available only on platforms which have any supported RE implementation..TP.I -skip_url_pattern $patternThis option allows you to specify wildcard pattern for URLs that should be skipped.All URLs are tested if they match this pattern..TP.I -skip_url_rpattern $reg_expThis is equal option as previous, but this uses regular expressions.Available only on platforms which have any supported RE implementation..TP.I -aip_pattern $reThis option allows you to limit set of transferred documents by server IP address.IP address can be specified as regular expressions, so it is possible to specifyset of IP addresses by one expression.Available only on platforms which have any supported RE implementation..TP.I -dip_pattern $reThis option similar to previous option, but is used to specify set of disallowedIP addresses.Available only on platforms which have any supported RE implementation..TP.I -tag_pattern $tag $attrib $urlMore powerful version of \fI-url_pattern\fR option for more precise matchingof allowed URLs based on HTML tag name pattern, HTML tag attribute name patternand on URL pattern. You can use in all three parameters of this option wildcardpatterns, thus something like \fB-tag_pattern '*' '*' url_pattern\fR is equalto \fB-url_pattern url_pattern\fR. The \fI$tag\fR and \fI$attrib\fR parametersare always matched again uppercase strings. For example if you want just letpavuk follow only regular links ignoring any stylesheets, images, etc.,use option \fB-tag_pattern A HREF '*'\fR..TP.I -tag_rpattern $tag $attrib $urlThis is variation on the \fI-tag_pattern\fR. It uses regular expressionpatterns in parameters instead of wildcard patterns used in the previous option..SH Limitation Protocol Option.sp.TP.I -noHTTP/-HTTPThis switch suppresses all transfers through HTTP protocol.Default is transfer trough HTTP enabled..TP.I -noSSL/-SSLThis switch suppresses all transfers through HTTPS protocol (HTTP protocol over SSL) .Default is transfer trough HTTPS enabled.This option is available only when compiled with SSL support (you need SSleay or OpenSSL libraries and development headers).TP.I -noGopher/-GopherSuppress all transfers through Gopher Internet protocol.Default is transfer trough Gopher enabled..TP.I -noFTP/-FTPThis switch prevents processing documents allocated on all FTP servers.Default is transfer trough FTP enabled..TP.I -noFTPS/-FTPSThis switch prevents processing documents allocated on all FTP servers accessed through SSL.Default is transfer trough FTPS enabled.This option is available only when compiled with SSL support (you need SSleay or OpenSSL libraries and development headers).TP.I -FTPhtml/-noFTPhtmlBy using of option -FTPhtml you can force pavuk to process HTML files downloaded with FTP protocol.At default pavuk won't parse HTML files from FTP servers..TP.I -FTPdir/-noFTPdirForce recursive processing of FTP directories too.At default is recursive downloading from FTP servers denied..TP.I -disable_html_tag $TAG,[$ATTRIB][;...].I -enable_html_tag $TAG,[$ATTRIB][;...]Enable or disable processing of particular HTML tags or attributes.At default all supported HTML tags are enabled..spFor example if you don't want to process all images you should use option.B -disable_html_tag 'IMG,SRC;INPUT,SRC;BODY,BACKGROUND' ..SH Other Limitation Options.sp.TP.I -subdir $dirSubdirectory of local tree directory, to limit some of the modes{sync, resumeregets, linkupdate} in its tree scan..TP.I -dont_leave_site/-leave_site(Don't) leave starting site. At default pavuk can span host when recursing through WWW tree..TP.I -dont_leave_dir/-leave_dir(Don't) leave starting directory. If -dont_leave_dir option is usedpavuk will stay only in starting directory (including its own subdirectories).At default pavuk can leave starting directories..TP.I -leave_site_enter_dir/-dont_leave_site_enter_dirIf you are downloading WWW tree which spans multiple hosts with huge trees, you may want to allow downloading of document which are in directory hierarchy below directory which we visited as first on each site. To obtain this, use option -dont_leave_site_enter_dir. As default pavuk will go also to higher directory levels on that site..TP.I -lmax $nrSet maximum allowed level of tree traverse. Default is set to 0,what means that pavuk can traverse at infinitum.As of version 0.8pl1 inline objects of HTML pages are placed at samelevel as parent HTML page..TP.I -leave_level $nrMaximum level of documents outside from site of starting URL.Default is set to 0, and 0 means that checking is not applied..TP.I -site_level $nrMaximum level of sites outside from site of starting URL.Default is set to 0, and 0 means that checking is not applied..TP.I -dmax $nrSet maximum allowed number of documents that are processed.Default value is 0.That means no restrictions are used in number of processed documents..TP.I -singlepage/-nosinglepageUsing option \fB-singlepage\fR allows you to transfer just HTML pages with all itsinlined objects (pictures, sounds, frame documents, ...).As default is disabled single page transfer. This option makes \fB-mode singlepage\fRoption obsolete..TP.I -limit_inlines/-dont_limit_inlinesWith this option you can control whether limiting options apply also to inlineobjects (pictures, sounds, ...). This is useful when you want to downloadspecified set of HTML pages with all inline options without any restrictions..TP.I -user_condition $strScript or program name for users own conditions.You can write any script which should with exit value decide if download URL or not.Script gets from pavuk any number of options, with this meaning :.in +3.sp.B -url $url- processed URL.br.B -parent $url- any number of parent URLs.br.B -level $nr- level of this URL from starting URL.br.B -size $nr- size of requested URL.br.B -date $datenr- modification time of requested URL in format.I YYYYMMDDhhmmss.br.in -3.spThe exit status 0 of script or program means that current URL should be rejected and nonzero exit status means that URL should be accepted..br.B Warning :use user conditions only if required because of big slowdownscaused by forking scripts for each checked URL..TP.I -follow_cmd $strThis option allows you to specify script or program which can by its exit status decide whether to follow URLs from current HTML document. This script will be called after download of each HTML document.The script will get following options as it's parameters:.in +3.sp.B -url $url- URL of current HTML document.br.B -infile $file- local file where is stored HTML document.in -3.spThe exit status 0 of script or program means that URLs from current document will be disallowed, other exit status means, that pavuk can follow links from current HTML document..SH Javascript supportSupport for scripting languages like JavaScript or VBScript in pavuk is done bit hacky way. There is no interpreter for this languages, so not all things will work. Whole support which pavuk have for this scripting languages is based on regular expression patterns specified by user. Pavuk search for this patterns in DOM event attributes of HTML tags, in javascript:... URLs, in inline scripts in HTML documents enclosed between <script></script> tags and in separate javascript files.Support for scripting languages is only available when pavuk is compiled with proper regular expression library (POSIX/GNU/PCRE)..TP.I -enable_js/-disable_jsThis options are used to enable or disable processing of Javascript parts of HTML documents. You must enable this option to be able to use processing of javascript patterns..TP.I -js_pattern $reWith this option you are specifying what patterns match interested parts of Javascript for extracting URLs. The parameter must be RE pattern with exactly one subpattern which match exactly the URL part. For example to match URL in following type of javascript expressions :.br document.b1.src='pics/button1_pre.jpg'.bryou can use this pattern.br "^document\.[a-zA-Z0-9_]*\.src[ \t]*=[ \t]*'(.*)'$".TP.I -js_transform $p $t $h $aThis option is similar to previous, but you can use custom transform rules for the URL parts of patterns and also specify the exact HTML tag and attribute where to look for this pattern. The \fB$p\fR is the pattern to match the interested part of script. The \fB$t\fR is transform rule for the URL, in this parameter the \fB$x\fR parts will be replaced by x-th subpattern of the \fB$p\fR pattern. The \fB$h\fR parameter is exact HTML tag or "*" when this apply to javascript: URLs or DOM event attribs or "" (empty string) when this apply to javascript body of HTML document or separate JS file. The \fB$a\fR parameter is exact HTML attrib of tag or "" (empty string) when this rule apply to javascript body..TP.I -js_transform2 $p $t $h $aThis option is very similar to previous. The meaning of all parameters is same,just the pattern \fB$p\fR can have only one substring which will be used inthe transform rule \fB$t\fR. This is required to allow rewriting of URL partsof the tags and scripts. This option can also be used to force pavuk torecognize HTML targ/attribute pairs which pavuk does not support..SH Cookie.sp.TP.I -cookie_file $fileFile where are stored cookie infos. This file must be in Netscape cookie file format (generated with Netscape Navigator or Communicator ...)..TP.I -cookie_send/-nocookie_sendUse collected cookies in HTTP/HTTPS requests.Pavuk will not send at default cookies..TP.I -cookie_recv/-nocookie_recvStore received cookies from HTTP/HTTPS responses into memory cookie cache.At default pavuk will not remember received cookies..TP.I -cookie_update/-nocookie_updateUpdate cookie file on disk and synchronize it with changes made by any concurrent processes.At default pavuk will not update cookie file on disk..TP.I -cookies_max $nrMaximum number of cookies in memory cookie cache.Default value is 0, and that means no restrictions for cookies number..TP.I -disabled_cookie_domains $listComma-separated list of cookie domains which are permitted to send cookies stored into cookie cache.TP.I -cookie_check/-nocookie_checkCheck when receiving cookie, if cookie domain is equal to domain of server which sends this cookie. At default pavuk check is server is setting cookies for its domain, and if it tries to set cookie for foreign domain pavuk will complain about that and will reject such cookie..SH HTML rewriting engine tuning options.sp.TP.I -noRelocate/-RelocateThis switch prevents the program to rewrite relative URLs to absolute, afterHTML document is transfered. Default pavuk behavior is to maintain linkconsistence of HTML documents. So always when HTML document is downloaded pavukwill rewrite all URLs to point to local document if it is available and if itis not available it will point to remote document. After document is properlydownloaded, pavuk will update links in HTML documents, which point to this one..TP.I -all_to_local/-noall_to_localThis option forces pavuk to change all URLs inside HTML document to local URLsimmediately after download of document. Default is this option disabled..TP.I -sel_to_local/-nosel_to_localThis option forces pavuk to change all URLs, which accomplish conditions for download,to local inside HTML document immediately after download of document.I recommend to use this option, when you are sure, that transfer will bewithout any problems. This option can save a lot of processor time.Default is this option disabled..TP.I -all_to_remote/-noall_to_remoteThis option forces pavuk to change all URLs inside HTML document to remote URLsimmediately after download of document.Default is this option disabled..TP.I -post_update/-nopost_updateThis option is especially designed to allow in \fB-fnrules\fR optiondoing rules based on MIME type of document. This option forces pavukto generate local names for documents just after pavuk knows what isthe MIME type of document. This have big impact on the rewriting engineof links inside HTML documents. This option causes disfunction of otheroptions for controlling the link rewriting engine. Use this option onlywhen you know what you are doing :-).TP.I -dont_touch_url_pattern $patThis options serves to deny rewriting and processing of particular URLsin HTML documents by pavuk HTML rewriting engine. This option acceptswildcard patterns to specify such URLs. Matching is done against untouchedURLs so when he URL is relative, you must use pattern which matches therelative URL, when it is absolute, you must use absolute URL..TP.I -dont_touch_url_rpattern $patThis option is variation on previous option. This one uses regular patternsfor matching of URLs instead of wildcard patterns used by\fB-dont_touch_url_pattern\fR option. This option is available only when pavukis compiled with support for regular expression patterns..TP.I -dont_touch_tag_rpattern $patThis option is variation on previous option, just matching is made on full HTMLtag with included <>. This option accepts regular expression patterns. It isavailable only when pavuk is compiled with support for regular expressionpatterns..SH Filename/URL Conversion Option.sp.TP.I -tr_del_chr $strAll characters found in \fB$str\fR will be deleted from local name of document.\fB$str\fR should contain escape sequences similar like in tr command:.br\fB\\n\fR- newline.br\fB\\r\fR- carriage return.br\fB\\t\fR- horizontal tab space.br\fB\\0xXX\fR- hexadecimal ASCII value.br.B [:upper:]- all uppercase letters.br.B [:lower:]- all lowercase letters.br.B [:alpha:]- all letters.br.B [:alnum:]- all letters and digits.br.B [:digit:]- all digits.br.B [:xdigit:]
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -