⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 pavuk.1.in

📁 网络爬虫程序
💻 IN
📖 第 1 页 / 共 5 页
字号:
.TP.I -dumpfd $nrFor scripting is sometimes usable to be able to download document directly topipe or variable instead of storing it to regular file. In such case you can usethis option to dump data for example to stdout ($nr = 1)..TP.I -dump_after/-nodump_afterWhile using \fB-dumpfd\fR option in multithreaded pavuk, it is required to dumpdocument in one moment because documents downloaded in multiple threads canoverlap. This option is also useful when you want to dump document after pavukadjusts links inside HTML documents..TP.I -dump_response/-nodump_responseThis option have effect only when used with \fB-dumpfd\fR option. It is used to dump HTTP response headers..TP.I -dump_urlfd $nrWhen you will use this option, pavuk will output all URLs found in HTMLdocuments to file descriptor $nr. You can use this option to extract andconvert all URLs to absolute..SH Scenario/Task options.sp.TP.I -scenario $strName of scenario to load and/or run. Scenarios are files with a structuresimilar to the \fB.pavukrc\fR file.Scenarios contain saved configurations. You can use it for periodicalmirroring. Parameters from scenarios specified at the command linecan be overwritten by command line parameters.To be able to use this option, you need to specify scenario basedirectory with option \fB-scndir\fR..TP.I -dumpscn $filenameStore actual configuration into scenario file with name \fB$filename\fR.This is useful to quickly create pre-configured scenarios for manualediting..SH Directory options.sp.TP.I -msgcat $dirDirectory which contains the message catalog for pavuk.If you do not have permissionto store a pavuk message catalog in the system directory, you should simplycreate similar structure of directories in your home directory as it is onyour system..sp.B For example:.spYour native language is German, and your home directory is/home/jano..spYou should at first create the directory/home/jano/locales/de/LC_MESSAGES/, then put the German pavuk.mothere and set -msgcat to /home/jano/locales/.  If you have properlyset locale environment values, you will see pavuk speaking German.This option is available only when you compiled in support for GNUgettext messages internationalization..TP.I -cdir $dirDirectory where are all retrieved documents are stored. If notspecified, the current directory is used. If the specified directory doesn'texist, it will be created..TP.I -scndir $dirDirectory in which your scenarios are stored.You must use this option when you are loading or storing scenario files..SH Preserve options.sp.TP.I -preserve_time/-nopreserve_timeStore downloaded document with same modification time as on the remote site. Modificationtime will be set only when such information is available (some FTP servers do notsupport the \fBMDTM\fR command, and some documents on HTTP servers are created online so pavukcan't retrieve the modification time of this document).At default modification time of documents isn't preserved..TP.I -preserve_perm/-nopreserve_permStore downloaded document with the same permissions as on the remote site.This option has effect only when downloading a file through FTP protocol andassumes that the \fB-ftplist\fR option is used. At default permissions are notpreserved..TP.I -preserve_slinks/-nopreserve_slinksSet symbolic links to point exactly to same location as on the remote server;don't do any relocations.This option has effect only when downloading file through FTP protocol andassumes that the \fB-ftplist\fR option is used.Default symbolic links are not preserved, and are retrieved as regulardocuments with full contents of linked file..spFor example, assume that on the FTP server ftp.xx.org there is a symbolic link/pub/pavuk/pavuk-current.tgz, which points to /tmp/pub/pavuk-0.9pl11.tgz.Pavuk will create symbolic link ftp/ftp.xx.org_21/pub/pavuk/pavuk-current.tgz.brif option -preserve_slinks will be used this symbolic link will point to/tmp/pub/pavuk-0.9pl11.tgz.brif option -preserve_slinks want be used, this symbolic link will point to ../../tmp/pub/pavuk-0.9pl11.tgz.TP.I -retrieve_symlink/-noretrieve_symlinkRetrieve files behind symbolic links instead of replicating symlinks in local tree..SH Proxy options.sp.TP.I -http_proxy $site[:$port]If this parameter is used, then all HTTP requests are going through this proxyserver. This is useful if your site resides behind a firewall, or if you wantto use a HTTP proxy cache server. The default port number is 8080.Pavuk allows you to specify multiple HTTP proxies(using multiple -http_proxy options) and it will rotate proxies with roundrobin priority disabling proxies with errors..TP.I -nocache/-cacheUse this option whenever you want to get the document directly fromthe site and not from your HTTP proxy cache server. Default pavuk allowstransfer of document copies from cache..TP.I -ftp_proxy $site[:$port]If this parameter is used, then all FTP requests are going through this proxyserver.This is useful when your site resides behind a firewall, or if you want to useFTP proxy cache server.  The default port number is 22.Pavuk supports three different types of proxies for FTP, see the options\fB-ftp_httpgw, -ftp_dirtyproxy.\fRIf none of the mentioned options is used, then pavuk assumes a regularFTP proxy with \fBUSER user@host\fR connecting to remote FTP server..TP.I -ftp_httpgw/-noftp_httpgwThe specified FTP proxy is a HTTP gateway for the FTP protocol. Default FTP proxy is regular FTP proxy..TP.I -ftp_dirtyproxy/-noftp_dirtyproxyThe specified FTP proxy is a HTTP proxy which supports a \fBCONNECT\fR request(pavuk should use full FTP protocol, except of active data connections).Default FTP proxy is regular FTP proxy.If both -ftp_dirtyproxy and -ftp_httpgw are specified, -ftp_dirtyproxy is preferred...TP.I -gopher_proxy $site[:$port]Gopher gateway or proxy/cache server..TP.I -gopher_httpgw/-nogopher_httpgwThe specified Gopher proxy server is a HTTP gateway for Gopher protocol.When \fB-gopher_proxy\fR is set and this \fB-gopher_httpgw\fR option isn't used,pavuk is using proxy as HTTP tunnel with \fBCONNECT\fR request to open connections to Gopher servers..TP.I -ssl_proxy $site[:$port]SSL proxy (tunneling) server [as that in CERN httpd + patch or inSquid] with enabled \fBCONNECT\fR request (at least on port 443). Thisoption is available only when compiled with SSL support (you needthe SSleay or OpenSSL libraries with development headers).SH Proxy Authentification.sp.TP.I -http_proxy_user $userUsername for HTTP proxy authentification..TP.I -http_proxy_pass $passPassword for HTTP proxy authentification..TP.I -http_proxy_auth {1/2/3/4/user/Basic/Digest/NTLM}Authentification scheme for proxy access. Similar meaning as the\fB-auth_scheme\fR option (see help for this option for more details).Default is 2 (Basic scheme)..TP.I -auth_proxy_ntlm_domain $strNT or LM domain used for authorization again HTTP proxy server when NTLM authentification scheme is required. This option is available only when compiled with OpenSSL or libdes libraries..TP.I -auth_reuse_proxy_nonce/-noauth_reuse_proxy_nonceWhen using HTTP Proxy Digest access authentification scheme usefirst received nonce value in multiple following requests..TP.I -ftp_proxy_user $userUsername for FTP proxy authentification..TP.I -ftp_proxy_pass $passPassword for FTP proxy authentification..SH Protocol/Download Options.sp.TP.I -ftp_passiveUses passive ftp when downloading via ftp..TP.I -ftp_activeUses active ftp when downloading via ftp..TP.I -active_ftp_port_range $min:$maxThis option permits to specify the ports used foractive ftp. This permits easier firewall configuration since the rangeof ports can be restricted..spPavuk will randomly choose a number from within the specifiedrange until an open port is found. Should no open ports be foundwithin the given range, pavuk will default to a normalkernel-assigned port, and a message (debug level net) is output..spThe port range selected must be in the non-privileged range(eg. greater than or equal to 1024); it is STRONGLY RECOMMENDED thatthe chosen range be large enough to handle many simultaneousactive connections (for example, 49152-65534, the IANA-registeredephemeral port range)..TP.I -always_mdtm/-noalways_mdtmForce pavuk to always use "MDTM" to determine the file modificationtime and never uses cached times determined when listing the remotefiles..TP.I -remove_before_store/-noremove_before_storeForce unlink'ing of files before new content is stored to a file. Thisis helpful if the local files are hardlinked to some other directoryand after mirroring the hardlinks are checked. All "broken" hardlinksindicate a file update..TP.I -retry $nrSet the number of attempts to transfer processed document.Default set to 1, this mean pavuk will retry once to get documentswhich failed on first attempt..TP.I -nregets $nrSet the number of allowed regets on a single document, after a broken transfer.Default value for this option is 2..TP.I -nredirs $nrSet number of allowed HTTP redirects. (use this for prevention of loops)Default value for this option is 5, and conform to HTTP specification..TP.I -force_reget/-noforce_regetForce reget'ing of the whole document after a broken transfer whenthe server doesn't support retrieving of partial content.Pavuk default behavior is to stop getting documents which don't allowrestarting of transfer from specified position..TP.I -timeout $nrTimeout for stalled connections in minutes. This value is also used forconnection timeouts. For sub-minute timeouts you can use floating point numbers.Default timeout is 0, an that means timeout checking is disabled..TP.I -noRobots/-RobotsThis switch suppresses the use of the \fBrobots.txt\fRstandard, which is used to restrict access of Web robots to somelocations on the web server. Default is allowed checking of robots.txtfiles on HTTP servers. Enable this option always when you are downloadinghuge sets of pages with unpredictable layout.This prevents you from upsetting server administrators :-)..TP.I -noEnc/-EncThis switch suppresses using of \fBgzip\fR or \fBcompress\fR or \fBdeflate\fRencoding in transfer. I don't know if some servers are broken or what, but theyare propagating that MIME type application/gzip or application/compress asencoded. Turn this option off, when you doesn't have libz support compiled inand also \fBgzip\fR program which is used to decode document encoded this way.At default is decoding of downloaded document disabled..TP.I -check_size/-nocheck_sizeThe option -nocheck_size should be used if you are trying todownload pages from a HTTP server which sends a wrong\fBContent-Length:\fR field in the MIME header of response.Default pavuk behavior is to check this field and complainwhen something is wrong..TP.I -maxrate $nrIf you don't want to give all your transfer bandwidth to pavuk, usethis option to set pavuk's maximum transfer rate. This optionaccepts a floating point number to specify the transfer rate in kB/s. Ifyou want get optimal settings, you also have to play with the size of the readbuffer (option \fB-bufsize\fR) because pavuk is doing flow control only atapplication level.At default pavuk use full bandwidth..TP.I -minrate $nrIf you hate slow transfer rates, this option allows you to breaktransfers with slow speed. You can set the minimum transfer rate,and if the connection gets slower than the given rate, the transferwill be stopped. The minimum transfer rate is given in kB/s.At default pavuk doesn't check this limit..TP.I -bufsize $nrThis option is used to specify the size of the read buffer (defaultsize: 32kB).  If you have a very fast connection, you may increasethe size of the buffer to get a better read performance. If you needto decrease the transfer rate, you may need to decrease the size ofthe buffer and set the maximum transfer rate with the \fB-maxrate\fRoption. This option accepts the size of the buffer in kB..TP.I -fs_quota $nrIf you are running pavuk on a multiuser system, you may need toavoid filling up your file system. This option lets you specify howmany space must remain free. If pavuk detects an underrun of thefree space, it will stop downloading files. Specify this quota inkB. Default value is 0, and that mean no checking of this quota..TP.I -file_quota $nrThis option is useful when you want to limit downloading of big files,but want to download at least $nr kilobytes from big files.A big file will be transferred, and when it reaches the specified size,transfer will break. Such document will be processed as properly downloaded,so be careful when using this option.At default pavuk is transferring full size of documents..TP.I -trans_quota $nrIf you are aware that your selection should address a big amount ofdata, you can use this option to limit the amount of transferred data.Default is by size unlimited transfer..TP.I -max_time $nrSet maximum amount of time for program run. After time is exceeded, pavukwill stop downloading. Time is specified in minutes. Default value is 0, and it means downloading time is not limited..TP.I -url_strategy $strategyThis option allows you to specify a downloading order for URLs in document tree.This option accepts the following strings as parameters :.br.sp.RS.B level- will order URLs as it loads it from HTML files (default).br.B leveli- as previous, but inline objects URLs come first.br.B pre- will insert URLs from actual HTML document at start, before other.br.B prei- as previous, but inline objects URLs come first.br.RE

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -