📄 wget.texi

📁 wget (command line browser) source code
💻 TEXI
📖 第 1 页 / 共 5 页
字号:
上一页 1 2 3 45
presenting a broken link.  The fact that the former links are convertedto relative links ensures that you can move the downloaded hierarchy toanother directory.Note that only at the end of the download can Wget know which links havebeen downloaded.  Because of that, the work done by @samp{-k} will beperformed at the end of all the downloads.@cindex backing up converted files@item -K@itemx --backup-convertedWhen converting a file, back up the original version with a @samp{.orig}suffix.  Affects the behavior of @samp{-N} (@pxref{HTTP Time-StampingInternals}).@item -m@itemx --mirrorTurn on options suitable for mirroring.  This option turns on recursionand time-stamping, sets infinite recursion depth and keeps @sc{ftp}directory listings.  It is currently equivalent to@samp{-r -N -l inf -nr}.@cindex page requisites@cindex required images, downloading@item -p@itemx --page-requisitesThis option causes Wget to download all the files that are necessary toproperly display a given @sc{html} page.  This includes such things asinlined images, sounds, and referenced stylesheets.Ordinarily, when downloading a single @sc{html} page, any requisite documentsthat may be needed to display it properly are not downloaded.  Using@samp{-r} together with @samp{-l} can help, but since Wget does notordinarily distinguish between external and inlined documents, one isgenerally left with ``leaf documents'' that are missing theirrequisites.For instance, say document @file{1.html} contains an @code{<IMG>} tagreferencing @file{1.gif} and an @code{<A>} tag pointing to externaldocument @file{2.html}.  Say that @file{2.html} is similar but that itsimage is @file{2.gif} and it links to @file{3.html}.  Say thiscontinues up to some arbitrarily high number.If one executes the command:@examplewget -r -l 2 http://@var{site}/1.html@end examplethen @file{1.html}, @file{1.gif}, @file{2.html}, @file{2.gif}, and@file{3.html} will be downloaded.  As you can see, @file{3.html} iswithout its requisite @file{3.gif} because Wget is simply counting thenumber of hops (up to 2) away from @file{1.html} in order to determinewhere to stop the recursion.  However, with this command:@examplewget -r -l 2 -p http://@var{site}/1.html@end exampleall the above files @emph{and} @file{3.html}'s requisite @file{3.gif}will be downloaded.  Similarly,@examplewget -r -l 1 -p http://@var{site}/1.html@end examplewill cause @file{1.html}, @file{1.gif}, @file{2.html}, and @file{2.gif}to be downloaded.  One might think that:@examplewget -r -l 0 -p http://@var{site}/1.html@end examplewould download just @file{1.html} and @file{1.gif}, but unfortunatelythis is not the case, because @samp{-l 0} is equivalent to@samp{-l inf}---that is, infinite recursion.  To download a single @sc{html}page (or a handful of them, all specified on the command-line or in a@samp{-i} @sc{url} input file) and its (or their) requisites, simply leave off@samp{-r} and @samp{-l}:@examplewget -p http://@var{site}/1.html@end exampleNote that Wget will behave as if @samp{-r} had been specified, but onlythat single page and its requisites will be downloaded.  Links from thatpage to external documents will not be followed.  Actually, to downloada single page and all its requisites (even if they exist on separatewebsites), and make sure the lot displays properly locally, this authorlikes to use a few options in addition to @samp{-p}:@examplewget -E -H -k -K -p http://@var{site}/@var{document}@end exampleTo finish off this topic, it's worth knowing that Wget's idea of anexternal document link is any URL specified in an @code{<A>} tag, an@code{<AREA>} tag, or a @code{<LINK>} tag other than @code{<LINKREL="stylesheet">}.@cindex @sc{html} comments@cindex comments, @sc{html}@item --strict-commentsTurn on strict parsing of @sc{html} comments.  The default is to terminatecomments at the first occurrence of @samp{-->}.According to specifications, @sc{html} comments are expressed as @sc{sgml}@dfn{declarations}.  Declaration is special markup that begins with@samp{<!} and ends with @samp{>}, such as @samp{<!DOCTYPE ...>}, thatmay contain comments between a pair of @samp{--} delimiters.  @sc{html}comments are ``empty declarations'', @sc{sgml} declarations without anynon-comment text.  Therefore, @samp{<!--foo-->} is a valid comment, andso is @samp{<!--one-- --two-->}, but @samp{<!--1--2-->} is not.On the other hand, most @sc{html} writers don't perceive comments as anythingother than text delimited with @samp{<!--} and @samp{-->}, which is notquite the same.  For example, something like @samp{<!------------>}works as a valid comment as long as the number of dashes is a multipleof four (!).  If not, the comment technically lasts until the next@samp{--}, which may be at the other end of the document.  Because ofthis, many popular browsers completely ignore the specification andimplement what users have come to expect: comments delimited with@samp{<!--} and @samp{-->}.Until version 1.9, Wget interpreted comments strictly, which resulted inmissing links in many web pages that displayed fine in browsers, but hadthe misfortune of containing non-compliant comments.  Beginning withversion 1.9, Wget has joined the ranks of clients that implements``naive'' comments, terminating each comment at the first occurrence of@samp{-->}.If, for whatever reason, you want strict comment parsing, use thisoption to turn it on.@end table@node Recursive Accept/Reject Options,  , Recursive Retrieval Options, Invoking@section Recursive Accept/Reject Options@table @samp@item -A @var{acclist} --accept @var{acclist}@itemx -R @var{rejlist} --reject @var{rejlist}Specify comma-separated lists of file name suffixes or patterns toaccept or reject (@pxref{Types of Files} for more details).@item -D @var{domain-list}@itemx --domains=@var{domain-list}Set domains to be followed.  @var{domain-list} is a comma-separated listof domains.  Note that it does @emph{not} turn on @samp{-H}.@item --exclude-domains @var{domain-list}Specify the domains that are @emph{not} to be followed.(@pxref{Spanning Hosts}).@cindex follow FTP links@item --follow-ftpFollow @sc{ftp} links from @sc{html} documents.  Without this option,Wget will ignore all the @sc{ftp} links.@cindex tag-based recursive pruning@item --follow-tags=@var{list}Wget has an internal table of @sc{html} tag / attribute pairs that itconsiders when looking for linked documents during a recursiveretrieval.  If a user wants only a subset of those tags to beconsidered, however, he or she should be specify such tags in acomma-separated @var{list} with this option.@item -G @var{list}@itemx --ignore-tags=@var{list}This is the opposite of the @samp{--follow-tags} option.  To skipcertain @sc{html} tags when recursively looking for documents to download,specify them in a comma-separated @var{list}.  In the past, the @samp{-G} option was the best bet for downloading asingle page and its requisites, using a command-line like:@examplewget -Ga,area -H -k -K -r http://@var{site}/@var{document}@end exampleHowever, the author of this option came across a page with tags like@code{<LINK REL="home" HREF="/">} and came to the realization that@samp{-G} was not enough.  One can't just tell Wget to ignore@code{<LINK>}, because then stylesheets will not be downloaded.  Now thebest bet for downloading a single page and its requisites is thededicated @samp{--page-requisites} option.@item -H@itemx --span-hostsEnable spanning across hosts when doing recursive retrieving(@pxref{Spanning Hosts}).@item -L@itemx --relativeFollow relative links only.  Useful for retrieving a specific home pagewithout any distractions, not even those from the same hosts(@pxref{Relative Links}).@item -I @var{list}@itemx --include-directories=@var{list}Specify a comma-separated list of directories you wish to follow whendownloading (@pxref{Directory-Based Limits} for more details.)  Elementsof @var{list} may contain wildcards.@item -X @var{list}@itemx --exclude-directories=@var{list}Specify a comma-separated list of directories you wish to exclude fromdownload (@pxref{Directory-Based Limits} for more details.)  Elements of@var{list} may contain wildcards.@item -np@item --no-parentDo not ever ascend to the parent directory when retrieving recursively.This is a useful option, since it guarantees that only the files@emph{below} a certain hierarchy will be downloaded.@xref{Directory-Based Limits}, for more details.@end table@c man end@node Recursive Retrieval, Following Links, Invoking, Top@chapter Recursive Retrieval@cindex recursion@cindex retrieving@cindex recursive retrievalGNU Wget is capable of traversing parts of the Web (or a single@sc{http} or @sc{ftp} server), following links and directory structure.We refer to this as to @dfn{recursive retrieval}, or @dfn{recursion}.With @sc{http} @sc{url}s, Wget retrieves and parses the @sc{html} fromthe given @sc{url}, documents, retrieving the files the @sc{html}document was referring to, through markup like @code{href}, or@code{src}.  If the freshly downloaded file is also of type@code{text/html} or @code{application/xhtml+xml}, it will be parsed and followed further.Recursive retrieval of @sc{http} and @sc{html} content is@dfn{breadth-first}.  This means that Wget first downloads the requested@sc{html} document, then the documents linked from that document, then thedocuments linked by them, and so on.  In other words, Wget firstdownloads the documents at depth 1, then those at depth 2, and so onuntil the specified maximum depth.The maximum @dfn{depth} to which the retrieval may descend is specifiedwith the @samp{-l} option.  The default maximum depth is five layers.When retrieving an @sc{ftp} @sc{url} recursively, Wget will retrieve allthe data from the given directory tree (including the subdirectories upto the specified depth) on the remote server, creating its mirror imagelocally.  @sc{ftp} retrieval is also limited by the @code{depth}parameter.  Unlike @sc{http} recursion, @sc{ftp} recursion is performeddepth-first.By default, Wget will create a local directory tree, corresponding tothe one found on the remote server.Recursive retrieving can find a number of applications, the mostimportant of which is mirroring.  It is also useful for @sc{www}presentations, and any other opportunities where slow networkconnections should be bypassed by storing the files locally.You should be warned that recursive downloads can overload the remoteservers.  Because of that, many administrators frown upon them and mayban access from your site if they detect very fast downloads of bigamounts of content.  When downloading from Internet servers, considerusing the @samp{-w} option to introduce a delay between accesses to theserver.  The download will take a while longer, but the serveradministrator will not be alarmed by your rudeness.Of course, recursive download may cause problems on your machine.  Ifleft to run unchecked, it can easily fill up the disk.  If downloadingfrom local network, it can also take bandwidth on the system, as well asconsume memory and CPU.Try to specify the criteria that match the kind of download you aretrying to achieve.  If you want to download only one page, use@samp{--page-requisites} without any additional recursion.  If you wantto download things under one directory, use @samp{-np} to avoiddownloading things from other directories.  If you want to download allthe files from one directory, use @samp{-l 1} to make sure the recursiondepth never exceeds one.  @xref{Following Links}, for more informationabout this.Recursive retrieval should be used with care.  Don't say you were notwarned.@node Following Links, Time-Stamping, Recursive Retrieval, Top@chapter Following Links@cindex links@cindex following linksWhen retrieving recursively, one does not wish to retrieve loads ofunnecessary data.  Most of the time the users bear in mind exactly whatthey want to download, and want Wget to follow only specific links.For example, if you wish to download the music archive from@samp{fly.srk.fer.hr}, you will not want to download all the home pagesthat happen to be referenced by an obscure part of the archive.Wget possesses several mechanisms that allows you to fine-tune whichlinks it will follow.@menu* Spanning Hosts::         (Un)limiting retrieval based on host name.* Types of Files::         Getting only certain files.* Directory-Based Limits:: Getting only certain directories.* Relative Links::         Follow relative links only.* FTP Links::              Following FTP links.@end menu@node Spanning Hosts, Types of Files, Following Links, Following Links@section Spanning Hosts@cindex spanning hosts@cindex hosts, spanningWget's recursive retrieval normally refuses to visit hosts differentthan the one you specified on the command line.  This is a reasonabledefault; without it, every retrieval would have the potential to turnyour Wget into a small version of google.However, visiting different hosts, or @dfn{host spanning,} is sometimesa useful option.  Maybe the images are served from a different server.Maybe you're mirroring a site that consists of pages interlinked betweenthree servers.  Maybe the server has two equivalent names, and the @sc{html}pages refer to both interchangeably.@table @asis@item Span to any host---@samp{-H}The @samp{-H} option turns on host spanning, thus allowing Wget'srecursive run to visit any host referenced by a link.
上一页 1 2 3 45
💿 文件大小 1292 K
👤 上传用户 xxjjyy1237
📂 所属分类 Linux/Unix编程
🏷️ 相关标签

#command #browser #source #wget
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -