📄 ch20_08.htm

📁 By Tom Christiansen and Nathan Torkington ISBN 1-56592-243-3 First Edition, published August 1998
💻 HTM
字号:
<HTML><HEAD><TITLE>Recipe 20.7. Finding Stale Links (Perl Cookbook)</TITLE><METANAME="DC.title"CONTENT="Perl Cookbook"><METANAME="DC.creator"CONTENT="Tom Christiansen &amp; Nathan Torkington"><METANAME="DC.publisher"CONTENT="O'Reilly &amp; Associates, Inc."><METANAME="DC.date"CONTENT="1999-07-02T01:45:59Z"><METANAME="DC.type"CONTENT="Text.Monograph"><METANAME="DC.format"CONTENT="text/html"SCHEME="MIME"><METANAME="DC.source"CONTENT="1-56592-243-3"SCHEME="ISBN"><METANAME="DC.language"CONTENT="en-US"><METANAME="generator"CONTENT="Jade 1.1/O'Reilly DocBook 3.0 to HTML 4.0"><LINKREV="made"HREF="mailto:online-books@oreilly.com"TITLE="Online Books Comments"><LINKREL="up"HREF="ch20_01.htm"TITLE="20. Web Automation"><LINKREL="prev"HREF="ch20_07.htm"TITLE="20.6. Extracting or Removing HTML Tags"><LINKREL="next"HREF="ch20_09.htm"TITLE="20.8. Finding Fresh Links"></HEAD><BODYBGCOLOR="#FFFFFF"><img alt="Book Home" border="0" src="gifs/smbanner.gif" usemap="#banner-map" /><map name="banner-map"><area shape="rect" coords="1,-2,616,66" href="index.htm" alt="Perl Cookbook"><area shape="rect" coords="629,-11,726,25" href="jobjects/fsearch.htm" alt="Search this book" /></map><div class="navbar"><p><TABLEWIDTH="684"BORDER="0"CELLSPACING="0"CELLPADDING="0"><TR><TDALIGN="LEFT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch20_07.htm"TITLE="20.6. Extracting or Removing HTML Tags"><IMGSRC="../gifs/txtpreva.gif"ALT="Previous: 20.6. Extracting or Removing HTML Tags"BORDER="0"></A></TD><TDALIGN="CENTER"VALIGN="TOP"WIDTH="228"><B><FONTFACE="ARIEL,HELVETICA,HELV,SANSERIF"SIZE="-1"><ACLASS="chapter"REL="up"HREF="ch20_01.htm"TITLE="20. Web Automation"></A></FONT></B></TD><TDALIGN="RIGHT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch20_09.htm"TITLE="20.8. Finding Fresh Links"><IMGSRC="../gifs/txtnexta.gif"ALT="Next: 20.8. Finding Fresh Links"BORDER="0"></A></TD></TR></TABLE></DIV><DIVCLASS="sect1"><H2CLASS="sect1"><ACLASS="title"NAME="ch20-14595">20.7. Finding Stale Links</A></H2><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="ch20-pgfId-1000002646">Problem<ACLASS="indexterm"NAME="ch20-idx-1000002650-0"></A><ACLASS="indexterm"NAME="ch20-idx-1000002650-1"></A><ACLASS="indexterm"NAME="ch20-idx-1000002650-2"></A><ACLASS="indexterm"NAME="ch20-idx-1000002650-3"></A><ACLASS="indexterm"NAME="ch20-idx-1000002650-4"></A><ACLASS="indexterm"NAME="ch20-idx-1000002650-5"></A></A></H3><PCLASS="para">You want to check whether a document contains invalid links.</P></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="ch20-pgfId-777">Solution</A></H3><PCLASS="para">Use the technique outlined in <ACLASS="xref"HREF="ch20_04.htm"TITLE="Extracting URLs">Recipe 20.3</A> to extract each link, and then use the LWP::Simple module's <CODECLASS="literal">head</CODE> function to make sure that link exists.</P></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="ch20-pgfId-783">Discussion</A></H3><PCLASS="para"><ACLASS="xref"HREF="ch20_08.htm#ch20-77587"TITLE="churl">Example 20.5</A> is an applied example of the link-extraction technique. Instead of just printing the name of the link, we call the LWP::Simple module's <CODECLASS="literal">head</CODE> function on it. The HEAD method fetches the remote document's metainformation to determine status information without downloading the whole document. If it fails, then the link is bad so we print an appropriate message.</P><PCLASS="para">Because this program uses the <CODECLASS="literal">get</CODE> function from LWP::Simple, it is expecting a URL, not a filename. If you want to supply either, use the <ACLASS="indexterm"NAME="ch20-idx-1000003872-0"></A>URI::Heuristic module described in <ACLASS="xref"HREF="ch20_02.htm"TITLE="Fetching a URL from a Perl Script">Recipe 20.1</A>.</P><DIVCLASS="example"><H4CLASS="example"><ACLASS="title"NAME="ch20-77587">Example 20.5: churl</A></H4><PRECLASS="programlisting">#!/usr/bin/perl -w# churl - check urlsuse HTML::LinkExtor;use LWP::Simple qw(get head);$base_url = shift    or die &quot;usage: $0 &lt;start_url&gt;\n&quot;;$parser = HTML::LinkExtor-&gt;new(undef, $base_url);$parser-&gt;parse(get($base_url));@links = $parser-&gt;links;print &quot;$base_url: \n&quot;;foreach $linkarray (@links) {    my @element  = @$linkarray;    my $elt_type = shift @element;    while (@element) {        my ($attr_name , $attr_value) = splice(@element, 0, 2);        if ($attr_value-&gt;scheme =~ /\b(ftp|https?|file)\b/) {            print &quot;  $attr_value: &quot;, head($attr_value) ? &quot;OK&quot; : &quot;BAD&quot;, &quot;\n&quot;;        }    }}</PRE></DIV><PCLASS="para">Here's an example of a program run:</P><PRECLASS="programlisting">% churl http://www.wizards.com<CODECLASS="userinput"><B><CODECLASS="replaceable"><I>http://www.wizards.com:</I></CODE></B></CODE><CODECLASS="userinput"><B><CODECLASS="replaceable"><I>  FrontPage/FP_Color.gif:  OK</I></CODE></B></CODE><CODECLASS="userinput"><B><CODECLASS="replaceable"><I>  FrontPage/FP_BW.gif:  BAD</I></CODE></B></CODE><CODECLASS="userinput"><B><CODECLASS="replaceable"><I>  #FP_Map:  OK</I></CODE></B></CODE><CODECLASS="userinput"><B><CODECLASS="replaceable"><I>  Games_Library/Welcome.html:  OK</I></CODE></B></CODE></PRE><PCLASS="para">This program has the same limitation as the HTML::LinkExtor program in <ACLASS="xref"HREF="ch20_04.htm"TITLE="Extracting URLs">Recipe 20.3</A>.<ACLASS="indexterm"NAME="ch20-idx-1000003913-0"></A><ACLASS="indexterm"NAME="ch20-idx-1000003913-1"></A></P></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="ch20-pgfId-851">See Also</A></H3><PCLASS="para">The documentation for the CPAN modules HTML::LinkExtor, LWP::Simple, LWP::UserAgent, and HTTP::Response; <ACLASS="xref"HREF="ch20_09.htm"TITLE="Finding Fresh Links">Recipe 20.8</A></P></DIV></DIV><DIVCLASS="htmlnav"><P></P><HRALIGN="LEFT"WIDTH="684"TITLE="footer"><TABLEWIDTH="684"BORDER="0"CELLSPACING="0"CELLPADDING="0"><TR><TDALIGN="LEFT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch20_07.htm"TITLE="20.6. Extracting or Removing HTML Tags"><IMGSRC="../gifs/txtpreva.gif"ALT="Previous: 20.6. Extracting or Removing HTML Tags"BORDER="0"></A></TD><TDALIGN="CENTER"VALIGN="TOP"WIDTH="228"><ACLASS="book"HREF="index.htm"TITLE="Perl Cookbook"><IMGSRC="../gifs/txthome.gif"ALT="Perl Cookbook"BORDER="0"></A></TD><TDALIGN="RIGHT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch20_09.htm"TITLE="20.8. Finding Fresh Links"><IMGSRC="../gifs/txtnexta.gif"ALT="Next: 20.8. Finding Fresh Links"BORDER="0"></A></TD></TR><TR><TDALIGN="LEFT"VALIGN="TOP"WIDTH="228">20.6. Extracting or Removing HTML Tags</TD><TDALIGN="CENTER"VALIGN="TOP"WIDTH="228"><ACLASS="index"HREF="index/index.htm"TITLE="Book Index"><IMGSRC="../gifs/index.gif"ALT="Book Index"BORDER="0"></A></TD><TDALIGN="RIGHT"VALIGN="TOP"WIDTH="228">20.8. Finding Fresh Links</TD></TR></TABLE><HRALIGN="LEFT"WIDTH="684"TITLE="footer"><FONTSIZE="-1"></DIV<!-- LIBRARY NAV BAR --> <img src="../gifs/smnavbar.gif" usemap="#library-map" border="0" alt="Library Navigation Links"><p> <a href="copyrght.htm">Copyright &copy; 2002</a> O'Reilly &amp; Associates. All rights reserved.</font> </p> <map name="library-map"> <area shape="rect" coords="1,0,85,94" href="../index.htm"><area shape="rect" coords="86,1,178,103" href="../lwp/index.htm"><area shape="rect" coords="180,0,265,103" href="../lperl/index.htm"><area shape="rect" coords="267,0,353,105" href="../perlnut/index.htm"><area shape="rect" coords="354,1,446,115" href="../prog/index.htm"><area shape="rect" coords="448,0,526,132" href="../tk/index.htm"><area shape="rect" coords="528,1,615,119" href="../cookbook/index.htm"><area shape="rect" coords="617,0,690,135" href="../pxml/index.htm"></map> </BODY></HTML>
💿 文件大小 1747 K
👤 上传用户 tiandl
📂 所属分类电子书籍
🏷️ 相关标签

#Christiansen #Torkington #published #Edition
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -