⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 ch20_04.htm

📁 By Tom Christiansen and Nathan Torkington ISBN 1-56592-243-3 First Edition, published August 1998
💻 HTM
字号:
<HTML><HEAD><TITLE>Recipe 20.3. Extracting URLs (Perl Cookbook)</TITLE><METANAME="DC.title"CONTENT="Perl Cookbook"><METANAME="DC.creator"CONTENT="Tom Christiansen &amp; Nathan Torkington"><METANAME="DC.publisher"CONTENT="O'Reilly &amp; Associates, Inc."><METANAME="DC.date"CONTENT="1999-07-02T01:45:56Z"><METANAME="DC.type"CONTENT="Text.Monograph"><METANAME="DC.format"CONTENT="text/html"SCHEME="MIME"><METANAME="DC.source"CONTENT="1-56592-243-3"SCHEME="ISBN"><METANAME="DC.language"CONTENT="en-US"><METANAME="generator"CONTENT="Jade 1.1/O'Reilly DocBook 3.0 to HTML 4.0"><LINKREV="made"HREF="mailto:online-books@oreilly.com"TITLE="Online Books Comments"><LINKREL="up"HREF="ch20_01.htm"TITLE="20. Web Automation"><LINKREL="prev"HREF="ch20_03.htm"TITLE="20.2. Automating Form Submission"><LINKREL="next"HREF="ch20_05.htm"TITLE="20.4. Converting ASCII to HTML"></HEAD><BODYBGCOLOR="#FFFFFF"><img alt="Book Home" border="0" src="gifs/smbanner.gif" usemap="#banner-map" /><map name="banner-map"><area shape="rect" coords="1,-2,616,66" href="index.htm" alt="Perl Cookbook"><area shape="rect" coords="629,-11,726,25" href="jobjects/fsearch.htm" alt="Search this book" /></map><div class="navbar"><p><TABLEWIDTH="684"BORDER="0"CELLSPACING="0"CELLPADDING="0"><TR><TDALIGN="LEFT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch20_03.htm"TITLE="20.2. Automating Form Submission"><IMGSRC="../gifs/txtpreva.gif"ALT="Previous: 20.2. Automating Form Submission"BORDER="0"></A></TD><TDALIGN="CENTER"VALIGN="TOP"WIDTH="228"><B><FONTFACE="ARIEL,HELVETICA,HELV,SANSERIF"SIZE="-1"><ACLASS="chapter"REL="up"HREF="ch20_01.htm"TITLE="20. Web Automation"></A></FONT></B></TD><TDALIGN="RIGHT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch20_05.htm"TITLE="20.4. Converting ASCII to HTML"><IMGSRC="../gifs/txtnexta.gif"ALT="Next: 20.4. Converting ASCII to HTML"BORDER="0"></A></TD></TR></TABLE></DIV><DIVCLASS="sect1"><H2CLASS="sect1"><ACLASS="title"NAME="ch20-25551">20.3. Extracting URLs<ACLASS="indexterm"NAME="ch20-idx-1000002602-0"></A><ACLASS="indexterm"NAME="ch20-idx-1000002602-1"></A><ACLASS="indexterm"NAME="ch20-idx-1000002602-2"></A><ACLASS="indexterm"NAME="ch20-idx-1000002602-3"></A><ACLASS="indexterm"NAME="ch20-idx-1000002602-4"></A><ACLASS="indexterm"NAME="ch20-idx-1000002602-5"></A></A></H2><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="ch20-pgfId-289">Problem</A></H3><PCLASS="para">You want to extract all URLs from an HTML file.</P></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="ch20-pgfId-295">Solution</A></H3><PCLASS="para">Use the HTML::LinkExtor module from CPAN:</P><PRECLASS="programlisting">use HTML::LinkExtor;$parser = HTML::LinkExtor-&gt;new(undef, $base_url);$parser-&gt;parse_file($filename);@links = $parser-&gt;links;foreach $linkarray (@links) {    my @element = @$linkarray;    my $elt_type = shift @element;                  # element type    # possibly test whether this is an element we're interested in    while (@element) {        # extract the next attribute and its value        my ($attr_name, $attr_value) = splice(@element, 0, 2);        # ... do something with them ...    }}</PRE></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="ch20-pgfId-333">Discussion</A></H3><PCLASS="para">You can use HTML::LinkExtor in two different ways: either to call <CODECLASS="literal">links</CODE> to get a list of all links in the document once it is completely parsed, or to pass a code reference in the first argument to <CODECLASS="literal">new</CODE>. The referenced function will be called on each link as the document is parsed.</P><PCLASS="para">The <CODECLASS="literal">links</CODE> method clears the link list, so you can call it only once per parsed document. It returns a reference to an array of elements. Each element is itself an array reference with an HTML::Element object at the front followed by a list of attribute name and attribute value pairs. For instance, the HTML:</P><PRECLASS="programlisting">&lt;A HREF=&quot;http://www.perl.com/&quot;&gt;Home page&lt;/A&gt;&lt;IMG SRC=&quot;images/big.gif&quot; LOWSRC=&quot;images/big-lowres.gif&quot;&gt;</PRE><PCLASS="para">would return a data structure like this:</P><PRECLASS="programlisting">[  [ a,   href   =&gt; &quot;http://www.perl.com/&quot; ],  [ img, src    =&gt;&quot;images/big.gif&quot;,         lowsrc =&gt; &quot;images/big-lowres.gif&quot; ]]</PRE><PCLASS="para">Here's an example of how you would use the <CODECLASS="literal">$elt_type</CODE> and the <CODECLASS="literal">$attr_name</CODE> to print out and anchor an image:</P><PRECLASS="programlisting">if ($elt_type eq 'a' &amp;&amp; $attr_name eq 'href') {    print &quot;ANCHOR: $attr_value\n&quot;         if $attr_value-&gt;scheme =~ /http|ftp/;}if ($elt_type eq 'img' &amp;&amp; $attr_name eq 'src') {    print &quot;IMAGE:  $attr_value\n&quot;;}</PRE><PCLASS="para"><ACLASS="xref"HREF="ch20_04.htm#ch20-42565"TITLE="xurl">Example 20.2</A> is a complete program that takes as its arguments a URL, like file:///tmp/testing.html or http://www.ora.com/, and produces on standard output an alphabetically sorted list of unique URLs.</P><DIVCLASS="example"><H4CLASS="example"><ACLASS="title"NAME="ch20-42565">Example 20.2: xurl</A></H4><PRECLASS="programlisting">#!/usr/bin/perl -w# xurl - extract unique, sorted list of links from URLuse HTML::LinkExtor;use LWP::Simple;$base_url = shift;$parser = HTML::LinkExtor-&gt;new(undef, $base_url);$parser-&gt;parse(get($base_url))-&gt;eof;@links = $parser-&gt;links;foreach $linkarray (@links) {    my @element  = @$linkarray;    my $elt_type = shift @element;    while (@element) {        my ($attr_name , $attr_value) = splice(@element, 0, 2);        $seen{$attr_value}++;    }}for (sort keys %seen) { print $_, &quot;\n&quot; }</PRE></DIV><PCLASS="para">This program does have a limitation: if the <CODECLASS="literal">get</CODE> of <CODECLASS="literal">$base_url</CODE> involves a redirection, your links will all be resolved with the original URL instead of the URL at the end of the redirection. To fix this, fetch the document with LWP::UserAgent and examine the response code to find out if a redirection occurred. Once you know the post-redirection URL (if any), construct the HTML::LinkExtor object.</P><PCLASS="para">Here's an example of the run:</P><PRECLASS="programlisting">% xurl http://www.perl.com/CPAN<CODECLASS="userinput"><B><CODECLASS="replaceable"><I>ftp://ftp@ftp.perl.com/CPAN/CPAN.html</I></CODE></B></CODE><CODECLASS="userinput"><B><CODECLASS="replaceable"><I>http://language.perl.com/misc/CPAN.cgi</I></CODE></B></CODE><CODECLASS="userinput"><B><CODECLASS="replaceable"><I>http://language.perl.com/misc/cpan_module</I></CODE></B></CODE><CODECLASS="userinput"><B><CODECLASS="replaceable"><I>http://language.perl.com/misc/getcpan</I></CODE></B></CODE><CODECLASS="userinput"><B><CODECLASS="replaceable"><I>http://www.perl.com/index.html</I></CODE></B></CODE><CODECLASS="userinput"><B><CODECLASS="replaceable"><I>http://www.perl.com/gifs/lcb.xbm</I></CODE></B></CODE></PRE><PCLASS="para">Often in mail or Usenet messages, you'll see URLs written as:</P><PRECLASS="programlisting">&lt;URL:http://www.perl.com&gt;</PRE><PCLASS="para">This is supposed to make it easy to pick URLs from messages:</P><PRECLASS="programlisting">@URLs = ($message =~ /&lt;URL:(.*?)&gt;/g);<ACLASS="indexterm"NAME="ch20-idx-1000002604-0"></A><ACLASS="indexterm"NAME="ch20-idx-1000002604-1"></A><ACLASS="indexterm"NAME="ch20-idx-1000002604-2"></A><ACLASS="indexterm"NAME="ch20-idx-1000002604-3"></A><ACLASS="indexterm"NAME="ch20-idx-1000002604-4"></A><ACLASS="indexterm"NAME="ch20-idx-1000002604-5"></A></PRE></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="ch20-pgfId-437">See Also</A></H3><PCLASS="para">The documentation for the CPAN modules LWP::Simple, HTML::LinkExtor, and HTML::Entities; <ACLASS="xref"HREF="ch20_02.htm"TITLE="Fetching a URL from a Perl Script">Recipe 20.1</A></P></DIV></DIV><DIVCLASS="htmlnav"><P></P><HRALIGN="LEFT"WIDTH="684"TITLE="footer"><TABLEWIDTH="684"BORDER="0"CELLSPACING="0"CELLPADDING="0"><TR><TDALIGN="LEFT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch20_03.htm"TITLE="20.2. Automating Form Submission"><IMGSRC="../gifs/txtpreva.gif"ALT="Previous: 20.2. Automating Form Submission"BORDER="0"></A></TD><TDALIGN="CENTER"VALIGN="TOP"WIDTH="228"><ACLASS="book"HREF="index.htm"TITLE="Perl Cookbook"><IMGSRC="../gifs/txthome.gif"ALT="Perl Cookbook"BORDER="0"></A></TD><TDALIGN="RIGHT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch20_05.htm"TITLE="20.4. Converting ASCII to HTML"><IMGSRC="../gifs/txtnexta.gif"ALT="Next: 20.4. Converting ASCII to HTML"BORDER="0"></A></TD></TR><TR><TDALIGN="LEFT"VALIGN="TOP"WIDTH="228">20.2. Automating Form Submission</TD><TDALIGN="CENTER"VALIGN="TOP"WIDTH="228"><ACLASS="index"HREF="index/index.htm"TITLE="Book Index"><IMGSRC="../gifs/index.gif"ALT="Book Index"BORDER="0"></A></TD><TDALIGN="RIGHT"VALIGN="TOP"WIDTH="228">20.4. Converting ASCII to HTML</TD></TR></TABLE><HRALIGN="LEFT"WIDTH="684"TITLE="footer"><FONTSIZE="-1"></DIV<!-- LIBRARY NAV BAR --> <img src="../gifs/smnavbar.gif" usemap="#library-map" border="0" alt="Library Navigation Links"><p> <a href="copyrght.htm">Copyright &copy; 2002</a> O'Reilly &amp; Associates. All rights reserved.</font> </p> <map name="library-map"> <area shape="rect" coords="1,0,85,94" href="../index.htm"><area shape="rect" coords="86,1,178,103" href="../lwp/index.htm"><area shape="rect" coords="180,0,265,103" href="../lperl/index.htm"><area shape="rect" coords="267,0,353,105" href="../perlnut/index.htm"><area shape="rect" coords="354,1,446,115" href="../prog/index.htm"><area shape="rect" coords="448,0,526,132" href="../tk/index.htm"><area shape="rect" coords="528,1,615,119" href="../cookbook/index.htm"><area shape="rect" coords="617,0,690,135" href="../pxml/index.htm"></map> </BODY></HTML>

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -