⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 ch20_07.htm

📁 By Tom Christiansen and Nathan Torkington ISBN 1-56592-243-3 First Edition, published August 1998
💻 HTM
字号:
<HTML><HEAD><TITLE>Recipe 20.6. Extracting or Removing HTML Tags (Perl Cookbook)</TITLE><METANAME="DC.title"CONTENT="Perl Cookbook"><METANAME="DC.creator"CONTENT="Tom Christiansen &amp; Nathan Torkington"><METANAME="DC.publisher"CONTENT="O'Reilly &amp; Associates, Inc."><METANAME="DC.date"CONTENT="1999-07-02T01:45:58Z"><METANAME="DC.type"CONTENT="Text.Monograph"><METANAME="DC.format"CONTENT="text/html"SCHEME="MIME"><METANAME="DC.source"CONTENT="1-56592-243-3"SCHEME="ISBN"><METANAME="DC.language"CONTENT="en-US"><METANAME="generator"CONTENT="Jade 1.1/O'Reilly DocBook 3.0 to HTML 4.0"><LINKREV="made"HREF="mailto:online-books@oreilly.com"TITLE="Online Books Comments"><LINKREL="up"HREF="ch20_01.htm"TITLE="20. Web Automation"><LINKREL="prev"HREF="ch20_06.htm"TITLE="20.5. Converting HTML to ASCII"><LINKREL="next"HREF="ch20_08.htm"TITLE="20.7. Finding Stale Links"></HEAD><BODYBGCOLOR="#FFFFFF"><img alt="Book Home" border="0" src="gifs/smbanner.gif" usemap="#banner-map" /><map name="banner-map"><area shape="rect" coords="1,-2,616,66" href="index.htm" alt="Perl Cookbook"><area shape="rect" coords="629,-11,726,25" href="jobjects/fsearch.htm" alt="Search this book" /></map><div class="navbar"><p><TABLEWIDTH="684"BORDER="0"CELLSPACING="0"CELLPADDING="0"><TR><TDALIGN="LEFT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch20_06.htm"TITLE="20.5. Converting HTML to ASCII"><IMGSRC="../gifs/txtpreva.gif"ALT="Previous: 20.5. Converting HTML to ASCII"BORDER="0"></A></TD><TDALIGN="CENTER"VALIGN="TOP"WIDTH="228"><B><FONTFACE="ARIEL,HELVETICA,HELV,SANSERIF"SIZE="-1"><ACLASS="chapter"REL="up"HREF="ch20_01.htm"TITLE="20. Web Automation"></A></FONT></B></TD><TDALIGN="RIGHT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch20_08.htm"TITLE="20.7. Finding Stale Links"><IMGSRC="../gifs/txtnexta.gif"ALT="Next: 20.7. Finding Stale Links"BORDER="0"></A></TD></TR></TABLE></DIV><DIVCLASS="sect1"><H2CLASS="sect1"><ACLASS="title"NAME="ch20-22334">20.6. Extracting or Removing HTML Tags</A></H2><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="ch20-pgfId-599">Problem<ACLASS="indexterm"NAME="ch20-idx-1000002635-0"></A><ACLASS="indexterm"NAME="ch20-idx-1000002635-1"></A><ACLASS="indexterm"NAME="ch20-idx-1000002635-2"></A><ACLASS="indexterm"NAME="ch20-idx-1000002635-3"></A><ACLASS="indexterm"NAME="ch20-idx-1000002635-4"></A><ACLASS="indexterm"NAME="ch20-idx-1000002635-5"></A><ACLASS="indexterm"NAME="ch20-idx-1000002635-6"></A><ACLASS="indexterm"NAME="ch20-idx-1000002635-7"></A></A></H3><PCLASS="para">You want to remove HTML tags from a string, leaving just plain text.</P></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="ch20-pgfId-605">Solution</A></H3><PCLASS="para">The following oft-cited solution is simple but wrong on all but the most trivial HTML:</P><PRECLASS="programlisting">($plain_text = $html_text) =~ s/&lt;[^&gt;]*&gt;//gs;     #WRONG</PRE><PCLASS="para">A correct but slower and slightly more complicated way is to use the CPAN modules:</P><PRECLASS="programlisting">use HTML::Parse;use HTML::FormatText;$plain_text = HTML::FormatText-&gt;new-&gt;format(parse_html($html_text));</PRE></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="ch20-pgfId-621">Discussion</A></H3><PCLASS="para">As with almost everything else, there is more than one way to do it. Each solution attempts to strike a balance between speed and flexibility. Occasionally you may find HTML that's simple enough that a trivial command line call will work:</P><PRECLASS="programlisting">% perl -pe 's/&lt;[^&gt;]*&gt;//g' file</PRE><PCLASS="para">However, this will break on with files whose tags cross line boundaries, like this:</P><PRECLASS="programlisting">&lt;IMG SRC = &quot;foo.gif&quot;     ALT = &quot;Flurp!&quot;&gt;</PRE><PCLASS="para">So, you'll see people doing this instead:</P><PRECLASS="programlisting">% perl -0777 -pe 's/&lt;[^&gt;]*&gt;//gs' file</PRE><PCLASS="para">or its scripted equivalent:</P><PRECLASS="programlisting">{    local $/;               # temporary whole-file input mode    $html = &lt;FILE&gt;;    $html =~ s/&lt;[^&gt;]*&gt;//gs;}</PRE><PCLASS="para">But even that isn't good enough except for simplistic HTML without any interesting bits in it. This approach fails for the following examples of valid HTML (among many others):</P><PRECLASS="programlisting">&lt;IMG SRC = &quot;foo.gif&quot; ALT = &quot;A &gt; B&quot;&gt;&lt;!-- &lt;A comment&gt; --&gt;&lt;script&gt;if (a&lt;b &amp;&amp; a&gt;c)&lt;/script&gt;&lt;# Just data #&gt;&lt;![INCLUDE CDATA [ &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; ]]&gt;</PRE><PCLASS="para">If HTML comments include other tags, those solutions would also break on text like this:</P><PRECLASS="programlisting">&lt;!-- This section commented out.    &lt;B&gt;You can't see me!&lt;/B&gt;--&gt;</PRE><PCLASS="para">The only solution that works well here is to use the HTML parsing routines from CPAN. The second code snippet shown above in the Solution demonstrates this better technique.</P><PCLASS="para">For more flexible parsing, subclass the HTML::Parser class and only record the text elements you see:</P><PRECLASS="programlisting">package MyParser;use HTML::Parser;use HTML::Entities qw(decode_entities);@ISA = qw(HTML::Parser);sub text {    my($self, $text) = @_;    print decode_entities($text);}package main;MyParser-&gt;new-&gt;parse_file(*F);</PRE><PCLASS="para">If you're only interested in simple tags that don't contain others <ACLASS="indexterm"NAME="ch20-idx-1000003776-0"></A>nested inside, you can often make do with an approach like the following, which extracts the title from a non-tricky HTML document:</P><PRECLASS="programlisting">($title) = ($html =~ m#&lt;TITLE&gt;\s*(.*?)\s*&lt;/TITLE&gt;#is);</PRE><PCLASS="para">Again, the regex approach has its flaws, so a more complete solution using LWP to process the HTML is shown in <ACLASS="xref"HREF="ch20_07.htm#ch20-11677"TITLE="htitle">Example 20.4</A>.</P><DIVCLASS="example"><H4CLASS="example"><ACLASS="title"NAME="ch20-11677">Example 20.4: htitle</A></H4><PRECLASS="programlisting">#!/usr/bin/perl# htitle - get html title from URLdie &quot;usage: $0 url ...\n&quot; unless @ARGV;require LWP;foreach $url (@ARGV) {    $ua = LWP::UserAgent-&gt;new();    $res = $ua-&gt;request(HTTP::Request-&gt;new(GET =&gt; $url));    print &quot;$url: &quot; if @ARGV &gt; 1;    if ($res-&gt;is_success) {        print $res-&gt;title, &quot;\n&quot;;    } else {        print $res-&gt;status_line, &quot;\n&quot;;    }}</PRE></DIV><PCLASS="para">Here's an example of the output:</P><PRECLASS="programlisting">% htitle http://www.ora.comwww.oreilly.com -- Welcome to O'Reilly &amp; Associates!% htitle http://www.perl.com/ http://www.perl.com/nullvoidhttp://www.perl.com/: The www.perl.com Home Pagehttp://www.perl.com/nullvoid: 404 File Not Found<ACLASS="indexterm"NAME="ch20-idx-1000002637-0"></A><ACLASS="indexterm"NAME="ch20-idx-1000002637-1"></A><ACLASS="indexterm"NAME="ch20-idx-1000002637-2"></A><ACLASS="indexterm"NAME="ch20-idx-1000002637-3"></A><ACLASS="indexterm"NAME="ch20-idx-1000002637-4"></A><ACLASS="indexterm"NAME="ch20-idx-1000002637-5"></A><ACLASS="indexterm"NAME="ch20-idx-1000002637-6"></A></PRE></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="ch20-pgfId-761">See Also</A></H3><PCLASS="para">The documentation for the CPAN modules HTML::TreeBuilder, HTML::Parser, HTML::Entities, and LWP::UserAgent; <ACLASS="xref"HREF="ch20_06.htm"TITLE="Converting HTML to ASCII">Recipe 20.5</A></P></DIV></DIV><DIVCLASS="htmlnav"><P></P><HRALIGN="LEFT"WIDTH="684"TITLE="footer"><TABLEWIDTH="684"BORDER="0"CELLSPACING="0"CELLPADDING="0"><TR><TDALIGN="LEFT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch20_06.htm"TITLE="20.5. Converting HTML to ASCII"><IMGSRC="../gifs/txtpreva.gif"ALT="Previous: 20.5. Converting HTML to ASCII"BORDER="0"></A></TD><TDALIGN="CENTER"VALIGN="TOP"WIDTH="228"><ACLASS="book"HREF="index.htm"TITLE="Perl Cookbook"><IMGSRC="../gifs/txthome.gif"ALT="Perl Cookbook"BORDER="0"></A></TD><TDALIGN="RIGHT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch20_08.htm"TITLE="20.7. Finding Stale Links"><IMGSRC="../gifs/txtnexta.gif"ALT="Next: 20.7. Finding Stale Links"BORDER="0"></A></TD></TR><TR><TDALIGN="LEFT"VALIGN="TOP"WIDTH="228">20.5. Converting HTML to ASCII</TD><TDALIGN="CENTER"VALIGN="TOP"WIDTH="228"><ACLASS="index"HREF="index/index.htm"TITLE="Book Index"><IMGSRC="../gifs/index.gif"ALT="Book Index"BORDER="0"></A></TD><TDALIGN="RIGHT"VALIGN="TOP"WIDTH="228">20.7. Finding Stale Links</TD></TR></TABLE><HRALIGN="LEFT"WIDTH="684"TITLE="footer"><FONTSIZE="-1"></DIV<!-- LIBRARY NAV BAR --> <img src="../gifs/smnavbar.gif" usemap="#library-map" border="0" alt="Library Navigation Links"><p> <a href="copyrght.htm">Copyright &copy; 2002</a> O'Reilly &amp; Associates. All rights reserved.</font> </p> <map name="library-map"> <area shape="rect" coords="1,0,85,94" href="../index.htm"><area shape="rect" coords="86,1,178,103" href="../lwp/index.htm"><area shape="rect" coords="180,0,265,103" href="../lperl/index.htm"><area shape="rect" coords="267,0,353,105" href="../perlnut/index.htm"><area shape="rect" coords="354,1,446,115" href="../prog/index.htm"><area shape="rect" coords="448,0,526,132" href="../tk/index.htm"><area shape="rect" coords="528,1,615,119" href="../cookbook/index.htm"><area shape="rect" coords="617,0,690,135" href="../pxml/index.htm"></map> </BODY></HTML>

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -