⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 ch20_04.htm

📁 by Randal L. Schwartz and Tom Phoenix ISBN 0-596-00132-0 Third Edition, published July 2001. (See
💻 HTM
📖 第 1 页 / 共 3 页
字号:
<html><head><title>The HTML Modules (Perl in a Nutshell, 2nd Edition)</title><link rel="stylesheet" type="text/css" href="../style/style1.css" /><meta name="DC.Creator" content="Stephen Spainhour" /><meta name="DC.Format" content="text/xml" scheme="MIME" /><meta name="DC.Language" content="en-US" /><meta name="DC.Publisher" content="O'Reilly &amp; Associates, Inc." /><meta name="DC.Source" scheme="ISBN" content="0596002416L" /><meta name="DC.Subject.Keyword" content="stuff" /><meta name="DC.Title" content="Perl in a Nutshell, 2nd Edition" /><meta name="DC.Type" content="Text.Monograph" /></head><body bgcolor="#ffffff"><img src="gifs/smbanner.gif" usemap="#banner-map" border="0" alt="Book Home" /><map name="banner-map"><area shape="rect" coords="1,-2,616,66" href="index.htm" alt="Java and XSLT" /><area shape="rect" coords="629,-11,726,25" href="jobjects/fsearch.htm" alt="Search this book" /></map><div class="navbar"><table width="684" border="0"><tr><td align="left" valign="top" width="228"><a href="ch20_03.htm"><img src="../gifs/txtpreva.gif" alt="Previous" border="0" /></a></td><td align="center" valign="top" width="228" /><td align="right" valign="top" width="228"><a href="ch20_05.htm"><img src="../gifs/txtnexta.gif" alt="Next" border="0" /></a></td></tr></table></div><h2 class="sect1">20.4. The HTML Modules</h2><p><a name="INDEX-2574" /><a name="INDEX-2575" />HTML modules provide an interface toparse HTML documents. After you parse the document, you can print ordisplay it according to the markup tags or extract specificinformation such as hyperlinks.</p><p><a name="INDEX-2576" /><a name="INDEX-2577" />The HTML::parser module providesmethods for, literally, parsing HTML. It can handle HTML text from astring or file and can separate out the syntactic structures anddata. You shouldn't use HTML::Parser directly,however, since its interface hasn't been designed tomake your life easy when you parse HTML. It's merelya base class from which you can build your own parser to deal withHTML in any way you want. And if you don't want toroll your own HTML parser or parser class, thenthere's always HTML::TokeParser andHTML::TreeBuilder, both of which are covered in this chapter.</p><p>HTML::TreeBuilder is a class that parses HTML into a syntax tree. Ina syntax tree, each element of the HTML, such as container elementswith beginning and end tags, is stored relative to other elements.This preserves the nested structure and behavior of HTML and itshierarchy.</p><p>A syntax tree of the TreeBuilder class is formed of connected nodesthat represent each element of the HTML document. These nodes aresaved as objects from the HTML::Element class. An HTML::Elementobject stores all the information from an HTML tag: the start tag,end tag, attributes, plain text, and pointers to any nested elements.</p><p>The remaining classes of the HTML modules use the syntax trees andits nodes of element objects to output useful information from theHTML documents. The format classes, such as HTML::FormatText andHTML::FormatPS, allow you to produce text and PostScript from HTML.The HTML::LinkExtor class extracts all of the links from a document.Additional modules provide means for replacing HTML characterentities and implementing HTML tags as subroutines.</p><a name="perlnut2-CHP-20-SECT-4.1" /><div class="sect2"><h3 class="sect2">20.4.1. HTML::Parser</h3><p><a name="INDEX-2578" /><a name="INDEX-2579" />This module implements the base classfor the other HTML modules. A parser object is created with the<tt class="literal">new</tt> constructor:</p><blockquote><pre class="code">$p = HTML::Parser-&gt;new( );</pre></blockquote><p>The constructor takes no arguments.</p><p>The parser object takes methods that read in HTML from a string or afile. The string-reading method can take data in several smallerchunks if the HTML is too big. Each chunk of HTML will be appended tothe object, and the <tt class="literal">eof</tt> method indicates the endof the document. These basic methods are described below.</p><a name="INDEX-2580" /><div class="refentry"><table width="515" border="0" cellpadding="5"><tr><td align="left"><font size="+1"><b>eof</b></font></td><td align="right"><i></i></td></tr></table><hr width="515" size="3" noshade="true" align="left" color="black" /><pre>$<em class="replaceable">p</em>-&gt;eof(  )</pre><p><a name="INDEX-2580" />Indicates the end of a document andflushes any buffered text. Returns the parser object.</p></div><a name="INDEX-2581" /><div class="refentry"><table width="515" border="0" cellpadding="5"><tr><td align="left"><font size="+1"><b>parse</b></font></td><td align="right"><i></i></td></tr></table><hr width="515" size="3" noshade="true" align="left" color="black" /><pre>$<em class="replaceable">p</em>-&gt;parse(<em class="replaceable">string</em>)</pre><p><a name="INDEX-2581" />Reads HTML into the parser object froma given <em class="replaceable"><tt>string</tt></em>. Performance problems occurif the string is too large, so the HTML can be broken up into smallerpieces, which will be appended to the data already contained in theobject. The parse can be terminated with a call to the<tt class="literal">eof</tt> method.</p></div><a name="INDEX-2582" /><div class="refentry"><table width="515" border="0" cellpadding="5"><tr><td align="left"><font size="+1"><b>parse_file</b></font></td><td align="right"><i></i></td></tr></table><hr width="515" size="3" noshade="true" align="left" color="black" /><pre>$<em class="replaceable">p</em>-&gt;parse_file(<em class="replaceable">file</em>)</pre><p><a name="INDEX-2582" />Reads HTML into the parserobject from the given <em class="replaceable"><tt>file</tt></em>, which can be afilename or an open filehandle.</p></div><p>When the <tt class="literal">parse</tt> or <tt class="literal">parse_file</tt>method is called, it parses the incoming HTML with a few internalmethods. In HTML::Parser, these methods are defined, but empty.Additional HTML parsing classes (included in the HTML modules or onesyou write yourself) override these methods for their own purposes.For example:</p><blockquote><pre class="code">package HTML::MyParser;require HTML::Parser;@ISA=qw(HTML::MyParser);sub start {     <em class="replaceable"><tt>your subroutine defined here</tt></em>     }</pre></blockquote><p>The following list shows the internal methods contained inHTML::Parser.</p><div class="refentry"><table width="515" border="0" cellpadding="5"><tr><td align="left"><font size="+1"><b>comment</b></font></td><td align="right"><i></i></td></tr></table><hr width="515" size="3" noshade="true" align="left" color="black" /><pre>comment(<em class="replaceable">comment</em>)</pre><p>Invoked on comments from HTML (text between <tt class="literal">&lt;!-</tt>and <tt class="literal">-&gt;</tt>). The text of the comment (without thetags) is given to the method as the string<em class="replaceable"><tt>comment</tt></em>.</p></div><div class="refentry"><table width="515" border="0" cellpadding="5"><tr><td align="left"><font size="+1"><b>end</b></font></td><td align="right"><i></i></td></tr></table><hr width="515" size="3" noshade="true" align="left" color="black" /><pre>end(<em class="replaceable">tag</em>, <em class="replaceable">origtext</em>)</pre><p>Invoked on end tags (those with the <tt class="literal">&lt;/tag&gt;</tt>form). The first argument, <em class="replaceable"><tt>tag</tt></em>, is the tagname in lowercase, and the second argument,<em class="replaceable"><tt>origtext</tt></em>, is the original HTML text of thetag.</p></div><div class="refentry"><table width="515" border="0" cellpadding="5"><tr><td align="left"><font size="+1"><b>start</b></font></td><td align="right"><i></i></td></tr></table><hr width="515" size="3" noshade="true" align="left" color="black" /><pre>start(<em class="replaceable">tag</em>, $<em class="replaceable">attr</em>, <em class="replaceable">attrseq</em>, <em class="replaceable">origtext</em>)</pre><p>Invoked on start tags. The first argument,<em class="replaceable"><tt>tag</tt></em>, is the name of the tag in lowercase.The second argument is a reference to a hash,<em class="replaceable"><tt>attr</tt></em>. This hash contains all theattributes and their values in key/value pairs. The keys are thenames of the attributes in lowercase. The third argument,<em class="replaceable"><tt>attrseq</tt></em>, is a reference to an array thatcontains the names of all the attributes in the order they appearedin the tag. The fourth argument, <em class="replaceable"><tt>origtext</tt></em>,is a string that contains the original text of the tag.</p></div><a name="INDEX-2583" /><a name="INDEX-2584" /><div class="refentry"><table width="515" border="0" cellpadding="5"><tr><td align="left"><font size="+1"><b>text</b></font></td><td align="right"><i></i></td></tr></table><hr width="515" size="3" noshade="true" align="left" color="black" /><pre>text(<em class="replaceable">text</em>)</pre><p>Invoked on plain text in the document. The text is passed unmodifiedand may contain newlines. Character entities in the text are notexpanded<a name="INDEX-2583" /><a name="INDEX-2584" />. </p></div><div class="refentry"><table width="515" border="0" cellpadding="5"><tr><td align="left"><font size="+1"><b>xml_mode</b></font></td><td align="right"><i></i></td></tr></table><hr width="515" size="3" noshade="true" align="left" color="black" /><pre>xml_mode(<em class="replaceable">bool</em>)</pre><p>Enabling this attribute changes the parser to allow some XMLconstructs such as empty element tags and XML processinginstructions. It also disables forcing tag and attribute names tolowercase when they are reported by the <tt class="literal">tagname</tt>and <tt class="literal">attr</tt> arguments, and suppresses specialtreatment of elements parsed as CDATA for HTML.</p></div></div><a name="perlnut2-CHP-20-SECT-4.2" /><div class="sect2"><h3 class="sect2">20.4.2. HTML::TokeParser</h3><p>As we said, you should use a subclassed HTML parser if you want abetter interface to HTML parsing features than what HTML::Parsergives you. HTML::TokeParser by Gisle Aas is one such example. WhileHTML::TokeParser is actually a subclass of HTML::PullParser, it canhelp you do many useful things, such as link extraction and HTMLchecking.</p><p>In short, HTML::TokeParser breaks an HTML document into tokens,attributes, and content, in which the HTML <tt class="literal">&lt;ahref="http://url"&gt;link&lt;/a&gt;</tt> would break down as:</p><blockquote><pre class="code">token: a    attrib: hrefcontent: http://urlcontent: linktoken /a</pre></blockquote><p>For example, you can use HTML::TokeParser to extract links from astring that contains HTML:</p><blockquote><pre class="code">#!/usr/local/bin/perl -wrequire HTML::TokeParser;# Our string that turns out to be HTML!my $html = '&lt;p&gt;Some text. &lt;a href="http://blah"My name is Nate!&lt;/a&gt;&lt;/p&gt;';my $parser = HTML::TokeParser-&gt;new(\$html);get_tag( ) tells TokeParser to match a tag by namewhile (my $token = $parser-&gt;get_tag("a")) {    my $url = $token-&gt;[1]{href} || "-";    my $text = $parser-&gt;get_trimmed_text("/a");    print "URL is: $url.\nURL text is: $text.\n";}</pre></blockquote><a name="perlnut2-CHP-20-SECT-4.2.1" /><div class="sect3"><h3 class="sect3">20.4.2.1. HTML::TokeParser methods</h3><div class="refentry"><table width="515" border="0" cellpadding="5"><tr><td align="left"><font size="+1"><b>new</b></font></td><td align="right"><i></i></td></tr></table><hr width="515" size="3" noshade="true" align="left" color="black" /><pre>new(  )</pre><p>Constructor. Takes a filename, filehandle, or reference to a scalaras arguments. Each argument represents the content that will beparsed. If a scalar is present, <tt class="literal">new</tt> looks for afilename <tt class="literal">$scalar</tt>. If a reference to a scalar ispresent, <tt class="literal">new</tt> looks for HTML in<tt class="literal">\$scalar</tt>. <tt class="literal">new</tt> will readfilehandles until end-of-file. Returns <tt class="literal">undef</tt> onfailure.</p></div><div class="refentry"><table width="515" border="0" cellpadding="5"><tr><td align="left"><font size="+1"><b>get_tag</b></font></td><td align="right"><i></i></td></tr></table><hr width="515" size="3" noshade="true" align="left" color="black" /><pre>get_tag(  )</pre><p>Returns the next start or end tag in a document. If there are noremaining start or end tags, <tt class="literal">get_tag</tt> returns<tt class="literal">undef</tt>. <tt class="literal">get_tag</tt> is usefulbecause it skips unwanted tokens and matches only the tag that youwant&#x2014;if it exists. When a tag is found, it is returned as anarray reference, like so: <tt class="literal">[$tag, $attr, $attrseq,$text]</tt>. If an end tag is found, is is returned&#x2014;e.g.,<tt class="literal">"/$tag"</tt>.</p></div><div class="refentry"><table width="515" border="0" cellpadding="5"><tr><td align="left"><font size="+1"><b>get_text</b></font></td><td align="right"><i></i></td></tr></table><hr width="515" size="3" noshade="true" align="left" color="black" /><pre>get_text(  )</pre><p>Returns all text found at the current position. If the next token isnot text, <tt class="literal">get_text</tt> returns a zero-length string.You can pass an <tt class="literal">"$end_tag"</tt> option to<tt class="literal">get_text</tt> to return all of the text before<tt class="literal">"end_tag"</tt>.</p></div>

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -