scanning.html

来自「perl教程」· HTML 代码 · 共 687 行 · 第 1/4 页

HTML
687
字号
<?xml version="1.0" ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<!-- saved from url=(0017)http://localhost/ -->
<script language="JavaScript" src="../../../displayToc.js"></script>
<script language="JavaScript" src="../../../tocParas.js"></script>
<script language="JavaScript" src="../../../tocTab.js"></script>
<link rel="stylesheet" type="text/css" href="../../../scineplex.css">
<title>HTML::Tree::Scanning -- article: &quot;Scanning HTML&quot;</title>
<link rel="stylesheet" href="../../../Active.css" type="text/css" />
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<link rev="made" href="mailto:" />
</head>

<body>

<script>writelinks('__top__',3);</script>
<h1><a>HTML::Tree::Scanning -- article: &quot;Scanning HTML&quot;</a></h1>
<p><a name="__index__"></a></p>

<!-- INDEX BEGIN -->

<ul>

	<li><a href="#name">NAME</a></li>
	<li><a href="#synopsis">SYNOPSIS</a></li>
	<li><a href="#description">DESCRIPTION</a></li>
	<li><a href="#scanning_html">Scanning HTML</a></li>
	<ul>

		<li><a href="#html__parser__html__treebuilder__and_html__element">HTML::Parser, HTML::TreeBuilder, and HTML::Element</a></li>
		<li><a href="#scanning_html_trees">Scanning HTML trees</a></li>
		<li><a href="#complex_criteria_in_tree_scanning">Complex Criteria in Tree Scanning</a></li>
		<li><a href="#a_case_study__scanning_yahoo_news_s_html">A Case Study: Scanning Yahoo News's HTML</a></li>
		<li><a href="#regardez__duvet_"><em>Regardez, duvet!</em></a></li>
		<li><a href="#_author_credit_">[Author Credit]</a></li>
	</ul>

	<li><a href="#back">BACK</a></li>
</ul>
<!-- INDEX END -->

<hr />
<p>
</p>
<h1><a name="name">NAME</a></h1>
<p>HTML::Tree::Scanning -- article: &quot;Scanning HTML&quot;</p>
<p>
</p>
<hr />
<h1><a name="synopsis">SYNOPSIS</a></h1>
<pre>
  <span class="comment"># This an article, not a module.</span>
</pre>
<p>
</p>
<hr />
<h1><a name="description">DESCRIPTION</a></h1>
<p>The following article by Sean M. Burke first appeared in <em>The Perl
Journal</em> #19 and is copyright 2000 The Perl Journal. It appears
courtesy of Jon Orwant and The Perl Journal.  This document may be
distributed under the same terms as Perl itself.</p>
<p>
</p>
<hr />
<h1><a name="scanning_html">Scanning HTML</a></h1>
<p>-- Sean M. Burke</p>
<p>In <em>The Perl Journal</em> issue 17, Ken MacFarlane's article &quot;Parsing
HTML with HTML::Parser&quot; describes how the HTML::Parser module scans
HTML source as a stream of start-tags, end-tags, text, comments, etc.
In TPJ #18, my &quot;Trees&quot; article kicked around the idea of tree-shaped
data structures.  Now I'll try to tie it together, in a discussion of
HTML trees.</p>
<p>The CPAN module HTML::TreeBuilder takes the
tags that HTML::Parser picks out, and builds a parse tree -- a
tree-shaped network of objects...</p>
<p>Footnote:
And if you need a quick explanation of objects, see my TPJ17 article &quot;A
User's View of Object-Oriented Modules&quot;; or go whole hog and get Damian
Conway's excellent book <em>Object-Oriented Perl</em>, from Manning
Publications.</p>
<p>...representing the structured content of the HTML document.  And once
the document is parsed as a tree, you'll find the common tasks
of extracting data from that HTML document/tree to be quite
straightforward.</p>
<p>
</p>
<h2><a name="html__parser__html__treebuilder__and_html__element">HTML::Parser, HTML::TreeBuilder, and HTML::Element</a></h2>
<p>You use HTML::TreeBuilder to make a parse tree out of an HTML source
file, by simply saying:</p>
<pre>
  <span class="keyword">use</span> <span class="variable">HTML::TreeBuilder</span><span class="operator">;</span>
  <span class="keyword">my</span> <span class="variable">$tree</span> <span class="operator">=</span> <span class="variable">HTML::TreeBuilder</span><span class="operator">-&gt;</span><span class="variable">new</span><span class="operator">();</span>
  <span class="variable">$tree</span><span class="operator">-&gt;</span><span class="variable">parse_file</span><span class="operator">(</span><span class="string">'foo.html'</span><span class="operator">);</span>
</pre>
<p>and then <code>$tree</code> contains a parse tree built from the HTML source from
the file &quot;foo.html&quot;.  The way this parse tree is represented is with a
network of objects -- <code>$tree</code> is the root, an element with tag-name
&quot;html&quot;, and its children typically include a &quot;head&quot; and &quot;body&quot; element,
and so on.  Elements in the tree are objects of the class
HTML::Element.</p>
<p>So, if you take this source:</p>
<pre>
  &lt;html&gt;&lt;head&gt;&lt;title&gt;Doc 1&lt;/title&gt;&lt;/head&gt;
  &lt;body&gt;
  Stuff &lt;hr&gt; 2000-08-17
  &lt;/body&gt;&lt;/html&gt;</pre>
<p>and feed it to HTML::TreeBuilder, it'll return a tree of objects that
looks like this:</p>
<pre>
               html
             /      \
         head        body
        /          /   |  \
     title    &quot;Stuff&quot;  hr  &quot;2000-08-17&quot;
       |
    &quot;Doc 1&quot;</pre>
<p>This is a pretty simple document, but if it were any more complex,
it'd be a bit hard to draw in that style, since it's sprawl left and
right.  The same tree can be represented a bit more easily sideways,
with indenting:</p>
<pre>
  . html
     . head
        . title
           . &quot;Doc 1&quot;
     . body
        . &quot;Stuff&quot;
        . hr
        . &quot;2000-08-17&quot;</pre>
<p>Either way expresses the same structure.  In that structure, the root
node is an object of the class HTML::Element</p>
<p>Footnote:
Well actually, the root is of the class HTML::TreeBuilder, but that's
just a subclass of HTML::Element, plus the few extra methods like
<code>parse_file</code> that elaborate the tree</p>
<p>, with the tag name &quot;html&quot;, and with two children: an HTML::Element
object whose tag names are &quot;head&quot; and &quot;body&quot;.  And each of those
elements have children, and so on down.  Not all elements (as we'll
call the objects of class HTML::Element) have children -- the &quot;hr&quot;
element doesn't.  And note all nodes in the tree are elements -- the
text nodes (&quot;Doc 1&quot;, &quot;Stuff&quot;, and &quot;2000-08-17&quot;) are just strings.</p>
<p>Objects of the class HTML::Element each have three noteworthy attributes:</p>
<dl>
<dt><strong><a name="item_lowercased"><code>_tag</code> -- (best accessed as <code>$e-&gt;tag</code>)
this element's tag-name, lowercased (e.g., &quot;em&quot; for an &quot;em&quot; element).</a></strong>

<p>Footnote: Yes, this is misnamed.  In proper SGML terminology, this is
instead called a &quot;GI&quot;, short for &quot;generic identifier&quot;; and the term
&quot;tag&quot; is used for a token of SGML source that represents either
the start of an element (a start-tag like &quot;&lt;em lang='fr'&gt;&quot;) or the end
of an element (an end-tag like &quot;&lt;/em&gt;&quot;.  However, since more people
claim to have been abducted by aliens than to have ever seen the
SGML standard, and since both encounters typically involve a feeling of
&quot;missing time&quot;, it's not surprising that the terminology of the SGML
standard is not closely followed.</p>
<dt><strong><a name="item_parent"><code>_parent</code> -- (best accessed as <code>$e-&gt;parent</code>)
the element that is <code>$obj</code>'s parent, or undef if this element is the
root of its tree.</a></strong>

<dt><strong><a name="item_nodes"><code>_content</code> -- (best accessed as <code>$e-&gt;content_list</code>)
the list of nodes (i.e., elements or text segments) that are <code>$e</code>'s
children.</a></strong>

</dl>
<p>Moreover, if an element object has any attributes in the SGML sense of
the word, then those are readable as <code>$e-&gt;attr('name')</code> -- for
example, with the object built from having parsed &quot;&lt;a
<strong>id='foo'</strong>&gt;bar&lt;/a&gt;&quot;, <code>$e-&gt;attr('id')</code> will return
the string &quot;foo&quot;.  Moreover, <code>$e-&gt;tag</code> on that object returns the
string &quot;a&quot;, <code>$e-&gt;content_list</code> returns a list consisting of just

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?