scanning.html
来自「perl教程」· HTML 代码 · 共 687 行 · 第 1/4 页
HTML
687 行
<?xml version="1.0" ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<!-- saved from url=(0017)http://localhost/ -->
<script language="JavaScript" src="../../../displayToc.js"></script>
<script language="JavaScript" src="../../../tocParas.js"></script>
<script language="JavaScript" src="../../../tocTab.js"></script>
<link rel="stylesheet" type="text/css" href="../../../scineplex.css">
<title>HTML::Tree::Scanning -- article: "Scanning HTML"</title>
<link rel="stylesheet" href="../../../Active.css" type="text/css" />
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<link rev="made" href="mailto:" />
</head>
<body>
<script>writelinks('__top__',3);</script>
<h1><a>HTML::Tree::Scanning -- article: "Scanning HTML"</a></h1>
<p><a name="__index__"></a></p>
<!-- INDEX BEGIN -->
<ul>
<li><a href="#name">NAME</a></li>
<li><a href="#synopsis">SYNOPSIS</a></li>
<li><a href="#description">DESCRIPTION</a></li>
<li><a href="#scanning_html">Scanning HTML</a></li>
<ul>
<li><a href="#html__parser__html__treebuilder__and_html__element">HTML::Parser, HTML::TreeBuilder, and HTML::Element</a></li>
<li><a href="#scanning_html_trees">Scanning HTML trees</a></li>
<li><a href="#complex_criteria_in_tree_scanning">Complex Criteria in Tree Scanning</a></li>
<li><a href="#a_case_study__scanning_yahoo_news_s_html">A Case Study: Scanning Yahoo News's HTML</a></li>
<li><a href="#regardez__duvet_"><em>Regardez, duvet!</em></a></li>
<li><a href="#_author_credit_">[Author Credit]</a></li>
</ul>
<li><a href="#back">BACK</a></li>
</ul>
<!-- INDEX END -->
<hr />
<p>
</p>
<h1><a name="name">NAME</a></h1>
<p>HTML::Tree::Scanning -- article: "Scanning HTML"</p>
<p>
</p>
<hr />
<h1><a name="synopsis">SYNOPSIS</a></h1>
<pre>
<span class="comment"># This an article, not a module.</span>
</pre>
<p>
</p>
<hr />
<h1><a name="description">DESCRIPTION</a></h1>
<p>The following article by Sean M. Burke first appeared in <em>The Perl
Journal</em> #19 and is copyright 2000 The Perl Journal. It appears
courtesy of Jon Orwant and The Perl Journal. This document may be
distributed under the same terms as Perl itself.</p>
<p>
</p>
<hr />
<h1><a name="scanning_html">Scanning HTML</a></h1>
<p>-- Sean M. Burke</p>
<p>In <em>The Perl Journal</em> issue 17, Ken MacFarlane's article "Parsing
HTML with HTML::Parser" describes how the HTML::Parser module scans
HTML source as a stream of start-tags, end-tags, text, comments, etc.
In TPJ #18, my "Trees" article kicked around the idea of tree-shaped
data structures. Now I'll try to tie it together, in a discussion of
HTML trees.</p>
<p>The CPAN module HTML::TreeBuilder takes the
tags that HTML::Parser picks out, and builds a parse tree -- a
tree-shaped network of objects...</p>
<p>Footnote:
And if you need a quick explanation of objects, see my TPJ17 article "A
User's View of Object-Oriented Modules"; or go whole hog and get Damian
Conway's excellent book <em>Object-Oriented Perl</em>, from Manning
Publications.</p>
<p>...representing the structured content of the HTML document. And once
the document is parsed as a tree, you'll find the common tasks
of extracting data from that HTML document/tree to be quite
straightforward.</p>
<p>
</p>
<h2><a name="html__parser__html__treebuilder__and_html__element">HTML::Parser, HTML::TreeBuilder, and HTML::Element</a></h2>
<p>You use HTML::TreeBuilder to make a parse tree out of an HTML source
file, by simply saying:</p>
<pre>
<span class="keyword">use</span> <span class="variable">HTML::TreeBuilder</span><span class="operator">;</span>
<span class="keyword">my</span> <span class="variable">$tree</span> <span class="operator">=</span> <span class="variable">HTML::TreeBuilder</span><span class="operator">-></span><span class="variable">new</span><span class="operator">();</span>
<span class="variable">$tree</span><span class="operator">-></span><span class="variable">parse_file</span><span class="operator">(</span><span class="string">'foo.html'</span><span class="operator">);</span>
</pre>
<p>and then <code>$tree</code> contains a parse tree built from the HTML source from
the file "foo.html". The way this parse tree is represented is with a
network of objects -- <code>$tree</code> is the root, an element with tag-name
"html", and its children typically include a "head" and "body" element,
and so on. Elements in the tree are objects of the class
HTML::Element.</p>
<p>So, if you take this source:</p>
<pre>
<html><head><title>Doc 1</title></head>
<body>
Stuff <hr> 2000-08-17
</body></html></pre>
<p>and feed it to HTML::TreeBuilder, it'll return a tree of objects that
looks like this:</p>
<pre>
html
/ \
head body
/ / | \
title "Stuff" hr "2000-08-17"
|
"Doc 1"</pre>
<p>This is a pretty simple document, but if it were any more complex,
it'd be a bit hard to draw in that style, since it's sprawl left and
right. The same tree can be represented a bit more easily sideways,
with indenting:</p>
<pre>
. html
. head
. title
. "Doc 1"
. body
. "Stuff"
. hr
. "2000-08-17"</pre>
<p>Either way expresses the same structure. In that structure, the root
node is an object of the class HTML::Element</p>
<p>Footnote:
Well actually, the root is of the class HTML::TreeBuilder, but that's
just a subclass of HTML::Element, plus the few extra methods like
<code>parse_file</code> that elaborate the tree</p>
<p>, with the tag name "html", and with two children: an HTML::Element
object whose tag names are "head" and "body". And each of those
elements have children, and so on down. Not all elements (as we'll
call the objects of class HTML::Element) have children -- the "hr"
element doesn't. And note all nodes in the tree are elements -- the
text nodes ("Doc 1", "Stuff", and "2000-08-17") are just strings.</p>
<p>Objects of the class HTML::Element each have three noteworthy attributes:</p>
<dl>
<dt><strong><a name="item_lowercased"><code>_tag</code> -- (best accessed as <code>$e->tag</code>)
this element's tag-name, lowercased (e.g., "em" for an "em" element).</a></strong>
<p>Footnote: Yes, this is misnamed. In proper SGML terminology, this is
instead called a "GI", short for "generic identifier"; and the term
"tag" is used for a token of SGML source that represents either
the start of an element (a start-tag like "<em lang='fr'>") or the end
of an element (an end-tag like "</em>". However, since more people
claim to have been abducted by aliens than to have ever seen the
SGML standard, and since both encounters typically involve a feeling of
"missing time", it's not surprising that the terminology of the SGML
standard is not closely followed.</p>
<dt><strong><a name="item_parent"><code>_parent</code> -- (best accessed as <code>$e->parent</code>)
the element that is <code>$obj</code>'s parent, or undef if this element is the
root of its tree.</a></strong>
<dt><strong><a name="item_nodes"><code>_content</code> -- (best accessed as <code>$e->content_list</code>)
the list of nodes (i.e., elements or text segments) that are <code>$e</code>'s
children.</a></strong>
</dl>
<p>Moreover, if an element object has any attributes in the SGML sense of
the word, then those are readable as <code>$e->attr('name')</code> -- for
example, with the object built from having parsed "<a
<strong>id='foo'</strong>>bar</a>", <code>$e->attr('id')</code> will return
the string "foo". Moreover, <code>$e->tag</code> on that object returns the
string "a", <code>$e->content_list</code> returns a list consisting of just
⌨️ 快捷键说明
复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?