⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 module-htmllib.html

📁 一本很好的python的说明书,适合对python感兴趣的人
💻 HTML
字号:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>13.2 htmllib -- A parser for HTML documents</title>
<META NAME="description" CONTENT="13.2 htmllib -- A parser for HTML documents">
<META NAME="keywords" CONTENT="lib">
<META NAME="resource-type" CONTENT="document">
<META NAME="distribution" CONTENT="global">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<link rel="STYLESHEET" href="lib.css" tppabs="http://www.python.org/doc/current/lib/lib.css">
<LINK REL="next" href="module-htmlentitydefs.html" tppabs="http://www.python.org/doc/current/lib/module-htmlentitydefs.html">
<LINK REL="previous" href="module-sgmllib.html" tppabs="http://www.python.org/doc/current/lib/module-sgmllib.html">
<LINK REL="up" href="markup.html" tppabs="http://www.python.org/doc/current/lib/markup.html">
<LINK REL="next" href="html-parser-objects.html" tppabs="http://www.python.org/doc/current/lib/html-parser-objects.html">
</head>
<body>
<DIV CLASS="navigation"><table align="center" width="100%" cellpadding="0" cellspacing="2">
<tr>
<td><A href="module-sgmllib.html" tppabs="http://www.python.org/doc/current/lib/module-sgmllib.html"><img src="previous.gif" tppabs="http://www.python.org/doc/current/icons/previous.gif" border="0" height="32"
  alt="Previous Page" width="32"></A></td>
<td><A href="markup.html" tppabs="http://www.python.org/doc/current/lib/markup.html"><img src="up.gif" tppabs="http://www.python.org/doc/current/icons/up.gif" border="0" height="32"
  alt="Up One Level" width="32"></A></td>
<td><A href="html-parser-objects.html" tppabs="http://www.python.org/doc/current/lib/html-parser-objects.html"><img src="next.gif" tppabs="http://www.python.org/doc/current/icons/next.gif" border="0" height="32"
  alt="Next Page" width="32"></A></td>
<td align="center" width="100%">Python Library Reference</td>
<td><A href="contents.html" tppabs="http://www.python.org/doc/current/lib/contents.html"><img src="contents.gif" tppabs="http://www.python.org/doc/current/icons/contents.gif" border="0" height="32"
  alt="Contents" width="32"></A></td>
<td><a href="modindex.html" tppabs="http://www.python.org/doc/current/lib/modindex.html" title="Module Index"><img src="modules.gif" tppabs="http://www.python.org/doc/current/icons/modules.gif" border="0" height="32"
  alt="Module Index" width="32"></a></td>
<td><A href="genindex.html" tppabs="http://www.python.org/doc/current/lib/genindex.html"><img src="index.gif" tppabs="http://www.python.org/doc/current/icons/index.gif" border="0" height="32"
  alt="Index" width="32"></A></td>
</tr></table>
<b class="navlabel">Previous:</b> <a class="sectref" href="module-sgmllib.html" tppabs="http://www.python.org/doc/current/lib/module-sgmllib.html">13.1 sgmllib  </A>
<b class="navlabel">Up:</b> <a class="sectref" href="markup.html" tppabs="http://www.python.org/doc/current/lib/markup.html">13. Structured Markup Processing</A>
<b class="navlabel">Next:</b> <a class="sectref" href="html-parser-objects.html" tppabs="http://www.python.org/doc/current/lib/html-parser-objects.html">13.2.1 HTMLParser Objects</A>
<br><hr></DIV>
<!--End of Navigation Panel-->

<H1><A NAME="SECTION0015200000000000000000">
13.2 <tt class="module">htmllib</tt> --
         A parser for HTML documents</A>
</H1>

<P>


<P>


<P>
This module defines a class which can serve as a base for parsing text
files formatted in the HyperText Mark-up Language (HTML).  The class
is not directly concerned with I/O -- it must be provided with input
in string form via a method, and makes calls to methods of a
``formatter'' object in order to produce output.  The
<tt class="class">HTMLParser</tt> class is designed to be used as a base class for
other classes in order to add functionality, and allows most of its
methods to be extended or overridden.  In turn, this class is derived
from and extends the <tt class="class">SGMLParser</tt> class defined in module
<tt class='module'><a href="module-sgmllib.html" tppabs="http://www.python.org/doc/current/lib/module-sgmllib.html">sgmllib</a></tt>.  The <tt class="class">HTMLParser</tt>
implementation supports the HTML 2.0 language as described in
<a class="rfc" name="rfcref-47703"
href="javascript:if(confirm('http://www.ietf.org/rfc/rfc1866.txt  \n\nThis file was not retrieved by Teleport Pro, because it is addressed on a domain or path outside the boundaries set for its Starting Address.  \n\nDo you want to open it from the server?'))window.location='http://www.ietf.org/rfc/rfc1866.txt'" tppabs="http://www.ietf.org/rfc/rfc1866.txt">RFC 1866 <img src="offsite.gif" tppabs="http://www.python.org/doc/current/icons/offsite.gif"
  border='0' class='offsitelink' height='15' width='17' alt='[off-site link]'
  ></a>.  Two implementations of formatter objects are provided in
the <tt class='module'><a href="module-formatter.html" tppabs="http://www.python.org/doc/current/lib/module-formatter.html">formatter</a></tt> module; refer to the
documentation for that module for information on the formatter
interface.

<P>
The following is a summary of the interface defined by
<tt class="class">sgmllib.SGMLParser</tt>:

<P>

<UL>
<LI>The interface to feed data to an instance is through the <tt class="method">feed()</tt>
method, which takes a string argument.  This can be called with as
little or as much text at a time as desired; "<tt class="samp">p.feed(a);
p.feed(b)</tt>" has the same effect as "<tt class="samp">p.feed(a+b)</tt>".  When the data
contains complete HTML tags, these are processed immediately;
incomplete elements are saved in a buffer.  To force processing of all
unprocessed data, call the <tt class="method">close()</tt> method.

<P>
For example, to parse the entire contents of a file, use:
<dl><dd><pre class="verbatim">
parser.feed(open('myfile.html').read())
parser.close()
</pre></dl>

<P>
</LI>
<LI>The interface to define semantics for HTML tags is very simple: derive
a class and define methods called <tt class="method">start_<var>tag</var>()</tt>,
<tt class="method">end_<var>tag</var>()</tt>, or <tt class="method">do_<var>tag</var>()</tt>.  The parser will
call these at appropriate moments: <tt class="method">start_<var>tag</var></tt> or
<tt class="method">do_<var>tag</var>()</tt> is called when an opening tag of the form
<code>&lt;<var>tag</var> ...&gt;</code> is encountered; <tt class="method">end_<var>tag</var>()</tt> is called
when a closing tag of the form <code>&lt;<var>tag</var>&gt;</code> is encountered.  If
an opening tag requires a corresponding closing tag, like <code>&lt;H1&gt;</code>
... <code>&lt;/H1&gt;</code>, the class should define the <tt class="method">start_<var>tag</var>()</tt>
method; if a tag requires no closing tag, like <code>&lt;P&gt;</code>, the class
should define the <tt class="method">do_<var>tag</var>()</tt> method.

<P>
</LI>
</UL>

<P>
The module defines a single class:

<P>
<dl><dt><b><a name='l2h-2580'><tt class='class'>HTMLParser</tt></a></b> (<var>formatter</var>)
<dd>
This is the basic HTML parser class.  It supports all entity names
required by the HTML 2.0 specification (<a class="rfc" name="rfcref-47706"
href="javascript:if(confirm('http://www.ietf.org/rfc/rfc1866.txt  \n\nThis file was not retrieved by Teleport Pro, because it is addressed on a domain or path outside the boundaries set for its Starting Address.  \n\nDo you want to open it from the server?'))window.location='http://www.ietf.org/rfc/rfc1866.txt'" tppabs="http://www.ietf.org/rfc/rfc1866.txt">RFC 1866 <img src="offsite.gif" tppabs="http://www.python.org/doc/current/icons/offsite.gif"
  border='0' class='offsitelink' height='15' width='17' alt='[off-site link]'
  ></a>).  It also defines
handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements.
</dl>

<P>
<div class='seealso'>
  <p class='heading'><b>See Also:</b></p>

  <dl compact class="seemodule">
    <dt>Module <b><tt class='module'><a href="module-htmlentitydefs.html" tppabs="http://www.python.org/doc/current/lib/module-htmlentitydefs.html">htmlentitydefs</a></tt>:</b>
    <dd>Definition of replacement text for HTML
                             2.0 entities.
  </dl>
  <dl compact class="seemodule">
    <dt>Module <b><tt class='module'><a href="module-sgmllib.html" tppabs="http://www.python.org/doc/current/lib/module-sgmllib.html">sgmllib</a></tt>:</b>
    <dd>Base class for <tt class="class">HTMLParser</tt>.
  </dl>
</div>

<P>

<p><hr>
<!--Table of Child-Links-->
<A NAME="CHILD_LINKS"><STRONG>Subsections</STRONG></A>

<UL>
<LI><A NAME="tex2html4789"
  href="html-parser-objects.html" tppabs="http://www.python.org/doc/current/lib/html-parser-objects.html">13.2.1 HTMLParser Objects </A>
</UL>
<!--End of Table of Child-Links-->

<DIV CLASS="navigation"><p><hr><table align="center" width="100%" cellpadding="0" cellspacing="2">
<tr>
<td><A href="module-sgmllib.html" tppabs="http://www.python.org/doc/current/lib/module-sgmllib.html"><img src="previous.gif" tppabs="http://www.python.org/doc/current/icons/previous.gif" border="0" height="32"
  alt="Previous Page" width="32"></A></td>
<td><A href="markup.html" tppabs="http://www.python.org/doc/current/lib/markup.html"><img src="up.gif" tppabs="http://www.python.org/doc/current/icons/up.gif" border="0" height="32"
  alt="Up One Level" width="32"></A></td>
<td><A href="html-parser-objects.html" tppabs="http://www.python.org/doc/current/lib/html-parser-objects.html"><img src="next.gif" tppabs="http://www.python.org/doc/current/icons/next.gif" border="0" height="32"
  alt="Next Page" width="32"></A></td>
<td align="center" width="100%">Python Library Reference</td>
<td><A href="contents.html" tppabs="http://www.python.org/doc/current/lib/contents.html"><img src="contents.gif" tppabs="http://www.python.org/doc/current/icons/contents.gif" border="0" height="32"
  alt="Contents" width="32"></A></td>
<td><a href="modindex.html" tppabs="http://www.python.org/doc/current/lib/modindex.html" title="Module Index"><img src="modules.gif" tppabs="http://www.python.org/doc/current/icons/modules.gif" border="0" height="32"
  alt="Module Index" width="32"></a></td>
<td><A href="genindex.html" tppabs="http://www.python.org/doc/current/lib/genindex.html"><img src="index.gif" tppabs="http://www.python.org/doc/current/icons/index.gif" border="0" height="32"
  alt="Index" width="32"></A></td>
</tr></table>
<b class="navlabel">Previous:</b> <a class="sectref" href="module-sgmllib.html" tppabs="http://www.python.org/doc/current/lib/module-sgmllib.html">13.1 sgmllib  </A>
<b class="navlabel">Up:</b> <a class="sectref" href="markup.html" tppabs="http://www.python.org/doc/current/lib/markup.html">13. Structured Markup Processing</A>
<b class="navlabel">Next:</b> <a class="sectref" href="html-parser-objects.html" tppabs="http://www.python.org/doc/current/lib/html-parser-objects.html">13.2.1 HTMLParser Objects</A>
</DIV>
<!--End of Navigation Panel-->
<ADDRESS>
<hr>See <i><a href="about.html" tppabs="http://www.python.org/doc/current/lib/about.html">About this document...</a></i> for information on suggesting changes.
</ADDRESS>
</BODY>
</HTML>

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -