📄 190.html
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<META NAME="Robots" content="INDEX,NOFOLLOW">
<META HTTP-EQUIV="Pragma" CONTENT="no-cache">
<TITLE>Safari | Python Developer's Handbook -> XML Processing</TITLE>
<LINK REL="stylesheet" HREF="oreillyi/oreillyN.css">
</HEAD>
<BODY bgcolor="white" text="black" link="#990000" vlink="#990000" alink="#990000" leftmargin="0" topmargin="0" marginwidth="0" marginheight="0">
<table width="100%" cellpadding=5 cellspacing=0 border=0 class="navtopbg"><tr><td><font size="1"><p class="navtitle"><a href="8.html" class="navtitle">Web Development</a> > <a href="0672319942.html" class="navtitle">Python Developer's Handbook</a> > <a href="187.html" class="navtitle">13. Data Manipulation</a> > <span class="nonavtitle">XML Processing</span></p></font></td><td align="right" valign="top" nowrap><font size="1"><a href="main.asp?list" class="safnavoff">See All Titles</a></font></td></tr></table>
<TABLE width=100% bgcolor=white border=0 cellspacing=0 cellpadding=5><TR><TD>
<TABLE border=0 width="100%" cellspacing=0 cellpadding=0><TR><td align=left width="15%" class="headingsubbarbg"><a href="189.html" title="Parsing and Manipulating Data"><font size="1">< BACK</font></a></td><td align=center width="70%" class="headingsubbarbg"><font size="1"><a href="popanote.asp?pubui=oreilly&bookname=0672319942&snode=190" target="_blank" title="Make a public or private annnotation">Make Note</a> | <a href="190.html" title="Use a Safari bookmark to remember this section">Bookmark</a></font></td><td align=right width="15%" class="headingsubbarbg"><a href="191.html" title="XML-RPC"><font size="1">CONTINUE ></font></a></td></TR></TABLE>
<a href="5%2F31%2F2002+4%3A45%3A56+PM.html" TABINDEX="-1"><img src=images/spacer.gif border=0 width=1 height=1></a><font color=white size=1>152015024128143245168232148039199167010047123209178152124239215162148041040017229250144100</font><a href="read6.asp?bookname=0672319942&snode=190&now=5%2F31%2F2002+4%3A45%3A56+PM" TABINDEX="-1"><img src=images/spacer.gif border=0 width=1 height=1></a><br>
<FONT><h3>
XML Processing</h3>
<p>The first standard that you will learn how to manipulate in Python is
XML.</p>
<P>The Web already has a standard for defining markup languages like
HTML, which is called
<A NAme="idx1073747053"></a>
<a NAME="idx1073747054"></a>
<a naME="idx1073747055"></A>SGML. HTML is actually defined in SGML. SGML could have been
used as this new standard, and browsers could have been extended with SGML
parsers. However, SGML is quite complex to implement and contains a lot of
features that are very rarely used.</P>
<p>SGML is much more than a Web standard because it was around long
before the Web. HTML is an application of SGML, and XML is a subset.</p>
<p>SGML also lacks character sets support, and it is difficult to
interpret an SGML document without having the definition of the markup language
(the DTD桪ocument Type Definition) available.</p>
<p>Consequently, it was decided to develop a simplified version of SGML,
which was called XML. The main point of XML is that you, by defining your own
markup language, can encode the information of your documents more precisely
than is possible with HTML. This meas that programs processing these documents
can "understand" them much better and therefore process the
information in ways that are impossible with HTML (or ordinary text processor
documents).</p>
<h4>Introduction to XML</h4>
<p>The <i>Extensible Markup Language (XML)</i> is a
subset of SGML. Its goal is to enable generic SGML to be served, received, and
processed on the Web in the way that is now possible with HTML. XML has been
designed for ease of implementation and for interoperability with both SGML and
HTML.
<a name="idx1073747056"></a>
<a naMe="idx1073747057"></a>
<A namE="idx1073747058"></a>
<a naMe="idx1073747059"></a></p>
<P>XML describes a class of data objects called XML documents and
partially describes the behavior of computer programs that process them. XML is
an application profile or restricted form of <I>SGML, </I>the
<I>Standard Generalized Markup Language</i> (ISO 8879). By
construction, XML documents are conforming SGML documents. An XML parser can
check if an XML document is formal without the aid of a DTD.
<a naME="idx1073747060"></A>
<A name="idx1073747061"></A>
<A NAme="idx1073747062"></a></p>
<P>XML documents are made up of storage units called
<A NAme="idx1073747063"></a><i>elements</i>, which contain either parsed
or unparsed data, and are delimited by tags. Parsed data is made up of
characters, some of which form character data, and some of which form markup
elements. Markup encodes a description of the document's storage layout and
logical structure. XML provides a mechanism to impose constraints on the
storage layout and logical structure.</p>
<p>A software module called an
<a name="idx1073747064"></a>
<a name="idx1073747065"></a>XML parser is used to read XML documents and provide access
to their content and structure. It is assumed that an XML parser is doing its
work on behalf of another module, called the application. This specification
describes the required behavior of an XML parser in terms of how it must read
XML data and the information it must provide to the application. For more
information, check out</p>
<BloCkquOte>
<p>
<p>Extensible Markup Language (XML) Recommendation</P>
</p>
<p>
<p>W3C Recommendation桬xtensible Markup Language (XML)
1.0</P>
</P>
<P>
<P>
<a tarGET="_blank" Href="http://www.w3.org/TR/REC-xml.html">http://www.w3.org/TR/REC-xml.html</a></P>
</P>
</BLockqUOTE>
<h4>
Writing an XML File</h4>
<p>As you can see next, it is simple to define your own markup language with XML. The next block of code is the content of a file called <tt class="monofont">survey.xml.</tt> This code defines a specific markup language for a given survey.</p>
<pre>
<!DOCTYPE SURVEY SYSTEM "SURVEY.DTD">
<SURVEY>
<CLIENT>
<NAME> Lessaworld Corp. </NAME>
<LOCATION> Pittsburgh, PA </LOCATION>
<CONTACT> Andre Lessa </CONTACT>
<EMAIL> webmaster@lessaworld.com </EMAIL>
<TELEPHONE> (412)555-5555 </TELEPHONE>
</CLIENT>
<SECTION SECTION_ID="1">
<QUESTION QUESTION_ID="1" QUESTION_LEVEL="1">
<QUESTION_DESC>What is your favorite language?</QUESTION_DESC>
<Op1>Python</Op1>
<Op2>Perl</Op2>
</QUESTION>
<QUESTION QUESTION_ID="2" QUESTION_LEVEL="1">
<QUESTION_DESC>Do you use this language at work?</QUESTION_DESC>
<Op1>Yes</Op1>
<Op2>No</Op2>
</QUESTION> <QUESTION QUESTION_ID="3" QUESTION_LEVEL="1">
<QUESTION_DESC>Did you expect the Spanish inquisition?</QUESTION_DESC>
<Op1>No</Op1>
<Op2>Of course not</Op2>
</QUESTION>
</SECTION>
</SURVEY>
</pre>
<p>In order to complement the XML markup language shown previously, we
need a
<A naMe="idx1073747066"></a>
<a Name="idx1073747067"></a><I>Document Type Definition (DTD)</i>, just
like the following one. The DTD can be part of the XML file, or it can be
stored as an independent file, as we are doing here. Note the first line of the
XML file, where we are passing the name of the DTD file
(<tt CLASs="monofont">survey.dtd</tt>). Also, it seems that XML is standardizing the
use of XML Schemas rather the DTDs.</p>
<PRE>
<!ELEMENT SURVEY (CLIENT, SECTION+)>
<!ELEMENT CLIENT (NAME, LOCATION, CONTACT?, EMAIL?, TELEPHONE?)>
<!ELEMENT NAME (#PCDATA)>
<!ELEMENT LOCATION (#PCDATA)>
<!ELEMENT CONTACT (#PCDATA)>
<!ELEMENT EMAIL (#PCDATA)>
<!ELEMENT TELEPHONE (#PCDATA)>
<!ELEMENT SECTION (QUESTION+)>
<!ELEMENT QUESTION (QUESTION_DESC, Op1, Op2)>
<!ELEMENT QUESTION_DESC (#PCDATA)>
<!ELEMENT Op1 (#PCDATA)>
<!ELEMENT Op2 (#PCDATA)>
<!ATTLIST SECTION SECTION_ID CDATA #IMPLIED>
<!ATTLIST QUESTION QUESTION_ID CDATA #IMPLIED
QUESTION_LEVEL CDATA #IMPLIED>
</Pre>
<p>Now, let's understand how a DTD works. For a simple example, like
this one, we need two special tags called <tT CLAss="monofont"><!ELEMENT></tt>
and <TT CLass="monofont"><!ATTLIST>.</tt>
<a name="idx1073747068"></a>
<a name="idx1073747069"></a>
<a naMe="idx1073747070"></a>
<A namE="idx1073747071"></a>
<a naMe="idx1073747072"></a>
<a NAME="idx1073747073"></a>
<a naME="idx1073747074"></A>
<A name="idx1073747075"></A>
<A NAme="idx1073747076"></a></p>
<P>The
<A NAme="idx1073747077"></a>
<a name="idx1073747078"></a><tt class="monofont"><!ELEMENT></tt> definition tag is used
to define the elements presented in the XML file. The general syntax is</p>
<pRe>
lt;!ELEMENT NAME
CONTENTS>
</pRe>
<p>The first argument
<a Name="idx1073747079"></a>
<A namE="idx1073747080"></A>(<TT clasS="monofont">NAME</TT>) gives the name of the element,
and the second one
<A name="idx1073747081"></A>
<A NAme="idx1073747082"></a>(<tT CLAss="monofont">CONTENTS</tt>) lists the element names that
are allowed to be underneath the element that we are defining.</p>
<p>The ordering that we use to list the contents is important. When we
say, for example,</p>
<pre>
lt;!ELEMENT SURVEY (CLIENT,
SECTION+)>
</pre>
<p>it means that we must have a <tt clasS="monofont">CLIENT</tt> first,
followed by a <Tt clAss="monofont">SECTION.</tt> Note that we have a special character
(the plus sign) just after the second element in the content list. This
character, as well as some others, has a special meaning:</P>
<ul>
<lI>
<P>A
<A Name="idx1073747083"></a>
<A NAMe="idx1073747084"></a><tt CLASs="monofont">+</tt> sign after an element means that
it can be included one or more times.</p>
</LI>
<LI>
<p>A
<a name="idx1073747085"></a>
<a name="idx1073747086"></a><tt class="monofont">?</Tt> sign indicates that the element
can be skipped.</p>
</Li>
<li>
<P>A
<a namE="idx1073747087"></a>
<a nAME="idx1073747088"></A><tt clASS="monofont">*</Tt> sign indicates an entity that can
be skipped or included one or more times.</p>
</li>
</UL>
<DIv claSS="note"><P Class="notetitle"><b>Note</b></p><p>
<p>These characters have similar meanings to what they do in regular
expressions. (Of course, not everything you use in an <tt class="monofont">re</tt> can
be used in a DTD.)</p>
</p></Div>
<Br>
<br>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -