📄 190.html

📁 Python Ebook Python&XML
💻 HTML
📖 第 1 页 / 共 2 页
字号:
12 下一页

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<META NAME="Robots" content="INDEX,NOFOLLOW">
<META HTTP-EQUIV="Pragma" CONTENT="no-cache">
<TITLE>Safari | Python Developer's Handbook -&gt; XML Processing</TITLE>
<LINK REL="stylesheet" HREF="oreillyi/oreillyN.css">
</HEAD>
<BODY bgcolor="white" text="black" link="#990000" vlink="#990000" alink="#990000" leftmargin="0" topmargin="0" marginwidth="0" marginheight="0">

<table width="100%" cellpadding=5 cellspacing=0 border=0 class="navtopbg"><tr><td><font size="1"><p class="navtitle"><a href="8.html" class="navtitle">Web Development</a> &gt; <a href="0672319942.html" class="navtitle">Python Developer's Handbook</a> &gt; <a href="187.html" class="navtitle">13. Data Manipulation</a> &gt; <span class="nonavtitle">XML Processing</span></p></font></td><td align="right" valign="top" nowrap><font size="1"><a href="main.asp?list" class="safnavoff">See All Titles</a></font></td></tr></table>
<TABLE width=100% bgcolor=white border=0 cellspacing=0 cellpadding=5><TR><TD>
<TABLE border=0 width="100%" cellspacing=0 cellpadding=0><TR><td align=left width="15%" class="headingsubbarbg"><a href="189.html" title="Parsing and Manipulating Data"><font size="1">&lt;&nbsp;BACK</font></a></td><td align=center width="70%" class="headingsubbarbg"><font size="1"><a href="popanote.asp?pubui=oreilly&bookname=0672319942&snode=190" target="_blank" title="Make a public or private annnotation">Make Note</a> | <a href="190.html" title="Use a Safari bookmark to remember this section">Bookmark</a></font></td><td align=right width="15%" class="headingsubbarbg"><a href="191.html" title="XML-RPC"><font size="1">CONTINUE&nbsp;&gt;</font></a></td></TR></TABLE>
<a href="5%2F31%2F2002+4%3A45%3A56+PM.html" TABINDEX="-1"><img src=images/spacer.gif border=0 width=1 height=1></a><font color=white size=1>152015024128143245168232148039199167010047123209178152124239215162148041040017229250144100</font><a href="read6.asp?bookname=0672319942&snode=190&now=5%2F31%2F2002+4%3A45%3A56+PM" TABINDEX="-1"><img src=images/spacer.gif border=0 width=1 height=1></a><br>
<FONT><h3>



XML Processing</h3>
				<p>The first standard that you will learn how to manipulate in Python is
XML.</p>

				<P>The Web already has a standard for defining markup languages like
HTML, which is called 
<A NAme="idx1073747053"></a>
					<a NAME="idx1073747054"></a>
					<a naME="idx1073747055"></A>SGML. HTML is actually defined in SGML. SGML could have been
used as this new standard, and browsers could have been extended with SGML
parsers. However, SGML is quite complex to implement and contains a lot of
features that are very rarely used.</P>

				<p>SGML is much more than a Web standard because it was around long
before the Web. HTML is an application of SGML, and XML is a subset.</p>

				<p>SGML also lacks character sets support, and it is difficult to
interpret an SGML document without having the definition of the markup language
(the DTD桪ocument Type Definition) available.</p>

				<p>Consequently, it was decided to develop a simplified version of SGML,
which was called XML. The main point of XML is that you, by defining your own
markup language, can encode the information of your documents more precisely
than is possible with HTML. This meas that programs processing these documents
can "understand" them much better and therefore process the
information in ways that are impossible with HTML (or ordinary text processor
documents).</p>

				
					<h4>Introduction to XML</h4>
					<p>The <i>Extensible Markup Language (XML)</i> is a
  subset of SGML. Its goal is to enable generic SGML to be served, received, and
  processed on the Web in the way that is now possible with HTML. XML has been
  designed for ease of implementation and for interoperability with both SGML and
  HTML.
  <a name="idx1073747056"></a>
						<a naMe="idx1073747057"></a>
						<A namE="idx1073747058"></a>
						<a naMe="idx1073747059"></a></p>

					<P>XML describes a class of data objects called XML documents and
  partially describes the behavior of computer programs that process them. XML is
  an application profile or restricted form of <I>SGML, </I>the
  <I>Standard Generalized Markup Language</i> (ISO 8879). By
  construction, XML documents are conforming SGML documents. An XML parser can
  check if an XML document is formal without the aid of a DTD.
  <a naME="idx1073747060"></A>
						<A name="idx1073747061"></A>
						<A NAme="idx1073747062"></a></p>

					<P>XML documents are made up of storage units called 
  <A NAme="idx1073747063"></a><i>elements</i>, which contain either parsed
  or unparsed data, and are delimited by tags. Parsed data is made up of
  characters, some of which form character data, and some of which form markup
  elements. Markup encodes a description of the document's storage layout and
  logical structure. XML provides a mechanism to impose constraints on the
  storage layout and logical structure.</p>

					<p>A software module called an 
  <a name="idx1073747064"></a>
						<a name="idx1073747065"></a>XML parser is used to read XML documents and provide access
  to their content and structure. It is assumed that an XML parser is doing its
  work on behalf of another module, called the application. This specification
  describes the required behavior of an XML parser in terms of how it must read
  XML data and the information it must provide to the application. For more
  information, check out</p>

					<BloCkquOte>
<p>
							<p>Extensible Markup Language (XML) Recommendation</P>

						</p>
<p>
							<p>W3C Recommendation桬xtensible Markup Language (XML)
1.0</P>

						</P>
<P>
							<P>
								<a tarGET="_blank" Href="http://www.w3.org/TR/REC-xml.html">http://www.w3.org/TR/REC-xml.html</a></P>

						</P>
</BLockqUOTE>
				
				
					<h4>
  
  
  Writing an XML File</h4>
					<p>As you can see next, it is simple to define your own markup language with XML. The next block of code is the content of a file called <tt class="monofont">survey.xml.</tt> This code defines a specific markup language for a given survey.</p>

					<pre>
						
&lt;!DOCTYPE SURVEY SYSTEM  "SURVEY.DTD"&gt;
&lt;SURVEY&gt;
  &lt;CLIENT&gt;
     &lt;NAME&gt;         Lessaworld Corp.           &lt;/NAME&gt;
     &lt;LOCATION&gt;     Pittsburgh, PA             &lt;/LOCATION&gt;
     &lt;CONTACT&gt;      Andre Lessa                &lt;/CONTACT&gt;
     &lt;EMAIL&gt;        webmaster@lessaworld.com   &lt;/EMAIL&gt;
     &lt;TELEPHONE&gt;   (412)555-5555                &lt;/TELEPHONE&gt;
  &lt;/CLIENT&gt;
  &lt;SECTION SECTION_ID="1"&gt;
  &lt;QUESTION QUESTION_ID="1" QUESTION_LEVEL="1"&gt;
   &lt;QUESTION_DESC&gt;What is your favorite language?&lt;/QUESTION_DESC&gt;
   &lt;Op1&gt;Python&lt;/Op1&gt;
   &lt;Op2&gt;Perl&lt;/Op2&gt;
  &lt;/QUESTION&gt;
  &lt;QUESTION QUESTION_ID="2" QUESTION_LEVEL="1"&gt;
    &lt;QUESTION_DESC&gt;Do you use this language at work?&lt;/QUESTION_DESC&gt;
    &lt;Op1&gt;Yes&lt;/Op1&gt;
    &lt;Op2&gt;No&lt;/Op2&gt;
  &lt;/QUESTION&gt; &lt;QUESTION QUESTION_ID="3" QUESTION_LEVEL="1"&gt;
    &lt;QUESTION_DESC&gt;Did you expect the Spanish inquisition?&lt;/QUESTION_DESC&gt;
    &lt;Op1&gt;No&lt;/Op1&gt; 
    &lt;Op2&gt;Of course not&lt;/Op2&gt;
  &lt;/QUESTION&gt; 
   &lt;/SECTION&gt;
&lt;/SURVEY&gt;

					</pre>

					<p>In order to complement the XML markup language shown previously, we
  need a 
  <A naMe="idx1073747066"></a>
						<a Name="idx1073747067"></a><I>Document Type Definition (DTD)</i>, just
  like the following one. The DTD can be part of the XML file, or it can be
  stored as an independent file, as we are doing here. Note the first line of the
  XML file, where we are passing the name of the DTD file
  (<tt CLASs="monofont">survey.dtd</tt>). Also, it seems that XML is standardizing the
  use of XML Schemas rather the DTDs.</p>

					<PRE>
						
&lt;!ELEMENT SURVEY      (CLIENT, SECTION+)&gt;

&lt;!ELEMENT CLIENT (NAME, LOCATION, CONTACT?, EMAIL?, TELEPHONE?)&gt;
&lt;!ELEMENT NAME         (#PCDATA)&gt;
&lt;!ELEMENT LOCATION     (#PCDATA)&gt;
&lt;!ELEMENT CONTACT      (#PCDATA)&gt; 
&lt;!ELEMENT EMAIL        (#PCDATA)&gt;
&lt;!ELEMENT TELEPHONE    (#PCDATA)&gt;

&lt;!ELEMENT SECTION     (QUESTION+)&gt;
&lt;!ELEMENT QUESTION    (QUESTION_DESC, Op1, Op2)&gt;

&lt;!ELEMENT QUESTION_DESC   (#PCDATA)&gt;
&lt;!ELEMENT Op1              (#PCDATA)&gt;
&lt;!ELEMENT Op2              (#PCDATA)&gt;

&lt;!ATTLIST SECTION    SECTION_ID     CDATA #IMPLIED&gt;
&lt;!ATTLIST QUESTION   QUESTION_ID    CDATA #IMPLIED
                     QUESTION_LEVEL CDATA #IMPLIED&gt;

					</Pre>

					<p>Now, let's understand how a DTD works. For a simple example, like
  this one, we need two special tags called <tT CLAss="monofont">&lt;!ELEMENT&gt;</tt>
  and <TT CLass="monofont">&lt;!ATTLIST&gt;.</tt>
						<a name="idx1073747068"></a>
						<a name="idx1073747069"></a>
						<a naMe="idx1073747070"></a>
						<A namE="idx1073747071"></a>
						<a naMe="idx1073747072"></a>
						<a NAME="idx1073747073"></a>
						<a naME="idx1073747074"></A>
						<A name="idx1073747075"></A>
						<A NAme="idx1073747076"></a></p>

					<P>The 
  <A NAme="idx1073747077"></a>
						<a name="idx1073747078"></a><tt class="monofont">&lt;!ELEMENT&gt;</tt> definition tag is used
  to define the elements presented in the XML file. The general syntax is</p>

					<pRe>
						
lt;!ELEMENT NAME
 CONTENTS&gt;
					</pRe>

					<p>The first argument 
  <a Name="idx1073747079"></a>
						<A namE="idx1073747080"></A>(<TT clasS="monofont">NAME</TT>) gives the name of the element,
  and the second one 
  <A name="idx1073747081"></A>
						<A NAme="idx1073747082"></a>(<tT CLAss="monofont">CONTENTS</tt>) lists the element names that
  are allowed to be underneath the element that we are defining.</p>

					<p>The ordering that we use to list the contents is important. When we
  say, for example,</p>

					<pre>
						
lt;!ELEMENT SURVEY (CLIENT,
 SECTION+)&gt;
					</pre>

					<p>it means that we must have a <tt clasS="monofont">CLIENT</tt> first,
  followed by a <Tt clAss="monofont">SECTION.</tt> Note that we have a special character
  (the plus sign) just after the second element in the content list. This
  character, as well as some others, has a special meaning:</P>

					<ul>
<lI>
							<P>A 
<A Name="idx1073747083"></a>
								<A NAMe="idx1073747084"></a><tt CLASs="monofont">+</tt> sign after an element means that
it can be included one or more times.</p>

						</LI>
<LI>
							<p>A 
<a name="idx1073747085"></a>
								<a name="idx1073747086"></a><tt class="monofont">?</Tt> sign indicates that the element
can be skipped.</p>

						</Li>
<li>
							<P>A 
<a namE="idx1073747087"></a>
								<a nAME="idx1073747088"></A><tt clASS="monofont">*</Tt> sign indicates an entity that can
be skipped or included one or more times.</p>

						</li>
</UL>
					<DIv claSS="note"><P Class="notetitle"><b>Note</b></p><p>

						<p>These characters have similar meanings to what they do in regular
 expressions. (Of course, not everything you use in an <tt class="monofont">re</tt> can
 be used in a DTD.)</p>

					</p></Div>
<Br>
<br>
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -