📄 xml-python.txt
字号:
Author: David Beazley (beazley@cs.uchicago.edu)
XML Processing in Python
========================
This chapter provides an general overview of XML processing topics in
Python. Particular attention is given to the PyXML toolkit that has
been developed by the Python XML-SIG and which is available at
http://www.python.org. However, readers should be aware that a wide
variety of other XML processing tools have been developed (a fairly
complete list is available at
http://pyxml.sourceforge.net/topics/software.html). Also, this chapter
is not intended to be a comprehensive reference. More details about
the SAX interface are provided in the next chapter, but the Python
library reference documentation should be consulted for further
details of other topics.
Installing the Software
-----------------------
Starting with Python 2.0, the Python standard library includes a
number of modules related to XML processing (xml.sax, xml.dom, xmllib,
etc.). However, most of these modules are only high-level interfaces that
are useless unless you also install a suitable XML parsing module.
Unfortunately, no such module is included in a standard Python source
distribution. Therefore, to enable this capability you may need to
download a binary distribution that includes XML support or install a
common XML parser such as Expat on your machine prior to building
Python.
For the most trouble-free results, you should download and install the
PyXML package. Information about this package is available at
http://www.python.org/sigs/xml-sig/. The version used for the
examples in this book are based on PyXML-0.6.6. The PyXML package
includes a number of useful features including the xmlproc validating
XML parser, SAX and DOM interfaces, and a variety of utility modules.
Installing the PyXML package is easy. Simply download the software
and type the following (make sure you have permission to write to the
Python library directory first):
$ tar -xf PyXML-0.6.6.tar
$ cd PyXML-0.6.6
$ python setup.py build # Build the package
$ python setup.py install # Install the package
In addition, for some of the examples presented in this chapter,
you may need to download the Expat XML parser which is available at
http://www.jclark.com/xml/expat.html. To see if you need Expat or
not, simply type the following:
$ python
>>> from xml.parsers import expat
If you get an ImportError exception, then it needs to be installed.
If you built Python from source, you can enable the Expat module by
first installing the Expat libraries and headers in a common location
(e.g., /usr/local on Unix). Then, go to the Python source directory
and type the following:
$ cd Python-2.1
$ python setup.py build
$ python setup.py install
Creating XML documents
----------------------
To create an XML document, simply use any standard text editor such as
vi or emacs. Although XML documents may contain wide Unicode
characters, using a normal ASCII editor is usually fine. If working
with 8-bit data, you will probably want to include specific encoding
information in the document. For example:
<?xml version="1.0" encoding="utf-8"?>
or
<?xml version="1.0" encoding="iso-8859-1"?>
If a XML document is encoded with wide-characters such as UTF-16, it
may be difficult to view or edit in certain editors. However, Python
2.0 and all newer releases support Unicode text as does the PyXML
toolkit.
It is also simple to generate XML from a program by
outputting text that includes an appropriate set of XML tags.
For example:
# Python program that outputs XML
recipe = {
'title' : 'Famous Guacamole',
'description' : 'A southwest favorite!',
'ingredients' : [ ('4',None,'Large avocados, chopped'),
('1',None,'Tomato, chopped'),
('1/2', 'C', 'White onion, chopped'),
('2','tbl','Fresh squeezed lemon juice'),
('1',None,'Jalapeno pepper, diced'),
('1','tbl','Fresh cilantro, minced'),
('1','tbl','Garlic, minced'),
('3','tsp','Salt'),
('12','bottles','Ice-cold beer')
],
'directions' : 'Combine all ingredients and hand whisk to desired '
'consistency. Serve and enjoy with ice-cold beers'
}
# Turn a recipe into XML
def make_xml(r,f):
# Utility function to format an item
def make_item(num,units,desc):
if units:
return '<item num="%s" units="%s">%s</item>\n' % (num,units,desc)
else:
return '<item num="%s">%s</item>\n' % (num,desc)
# Make ingredient item list
ingr_str = "".join([make_item(*i) for i in r["ingredients"]])
f.write(
"""
<recipe>
<title>%(title)s</title>
<description>
%(description)s
</description>
""" % recipe)
f.write(
"""
<ingredients>
%s
</ingredients>
""" % ingr_str)
f.write(
"""
<directions>
%(directions)s
</directions>
</recipe>
""" % recipe)
# Try it out
f = open("foo.xml","w")
f.write('<?xml version="1.0" encoding="iso-8859-1"?>\n')
make_xml(recipe,f)
f.close()
There is one subtle point of generating XML documents from
Python that needs to be mentioned. Since XML documents may contain
Unicode characters, it is common for programs that produce XML to work
with Unicode strings. In Python, Unicode strings are specified by
preceding a string with a u prefix like this:
s = u'Jalape\u00f1o Pepper'
In addition, arbitrary unicode characters are specified using the
\uxxxx escape sequence as shown above.
One problem with Unicode strings is that you can't just print them out
using the standard print or file write() method statement because you
may get an exception. For example:
>>> print s
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)
>>>
To work around this problem, you need to write Unicode data using a
special data encoder that converts Unicode characters into a standard
encoding format such as UTF-8 or UTF-16. The Python codecs modules
contains functions for dealing with this problem. Here is a simple
example that shows how you would write a Unicode string in UTF-8
format.
import codecs
f = codecs.open("foo","w","utf-8")
s = u'Jalape\u00f1o Pepper'
f.write(s)
f.close()
To modify the earlier example to work properly with Unicode, you might
modify the last part of it to read like this:
import codecs
f = codecs.open("foo.xml","w","utf-8")
f.write('<?xml version="1.0" encoding="utf-8"?>\n')
make_xml(recipe,f)
f.close()
For further details about Unicode strings and I/O, consult a reference
such as the Python Essential Reference (New Riders).
Validating XML documents
------------------------
Once you've created a few XML documents, it is often quite useful to
validate them for proper syntax and structure. When the PyXML toolkit
is installed, two command line tools (both written in Python) are
installed for this purpose. The xmlproc_parse command can be used to
see if an XML document is well-formed. For example:
$ xmlproc_parse guac.xml
xmlproc version 0.70
Parsing 'guac.xml'
Parse complete, 0 error(s) and 0 warning(s)
$
If you make a mistake like misspelling a tag, you will get an error.
For example, suppose that <ingredients> was closed with </ingredient>
by accident:
<?xml version="1.0" encoding="utf-8"?>
<recipe>
<title> Famous Guacamole </title>
<description>
A southwest favorite!
</description>
<ingredients>
<item num="4"> Large avocados, chopped </item>
...
</ingredient>
<directions>
Combine all ingredients and hand whisk to desired consistency.
Serve and enjoy with ice-cold beers.
</directions>
</recipe>
Then the xmlproc_parse tool will report the following:
$ xmlproc_parse guac.xml
xmlproc version 0.70
Parsing 'guac.xml'
E:guac.xml:18:17: End tag for 'ingredient' seen, but 'ingredients' expected
E:guac.xml:23:10: End tag for 'recipe' seen, but 'ingredients' expected
Parse complete, 2 error(s) and 0 warning(s)
$
xmlproc_parse only checks the structure of the XML document. If you
want to validate the document against a DTD, the xmlproc_val tool can
be used. If you run xmlproc_val on a document without supplying a
DTD, you will get a large number of errors. For example:
$xmlproc_val guac.xml
xmlproc version 0.70
Parsing 'guac.xml'
E:guac.xml:3:9: Element 'recipe' not declared
E:guac.xml:4:11: Element 'title' not declared
E:guac.xml:5:17: Element 'description' not declared
E:guac.xml:8:17: Element 'ingredients' not declared
E:guac.xml:9:23: Element 'item' not declared
...
E:guac.xml:19:16: Element 'directions' not declared
Parse complete, 14 error(s) and 0 warning(s)
$
Therefore, your document needs to either include a DTD directly or
import one from elsewhere. For example, consider the simple recipe
DTD from the previous chapter:
<!DOCTYPE recipe [
<!-- The recipe DTD appears here -->
<!ELEMENT recipe (title, description?, ingredients, directions)>
<!ELEMENT ingredients (item+)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT description (#PCDATA)>
<!ELEMENT item (#PCDATA)>
<!ELEMENT directions (#PCDATA)>
<!ATTLIST item num CDATA #REQUIRED
units (C | tsp | tbl | bottles | none) "none">
]>
<!-- End of DTD -->
Now, let's make a few mistakes in the XML file by reordering two
elements and changing one of the attribute names.
<recipe>
<description>
A southwest favorite!
</description>
<title> Famous Guacamole </title>
<ingredients>
<item num="4"> Large avocados, chopped </item>
...
<item quantity="12" units="bottles"> Ice-cold beer </item>
</ingredient>
<directions>
Combine all ingredients and hand whisk to desired consistency.
Serve and enjoy with ice-cold beers.
</directions>
</recipe>
Now, look at the output of xmlproc_val:
$ xmlproc_val guac1.xml
xmlproc version 0.70
Parsing 'guac1.xml'
E:guac1.xml:17:17: Element 'description' not allowed here
E:guac1.xml:31:45: Attribute 'quantity' not declared
Parse complete, 2 error(s) and 0 warning(s)
$
Although you probably won't use xmlproc_parse and xmlproc_val directly
in your own XML programs, they are useful debugging tools to make
sure you've actually created a valid XML document.
XML Parsing with Regular Expressions
------------------------------------
If you only need to extract a few very simple pieces of information
from an XML document, you might consider the use of Python regular
expression pattern matching. For example, if you only wanted to
extract a recipe title, you might write a short script like this:
#!/usr/local/bin/python
import sys, re
pat = r'<title>(.*?)</title>'
data = open(sys.argv[1]).read()
# Search for a match
m = re.search(p,data)
if m:
print m.group(1)
Now, running it:
$ title.py guac.xml
Famous Guacamole
$
Here is a more complicated script that extracts the ingredient list
from a document.
#!/usr/local/bin/python
import sys
import re
pat = r'<ingredients>((.|\n)*?)</ingredients>'
filename = sys.argv[1]
data = open(filename).read()
m = re.search(pat,data)
if not m:
print "No ingredients found"
print sys.exit(1)
ingredients = m.group(1)
pat = r'<item\s.*?>(.*?)</item>'
all = re.findall(pat,ingredients)
for item in all:
print item
Although such techniques work in very simple cases, XML's treatment of
whitespace and formatting will present problems. For instance, in the
first example, the pattern for matching a title was specified as this:
pat = r'<title>(.*?)</title>'
However, the <title> element might appear in a variety of formats. For
example:
<title>
Famous Guacamole
</title>
<title >
Famous Guacamole
</title >
<title xml:lang="en">Famous Guacamole
</title>
<title
xml:lang="en">
Famous Guacamole
</title>
To deal with this properly, regular expressions start to get
complicated. For instance, you might rewrite the pattern like this:
pat = r'<title\s*(.|\n)*?>((.|\n)*?)</title\s*(.|\n)*?>
In addition to this, you have to be a little careful when trying to
match up starting and ending tags. For instance, if you wrote a
pattern like this:
pat = r'<item>(.|\n)*</item>'
And applied it to this text:
<item>Large avocado</item>
<item>Tomato</item>
It will find the longest possible match. In this case, it matches the
entire text instead of the just the first item. To prevent this, you
need to use special modifiers such as ? to force shortest length
matching. For example:
pay = r'<item>(.|\n)*?</item>
There are other issues with regular expressions that might cause
problems. For instance, text in XML comments or CDATA sections can
throw off pattern matching (for instance, if a comment included text
that was important to the pattern matching). In addition, XML
entities aren't expanded properly. For example, if you matched this:
<item num="1"> Jalapeño pepper, diced </item>
the ñ sequence would be left unmodified.
Needless to say, you should only consider the use of regular expressions
if you need a quick and dirty hack or if you have a simple
application in which it makes sense to do this.
An Introduction to XML Parsing with Expat
-----------------------------------------
This section describes the fundamentals of XML parsing using the
Python Expat parsing module. Expat is a freely available
non-validating XML parser that was developed by James Clark and is
used in a variety of projects including Python, Perl, and Mozilla. It
is also the only XML parser that a standard Python distribution will
recognize without additionally installing another XML package such as
PyXML. It is written in C so it has the advantage of being extremely
fast---which may be important if you are working with very large
documents.
The primary goal of this chapter is to introduce simple XML processing
by looking at the operation of a simple, low-level XML parser. In the
next section we show how the techniques in this chapter extend to the
more popular SAX and DOM interfaces for manipulating XML.
Python provides an interface to Expat in the xml.parsers.expat module.
In order for this module to be available, you may need to install the
Expat library on your machine and build the corresponding Python extension
module. Follow the instructions in the Python distribution and the first
part of this chapter for details.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -