📄 xml-python.txt
字号:
The expat module works by providing providing a special extension type
object that refers to an Expat parser. To control the parser, you
simply set different attributes on the object. Here is a very simple
example that prints out all of the starting tags, ending tags, and
character data in a document:
import sys
from xml.parsers import expat
def start_element(name,attrs):
print 'Start:', name, attrs
def character_data(data):
print 'Data:', repr(data)
def end_element(name):
print 'End: ', name
# Create an Expat parser
p = expat.ParserCreate()
# Attach some functions to the parser
p.StartElementHandler = start_element
p.EndElementHandler = end_element
p.CharacterDataHandler = character_data
# Run it on a file
p.ParseFile(open(sys.argv[1]))
When run on our example file, it produces the following output:
$ python simple.py guac.xml
Start: recipe {}
Data: u'\n'
Data: u' '
Start: title {}
Data: u'\n'
Data: u' Famous Guacamole'
Data: u'\n'
Data: u' '
End: title
Data: u'\n'
Data: u' '
Start: description {}
Data: u'\n'
Data: u' A southwest favorite!'
Data: u'\n'
Data: u' '
End: description
Data: u'\n'
Data: u' '
Start: ingredients {}
Data: u'\n'
Data: u' '
Start: item {u'num': u'4', u'units': u'none'}
Data: u' Large avocados, chopped '
End: item
Data: u'\n'
Data: u' '
Start: item {u'num': u'1', u'units': u'none'}
Data: u' Tomato, chopped '
End: item
Data: u'\n'
Data: u' '
Start: item {u'num': u'1/2', u'units': u'C'}
Data: u' White onion, chopped '
End: item
Data: u'\n'
Data: u' '
Start: item {u'num': u'2', u'units': u'tbl'}
Data: u' Fresh squeezed lemon juice '
End: item
Data: u'\n'
Data: u' '
Start: item {u'num': u'1', u'units': u'none'}
Data: u' Jalape'
Data: u'\xf1'
Data: u'o pepper, diced '
End: item
Data: u'\n'
Data: u' '
Start: item {u'num': u'1', u'units': u'tbl'}
Data: u' Fresh cilantro, minced '
End: item
Data: u'\n'
Data: u' '
Start: item {u'num': u'1', u'units': u'tbl'}
Data: u' Garlic, minced '
End: item
Data: u'\n'
Data: u' '
Start: item {u'num': u'3', u'units': u'tsp'}
Data: u' Salt '
End: item
Data: u'\n'
Data: u' '
Start: item {u'num': u'12', u'units': u'bottles'}
Data: u' Ice-cold beer '
End: item
Data: u'\n'
Data: u' '
End: ingredients
Data: u'\n'
Data: u' '
Start: directions {}
Data: u'\n'
Data: u' Combine all ingredients and hand whisk to desired consistency. '
Data: u'\n'
Data: u' Serve and enjoy with ice-cold beers.'
Data: u'\n'
Data: u' '
End: directions
Data: u'\n'
End: recipe
In this example, the document is read sequentially and functions such
as start_element(), end_element() and character_data() are called as
different parts of the document are encountered. From the output, you
can see how the starting and ending tags nest. Furthermore, when
starting tags are encountered, the attributes are conveniently
provided as a Python dictionary. For instance, the XML element
<item num="12" units="bottles">Ice-cold beer</item>
Produces the following output:
Start: item {u'num': u'12', u'units': u'bottles'}
Data: u' Ice-cold beer '
End: item
When character data is encountered, the character_data() function is
called. In almost all cases, XML parsers return character data and
attributes as Unicode strings as denoted by the the u'...' syntax in
the above output. An important feature of receiving character data is
that it is often supplied in small chunks that must be concatenated.
In the example, you can see this by observing the multiple calls to
the character_data() function that occur between certain starting and
ending tags. For example, the XML element
<directions>
Combine all ingredients and hand whisk to desired consistency.
Serve and enjoy with ice-cold beers.
</directions>
Produces the following sequence of calls:
Start: directions {}
Data: u'\n'
Data: u' Combine all ingredients and hand whisk to desired consistency. '
Data: u'\n'
Data: u' Serve and enjoy with ice-cold beers.'
Data: u'\n'
Data: u' '
End: directions
Another useful feature of the parser is that special sequences like <, >,
and ñ are automatically expanded. For example, the element
<item num="1"> Jalapeño pepper, diced </item>
is expanded as:
Start: item {u'num': u'1', u'units': u'none'}
Data: u' Jalape'
Data: u'\xf1'
Data: u'o pepper, diced '
End: item
Notice how the ñ text is replaced by the unicode character '\xf1'.
One problematic aspect of the XML parsing in our example is the handling of
whitespace. In general, the parser does nothing to strip whitespace from
text. Therefore, the following set of elements
<title>Famous Guacamole</title>
<title> Famous Guacamole </title>
<title>
Famous Guacamole
</title>
produce three entirely different text strings
u'Famous Guacamole'
u' Famous Guacamole '
u'\nFamous Guacamole\n'
In many cases, you may not want the extra whitespace. Therefore,
it is fairly common to write a normalization function that removes
redundant whitespace. For example:
def normalize_whitespace(text):
return " ".join(text.split())
Here is a slightly different example that collects text and uses
this function to normalize it:
import sys
from xml.parsers import expat
def normalize_whitespace(text):
return " ".join(text.split())
class SimpleParse:
def __init__(self):
self.parser = expat.ParserCreate()
self.parser.StartElementHandler = self.start_element
self.parser.EndElementHandler = self.end_element
self.parser.CharacterDataHandler = self.character_data
self.cdata = [ ]
def parse(self,file):
self.parser.ParseFile(file)
def print_cdata(self):
txt = normalize_whitespace("".join(self.cdata))
if txt: print normalize_whitespace(txt)
self.cdata = [ ]
def start_element(self,name,attrs):
self.print_cdata()
print "Start:",name,attrs
def character_data(self,data):
self.cdata.append(data)
def end_element(self,name):
self.print_cdata()
print "End:", name
p = SimpleParse()
p.parse(open(sys.argv[1]))
Now, the output appears somewhat more readable:
$ python simple.py guac.xml
Start: recipe {}
Start: title {}
u'Famous Guacamole'
End: title
Start: description {}
u'A southwest favorite!'
End: description
Start: ingredients {}
Start: item {u'num': u'4', u'units': u'none'}
u'Large avocados, chopped'
End: item
Start: item {u'num': u'1', u'units': u'none'}
u'Tomato, chopped'
End: item
Start: item {u'num': u'1/2', u'units': u'C'}
u'White onion, chopped'
End: item
Start: item {u'num': u'2', u'units': u'tbl'}
u'Fresh squeezed lemon juice'
End: item
Start: item {u'num': u'1', u'units': u'none'}
u'Jalape\xf1o pepper, diced'
End: item
Start: item {u'num': u'1', u'units': u'tbl'}
u'Fresh cilantro, minced'
End: item
Start: item {u'num': u'1', u'units': u'tbl'}
u'Garlic, minced'
End: item
Start: item {u'num': u'3', u'units': u'tsp'}
u'Salt'
End: item
Start: item {u'num': u'12', u'units': u'bottles'}
u'Ice-cold beer'
End: item
End: ingredients
Start: directions {}
u'Combine all ingredients and hand whisk to desired consistency. Serve and enjoy with ice-cold beers.'
End: directions
End: recipe
This style of XML processing in which special functions are invoked
for different document features as the document is read is more
formally known as "event driven parsing." In some sense, this is the
most simple way to read an XML document. However, in certain
situations, it is useful to work with an entire XML document that is
stored in a tree structure. Building a tree from an event driven
parser is usually straightforward. For example, you might start with
code similar to the following:
import sys
from xml.parsers import expat
def normalize_whitespace(text):
return " ".join(text.split())
class Node:
def __init__(self):
self.nextSibling = None
self.prevSibling = None
self.parentNode = None
self.firstChild = None
self.lastChild = None
def appendChild(self,c):
if self.firstChild:
c.prevSibling = self.lastChild
self.lastChild.nextSibling = c
else:
self.firstChild = c
c.parentNode = self
self.lastChild = c
class DocumentNode(Node):
def __str__(self):
return "DOCUMENT_NODE:"
class ElementNode(Node):
def __init__(self,name,attrs):
Node.__init__(self)
self.attributes = attrs
self.nodeName = name
def __str__(self):
return "ELEMENT_NODE: %s %s" % (self.nodeName, self.attributes)
class TextNode(Node):
def __init__(self,text):
Node.__init__(self)
self.nodeValue = text
def __str__(self):
return "TEXT_NODE : %s" % repr(self.nodeValue)
class TreeParser:
def __init__(self):
self.parser = expat.ParserCreate()
self.parser.StartElementHandler = self.start_element
self.parser.EndElementHandler = self.end_element
self.parser.CharacterDataHandler = self.character_data
self.topNode = DocumentNode()
def parse(self,file):
self.parser.ParseFile(file)
return self.topNode
def start_element(self,name,attrs):
n = ElementNode(name,attrs)
self.topNode.appendChild(n)
self.topNode = n
def character_data(self,data):
n = TextNode(data)
self.topNode.appendChild(n)
def end_element(self,name):
self.topNode = self.topNode.parentNode
p = TreeParser()
doc = p.parse(open(sys.argv[1]))
def print_tree(node,indent=0):
if not node: return
ispace = " "*4
while node:
print "%s%s" % (ispace*indent,node)
print_tree(node.firstChild,indent+1)
node = node.nextSibling
print_tree(doc)
When applied to our example, the output shows how it has captured
the hierarchical structure of the document:
DOCUMENT_NODE:
ELEMENT_NODE: recipe {}
TEXT_NODE : u'\n'
TEXT_NODE : u' '
ELEMENT_NODE: title {}
TEXT_NODE : u' '
TEXT_NODE : u'\n'
TEXT_NODE : u' Famous Guacamole'
TEXT_NODE : u'\n'
TEXT_NODE : u' '
TEXT_NODE : u'\n'
TEXT_NODE : u' '
ELEMENT_NODE: description {}
TEXT_NODE : u'\n'
TEXT_NODE : u' A southwest favorite!'
TEXT_NODE : u'\n'
TEXT_NODE : u' '
TEXT_NODE : u'\n'
TEXT_NODE : u' '
ELEMENT_NODE: ingredients {}
TEXT_NODE : u'\n'
TEXT_NODE : u' '
ELEMENT_NODE: item {u'num': u'4', u'units': u'none'}
TEXT_NODE : u' Large avocados, chopped '
TEXT_NODE : u'\n'
TEXT_NODE : u' '
ELEMENT_NODE: item {u'num': u'1', u'units': u'none'}
TEXT_NODE : u' Tomato, chopped '
TEXT_NODE : u'\n'
TEXT_NODE : u' '
ELEMENT_NODE: item {u'num': u'1/2', u'units': u'C'}
TEXT_NODE : u' White onion, chopped '
TEXT_NODE : u'\n'
TEXT_NODE : u' '
ELEMENT_NODE: item {u'num': u'2', u'units': u'tbl'}
TEXT_NODE : u' Fresh squeezed lemon juice '
TEXT_NODE : u'\n'
TEXT_NODE : u' '
ELEMENT_NODE: item {u'num': u'1', u'units': u'none'}
TEXT_NODE : u' Jalape'
TEXT_NODE : u'\xf1'
TEXT_NODE : u'o pepper, diced '
TEXT_NODE : u'\n'
TEXT_NODE : u' '
ELEMENT_NODE: item {u'num': u'1', u'units': u'tbl'}
TEXT_NODE : u' Fresh cilantro, minced '
TEXT_NODE : u'\n'
TEXT_NODE : u' '
ELEMENT_NODE: item {u'num': u'1', u'units': u'tbl'}
TEXT_NODE : u' Garlic, minced '
TEXT_NODE : u'\n'
TEXT_NODE : u' '
ELEMENT_NODE: item {u'num': u'3', u'units': u'tsp'}
TEXT_NODE : u' Salt '
TEXT_NODE : u'\n'
TEXT_NODE : u' '
ELEMENT_NODE: item {u'num': u'12', u'units': u'bottles'}
TEXT_NODE : u' Ice-cold beer '
TEXT_NODE : u'\n'
TEXT_NODE : u' '
TEXT_NODE : u'\n'
TEXT_NODE : u' '
ELEMENT_NODE: directions {}
TEXT_NODE : u'\n'
TEXT_NODE : u' Combine all ingredients and hand whisk to desired consistency. '
TEXT_NODE : u'\n'
TEXT_NODE : u' Serve and enjoy with ice-cold beers.'
TEXT_NODE : u'\n'
TEXT_NODE : u' '
TEXT_NODE : u'\n'
With the tree structure in hand, it is possible to look at the
document as a whole and to make tree transformations (e.g., nodes can
be removed, moved to new locations, reordered, etc.). It is even
possible to turn the tree back into XML by supplying a few additional
methods to the Node class. For example, the generate() method in the following:
class DocumentNode(Node):
def __str__(self):
return "DOCUMENT_NODE:"
def generate(self,f):
c = self.firstChild
while c:
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -