📄 xml-python.txt

📁 python web programming 部分
💻 TXT
📖 第 1 页 / 共 4 页
字号:
            c.generate(f)
            c = c.nextSibling
        
class ElementNode(Node):
    def __init__(self,name,attrs):
        Node.__init__(self)        
        self.attributes = attrs
        self.nodeName = name

    def __str__(self):
        return "ELEMENT_NODE: %s %s" % (self.nodeName, self.attributes)

    def generate(self,f):
        f.write('<%s ' % self.nodeName)
        for key,value in self.attributes.items():
            f.write('%s="%s" ' % (key, value.replace('"','&quot;')))
        f.write(">")
        c = self.firstChild
        while c:
            c.generate(f)
            c = c.nextSibling
        f.write("</%s>" % self.nodeName)
    
class TextNode(Node):
    def __init__(self,text):
        Node.__init__(self)                
        self.nodeValue = text
    def __str__(self):
        return "TEXT_NODE   : %s" % repr(self.nodeValue)

    def generate(self,f):
        f.write(self.nodeValue)

...

# Read the document
p = TreeParser()
doc = p.parse(open(sys.argv[1]))

# Do something to it
...

# Write it back out
import codecs
f = codecs.open("foo.xml","w","utf-8")
doc.generate(f)
f.close()

Although the examples in this section have only focused on parsing
XML elements and text sections, parsing of other XML document features
is performed in an entirely analogous manner.  For example, if you wanted to
capture XML comments you would write a function like this:

   def comment(data):
       print "Comment: ", repr(data)

   ...
   parser.CommentHandler = comment

If you wanted to know where XML CDATA sections started and ended (so that
you could preserve formatting), you would supply a pair of functions:

   def start_cdata():
       print "CDATA start:"

   def end_cdata():
       print "CDATA end:"

   ...
   parser.StartCdataSectionHandler = start_cdata
   parser.EndCdataSectionHandler = end_cdata

Similar functions can be written for capturing information from a DTD,
entities, and other XML features (the full details of this can be
found in the library documentation for the xml.parser.expat module).

Although several simple examples of parsing with expat have been
presented, this is rarely the approach that you would take in
practice.  Instead, a number of standardized high-level parsing
interfaces have been developed.  This leads to the next section.

XML Processing with SAX and DOM
-------------------------------
One issue that arises with XML processing is that there are a wide
variety of parsing libraries and modules for working with XML
documents.  Although the previous section described the Expat module,
it is not the only XML parsing engine available for Python.  If you
wanted to use a different parsing module such as xmlproc, you would
have to change your program to work with its API.  Clearly this is
problematic.

To address this problem, there are two commonly used XML processing
interfaces:

  SAX (Simple API for XML).  SAX is an API for event-driven XML
  processing.

  DOM (Document Object Model).  DOM is a specification that for
  manipulating XML documents as a tree.

It is important to note that SAX and DOM do not refer to a specific
package or library.  Instead, they merely standardize the interface
that is used to process an XML document.  In fact, these interfaces
are far more general than Python and extend to other languages such as
Java and C++.

The SAX interface is very similar to the Expat interface described in
the previous section.  To parse a document, you implement special
methods in a handler class that get invoked as different parts of the
document such as elements, text, and entities are encountered.  For
example:

# Simple SAX example
import sys
from xml.sax import saxutils
from xml.sax import make_parser

# Define a simple SAX handler that prints elements and text
class SimpleHandler(saxutils.DefaultHandler):
    def startElement(self,name,attrs):
        print 'Start: ',name,attrs
    def endElement(self,name):
        print 'End: ',name
    def characters(self,data):
        print 'Data: ', repr(data)

# Create an XML parser object
parser = make_parser()

# Create a simple handler
sh = SimpleHandler()

# Tell the parser about it
parser.setContentHandler(sh)

# Parse a file
parser.parse(open(sys.argv[1]))

When run on the example file, this produces the following output:

Start:  recipe <xml.sax.xmlreader.AttributesImpl instance at 16a59c>
Data:  u'\n'
Data:  u'   '
Start:  title <xml.sax.xmlreader.AttributesImpl instance at 16a59c>
Data:  u' '
Data:  u'\n'
Data:  u'   Famous Guacamole'
Data:  u'\n'
Data:  u'   '
End:  title
Data:  u'\n'
Data:  u'   '
Start:  description <xml.sax.xmlreader.AttributesImpl instance at 16a754>
Data:  u'\n'
Data:  u'   A southwest favorite!'
Data:  u'\n'
Data:  u'   '
End:  description
Data:  u'\n'
Data:  u'   '
Start:  ingredients <xml.sax.xmlreader.AttributesImpl instance at 16a754>
Data:  u'\n'
Data:  u'        '
Start:  item <xml.sax.xmlreader.AttributesImpl instance at 16a59c>
Data:  u' Large avocados, chopped '
End:  item
Data:  u'\n'
Data:  u'        '
Start:  item <xml.sax.xmlreader.AttributesImpl instance at 16a59c>
Data:  u' Tomato, chopped '
End:  item
Data:  u'\n'
Data:  u'        '
Start:  item <xml.sax.xmlreader.AttributesImpl instance at 16a59c>
Data:  u' White onion, chopped '
End:  item
Data:  u'\n'
Data:  u'        '
Start:  item <xml.sax.xmlreader.AttributesImpl instance at 16a59c>
Data:  u' Fresh squeezed lemon juice '
End:  item
Data:  u'\n'
Data:  u'        '
Start:  item <xml.sax.xmlreader.AttributesImpl instance at 16a59c>
Data:  u' Jalape'
Data:  u'\xf1'
Data:  u'o pepper, diced '
End:  item
Data:  u'\n'
Data:  u'        '
Start:  item <xml.sax.xmlreader.AttributesImpl instance at 16a77c>
Data:  u' Fresh cilantro, minced '
End:  item
Data:  u'\n'
Data:  u'        '
Start:  item <xml.sax.xmlreader.AttributesImpl instance at 16a77c>
Data:  u' Garlic, minced '
End:  item
Data:  u'\n'
Data:  u'        '
Start:  item <xml.sax.xmlreader.AttributesImpl instance at 16a77c>
Data:  u' Salt '
End:  item
Data:  u'\n'
Data:  u'        '
Start:  item <xml.sax.xmlreader.AttributesImpl instance at 16a77c>
Data:  u' Ice-cold beer '
End:  item
Data:  u'\n'
Data:  u'   '
End:  ingredients
Data:  u'\n'
Data:  u'   '
Start:  directions <xml.sax.xmlreader.AttributesImpl instance at 16a77c>
Data:  u'\n'
Data:  u'   Combine all ingredients and hand whisk to desired consistency.  '
Data:  u'\n'
Data:  u'   Serve and enjoy with ice-cold beers.'
Data:  u'\n'
Data:  u'   '
End:  directions
Data:  u'\n'
End:  recipe

The primary difference between this and the earlier Expat example is
that actions are defined as methods of a handler class.  In addition,
attributes are stored by special objects that work much like a
dictionary, but provide a few additional methods for access.  For
example, to print the attributes you might change startElement() to
the following:

    def startElement(self,name,attrs):
        print 'Start: ',name
        for n in attrs.getNames():
            print "    %s = %r" % (n, attrs.getValue(n))
            
This produces output such as this:

    Start:  item
        num = u'12'
        units = u'bottles'
    Data:  u' Ice-cold beer '
    End:  item

Some advantages of SAX include the fact that the handler interface is
relatively simple to describe and use.  In addition, the event-driven
interface allows large XML documents can be quickly scanned without
having to store the entire document in memory.  Since SAX interface is
one of the most common XML handling techniques, a full discussion and
reference is included in the next chapter.  Please consult that for
further details.

(maybe add more here?)

The Document Object Model (DOM) is an interface for manipulating XML
documents as a tree.  Each document is represented by a top-level
Document node.  This node contains a single Element node that
corresponds to the first element. Additional sub-elements are then
added in a way that mirrors the underlying structure of the document.

The DOM approach is useful in certain contexts because it allows you
to traverse and manipulate the tree as a whole.  For example, you can
add new nodes, remove subtrees, or reorder the document in various
ways.  The DOM interface also tends to be used in interactive
applications such as browsers or XML editors (in fact, much of the DOM
interface was originally developed to better support the integration
between Javascript and HTML in browsers).

A full description of the DOM interface is available
http://www.w3.org/DOM.  Since the specification is quite large and is
surprisingly readable on its own, it is not presented in detail
here. However, a few simple examples are presented to illustrate its
operation.

First, in order to work with an XML document using DOM, you have to parse it into
a DOM tree.   Although you could do this yourself using a program such as
the tree construction code presented at the end of the last section, it
is much easier to have one constructed automatically using an existing
Python module.  An easy way to do this is as follows:

    # Make a DOM tree for file supplied on command line
    import sys
    from xml.dom import minidom
    doc = minidom.parse(open(sys.argv[1]))

With the tree in hand, DOM defines a few standard attribute names 
for walking through nodes:

    n.nextSibling
    n.previousSibling
    n.firstChild
    n.lastChild
    n.nodeParent

For example, the following function walks the entire DOM tree
and prints out the nodes:

    def print_tree(n,indent=0):
        while n:
            print " "*indent,n
            print_tree(n.firstChild,indent+4);
            n = n.nextSibling
 
    print_tree(doc)

It produces output similar to the following:

 <xml.dom.minidom.Document instance at 1cc674>
     <DOM Element: recipe at 1969980>
         <DOM Text node "\n">
         <DOM Text node "   ">
         <DOM Element: title at 1970220>
             <DOM Text node " ">
             <DOM Text node "\n">
             <DOM Text node "   Famous ...">
             <DOM Text node "\n">
             <DOM Text node "   ">
         <DOM Text node "\n">
         <DOM Text node "   ">
         <DOM Element: description at 2025004>
             <DOM Text node "\n">
             <DOM Text node "   A south...">
             <DOM Text node "\n">
             <DOM Text node "   ">
         <DOM Text node "\n">
         <DOM Text node "   ">
         <DOM Element: ingredients at 2026204>
             <DOM Text node "\n">
             <DOM Text node "        ">
             <DOM Element: item at 2030604>
                 <DOM Text node " Large avo...">
             <DOM Text node "\n">
             <DOM Text node "        ">
             <DOM Element: item at 2031564>
                 <DOM Text node " Tomato, c...">
             <DOM Text node "\n">
             <DOM Text node "        ">
             <DOM Element: item at 2032524>
                 <DOM Text node " White oni...">
             <DOM Text node "\n">
             <DOM Text node "        ">
             <DOM Element: item at 2048412>
                 <DOM Text node " Fresh squ...">
             <DOM Text node "\n">
             <DOM Text node "        ">
             <DOM Element: item at 2049372>
                 <DOM Text node " Jalape">
                 <DOM Text node "\xf1">
                 <DOM Text node "o pepper, ...">
             <DOM Text node "\n">
             <DOM Text node "        ">
             <DOM Element: item at 2057044>
                 <DOM Text node " Fresh cil...">
             <DOM Text node "\n">
             <DOM Text node "        ">
             <DOM Element: item at 2058004>
                 <DOM Text node " Garlic, m...">
             <DOM Text node "\n">
             <DOM Text node "        ">
             <DOM Element: item at 2058964>
                 <DOM Text node " Salt ">
             <DOM Text node "\n">
             <DOM Text node "        ">
             <DOM Element: item at 2066564>
                 <DOM Text node " Ice-cold ...">
             <DOM Text node "\n">
             <DOM Text node "   ">
         <DOM Text node "\n">
         <DOM Text node "   ">
         <DOM Element: directions at 2068044>
             <DOM Text node "\n">
             <DOM Text node "   Combine...">
             <DOM Text node "\n">
             <DOM Text node "   Serve a...">
             <DOM Text node "\n">
             <DOM Text node "   ">
         <DOM Text node "\n">

(DB: this part is weak. Needs work)

Nodes in a DOM tree are classified into few different types.  For
example:

  DocumentNode      -  Top level node representing entire document

  ElementNode       -  Node represent XML elements such as 
                       <title>...</title>

  TextNode          -  Raw text data (characters inside an element).

In addition, the full DOM specification a number of other types.
Typically, the type of each node is identified by a unique integer
code is available in the n.nodeType attribute.  The values are
set to a value such as the following:

   xml.dom.Node.ELEMENT_NODE
   xml.dom.Node.DOCUMENT_NODE
   xml.dom.Node.TEXT_NODE
   xml.dom.Node.ENTITY_NODE
   ...

For example, consider the following program that makes some changes to 
element and attribute names in a recipe and outputs a new XML file:

import sys
from xml.dom.ext import PrettyPrint
from xml.dom import minidom
v
doc = minidom.parse(open(sys.argv[1]))

# Generic tree walker 
def walk_tree(n,func):
    while n:
        func(n)
        walk_tree(n.firstChild,func)
        n = n.nextSibling

# Fix some nodes
def fix_node(n):
    if n.nodeType == n.ELEMENT_NODE:
        if n.tagName == "ingredients":
            n.tagName = "ingredientlist"
        if n.tagName == "item":
            # Change the name of "item" to "ingredient"
            n.tagName = "ingredient"
            # Change attribute name
            attr = n.getAttribute("num")
            n.setAttribute("quantity",attr)
            n.removeAttribute("num")

walk_tree(doc,fix_node)
PrettyPrint(doc)

The output of program might look like this:

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE recipe>
<recipe>
  <title>   Famous Guacamole</title>
  <description>   A southwest favorite!</description>
  <ingredientlist>
    <ingredient quantity='4' units='none'> Large avocados, chopped </ingredient>
    <ingredient quantity='1' units='none'> Tomato, chopped </ingredient>
    <ingredient quantity='1/2' units='C'> White onion, chopped </ingredient>
    <ingredient quantity='2' units='tbl'> Fresh squeezed lemon juice </ingredient>
    <ingredient quantity='1' units='none'> Jalapeno pepper, diced </ingredient>
    <ingredient quantity='1' units='tbl'> Fresh cilantro, minced </ingredient>
    <ingredient quantity='1' units='tbl'> Garlic, minced </ingredient>
    <ingredient quantity='3' units='tsp'> Salt </ingredient>
    <ingredient quantity='12' units='bottles'> Ice-cold beer </ingredient>
  </ingredientlist>
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -