📄 xmlreader.html

📁 libxml,在UNIX/LINUX下非常重要的一个库,为XML相关应用提供方便.目前上载的是最新版本,若要取得最新版本,请参考里面的readme.
💻 HTML
📖 第 1 页 / 共 2 页
字号:
12 下一页
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"    "http://www.w3.org/TR/html4/loose.dtd"><html><head>  <meta http-equiv="Content-Type" content="text/html">  <style type="text/css"></style><!--TD {font-family: Verdana,Arial,Helvetica}BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}H1 {font-family: Verdana,Arial,Helvetica}H2 {font-family: Verdana,Arial,Helvetica}H3 {font-family: Verdana,Arial,Helvetica}A:link, A:visited, A:active { text-decoration: underline }  </style>-->  <title>Libxml2 XmlTextReader Interface tutorial</title></head><body bgcolor="#fffacd" text="#000000"><h1 align="center">Libxml2 XmlTextReader Interface tutorial</h1><p></p><p>This document describes the use of the XmlTextReader streaming API addedto libxml2 in version 2.5.0 . This API is closely modeled after the <ahref="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader</a>and <ahref="http://dotgnu.org/pnetlib-doc/System/Xml/XmlReader.html">XmlReader</a>classes of the C# language.</p><p>This tutorial will present the key points of this API, and workingexamples using both C and the Python bindings:</p><p>Table of content:</p><ul>  <li><a href="#Introducti">Introduction: why a new API</a></li>  <li><a href="#Walking">Walking a simple tree</a></li>  <li><a href="#Extracting">Extracting informations for the current  node</a></li>  <li><a href="#Extracting1">Extracting informations for the  attributes</a></li>  <li><a href="#Validating">Validating a document</a></li>  <li><a href="#Entities">Entities substitution</a></li>  <li><a href="#L1142">Relax-NG Validation</a></li>  <li><a href="#Mixing">Mixing the reader and tree or XPath  operations</a></li></ul><p></p><h2><a name="Introducti">Introduction: why a new API</a></h2><p>Libxml2 <a href="http://xmlsoft.org/html/libxml-tree.html">main API istree based</a>, where the parsing operation results in a document loadedcompletely in memory, and expose it as a tree of nodes all availble at thesame time. This is very simple and quite powerful, but has the majorlimitation that the size of the document that can be hamdled is limited bythe size of the memory available. Libxml2 also provide a <ahref="http://www.saxproject.org/">SAX</a> based API, but that version wasdesigned upon one of the early <ahref="http://www.jclark.com/xml/expat.html">expat</a> version of SAX, SAX isalso not formally defined for C. SAX basically work by registering callbackswhich are called directly by the parser as it progresses through the documentstreams. The problem is that this programming model is relatively complex,not well standardized, cannot provide validation directly, makes entity,namespace and base processing relatively hard.</p><p>The <ahref="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReaderAPI from C#</a> provides a far simpler programming model. The API acts as acursor going forward on the document stream and stopping at each node in theway. The user's code keeps control of the progress and simply calls aRead() function repeatedly to progress to each node in sequence in documentorder. There is direct support for namespaces, xml:base, entity handling andadding DTD validation on top of it was relatively simple. This API is reallyclose to the <a href="http://www.w3.org/TR/DOM-Level-2-Core/">DOM Corespecification</a> This provides a far more standard, easy to use and powerfulAPI than the existing SAX. Moreover integrating extension features based onthe tree seems relatively easy.</p><p>In a nutshell the XmlTextReader API provides a simpler, more standard andmore extensible interface to handle large documents than the existing SAXversion.</p><h2><a name="Walking">Walking a simple tree</a></h2><p>Basically the XmlTextReader API is a forward only tree walking interface.The basic steps are:</p><ol>  <li>prepare a reader context operating on some input</li>  <li>run a loop iterating over all nodes in the document</li>  <li>free up the reader context</li></ol><p>Here is a basic C sample doing this:</p><pre>#include &lt;libxml/xmlreader.h&gt;void processNode(xmlTextReaderPtr reader) {    /* handling of a node in the tree */}int streamFile(char *filename) {    xmlTextReaderPtr reader;    int ret;    reader = xmlNewTextReaderFilename(filename);    if (reader != NULL) {        ret = xmlTextReaderRead(reader);        while (ret == 1) {            processNode(reader);            ret = xmlTextReaderRead(reader);        }        xmlFreeTextReader(reader);        if (ret != 0) {            printf("%s : failed to parse\n", filename);        }    } else {        printf("Unable to open %s\n", filename);    }}</pre><p>A few things to notice:</p><ul>  <li>the include file needed : <code>libxml/xmlreader.h</code></li>  <li>the creation of the reader using a filename</li>  <li>the repeated call to xmlTextReaderRead() and how any return value    different from 1 should stop the loop</li>  <li>that a negative return means a parsing error</li>  <li>how xmlFreeTextReader() should be used to free up the resources used by    the reader.</li></ul><p>Here is similar code in python for exactly the same processing:</p><pre>import libxml2def processNode(reader):    passdef streamFile(filename):    try:        reader = libxml2.newTextReaderFilename(filename)    except:        print "unable to open %s" % (filename)        return    ret = reader.Read()    while ret == 1:        processNode(reader)        ret = reader.Read()    if ret != 0:        print "%s : failed to parse" % (filename)</pre><p>The only things worth adding are that the <ahref="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">xmlTextReaderis abstracted as a class like in C#</a> with the same method names (but theproperties are currently accessed with methods) and that one doesn't need tofree the reader at the end of the processing. It will get garbage collectedonce all references have disapeared.</p><h2><a name="Extracting">Extracting information for the current node</a></h2><p>So far the example code did not indicate how information was extractedfrom the reader. It was abstrated as a call to the processNode() routine,with the reader as the argument. At each invocation, the parser is stopped ona given node and the reader can be used to query those node properties. Each<em>Property</em> is available at the C level as a function taking a singlexmlTextReaderPtr argument whose name is<code>xmlTextReader</code><em>Property</em> , if the return type is an<code>xmlChar *</code> string then it must be deallocated with<code>xmlFree()</code> to avoid leaks. For the Python interface, there is a<em>Property</em> method to the reader class that can be called on theinstance. The list of the properties is based on the <ahref="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">C#XmlTextReader class</a> set of properties and methods:</p><ul>  <li><em>NodeType</em>: The node type, 1 for start element, 15 for end of    element, 2 for attributes, 3 for text nodes, 4 for CData sections, 5 for    entity references, 6 for entity declarations, 7 for PIs, 8 for comments,    9 for the document nodes, 10 for DTD/Doctype nodes, 11 for document    fragment and 12 for notation nodes.</li>  <li><em>Name</em>: the <a    href="http://www.w3.org/TR/REC-xml-names/#ns-qualnames">qualified    name</a> of the node, equal to (<em>Prefix</em>:)<em>LocalName</em>.</li>  <li><em>LocalName</em>: the <a    href="http://www.w3.org/TR/REC-xml-names/#NT-LocalPart">local name</a> of    the node.</li>  <li><em>Prefix</em>: a  shorthand reference to the <a    href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with    the node.</li>  <li><em>NamespaceUri</em>: the URI defining the <a    href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with    the node.</li>  <li><em>BaseUri:</em> the base URI of the node. See the <a    href="http://www.w3.org/TR/xmlbase/">XML Base W3C specification</a>.</li>  <li><em>Depth:</em> the depth of the node in the tree, starts at 0 for the    root node.</li>  <li><em>HasAttributes</em>: whether the node has attributes.</li>  <li><em>HasValue</em>: whether the node can have a text value.</li>  <li><em>Value</em>: provides the text value of the node if present.</li>  <li><em>IsDefault</em>: whether an Attribute  node was generated from the    default value defined in the DTD or schema (<em>unsupported  yet</em>).</li>  <li><em>XmlLang</em>: the <a    href="http://www.w3.org/TR/REC-xml#sec-lang-tag">xml:lang</a> scope    within which the node resides.</li>  <li><em>IsEmptyElement</em>: check if the current node is empty, this is a    bit bizarre in the sense that <code>&lt;a/&gt;</code> will be considered    empty while <code>&lt;a&gt;&lt;/a&gt;</code> will not.</li>  <li><em>AttributeCount</em>: provides the number of attributes of the    current node.</li></ul><p>Let's look first at a small example to get this in practice by redefiningthe processNode() function in the Python example:</p><pre>def processNode(reader):    print "%d %d %s %d" % (reader.Depth(), reader.NodeType(),                           reader.Name(), reader.IsEmptyElement())</pre><p>and look at the result of calling streamFile("tst.xml") for variouscontent of the XML test file.</p><p>For the minimal document "<code>&lt;doc/&gt;</code>" we get:</p><pre>0 1 doc 1</pre><p>Only one node is found, its depth is 0, type 1 indicate an element start,of name "doc" and it is empty. Trying now with"<code>&lt;doc&gt;&lt;/doc&gt;</code>" instead leads to:</p><pre>0 1 doc 00 15 doc 0</pre><p>The document root node is not flagged as empty anymore and both a startand an end of element are detected. The following document shows howcharacter data are reported:</p><pre>&lt;doc&gt;&lt;a/&gt;&lt;b&gt;some text&lt;/b&gt;&lt;c/&gt;&lt;/doc&gt;</pre><p>We modifying the processNode() function to also report the node Value:</p>
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -