📄 xml-intro.txt
字号:
Author: David Beazley (beazley@cs.uchicago.edu)
Chapter: A Bird's Eye View of XML
This chapter provides a high-level overview of XML (eXtensible Markup
Language), an increasingly popular document and data encoding standard
that is used on the Internet and in many other areas of computation.
Although a comprehensive discussion of XML is far beyond the scope of
a single chapter, the essential features of XML are described with an
emphasis on the structure of XML documents. Interested readers are
well-advised to consult the references at the end of this chapter for
more detailed coverage of XML and related technologies.
1. Background
-------------
XML is a standard that specifies how to define customized markup
languages for various types of documents. Developed by the World Wide
Web Consortium (W3C) and first published in 1998, XML is most often
described as a solution to various shortcomings in HTML. However, XML
is really considered to be a simplification and extension of SGML
(Standard Generalized Markup Language)--a standard for defining
structured documents that long predates the development of the World
Wide Web.
Given the amount of hype that surrounds XML, it is easy to be
overwhelmed with acronyms and terminology such as XML, XSL, XSLT,
XLinks, XPointers, XHTML, RDF, CDF, DTDs, schemas, CSS, SAX, DOM,
XML-RPC, SOAP, and so forth. To make sense of this mess, keep in mind
that most of these refer to specific XML-related technologies rather
than XML itself. The XML standard itself, although nontrivial, is
relatively simple and self-contained.
Much of the excitement that surrounds XML is due to its use in
representing almost any kind of structured data ranging from documents
to system configuration files. By storing such data in a highly
standardized way, it is easier to transfer data between applications
and to process that data using a common set of tools and libraries
(contrast this to creating a customized or proprietary data format for
each application).
2. Shortcomings of HTML
-----------------------
Much of XML's popularity has been driven by perceived short-comings of
HTML. For example, consider the following HTML document for a
delicious recipe:
<html>
<body>
<h1>Famous Guacamole</h1>
<p>
A southwest favorite!
<h2>Ingredients</h2>
<ul>
<li>4 Large avocados, chopped
<li>1 Tomato, chopped
<li>1/2 C. White onion, chopped
<li>2 tbl. Fresh squeezed lemon juice
<li>1 Jalapeno pepper, diced
<li>1 tbl. Fresh cilantro, minced
<li>1 tbl. Minced garlic
<li>3 tsp. Salt
<li>12 bottles Ice-cold beer
</ul>
<h2>Directions</h2>
<p>
Combine all ingredients and hand whisk to desired consistency.
Serve and enjoy with ice-cold beers.
</body>
</html>
Although this document generates a perfectly viewable web-page, it also
contains information that might be useful in other contexts. For
example, you might want to make a easily searchable database of
recipes. Or perhaps you would like to be able to easily change the
formatting so that you could include the recipe on other types of web
pages (or to produce a printed book). Maybe you would like to make it
easy to export the ingredient list to other applications such as a
shopping list or online ordering system.
Unfortunately, these sorts of tasks are difficult to implement because
HTML documents are confined to a relatively small number of generic
formatting elements such as <h1>, <h2>, <ul>, <li>, and so forth.
Although it is theoretically possible to disect a document based on
the information enclosed within each type of tag, this tends to be
rather ad-hoc and error-prone. For instance, there is no predefined
structure that defines how the tags are supposed to be arranged.
Furthermore, documents may use the tags in a wildly inconsistent
manner (e.g., another recipe page might use a slightly different set
of HTML tags depending on the author). Clearly this makes it
difficult to write scripts that both manage and extract data from raw
HTML files.
The main problem with HTML is that it really only serves as a
low-level formatting language for defining how a document is supposed
to appear in browser. Thus, even though it provides a collection of
document elements such as paragraphs, headings, tables, and bulleted
lists, none of these things really capture the underlying semantic
structure of the document. As an analogy, you might compare the
presentation of the above recipe to one formatted in postscript for
printing. Although the postscript document clearly contains all of
the recipe information, it is deeply embedded in a bunch of formatting
instructions related to page placement, fonts, and so forth. Like
HTML, this clearly makes it difficult to extract specific document
information such as the list of ingredients or even the name of the
recipe.
3. XML
-------
XML differs from HTML in that it allows you to create a user-definable
markup language that is customized for your specific application. In
fact, XML is not a markup language like HTML, but a set of rules
that describe how to create new markup languages in a highly
standardized manner. For instance, rather than describing a recipe in
terms of HTML tags, XML allows you to use a much more descriptive
document such as this:
<?xml version="1.0" encoding="utf-8"?>
<recipe>
<title>Famous Guacamole</title>
<description>
A southwest favorite!
</description>
<ingredients>
<item num="4"> Large avocados, chopped </item>
<item num="1"> Tomato, chopped </item>
<item num="1/2" units="C"> White onion, chopped </item>
<item num="2" units="tbl"> Fresh squeezed lemon juice </item>
<item num="1"> Jalapeno pepper, diced </item>
<item num="1" units="tbl"> Fresh cilantro, minced </item>
<item num="1" units="tbl"> Garlic, minced </item>
<item num="3" units="tsp"> Salt </item>
<item num="12" units="bottles"> Ice-cold beer </item>
</ingredients>
<directions>
Combine all ingredients and hand whisk to desired consistency.
Serve and enjoy with ice-cold beers.
</directions>
</recipe>
In this example, user definable elements such as <description> and
<ingredients> have been used to precisely describe different parts of
the document. Because different sections of the recipe are identified
by unique element names such as the recipe description, the ingredient
list, and directions, it is much easier to write software for
transforming or extracting information.
By allowing user definable elements, XML documents are encoded in a
manner that preserves the underlying semantic structure. For
instance, the <ingredients> element in the XML version precisely
describes that part of the document whereas the <ul> element in HTML
really only describes formatting and is vague about the content that
is actually being formatted.
Because XML allows user-definable elements, it can be applied in
essentially any domain that involves highly structured data. For
example, a book written in XML would include elements such as
chapters, sections, subsections, paragraphs, quotations, tables,
figures, and so forth. Similarly, XML can be used in settings not
normally associated with documents. For example, a program with a
graphical user interface, might use XML as a convenient format for
specifying the structure of windows, menu bars, and pull down menus.
Similarly, a compiler could use XML to export a parse tree to a code
generation tool. A scientific simulation could use XML to store
information about simulation parameters or as a highly structured
output format. Naturally, you can also use XML to generate web pages.
4. The Structure of XML documents
---------------------------------
Most XML documents start with a standard header that looks like this:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
This header defines the XML version and the document encoding. Since
XML documents may be encoded in Unicode, the encoding attribute
specifies the document encoding to use. Most standard text editors
only support 7-bit ASCII characters so "utf-8" tends to be the most
common encoding (note: ASCII characters are a subset of UTF-8).
However, general purpose XML processing tools should be written to
anticipate other encodings such as "utf-16","utf-16-le","utf-16-be"
and so forth. If the version number is omitted, the XML version is
assumed to be "1.0". If the encoding is omitted, "utf-8" is used.
The optional standalone attribute, if supplied, specifies whether or
not the document requires on an external Document Type Definition (DTD)
to be processed. This topic is described in the next section.
Following the <?xml ...?> header, many XML documents contain a
collection of XML markup declarations denoted by the following syntax:
<! ... >
Although not shown in our earlier example, these declarations are
used to control the behavior of XML processing itself. Typically,
a markup declaration is used to include a Document Type Definition (DTD) that
defines the set of allowable document elements and attributes.
For example:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE RECIPE SYSTEM "Recipe.DTD">
The following XML syntax is used to denote a comment:
<!-- Some comment -->
Comments may appear anywhere in a document following the initial XML
header.
In some cases, additional processing instructions are supplied. These
are always enclosed in <? ... ?> just like the <?xml ... ?> header.
Typically these are used to control the behavior of programs that process
XML documents. For example, an XML document that is going to be processed
using XSLT (a formatting application) might include a declaration like this:
<?xml-stylesheet type="text/xml" href="recipe.xsl"?>
Following the XML header, markup declarations, and processing
instructions, the rest of the document must be contained within a
<em>single</em> top-level element. For instance, in our recipe
example, the entire document is enclosed by a pair of tags like this:
<recipe>
...
</recipe>
This structure is analogous to HTML where everything in the document has to be
enclosed by a single pair of <html> tags:
<html>
...
</html>
It is important to note that only one top-level element is allowed. Therefore,
an XML document does not allow the following:
<?xml version="1.0"?>
<recipe>
...
</recipe>
<recipe> <!-- Illegal -->
...
</recipe>
Within the document, elements are defined by simply surrounding
sections of text with a pair of opening and closing tags. For example,
<title>Famous Guacamole</title>
defines a "title" element. XML always requires elements to be
surrounded by both an opening and a closing tag. This differs from
HTML which is somewhat ad-hoc in its handling of tags (for example, in
HTML it is fairly common to omit certain closing tags for elements
such as </p> or </li>). This is not allowed in XML--every opening tag
must have a matching closed tag.
Elements may be nested, but the opening and closing tags of inner elements
must be wholly contained within any enclosing element.
For example, this is legal:
<foo><bar>Blah</bar></foo>
whereas this is not:
<foo><bar>Blah</foo></bar> <!-- Bad XML -->
Again, this differs from HTML which is rather permissive in how tags
may be nested. For instance, most browsers have no trouble dealing
with text such as <b><i>Bold italics</b></i> even though this is
illegal in XML.
The nesting restriction ensures that all XML documents define a tree of elements.
For example, the recipe example defines the tree shown in Figure 1.
recipe
|
|----- title
|
|----- description
|
|----- ingredients
| |
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -