📄 xmlwriter.java

📁 Web-Harvest是一个Java开源Web数据抽取工具。它能够收集指定的Web页面并从这些页面中提取有用的数据。Web-Harvest主要是运用了像XSLT,XQuery,正则表达式等这些技术来实
💻 JAVA
📖 第 1 页 / 共 3 页
字号:
12 3 下一页
package org.webharvest.utils;

//XMLWriter.java - serialize an XML document.
//Written by David Megginson, david@megginson.com
//NO WARRANTY!  This class is in the public domain.
//Modified by John Cowan and Leigh Klotz for the TagSoup project.  Still in the public domain.
//New features:
//	it is a LexicalHandler
//	it prints a comment if the LexicalHandler#comment method is called
//	it supports certain XSLT output properties using get/setOutputProperty

//$Id: XMLWriter.java,v 1.1 2004/01/28 05:35:43 joe Exp $

import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.Enumeration;
import java.util.Hashtable;
import java.util.Properties;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.AttributesImpl;
import org.xml.sax.helpers.NamespaceSupport;
import org.xml.sax.helpers.XMLFilterImpl;
import org.xml.sax.ext.LexicalHandler;

/**
 * Filter to write an XML document from a SAX event stream.
 * 
 * <p>
 * This class can be used by itself or as part of a SAX event stream: it takes
 * as input a series of SAX2 ContentHandler events and uses the information in
 * those events to write an XML document. Since this class is a filter, it can
 * also pass the events on down a filter chain for further processing (you can
 * use the XMLWriter to take a snapshot of the current state at any point in a
 * filter chain), and it can be used directly as a ContentHandler for a SAX2
 * XMLReader.
 * </p>
 * 
 * <p>
 * The client creates a document by invoking the methods for standard SAX2
 * events, always beginning with the {@link #startDocument startDocument} method
 * and ending with the {@link #endDocument endDocument} method. There are
 * convenience methods provided so that clients to not have to create empty
 * attribute lists or provide empty strings as parameters; for example, the
 * method invocation
 * </p>
 * 
 * <pre>
 * w.startElement(&quot;foo&quot;);
 * </pre>
 * 
 * <p>
 * is equivalent to the regular SAX2 ContentHandler method
 * </p>
 * 
 * <pre>
 * w.startElement(&quot;&quot;, &quot;foo&quot;, &quot;&quot;, new AttributesImpl());
 * </pre>
 * 
 * <p>
 * Except that it is more efficient because it does not allocate a new empty
 * attribute list each time. The following code will send a simple XML document
 * to standard output:
 * </p>
 * 
 * <pre>
 * XMLWriter w = new XMLWriter();
 * 
 * w.startDocument();
 * w.startElement(&quot;greeting&quot;);
 * w.characters(&quot;Hello, world!&quot;);
 * w.endElement(&quot;greeting&quot;);
 * w.endDocument();
 * </pre>
 * 
 * <p>
 * The resulting document will look like this:
 * </p>
 * 
 * <pre>
 *  &lt;?xml version=&quot;1.0&quot; standalone=&quot;yes&quot;?&gt;
 * 
 *  &lt;greeting&gt;Hello, world!&lt;/greeting&gt;
 * </pre>
 * 
 * <p>
 * In fact, there is an even simpler convenience method, <var>dataElement</var>,
 * designed for writing elements that contain only character data, so the code
 * to generate the document could be shortened to
 * </p>
 * 
 * <pre>
 * XMLWriter w = new XMLWriter();
 * 
 * w.startDocument();
 * w.dataElement(&quot;greeting&quot;, &quot;Hello, world!&quot;);
 * w.endDocument();
 * </pre>
 * 
 * <h2>Whitespace</h2>
 * 
 * <p>
 * According to the XML Recommendation, <em>all</em> whitespace in an XML
 * document is potentially significant to an application, so this class never
 * adds newlines or indentation. If you insert three elements in a row, as in
 * </p>
 * 
 * <pre>
 * w.dataElement(&quot;item&quot;, &quot;1&quot;);
 * w.dataElement(&quot;item&quot;, &quot;2&quot;);
 * w.dataElement(&quot;item&quot;, &quot;3&quot;);
 * </pre>
 * 
 * <p>
 * you will end up with
 * </p>
 * 
 * <pre>
 *  &lt;item&gt;1&lt;/item&gt;&lt;item&gt;3&lt;/item&gt;&lt;item&gt;3&lt;/item&gt;
 * </pre>
 * 
 * <p>
 * You need to invoke one of the <var>characters</var> methods explicitly to
 * add newlines or indentation. Alternatively, you can use
 * {@link com.megginson.sax.DataWriter DataWriter}, which is derived from this
 * class -- it is optimized for writing purely data-oriented (or field-oriented)
 * XML, and does automatic linebreaks and indentation (but does not support
 * mixed content properly).
 * </p>
 * 
 * 
 * <h2>Namespace Support</h2>
 * 
 * <p>
 * The writer contains extensive support for XML Namespaces, so that a client
 * application does not have to keep track of prefixes and supply <var>xmlns</var>
 * attributes. By default, the XML writer will generate Namespace declarations
 * in the form _NS1, _NS2, etc., wherever they are needed, as in the following
 * example:
 * </p>
 * 
 * <pre>
 * w.startDocument();
 * w.emptyElement(&quot;http://www.foo.com/ns/&quot;, &quot;foo&quot;);
 * w.endDocument();
 * </pre>
 * 
 * <p>
 * The resulting document will look like this:
 * </p>
 * 
 * <pre>
 *  &lt;?xml version=&quot;1.0&quot; standalone=&quot;yes&quot;?&gt;
 * 
 *  &lt;_NS1:foo xmlns:_NS1=&quot;http://www.foo.com/ns/&quot;/&gt;
 * </pre>
 * 
 * <p>
 * In many cases, document authors will prefer to choose their own prefixes
 * rather than using the (ugly) default names. The XML writer allows two methods
 * for selecting prefixes:
 * </p>
 * 
 * <ol>
 * <li>the qualified name</li>
 * <li>the {@link #setPrefix setPrefix} method.</li>
 * </ol>
 * 
 * <p>
 * Whenever the XML writer finds a new Namespace URI, it checks to see if a
 * qualified (prefixed) name is also available; if so it attempts to use the
 * name's prefix (as long as the prefix is not already in use for another
 * Namespace URI).
 * </p>
 * 
 * <p>
 * Before writing a document, the client can also pre-map a prefix to a
 * Namespace URI with the setPrefix method:
 * </p>
 * 
 * <pre>
 * w.setPrefix(&quot;http://www.foo.com/ns/&quot;, &quot;foo&quot;);
 * w.startDocument();
 * w.emptyElement(&quot;http://www.foo.com/ns/&quot;, &quot;foo&quot;);
 * w.endDocument();
 * </pre>
 * 
 * <p>
 * The resulting document will look like this:
 * </p>
 * 
 * <pre>
 *  &lt;?xml version=&quot;1.0&quot; standalone=&quot;yes&quot;?&gt;
 * 
 *  &lt;foo:foo xmlns:foo=&quot;http://www.foo.com/ns/&quot;/&gt;
 * </pre>
 * 
 * <p>
 * The default Namespace simply uses an empty string as the prefix:
 * </p>
 * 
 * <pre>
 * w.setPrefix(&quot;http://www.foo.com/ns/&quot;, &quot;&quot;);
 * w.startDocument();
 * w.emptyElement(&quot;http://www.foo.com/ns/&quot;, &quot;foo&quot;);
 * w.endDocument();
 * </pre>
 * 
 * <p>
 * The resulting document will look like this:
 * </p>
 * 
 * <pre>
 *  &lt;?xml version=&quot;1.0&quot; standalone=&quot;yes&quot;?&gt;
 * 
 *  &lt;foo xmlns=&quot;http://www.foo.com/ns/&quot;/&gt;
 * </pre>
 * 
 * <p>
 * By default, the XML writer will not declare a Namespace until it is actually
 * used. Sometimes, this approach will create a large number of Namespace
 * declarations, as in the following example:
 * </p>
 * 
 * <pre>
 *  &lt;xml version=&quot;1.0&quot; standalone=&quot;yes&quot;?&gt;
 * 
 *  &lt;rdf:RDF xmlns:rdf=&quot;http://www.w3.org/1999/02/22-rdf-syntax-ns#&quot;&gt;
 *   &lt;rdf:Description about=&quot;http://www.foo.com/ids/books/12345&quot;&gt;
 *    &lt;dc:title xmlns:dc=&quot;http://www.purl.org/dc/&quot;&gt;A Dark Night&lt;/dc:title&gt;
 *    &lt;dc:creator xmlns:dc=&quot;http://www.purl.org/dc/&quot;&gt;Jane Smith&lt;/dc:title&gt;
 *    &lt;dc:date xmlns:dc=&quot;http://www.purl.org/dc/&quot;&gt;2000-09-09&lt;/dc:title&gt;
 *   &lt;/rdf:Description&gt;
 *  &lt;/rdf:RDF&gt;
 * </pre>
 * 
 * <p>
 * The "rdf" prefix is declared only once, because the RDF Namespace is used by
 * the root element and can be inherited by all of its descendants; the "dc"
 * prefix, on the other hand, is declared three times, because no higher element
 * uses the Namespace. To solve this problem, you can instruct the XML writer to
 * predeclare Namespaces on the root element even if they are not used there:
 * </p>
 * 
 * <pre>
 * w.forceNSDecl(&quot;http://www.purl.org/dc/&quot;);
 * </pre>
 * 
 * <p>
 * Now, the "dc" prefix will be declared on the root element even though it's
 * not needed there, and can be inherited by its descendants:
 * </p>
 * 
 * <pre>
 *  &lt;xml version=&quot;1.0&quot; standalone=&quot;yes&quot;?&gt;
 * 
 *  &lt;rdf:RDF xmlns:rdf=&quot;http://www.w3.org/1999/02/22-rdf-syntax-ns#&quot;
 *              xmlns:dc=&quot;http://www.purl.org/dc/&quot;&gt;
 *   &lt;rdf:Description about=&quot;http://www.foo.com/ids/books/12345&quot;&gt;
 *    &lt;dc:title&gt;A Dark Night&lt;/dc:title&gt;
 *    &lt;dc:creator&gt;Jane Smith&lt;/dc:title&gt;
 *    &lt;dc:date&gt;2000-09-09&lt;/dc:title&gt;
 *   &lt;/rdf:Description&gt;
 *  &lt;/rdf:RDF&gt;
 * </pre>
 * 
 * <p>
 * This approach is also useful for declaring Namespace prefixes that be used by
 * qualified names appearing in attribute values or character data.
 * </p>
 * 
 * @author David Megginson, david@megginson.com
 * @version 0.2
 * @see org.xml.sax.XMLFilter
 * @see org.xml.sax.ContentHandler
 */
public class XMLWriter extends XMLFilterImpl implements LexicalHandler {

	// //////////////////////////////////////////////////////////////////
	// Constructors.
	// //////////////////////////////////////////////////////////////////

	/**
	 * Create a new XML writer.
	 * 
	 * <p>
	 * Write to standard output.
	 * </p>
	 */
	public XMLWriter() {
		init(null);
	}

	/**
	 * Create a new XML writer.
	 * 
	 * <p>
	 * Write to the writer provided.
	 * </p>
	 * 
	 * @param writer
	 *            The output destination, or null to use standard output.
	 */
	public XMLWriter(Writer writer) {
		init(writer);
	}

	/**
	 * Create a new XML writer.
	 * 
	 * <p>
	 * Use the specified XML reader as the parent.
	 * </p>
	 * 
	 * @param xmlreader
	 *            The parent in the filter chain, or null for no parent.
	 */
	public XMLWriter(XMLReader xmlreader) {
		super(xmlreader);
		init(null);
	}

	/**
	 * Create a new XML writer.
	 * 
	 * <p>
	 * Use the specified XML reader as the parent, and write to the specified
	 * writer.
	 * </p>
	 * 
	 * @param xmlreader
	 *            The parent in the filter chain, or null for no parent.
	 * @param writer
	 *            The output destination, or null to use standard output.
	 */
	public XMLWriter(XMLReader xmlreader, Writer writer) {
		super(xmlreader);
		init(writer);
	}

	/**
	 * Internal initialization method.
	 * 
	 * <p>
	 * All of the public constructors invoke this method.
	 * 
	 * @param writer
	 *            The output destination, or null to use standard output.
	 */
	private void init(Writer writer) {
		setOutput(writer);
		nsSupport = new NamespaceSupport();
		prefixTable = new Hashtable();
		forcedDeclTable = new Hashtable();
		doneDeclTable = new Hashtable();
		outputProperties = new Properties();
	}

	// //////////////////////////////////////////////////////////////////
	// Public methods.
	// //////////////////////////////////////////////////////////////////

	/**
	 * Reset the writer.
	 * 
	 * <p>
	 * This method is especially useful if the writer throws an exception before
	 * it is finished, and you want to reuse the writer for a new document. It
	 * is usually a good idea to invoke {@link #flush flush} before resetting
	 * the writer, to make sure that no output is lost.
	 * </p>
	 * 
	 * <p>
	 * This method is invoked automatically by the
	 * {@link #startDocument startDocument} method before writing a new
	 * document.
	 * </p>
	 * 
	 * <p>
	 * <strong>Note:</strong> this method will <em>not</em> clear the prefix
	 * or URI information in the writer or the selected output writer.
	 * </p>
	 * 
	 * @see #flush
	 */
	public void reset() {
		elementLevel = 0;
		prefixCounter = 0;
		nsSupport.reset();
	}

	/**
	 * Flush the output.
	 * 
	 * <p>
	 * This method flushes the output stream. It is especially useful when you
	 * need to make certain that the entire document has been written to output
	 * but do not want to close the output stream.
	 * </p>
	 * 
	 * <p>
	 * This method is invoked automatically by the
	 * {@link #endDocument endDocument} method after writing a document.
	 * </p>
	 * 
	 * @see #reset
	 */
	public void flush() throws IOException {
		output.flush();
	}

	/**
	 * Set a new output destination for the document.
	 * 
	 * @param writer
	 *            The output destination, or null to use standard output.
	 * @return The current output writer.
	 * @see #flush
	 */
	public void setOutput(Writer writer) {
		if (writer == null) {
			output = new OutputStreamWriter(System.out);
		} else {
			output = writer;
		}
	}

	/**
	 * Specify a preferred prefix for a Namespace URI.
	 * 
	 * <p>
	 * Note that this method does not actually force the Namespace to be
	 * declared; to do that, use the {@link  #forceNSDecl(java.lang.String)
	 * forceNSDecl} method as well.
	 * </p>
	 * 
	 * @param uri
	 *            The Namespace URI.
	 * @param prefix
	 *            The preferred prefix, or "" to select the default Namespace.
	 * @see #getPrefix
	 * @see #forceNSDecl(java.lang.String)
	 * @see #forceNSDecl(java.lang.String,java.lang.String)
	 */
	public void setPrefix(String uri, String prefix) {
		prefixTable.put(uri, prefix);
	}
12 3 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -