⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 sourceformatter.java

📁 HTML解析器是一个Java库
💻 JAVA
📖 第 1 页 / 共 3 页
字号:
// Jericho HTML Parser - Java based library for analysing and manipulating HTML
// Version 3.0
// Copyright (C) 2007 Martin Jericho
// http://jerichohtml.sourceforge.net/
//
// This library is free software; you can redistribute it and/or
// modify it under the terms of either one of the following licences:
//
// 1. The Eclipse Public License (EPL) version 1.0,
// included in this distribution in the file licence-epl-1.0.html
// or available at http://www.eclipse.org/legal/epl-v10.html
//
// 2. The GNU Lesser General Public License (LGPL) version 2.1 or later,
// included in this distribution in the file licence-lgpl-2.1.txt
// or available at http://www.gnu.org/licenses/lgpl.txt
//
// This library is distributed on an "AS IS" basis,
// WITHOUT WARRANTY OF ANY KIND, either express or implied.
// See the individual licence texts for more details.

package net.htmlparser.jericho;

import java.util.*;
import java.io.*;
import java.net.*;

/**
 * Formats HTML source by laying out each non-inline-level element on a new line with an appropriate indent.
 * <p>
 * Any indentation present in the original source text is removed.
 * <p>
 * Use one of the following methods to obtain the output:
 * <ul>
 *  <li>{@link #writeTo(Writer)}</li>
 *  <li>{@link #appendTo(Appendable)}</li>
 *  <li>{@link #toString()}</li>
 *  <li>{@link CharStreamSourceUtil#getReader(CharStreamSource) CharStreamSourceUtil.getReader(this)}</li>
 * </ul>
 * <p>
 * The output text is functionally equivalent to the original source and should be rendered identically unless specified below.
 * <p>
 * The following points describe the process in general terms.
 * Any aspect of the algorithm not specifically mentioned here is subject to change without notice in future versions.
 * <p>
 * <ul>
 *  <li>Every element that is not an {@linkplain HTMLElements#getInlineLevelElementNames() inline-level element} appears on a new line
 *   with an indent corresponding to its {@linkplain Element#getDepth() depth} in the <a href="Source.html#DocumentElementHierarchy">document element hierarchy</a>.
 *  <li>The indent is formed by writing <i>n</i> repetitions of the string specified in the {@link #setIndentString(String) IndentString} property,
 *   where <i>n</i> is the depth of the indentation.
 *  <li>The {@linkplain Element#getContent() content} of an indented element starts on a new line and is indented at a depth one greater than that of the element,
 *   with the end tag appearing on a new line at the same depth as the start tag.
 *   If the content contains only text and {@linkplain HTMLElements#getInlineLevelElementNames() inline-level elements},
 *   it may continue on the same line as the start tag.  Additionally, if the output content contains no new lines, the end tag may also continue on the same line.
 *  <li>The content of preformatted elements such as {@link HTMLElementName#PRE PRE} and {@link HTMLElementName#TEXTAREA TEXTAREA} are not indented,
 *   nor is the white space modified in any way.
 *  <li>Only {@linkplain StartTagType#NORMAL normal} and {@linkplain StartTagType#DOCTYPE_DECLARATION document type declaration} elements are indented.
 *   All others are treated as {@linkplain HTMLElements#getInlineLevelElementNames() inline-level elements}.
 *  <li>White space and indentation inside HTML {@linkplain StartTagType#COMMENT comments}, {@linkplain StartTagType#CDATA_SECTION CDATA sections}, or any
 *   {@linkplain TagType#isServerTag() server tag} is preserved, 
 *   but with the indentation of new lines starting at a depth one greater than that of the surrounding text.
 *  <li>White space and indentation inside {@link HTMLElementName#SCRIPT SCRIPT} elements is preserved, 
 *   but with the indentation of new lines starting at a depth one greater than that of the <code>SCRIPT</code> element.
 *  <li>If the {@link #setTidyTags(boolean) TidyTags} property is set to <code>true</code>,
 *   every tag in the document is replaced with the output from its {@link Tag#tidy()} method.
 *   If this property is set to <code>false</code>, the tag from the original text is used, including all white space,
 *   but with any new lines indented at a depth one greater than that of the element.
 *  <li>If the {@link #setCollapseWhiteSpace(boolean) CollapseWhiteSpace} property
 *   is set to <code>true</code>, every string of one or more {@linkplain Segment#isWhiteSpace(char) white space} characters
 *   located outside of a tag is replaced with a single space in the output.
 *   White space located adjacent to a non-inline-level element tag (except {@linkplain TagType#isServerTag() server tags}) may be removed.
 *  <li>If the {@link #setIndentAllElements(boolean) IndentAllElements} property
 *   is set to <code>true</code>, every element appears indented on a new line, including {@linkplain HTMLElements#getInlineLevelElementNames() inline-level elements}.
 *   This generates output that is a good representation of the actual <a href="Source.html#DocumentElementHierarchy">document element hierarchy</a>,
 *   but is very likely to introduce white space that compromises the functional equivalency of the document.
 *  <li>The {@link #setNewLine(String) NewLine} property specifies the character sequence
 *   to use for each <a target="_blank" href="http://en.wikipedia.org/wiki/Newline">newline</a> in the output document.
 *  <li>If the source document contains {@linkplain TagType#isServerTag() server tags}, the functional equivalency of the output document may be compromised.
 * </ul>
 * <p>
 * Formatting an entire {@link Source} object performs a {@linkplain Source#fullSequentialParse() full sequential parse} automatically.
 */
public final class SourceFormatter implements CharStreamSource {
	private final Segment segment;
	private String indentString="\t";
	private boolean tidyTags=false;
	private boolean collapseWhiteSpace=false;
	private boolean removeLineBreaks=false;
	private boolean indentAllElements=false;
	private String newLine=null;

	/**
	 * Constructs a new <code>SourceFormatter</code> based on the specified {@link Segment}.
	 * @param segment  the segment containing the HTML to be formatted.
	 * @see Source#getSourceFormatter()
	 */
	public SourceFormatter(final Segment segment) {
		this.segment=segment;
	}

	// Documentation inherited from CharStreamSource
	public void writeTo(final Writer writer) throws IOException {
		appendTo(writer);
		writer.flush();
	}

	// Documentation inherited from CharStreamSource
	public void appendTo(final Appendable appendable) throws IOException {
		new Processor(segment,getIndentString(),getTidyTags(),getCollapseWhiteSpace(),getRemoveLineBreaks(),getIndentAllElements(),getIndentAllElements(),getNewLine()).appendTo(appendable);
	}

	// Documentation inherited from CharStreamSource
	public long getEstimatedMaximumOutputLength() {
		return segment.length()*2;
	}

	// Documentation inherited from CharStreamSource
	public String toString() {
		return CharStreamSourceUtil.toString(this);
	}

	/**
	 * Sets the string to be used for indentation.
	 * <p>
	 * The default value is a string containing a single tab character (U+0009).
	 * <p>
	 * The most commonly used indent strings are <code>"\t"</code> (single tab), <code>"&nbsp;"</code> (single space), <code>"&nbsp;&nbsp;"</code> (2 spaces), and <code>"&nbsp;&nbsp;&nbsp;&nbsp;"</code> (4 spaces).
	 * 
	 * @param indentString  the string to be used for indentation, must not be <code>null</code>.
	 * @return this <code>SourceFormatter</code> instance, allowing multiple property setting methods to be chained in a single statement. 
	 * @see #getIndentString()
	 */
	public SourceFormatter setIndentString(final String indentString) {
		if (indentString==null) throw new IllegalArgumentException("indentString property must not be null");
		this.indentString=indentString;
		return this;
	}

	/**
	 * Returns the string to be used for indentation.
	 * <p>
	 * See the {@link #setIndentString(String)} method for a full description of this property.
	 *
	 * @return the string to be used for indentation.
	 */
	public String getIndentString() {
		return indentString;
	}

	/**
	 * Sets whether the original text of each tag is to be replaced with the output from its {@link Tag#tidy()} method.
	 * <p>
	 * The default value is <code>false</code>.
	 * <p>
	 * If this property is set to <code>false</code>, the tag from the original text is used, including all white space,
	 * but with any new lines indented at a depth one greater than that of the element.
	 *
	 * @param tidyTags  specifies whether the original text of each tag is to be replaced with the output from its {@link Tag#tidy()} method.
	 * @return this <code>SourceFormatter</code> instance, allowing multiple property setting methods to be chained in a single statement. 
	 * @see #getTidyTags()
	 */
	public SourceFormatter setTidyTags(final boolean tidyTags) {
		this.tidyTags=tidyTags;
		return this;
	}

	/**
	 * Indicates whether the original text of each tag is to be replaced with the output from its {@link Tag#tidy()} method.
	 * <p>
	 * See the {@link #setTidyTags(boolean)} method for a full description of this property.
	 * 
	 * @return <code>true</code> if the original text of each tag is to be replaced with the output from its {@link Tag#tidy()} method, otherwise <code>false</code>.
	 */
	public boolean getTidyTags() {
		return tidyTags;
	}

	/**
	 * Sets whether {@linkplain Segment#isWhiteSpace(char) white space} in the text between the tags is to be collapsed.
	 * <p>
	 * The default value is <code>false</code>.
	 * <p>
	 * If this property is set to <code>true</code>, every string of one or more {@linkplain Segment#isWhiteSpace(char) white space} characters
	 * located outside of a tag is replaced with a single space in the output.
	 * White space located adjacent to a non-inline-level element tag (except {@linkplain TagType#isServerTag() server tags}) may be removed.
	 *
	 * @param collapseWhiteSpace  specifies whether {@linkplain Segment#isWhiteSpace(char) white space} in the text between the tags is to be collapsed.
	 * @return this <code>SourceFormatter</code> instance, allowing multiple property setting methods to be chained in a single statement. 
	 * @see #getCollapseWhiteSpace()
	 */
	public SourceFormatter setCollapseWhiteSpace(final boolean collapseWhiteSpace) {
		this.collapseWhiteSpace=collapseWhiteSpace;
		return this;
	}
	
	/**
	 * Indicates whether {@linkplain Segment#isWhiteSpace(char) white space} in the text between the tags is to be collapsed.
	 * <p>
	 * See the {@link #setCollapseWhiteSpace(boolean collapseWhiteSpace)} method for a full description of this property.
	 * 
	 * @return <code>true</code> if {@linkplain Segment#isWhiteSpace(char) white space} in the text between the tags is to be collapsed, otherwise <code>false</code>.
	 */
	public boolean getCollapseWhiteSpace() {
		return collapseWhiteSpace;
	}

	/**
	 * Sets whether all non-essential line breaks are removed.
	 * <p>
	 * The default value is <code>false</code>.
	 * <p>
	 * If this property is set to <code>true</code>, only essential line breaks are retained in the output.
	 * <p>
	 * Setting this property automatically engages the {@link #setCollapseWhiteSpace(boolean) CollapseWhiteSpace} option, regardless of its property setting.
	 * <p>
	 * It is recommended to set the {@link #setTidyTags(boolean) TidyTags} property when this option is used so that non-essential line breaks are also removed from tags.
	 *
	 * @param removeLineBreaks  specifies whether all non-essential line breaks are removed.
	 * @return this <code>SourceFormatter</code> instance, allowing multiple property setting methods to be chained in a single statement. 
	 * @see #getRemoveLineBreaks()
	 */
	SourceFormatter setRemoveLineBreaks(final boolean removeLineBreaks) {
		this.removeLineBreaks=removeLineBreaks;
		return this;
	}
	
	/**
	 * Indicates whether all non-essential line breaks are removed.
	 * <p>
	 * See the {@link #setRemoveLineBreaks(boolean removeLineBreaks)} method for a full description of this property.
	 * 
	 * @return <code>true</code> if all non-essential line breaks are removed, otherwise <code>false</code>.
	 */
	boolean getRemoveLineBreaks() {
		return removeLineBreaks;
	}

	/**
	 * Sets whether all elements are to be indented, including {@linkplain HTMLElements#getInlineLevelElementNames() inline-level elements} and those with preformatted contents.
	 * <p>
	 * The default value is <code>false</code>.
	 * <p>
	 * If this property is set to <code>true</code>, every element appears indented on a new line, including
	 * {@linkplain HTMLElements#getInlineLevelElementNames() inline-level elements}.
	 * <p>
	 * This generates output that is a good representation of the actual <a href="Source.html#DocumentElementHierarchy">document element hierarchy</a>,
	 * but is very likely to introduce white space that compromises the functional equivalency of the document.
	 *

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -