📄 tag.java

📁 HTML解析器是一个Java库
💻 JAVA
📖 第 1 页 / 共 4 页
字号:
12 3 4 下一页
// Jericho HTML Parser - Java based library for analysing and manipulating HTML
// Version 3.0
// Copyright (C) 2007 Martin Jericho
// http://jerichohtml.sourceforge.net/
//
// This library is free software; you can redistribute it and/or
// modify it under the terms of either one of the following licences:
//
// 1. The Eclipse Public License (EPL) version 1.0,
// included in this distribution in the file licence-epl-1.0.html
// or available at http://www.eclipse.org/legal/epl-v10.html
//
// 2. The GNU Lesser General Public License (LGPL) version 2.1 or later,
// included in this distribution in the file licence-lgpl-2.1.txt
// or available at http://www.gnu.org/licenses/lgpl.txt
//
// This library is distributed on an "AS IS" basis,
// WITHOUT WARRANTY OF ANY KIND, either express or implied.
// See the individual licence texts for more details.

package net.htmlparser.jericho;

import java.util.*;

/**
 * Represents either a {@link StartTag} or {@link EndTag} in a specific {@linkplain Source source} document.
 * <p>
 * Take the following HTML segment as an example:
 * <p>
 * <code>&lt;p&gt;This is a sample paragraph.&lt;/p&gt;</code>
 * <p>
 * The "<code>&lt;p&gt;</code>" is represented by a {@link StartTag} object, and the "<code>&lt;/p&gt;</code>" is represented by an {@link EndTag} object,
 * both of which are subclasses of the <code>Tag</code> class.
 * The whole segment, including the start tag, its corresponding end tag and all of the content in between, is represented by an {@link Element} object.
 *
 * <h3><a name="ParsingProcess">Tag Parsing Process</a></h3>
 * The following process describes how each tag is identified by the parser:
 * <ol class="Separated">
 *  <li>
 *   Every '<code>&lt;</code>' character found in the source document is considered to be the start of a tag.
 *   The characters following it are compared with the {@linkplain TagType#getStartDelimiter() start delimiters}
 *   of all the {@linkplain TagType#register() registered} {@linkplain TagType tag types}, and a list of matching tag types
 *   is determined.
 *  <li>
 *   A more detailed analysis of the source is performed according to the features of each matching tag type from the first step,
 *   in order of <a href="TagType.html#Precedence">precedence</a>, until a valid tag is able to be constructed.
 *   <p>
 *   The analysis performed in relation to each candidate tag type is a two-stage process:
 *   <ol>
 *    <li>
 *     The position of the tag is checked to determine whether it is {@linkplain TagType#isValidPosition(Source,int,int[]) valid}.
 *     In theory, a {@linkplain TagType#isServerTag() server tag} is valid in any position, but a non-server tag is not valid inside any other tag,
 *     nor inside elements with CDATA content such as {@link HTMLElementName#SCRIPT SCRIPT} and {@link HTMLElementName#STYLE STYLE} elements.
 *     Theory dictates therefore that {@linkplain StartTagType#COMMENT comments} and explicit {@linkplain StartTagType#CDATA_SECTION CDATA sections}
 *     inside script elements should not be recognised as tags.
 *     The behaviour of the parser however does not always strictly adhere to the theory, to maintain compatability with major browsers
 *     and also for efficiency reasons.
 *     <p>
 *     The {@link TagType#isValidPosition(Source, int pos, int[] fullSequentialParseData)} method is responsible for this check
 *     and has a common default implementation for all tag types
 *     (although <a href="TagType.html#custom">custom</a> tag types can override it if necessary).
 *     Its behaviour differs depending on whether or not a {@linkplain Source#fullSequentialParse() full sequential parse} is peformed.
 *     See the documentation of the {@link TagType#isValidPosition(Source,int,int[]) isValidPosition} method for full details.
 *    <li>
 *     A final analysis is performed by the {@link TagType#constructTagAt(Source, int pos)} method of the candidate tag type.
 *     This method returns a valid {@link Tag} object if all conditions of the candidate tag type are met, otherwise it returns
 *     <code>null</code> and the process continues with the next candidate tag type.
 *   </ol>
 *  <li>
 *   If the source does not match the start delimiter or syntax of any registered tag type, the segment spanning it and the next
 *   '<code>&gt;</code>' character is taken to be an {@linkplain #isUnregistered() unregistered} tag.
 *   Some tag search methods ignore unregistered tags.  See the {@link #isUnregistered()} method for more information.
 * </ol>
 * <p>
 * See the documentation of the {@link TagType} class for more details on how tags are recognised.
 *
 * <h3><a name="TagSearchMethods">Tag Search Methods</a></h3>
 * <p>
 * Methods that get tags in a source document are collectively referred to as <i>Tag Search Methods</i>.
 * They are found mostly in the {@link Source} and {@link Segment} classes, and can be generally categorised as follows:
 * <dl class="Separated">
 *  <dt><a name="OpenSearch">Open Search:</a>
 *   <dd>These methods search for tags of any {@linkplain #getName() name} and {@linkplain #getTagType() type}.
 *    <ul class="Unseparated">
 *     <li>{@link Tag#getNextTag()}
 *     <li>{@link Tag#getPreviousTag()}
 *     <li>{@link Segment#getAllElements()}
 *     <li>{@link Segment#getFirstElement()}
 *     <li>{@link Source#getTagAt(int pos)}
 *     <li>{@link Source#getPreviousTag(int pos)}
 *     <li>{@link Source#getNextTag(int pos)}
 *     <li>{@link Source#getEnclosingTag(int pos)}
 *     <li>{@link Segment#getAllTags()}
 *     <li>{@link Segment#getAllStartTags()}
 *     <li>{@link Segment#getFirstStartTag()}
 *     <li>{@link Source#getPreviousStartTag(int pos)}
 *     <li>{@link Source#getNextStartTag(int pos)}
 *     <li>{@link Source#getPreviousEndTag(int pos)}
 *     <li>{@link Source#getNextEndTag(int pos)}
 *    </ul>
 *  <dt><a name="NamedSearch">Named Search:</a>
 *   <dd>These methods include a parameter called <code>name</code> which is used to specify the {@linkplain #getName() name} of the tag to search for.
 *    Specifying a name that ends in a colon (<code>:</code>) searches for all elements or tags in the specified XML namespace.
 *    <ul class="Unseparated">
 *     <li>{@link Segment#getAllElements(String name)}
 *     <li>{@link Segment#getFirstElement(String name)}
 *     <li>{@link Segment#getAllStartTags(String name)}
 *     <li>{@link Segment#getFirstStartTag(String name)}
 *     <li>{@link Source#getPreviousStartTag(int pos, String name)}
 *     <li>{@link Source#getNextStartTag(int pos, String name)}
 *     <li>{@link Source#getPreviousEndTag(int pos, String name)}
 *     <li>{@link Source#getNextEndTag(int pos, String name)}
 *     <li>{@link Source#getNextEndTag(int pos, String name, EndTagType)}
 *    </ul>
 *  <dt><a name="TagTypeSearch">Tag Type Search:</a>
 *   <dd>These methods typically include a parameter called <code>tagType</code> which is used to specify the {@linkplain #getTagType() type} of the tag to search for.
 *    In some methods the search parameter is restricted to the {@link StartTagType} or {@link EndTagType} subclass of <code>TagType</code>.
 *    <ul class="Unseparated">
 *     <li>{@link Segment#getAllElements(StartTagType)}
 *     <li>{@link Segment#getAllTags(TagType)}
 *     <li>{@link Segment#getAllStartTags(StartTagType)}
 *     <li>{@link Segment#getFirstStartTag(StartTagType)}
 *     <li>{@link Source#getPreviousTag(int pos, TagType)}
 *     <li>{@link Source#getPreviousStartTag(int pos, StartTagType)}
 *     <li>{@link Source#getPreviousEndTag(int pos, EndTagType)}
 *     <li>{@link Source#getNextTag(int pos, TagType)}
 *     <li>{@link Source#getNextStartTag(int pos, StartTagType)}
 *     <li>{@link Source#getNextEndTag(int pos, EndTagType)}
 *     <li>{@link Source#getEnclosingTag(int pos, TagType)}
 *     <li>{@link Source#getNextEndTag(int pos, String name, EndTagType)}
 *    </ul>
 *  <dt><a name="OtherSearch">Attribute Search:</a>
 *   <dd>These methods (with one exception) include parameters called <code>attributeName</code> and <code>value</code>, which are used to specify an attribute name/value pair to search for.
 *    <ul class="Unseparated">
 *     <li>{@link Segment#getAllElements(String attributeName, String value, boolean valueCaseSensitive)}
 *     <li>{@link Segment#getFirstElement(String attributeName, String value, boolean valueCaseSensitive)}
 *     <li>{@link Segment#getAllStartTags(String attributeName, String value, boolean valueCaseSensitive)}
 *     <li>{@link Segment#getFirstStartTag(String attributeName, String value, boolean valueCaseSensitive)}
 *     <li>{@link Source#getElementById(String id)}
 *     <li>{@link Source#getNextStartTag(int pos, String attributeName, String value, boolean valueCaseSensitive)}
 *    </ul>
 * </dl>
 */
public abstract class Tag extends Segment {
	String name=null; // always lower case, can always use == operator to compare with constants in HTMLElementName interface
	private Object userData=null;
	// cached values:
	Element element=Element.NOT_CACHED;
	private Tag previousTag=NOT_CACHED; // does not include unregistered tags
	private Tag nextTag=NOT_CACHED; // does not include unregistered tags
	// A NOT_CACHED value in nextTag can also indicate that this tag is not in the cache. See isOrphaned() for details.

	static final Tag NOT_CACHED=new StartTag();

	private static final boolean INCLUDE_UNREGISTERED_IN_SEARCH=false; // determines whether unregistered tags are included in searches

	Tag(final Source source, final int begin, final int end, final String name) {
		super(source,begin,end);
		this.name=HTMLElements.getConstantElementName(name.toLowerCase());
	}

	// only used to create Tag.NOT_CACHED
	Tag() {}

	/**
	 * Returns the {@linkplain Element element} that is started or ended by this tag.
	 * <p>
	 * {@link StartTag#getElement()} is guaranteed not <code>null</code>.
	 * <p>
	 * {@link EndTag#getElement()} can return <code>null</code> if the end tag is not properly matched to a start tag.
	 *
	 * @return the {@linkplain Element element} that is started or ended by this tag.
	 */
	public abstract Element getElement();

	/**
	 * Returns the name of this tag, always in lower case.
	 * <p>
	 * The name always starts with the {@linkplain TagType#getNamePrefix() name prefix} defined in this tag's {@linkplain TagType type}.
	 * For some tag types, the name consists only of this prefix, while in others it must be followed by a valid
	 * <a target="_blank" href="http://www.w3.org/TR/REC-xml/#NT-Name">XML name</a>
	 * (see {@link StartTagType#isNameAfterPrefixRequired()}).
	 * <p>
	 * If the name is equal to one of the constants defined in the {@link HTMLElementName} interface, this method is guaranteed to return
	 * the constant itself.
	 * This allows comparisons to be performed using the <code>==</code> operator instead of the less efficient
	 * <code>String.equals(Object)</code> method.
	 * <p>
	 * For example, the following expression can be used to test whether a {@link StartTag} is from a
	 * <code><a target="_blank" href="http://www.w3.org/TR/html401/interact/forms.html#edef-SELECT">SELECT</a></code> element:
	 * <br /><code>startTag.getName()==HTMLElementName.SELECT</code>
	 * <p>
	 * To get the name of this tag in its original case, use {@link #getNameSegment()}<code>.toString()</code>.
	 *
	 * @return the name of this tag, always in lower case.
	 */
	public final String getName() {
		return name;
	}

	/**
	 * Returns the segment spanning the {@linkplain #getName() name} of this tag.
	 * <p>
	 * The code <code>getNameSegment().toString()</code> can be used to retrieve the name of this tag in its original case.
	 * <p>
	 * Every call to this method constructs a new <code>Segment</code> object.
	 *
	 * @return the segment spanning the {@linkplain #getName() name} of this tag.
	 * @see #getName()
	 */
	public Segment getNameSegment() {
		final int nameSegmentBegin=begin+getTagType().startDelimiterPrefix.length();
		return new Segment(source,nameSegmentBegin,nameSegmentBegin+name.length());
	}

	/**
	 * Returns the {@linkplain TagType type} of this tag.	
	 * @return the {@linkplain TagType type} of this tag.	
	 */
	public abstract TagType getTagType();

	/**
	 * Returns the general purpose user data object that has previously been associated with this tag via the {@link #setUserData(Object)} method.
	 * <p>
	 * If {@link #setUserData(Object)} has not been called, this method returns <code>null</code>.
	 *
	 * @return the generic data object that has previously been associated with this tag via the {@link #setUserData(Object)} method.
	 */
	public Object getUserData() {
		return userData;
	}

	/**
	 * Associates the specified general purpose user data object with this tag.
	 * <p>
	 * This property can be useful for applications that need to associate extra information with tags.
	 * The object can be retrieved later via the {@link #getUserData()} method.
	 *
	 * @param userData  general purpose user data of any type.
	 */
	public void setUserData(final Object userData) {
		this.userData=userData;
	}

	/**
	 * Returns the next tag in the source document.
	 * <p>
	 * This method also returns {@linkplain TagType#isServerTag() server tags}.
	 * <p>
	 * The result of a call to this method is cached.
	 * Performing a {@linkplain Source#fullSequentialParse() full sequential parse} prepopulates this cache.
	 * <p>
	 * If the result is not cached, a call to this method is equivalent to <code>source.</code>{@link Source#getNextTag(int) getNextTag}<code>(</code>{@link #getBegin() getBegin()}<code>+1)</code>.
	 * <p>
	 * See the {@link Tag} class documentation for more details about the behaviour of this method.
	 *
	 * @return the next tag in the source document, or <code>null</code> if this is the last tag.
	 */
	public Tag getNextTag() {
		if (nextTag==NOT_CACHED) {
			final Tag localNextTag=getNextTag(source,begin+1);
			if (source.wasFullSequentialParseCalled()) return localNextTag; // Don't set nextTag if this is an orphaned tag. See isOrphaned() for details.
			nextTag=localNextTag;
		}
		return nextTag;
	}

	/**
	 * Returns the previous tag in the source document.
	 * <p>
	 * This method also returns {@linkplain TagType#isServerTag() server tags}.
	 * <p>
	 * The result of a call to this method is cached.
	 * Performing a {@linkplain Source#fullSequentialParse() full sequential parse} prepopulates this cache.
	 * <p>
	 * If the result is not cached, a call to this method is equivalent to <code>source.</code>{@link Source#getPreviousTag(int) getPreviousTag}<code>(</code>{@link #getBegin() getBegin()}<code>-1)</code>.
	 * <p>
	 * See the {@link Tag} class documentation for more details about the behaviour of this method.
	 *
	 * @return the previous tag in the source document, or <code>null</code> if this is the first tag.
	 */
	public Tag getPreviousTag() {
		if (previousTag==NOT_CACHED) previousTag=getPreviousTag(source,begin-1);
		return previousTag;
	}

	/**
	 * Indicates whether this tag has a syntax that does not match any of the {@linkplain TagType#register() registered} {@linkplain TagType tag types}.
	 * <p>
 	 * The only requirement of an unregistered tag type is that it {@linkplain TagType#getStartDelimiter() starts} with
 	 * '<code>&lt;</code>' and there is a {@linkplain TagType#getClosingDelimiter() closing} '<code>&gt;</code>' character
 	 * at some position after it in the source document.
	 * <p>
	 * The absence or presence of a '<code>/</code>' character after the initial '<code>&lt;</code>' determines whether an
	 * unregistered tag is respectively a
	 * {@link StartTag} with a {@linkplain #getTagType() type} of {@link StartTagType#UNREGISTERED} or an
	 * {@link EndTag} with a {@linkplain #getTagType() type} of {@link EndTagType#UNREGISTERED}.
	 * <p>
	 * There are no restrictions on the characters that might appear between these delimiters, including other '<code>&lt;</code>'
	 * characters.  This may result in a '<code>&gt;</code>' character that is identified as the closing delimiter of two
	 * separate tags, one an unregistered tag, and the other a tag of any type that {@linkplain #getBegin() begins} in the middle 
	 * of the unregistered tag.  As explained below, unregistered tags are usually only found when specifically looking for them,
	 * so it is up to the user to detect and deal with any such nonsensical results.
	 * <p>
	 * Unregistered tags are only returned by the {@link Source#getTagAt(int pos)} method,
	 * <a href="Tag.html#NamedSearch">named search</a> methods, where the specified <code>name</code>
	 * matches the first characters inside the tag, and by <a href="Tag.html#TagTypeSearch">tag type search</a> methods, where the
	 * specified <code>tagType</code> is either {@link StartTagType#UNREGISTERED} or {@link EndTagType#UNREGISTERED}.
	 * <p>
	 * <a href="Tag.html#OpenSearch">Open</a> tag searches and <a href="Tag.html#OtherSearch">other</a> searches always ignore
	 * unregistered tags, although every discovery of an unregistered tag is {@linkplain Source#getLogger() logged} by the parser.
	 * <p>
	 * The logic behind this design is that unregistered tag types are usually the result of a '<code>&lt;</code>' character 
	 * in the text that was mistakenly left {@linkplain CharacterReference#encode(CharSequence) unencoded}, or a less-than 
	 * operator inside a script, or some other occurrence which is of no interest to the user.
	 * By returning unregistered tags in <a href="Tag.html#NamedSearch">named</a> and <a href="Tag.html#TagTypeSearch">tag type</a>
	 * search methods, the library allows the user to specifically search for tags with a certain syntax that does not match any
	 * existing {@link TagType}.  This expediency feature avoids the need for the user to create a
	 * <a href="TagType.html#Custom">custom tag type</a> to define the syntax before searching for these tags.
	 * By not returning unregistered tags in the less specific search methods, it is providing only the information that 
	 * most users are interested in.
	 *
	 * @return <code>true</code> if this tag has a syntax that does not match any of the {@linkplain TagType#register() registered} {@linkplain TagType tag types}, otherwise <code>false</code>.
	 */
	public abstract boolean isUnregistered();
12 3 4 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -