📄 source.java

📁 HTML解析器是一个Java库
💻 JAVA
📖 第 1 页 / 共 5 页
字号:
		return false;
	}

	/**
	 * Returns the <a target="_blank" href="http://en.wikipedia.org/wiki/Newline">newline</a> character sequence used in the source document.
	 * <p>
	 * If the document does not contain any newline characters, this method returns <code>null</code>.
	 * <p>
	 * The three possible return values (aside from <code>null</code>) are <code>"\n"</code>, <code>"\r\n"</code> and <code>"\r"</code>.
	 *
	 * @return the <a target="_blank" href="http://en.wikipedia.org/wiki/Newline">newline</a> character sequence used in the source document, or <code>null</code> if none is present.
	 */
	public String getNewLine() {
		if (newLine!=UNINITIALISED) return newLine;
		for (int i=0; i<end; i++) {
			char ch=string.charAt(i);
			if (ch=='\n') return newLine=lastNewLine=LF;
			if (ch=='\r') return newLine=lastNewLine=(++i<end && string.charAt(i)=='\n') ? CRLF : CR;
		}
		return newLine=null;
	}

	String getBestGuessNewLine() {
		final String newLine=getNewLine();
		if (newLine!=null) return newLine;
		if (lastNewLine!=null) return lastNewLine;
		return Config.NewLine;
	}

	/**
	 * Returns the row number of the specified character position in the source document.
	 * @param pos  the position in the source document.
	 * @return the row number of the specified character position in the source document.
	 * @throws IndexOutOfBoundsException if the specified position is not within the bounds of the document.
	 * @see #getColumn(int pos)
	 * @see #getRowColumnVector(int pos)
	 */
	public int getRow(final int pos) {
		return getRowColumnVector(pos).getRow();
	}

	/**
	 * Returns the column number of the specified character position in the source document.
	 * @param pos  the position in the source document.
	 * @return the column number of the specified character position in the source document.
	 * @throws IndexOutOfBoundsException if the specified position is not within the bounds of the document.
	 * @see #getRow(int pos)
	 * @see #getRowColumnVector(int pos)
	 */
	public int getColumn(final int pos) {
		return getRowColumnVector(pos).getColumn();
	}

	/**
	 * Returns a {@link RowColumnVector} object representing the row and column number of the specified character position in the source document.
	 * @param pos  the position in the source document.
	 * @return a {@link RowColumnVector} object representing the row and column number of the specified character position in the source document.
	 * @throws IndexOutOfBoundsException if the specified position is not within the bounds of the document.
	 * @see #getRow(int pos)
	 * @see #getColumn(int pos)
	 */
	public RowColumnVector getRowColumnVector(final int pos) {
		if (pos>end) throw new IndexOutOfBoundsException();
		if (rowColumnVectorCacheArray==null) rowColumnVectorCacheArray=RowColumnVector.getCacheArray(this);
		return RowColumnVector.get(rowColumnVectorCacheArray,pos);
	}
	
	/**
	 * Returns the source text as a <code>String</code>.
	 * @return the source text as a <code>String</code>.
	 */
	public String toString() {
		return string;
	}

	/**
	 * Parses all of the {@linkplain Tag tags} in this source document sequentially from beginning to end.
	 * <p>
	 * Calling this method can greatly improve performance if most or all of the tags in the document need to be parsed.
	 * <p>
	 * Calling the {@link #getAllTags()}, {@link #getAllStartTags()}, {@link #getAllElements()}, {@link #getChildElements()},
	 * {@link #iterator()} or {@link #getNodeIterator()}
	 * method on the <code>Source</code> object performs a full sequential parse automatically.
	 * There are however still circumstances where it should be called manually, such as when it is known that most or all of the tags in the document will need to be parsed,
	 * but none of the abovementioned methods are used, or are called only after calling one or more other <a href="Tag.html#TagSearchMethods">tag search methods</a>.
	 * <p>
	 * If this method is called manually, is should be called soon after the <code>Source</code> object is created,
	 * before any <a href="Tag.html#TagSearchMethods">tag search methods</a> are called.
	 * <p>
	 * By default, tags are parsed only as needed, which is referred to as <i><a name="ParseOnDemand">parse on demand</a></i> mode.
	 * In this mode, every call to a tag search method that is not returning previously cached tags must perform a relatively complex check to determine whether a
	 * potential tag is in a {@linkplain TagType#isValidPosition(Source,int,int[]) valid position}.
	 * <p>
	 * Generally speaking, a tag is in a valid position if it does not appear inside any another tag.
	 * {@linkplain TagType#isServerTag() Server tags} can appear anywhere in a document, including inside other tags, so this relates only to non-server tags.
	 * Theoretically, checking whether a specified position in the document is enclosed in another tag is only possible if every preceding tag has been parsed,
	 * otherwise it is impossible to tell whether one of the delimiters of the enclosing tag was in fact enclosed by some other tag before it, thereby invalidating it.
	 * <p>
	 * When this method is called, each tag is parsed in sequence starting from the beginning of the document, making it easy to check whether each potential
	 * tag is in a valid position.
	 * In <i>parse on demand</i> mode a compromise technique must be used for this check, since the theoretical requirement of having parsed all preceding tags 
	 * is no longer practical.  
	 * This compromise involves only checking whether the position is enclosed by other tags with {@linkplain TagType#getTagTypesIgnoringEnclosedMarkup() certain tag types}.
	 * The added complexity of this technique makes parsing each tag slower compared to when a full sequential parse is performed, but when only a few tags need
	 * parsing this is an extremely beneficial trade-off.
	 * <p>
	 * The documentation of the {@link TagType#isValidPosition(Source, int pos, int[] fullSequentialParseData)} method,
	 * which is called internally by the parser to perform the valid position check,
	 * includes a more detailed explanation of the differences between the two modes of operation.
	 * <p>
	 * Calling this method a second or subsequent time has no effect.
	 * <p>
	 * This method returns the same list of tags as the {@link Source#getAllTags() Source.getAllTags()} method, but as an array instead of a list.
	 * <p>
	 * If this method is called after any of the <a href="Tag.html#TagSearchMethods">tag search methods</a> are called,
	 * the {@linkplain #getCacheDebugInfo() cache} is cleared of any previously found tags before being restocked via the full sequential parse.
	 * This means that if you still have references to tags or elements from before the full sequential parse, they will not be the same objects as those
	 * that are returned by tag search methods after the full sequential parse, which can cause confusion if you are allocating
	 * {@linkplain Tag#setUserData(Object) user data} to tags.
	 * It is also significant if the {@link Segment#ignoreWhenParsing()} method has been called since the tags were first found, as any tags inside the
	 * ignored segments will no longer be returned by any of the <a href="Tag.html#TagSearchMethods">tag search methods</a>.
	 * <p>
	 * See also the {@link Tag} class documentation for more general details about how tags are parsed.
	 *
	 * @return an array of all {@linkplain Tag tags} in this source document.
	 */
	public Tag[] fullSequentialParse() {
		// The assumeNoNestedTags flag tells the parser not to bother checking for tags inside other tags
		// if the user knows that the document doesn't contain any server tags.
		// This results in a more efficient search, but the difference during benchmark tests was only minimal -
		// about 12% speed improvement in a 1MB document containing 70,000 tags, 75% of which were inside a comment tag.
		// With such a small improvement in a document specifically designed to show an an exaggerated improvement,
		// it is not worth documenting this feature.
		// The flag has been retained internally however as it does not have a measurable performance impact to check for it.
		if (allTagsArray!=null) return allTagsArray;
		final boolean assumeNoNestedTags=false;
		if (cache.getTagCount()!=0) {
			logger.warn("Full sequential parse clearing all tags from cache. Consider calling Source.fullSequentialParse() manually immediately after construction of Source.");
			cache.clear();
		}
		final boolean useAllTypesCacheSave=useAllTypesCache;
		try {
			useAllTypesCache=false;
			useSpecialTypesCache=false;
			return Tag.parseAll(this,assumeNoNestedTags);
		} finally {
			useAllTypesCache=useAllTypesCacheSave;
			useSpecialTypesCache=true;
		}
	}

	/**
	 * Returns an iterator over every tag and text segment contained within this source document.
	 * <p>
	 * Every tag found in the {@link #getAllTags()} list is included in this iterator, including all {@linkplain TagType#isServerTag() server tags}.
	 * <p>
	 * Segments of the document between the tags are also included, resulting in a sequential walk-through of every "node" in the source document, where a node is either
	 * a tag or a segment of text.
	 * The {@linkplain #getEnd() end} position of each segment should correspond with the {@linkplain #getBegin() begin} position of the subsequent segment,
	 * unless any of the tags are enclosed by other tags, which is common when {@linkplain TagType#isServerTag() server tags} are present.
	 * <p>
	 * The {@link CharacterReference#decodeCollapseWhiteSpace(CharSequence)} method can be used to retrieve the text from each text segment.
	 * <p>
	 * This method is implemented by simply calling the {@link Segment#getNodeIterator()} method of the {@link Segment} superclass. 
	 * <p>
	 * <dl>
	 *  <dt>Example:</dt>
	 *  <dd>
	 *   <p>
	 *   The following code demonstrates the typical (implied) usage of this method through the <code>Iterable</code> interface.
	 *   <p>
	 * <pre>
	 * for (Segment segment : source) {
	 *   if (segment instanceof Tag) {
	 *     Tag tag=(Tag)segment;
	 *     if (tag.getTagType().isServerTag()) continue; // ignore server tags
	 *     // Process the tag (just output it in this example):
	 *     System.out.println(tag.tidy());
	 *   } else {
	 *     // Segment is a text segment.
	 *     // Process the text segment (just output its text in this example):
	 *     String text=CharacterReference.decodeCollapseWhiteSpace(segment);
	 *     System.out.println(text);
	 *   }
	 * }</pre>
	 *  </dd>
	 * </dl>
	 * @return an iterator over every tag and text segment contained within this source document.
	 */
	public Iterator<Segment> iterator() {
		return getNodeIterator();
	}

	/**
	 * Returns a list of the top-level {@linkplain Element elements} in the document element hierarchy.
	 * <p>
	 * The objects in the list are all of type {@link Element}.
	 * <p>
	 * The term <i><a name="TopLevelElement">top-level element</a></i> refers to an element that is not nested within any other element in the document.
	 * <p>
	 * The term <i><a name="DocumentElementHierarchy">document element hierarchy</a></i> refers to the hierarchy of elements that make up this source document.
	 * The source document itself is not considered to be part of the hierarchy, meaning there is typically more than one top-level element.
	 * Even when the source represents an entire HTML document, the {@linkplain StartTagType#DOCTYPE_DECLARATION document type declaration} and/or an
	 * {@linkplain StartTagType#XML_DECLARATION XML declaration} often exist as top-level elements along with the {@link HTMLElementName#HTML HTML} element itself.
	 * <p>
	 * The {@link Element#getChildElements()} method can be used to get the children of the top-level elements, with recursive use providing a means to
	 * visit every element in the document hierarchy.
	 * <p>
	 * The document element hierarchy differs from that of the <a target="_blank" href="http://en.wikipedia.org/wiki/Document_Object_Model">Document Object Model</a>
	 * in that it is only a representation of the elements that are physically present in the source text.  Unlike the DOM, it does not include any "implied" HTML elements
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -