📄 bulletparser.java

📁 MG4J (Managing Gigabytes for Java) is a free full-text search engine for large document collections
💻 JAVA
📖 第 1 页 / 共 2 页
字号:
12 下一页
package it.unimi.dsi.mg4j.util.parser;/*		  * MG4J: Managing Gigabytes for Java * * Copyright (C) 2005-2007 Sebastiano Vigna  * *  This library is free software; you can redistribute it and/or modify it *  under the terms of the GNU Lesser General Public License as published by the Free *  Software Foundation; either version 2.1 of the License, or (at your option) *  any later version. * *  This library is distributed in the hope that it will be useful, but *  WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY *  or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License *  for more details. * *  You should have received a copy of the GNU Lesser General Public License *  along with this program; if not, write to the Free Software *  Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. * */import it.unimi.dsi.fastutil.objects.Reference2ObjectArrayMap;import it.unimi.dsi.fastutil.objects.Reference2ObjectMap;import it.unimi.dsi.fastutil.objects.ReferenceArraySet;import it.unimi.dsi.fastutil.objects.ReferenceSet;import it.unimi.dsi.fastutil.objects.ReferenceSets;import it.unimi.dsi.mg4j.util.MutableString;import it.unimi.dsi.mg4j.util.TextPattern;import it.unimi.dsi.mg4j.util.parser.callback.Callback;/** A fast, lightweight, on-demand (X)HTML parser. *  * <p>The bullet parser has been written with two specific goals in mind: * web crawling and targeted data extraction from massive web data sets.  * To be usable in such environments, a parser must obey a number of  * restrictions: * <ul> * <li>it should avoid excessive object creation (which, for instance, * forbids a significant usage of Java strings); * <li>it should tolerate invalid syntax and recover reasonably; in fact, * it should never throw exceptions; * <li>it should perform actual parsing only on a settable feature subset: * there is no reason to parse the attributes of a <samp>P</samp> * element while searching for links; * <li>it should parse HTML as a <em>regular language</em>, and leave context-free * properties (e.g., stack maintenance and repair) to suitably designed callbacks. * </ul> *  * <p>Thus, in fact the bullet parser is not a parser. It is a bunch of * spaghetti code that analyses a stream of characters pretending that * it is an (X)HTML document. It has a very defensive attitude against * the stream character it is parsing, but at the same time it is * forgiving with all typical (X)HTML mistakes. *  * <p>The bullet parser is officially StringFree&trade;.  * <a href="http://mg4j.dsi.unimi.it/docs/it/unimi/dsi/mg4j/util/MutableString.html"><code>MutableString</code>s</a> * are used for internal processing, and Java strings are used only to return attribute * values. All internal maps are {@linkplain it.unimi.dsi.fastutil.objects.Reference2ObjectMap reference-based maps} * from <a href="http://fastutil.dsi.unimi.it/"><samp>fastutil</samp></a>, which * helps to accelerate further the parsing process. *  * <h2>HTML data</h2> *  * <p>The bullet parser uses attributes and methods of {@link it.unimi.dsi.mg4j.util.parser.HTMLFactory}, * {@link it.unimi.dsi.mg4j.util.parser.Element}, {@link it.unimi.dsi.mg4j.util.parser.Attribute} * and {@link it.unimi.dsi.mg4j.util.parser.Entity}. * Thus, for instance, whenever an element is to be passed around it is one * of the shared objects contained in {@link it.unimi.dsi.mg4j.util.parser.Element} * (e.g., {@link it.unimi.dsi.mg4j.util.parser.Element#BODY}). *  * <h2>Callbacks</h2> *  * <p>The result of the parsing process is the invocation of a callback. * The {@linkplain it.unimi.dsi.parser.callback.Callback callback interface} * of the bullet parser remembers closely SAX2, but it has some additional * methods targeted at (X)HTML, such as {@link Callback#cdata(it.unimi.dsi.mg4j.util.parser.Element,char[],int,int)}, * which returns characters found in a CDATA section (e.g., a stylesheet). *  * <p>Each callback must configure the parser, by requesting to perform * the analysis and the callbacks it requires. A callback that wants to * extract and tokenise text, for instance, will certainly require * {@link #parseText(boolean) parseText(true)}, but not {@link #parseTags(boolean) parseTags(true)}. * On the other hand, a callback wishing to extract links will require * to {@linkplain #parseAttribute(Attribute) parse selectively} certain attribute types. *  * <p>A more precise description follows. *  * <h2>Writing callbacks</h2> *  * <p>The first important issue is what has to be required to the parser. A newly * created parser does not invoke any callback. It is up to every callback * to add features so that it can do its job. Remember that since many * callbacks can be {@linkplain it.unimi.dsi.parser.callback.ComposedCallbackBuilder composed}, * you must always <em>add</em> features, never <em>remove</em> them, and moreover * your callbacks must be ready to be invoked with features they did not * request (e.g., attribute types added by another callback). *  * <p>The following parse features * may be configured; most of them are just boolean features, a.k.a. flags: * unless otherwise specified, by default all flags are set to false (e.g., by * the default the parser will <em>not</em> parse tags): * <ul> * <li><em>tags</em> ({@link #parseTags(boolean)} method): whether tags * should be parsed; * <li><em>attributes</em> ({@link #parseAttributes(boolean)} and * {@link #parseAttribute(Attribute) methods)}: * whether attributes should be parsed (of course, setting this flag is useless * if you are not parsing tags); note that setting this flag will just * activate the attribute parsing feature, but you must also * {@linkplain #parseAttribute(Attribute) register} every attribute  * whose value you want to obtain. * <li><em>text</em> ({@link #parseText(boolean)}method): whether text * should be parsed; if this flag is set, the parser will call the * {@link it.unimi.dsi.parser.callback.Callback#characters(char[], int, int, boolean)} * method for every text chunk found. * <li><em>CDATA sections</em> ({@link #parseCDATA(boolean)}method): whether CDATA * sections (stylesheets &amp; scripts) * should be parsed; if this flag is set, the parser will call the * {@link Callback#cdata(Element,char[],int,int)} * method for every CDATA section found. * </ul> *  * <h2>Invoking the parser</h2> *  * <p>After {@linkplain #setCallback(Callback) setting the parser callback},  * you just call {@link #parse(char[], int, int)}. * @deprecated Moved to <code>dsiutils</code>. */@Deprecatedpublic class BulletParser {	private static final boolean DEBUG = false;	/** Scanning text.. */	protected static final int STATE_TEXT = 0;	/** Scanning attribute name/value pairs. */	protected static final int STATE_BEFORE_START_TAG_NAME = 1;	/** Scanning a closing tag. */	protected static final int STATE_BEFORE_END_TAG_NAME = 2;	/** Scanning attribute name/value pairs. */	protected static final int STATE_IN_START_TAG = 3;	/** Scanning a closing tag. */	protected static final int STATE_IN_END_TAG = 4;	/** The maximum Unicode value accepted for a numeric entity. */	protected static final int MAX_ENTITY_VALUE = 65535;	/** The base for non-decimal entity. */	protected static final int HEXADECIMAL = 16;	/** The maximum number of digits of a hexadecimal numeric entity. */	protected static final int MAX_HEX_ENTITY_LENGTH = 8;	/** The maximum number of digits of a decimal numeric entity. */	protected static final int MAX_DEC_ENTITY_LENGTH = 9;	/** Closing tag for a script element. */	protected static final TextPattern SCRIPT_CLOSE_TAG_PATTERN = new TextPattern( "</script>", TextPattern.CASE_INSENSITIVE );	/** Closing tag for a style element. */	protected static final TextPattern STYLE_CLOSE_TAG_PATTERN = new TextPattern( "</style>", TextPattern.CASE_INSENSITIVE );	/** An array containing the non-space whitespace. */	protected static final char[] NONSPACE_WHITESPACE = { '\n', '\r', '\t' };	/** An array, parallel to {@link #NONSPACE_WHITESPACE}, containing spaces. */	protected static final char[] SPACE = { ' ', ' ', ' ' };		/** Closed comment. It should be "-->", but mistakes are common. */	protected static final TextPattern CLOSED_COMMENT = new TextPattern( "->" );	/** Closed ASP or similar tag. */	protected static final TextPattern CLOSED_PERCENT = new TextPattern( "%>" );	/** Closed processing instruction. */	protected static final TextPattern CLOSED_PIC = new TextPattern( "?>" );	/** Closed section (conditional, etc.). */	protected static final TextPattern CLOSED_SECTION = new TextPattern( "]>" );	/** Closed section (conditional, CDATA, etc.). */	protected static final TextPattern CLOSED_CDATA = new TextPattern( "]]>" );	/** TODO: what is this?. */	//protected static final TextPattern CLOSED_BOH = new TextPattern( "!>" );	/** The parsing factory used by this parser. */	public final ParsingFactory factory;		/** The callback of this parser. */	protected Callback callback;	/** A map from attributes to attribute values. */	protected Reference2ObjectMap<Attribute,MutableString> attrMap;	/** Whether we should invoke the text handler. */	protected boolean parseText;	/** Whether we should invoke the CDATA section handler. */	protected boolean parseCDATA;	/** Whether we should parse tags. */	protected boolean parseTags;	/** Whether we should parse attributes. */	protected boolean parseAttributes;	/**	 * The subset of attributes whose values will be actually parsed (if, of	 * course, {@link #parseAttributes}is true).	 */	protected ReferenceArraySet<Attribute> parsedAttrs = new ReferenceArraySet<Attribute>();	/**	 * An externally visible, immutable subset of attributes whose values will	 * be actually parsed.	 */	public ReferenceSet<Attribute> parsedAttributes = ReferenceSets.unmodifiable( parsedAttrs );	/** The character represented by the last scanned entity. */	protected char lastEntity;		/** Creates a new bullet parser. */	public BulletParser( final ParsingFactory factory ) {		this.factory = factory;	}	/** Creates a new bullet parser using the default factory {@link HTMLFactory#INSTANCE}. */	public BulletParser() {		this( HTMLFactory.INSTANCE );	}	/**	 * Returns whether this parser will invoke the text handler.	 * 	 * @return whether this parser will invoke the text handler.	 * @see #parseText(boolean)	 */	public boolean parseText() {		return parseText;	}	/**	 * Sets the text handler flag.	 * 	 * @param parseText	 *            the new value.	 * @return this parser.	 */	public BulletParser parseText( final boolean parseText ) {		this.parseText = parseText;		return this;	}	/**	 * Returns whether this parser will invoke the CDATA-section handler.	 * 	 * @return whether this parser will invoke the CDATA-section handler.	 * @see #parseCDATA(boolean)	 */	public boolean parseCDATA() {		return parseCDATA;	}	/**	 * Sets the CDATA-section handler flag.	 * 	 * @param parseCDATA	 *            the new value.	 * @return this parser.	 */	public BulletParser parseCDATA( final boolean parseCDATA ) {		this.parseCDATA = parseCDATA;		return this;	}	/**	 * Returns whether this parser will parse tags and invoke element handlers.	 * 	 * @return whether this parser will parse tags and invoke element handlers.	 * @see #parseTags(boolean)	 */	public boolean parseTags() {		return parseTags;	}	/**	 * Sets whether this parser will parse tags and invoke element handlers.	 * 	 * @param parseTags	 *            the new value.	 * @return this parser.	 */	public BulletParser parseTags( final boolean parseTags ) {		this.parseTags = parseTags;		return this;	}	/**	 * Returns whether this parser will parse attributes.	 * 	 * @return whether this parser will parse attributes.	 * @see #parseAttributes(boolean)	 */	public boolean parseAttributes() {		return parseAttributes;	}	/**	 * Sets the attribute parsing flag.	 * 	 * @param parseAttributes	 *            the new value for the flag.	 * @return this parser.	 */	public BulletParser parseAttributes( final boolean parseAttributes ) {		this.parseAttributes = parseAttributes;		return this;	}	/**	 * Adds the given attribute to the set of attributes to be parsed.	 * 	 * @param attribute	 *            an attribute that should be parsed.	 * @throws IllegalStateException	 *             if {@link #parseAttributes(boolean) parseAttributes(true)}	 *             has not been invoked on this parser.	 * @return this parser.	 */	public BulletParser parseAttribute( final Attribute attribute ) {		parsedAttrs.add( attribute );		return this;	}	/** Sets the callback for this parser, resetting at the same time all parsing flags.	 * 	 * @param callback the new callback.	 * @return this parser.	 */	public BulletParser setCallback( final Callback callback ) {        this.callback = callback;        parseCDATA = parseText = parseAttributes = parseTags = false;        parsedAttrs.clear();        callback.configure( this );        return this;	}	/** Returns the character corresponding to a given entity name.	 *	 * @param name the name of an entity.	 * @return the character corresponding to the entity, or an ASCII NUL if no entity with that name was found.	 */	protected char entity2Char( final MutableString name ) {		final Entity e = factory.getEntity( name );		return e == null ? (char)0 : e.character;	}	/** Searches for the end of an entity.	 * 	 * <P>This method will search for the end of an entity starting at the given offset (the offset	 * must correspond to the ampersand).	 * 	 * <P>Real-world HTML pages often contain hundreds of misplaced ampersands, due to the	 * unfortunate idea of using the ampersand as query separator (<em>please</em> use the comma	 * in new code!). All such ampersand should be specified as <samp>&amp;amp;</samp>. 	 * If named entities are delimited using a transition	 * from alphabetical to non-alphabetical characters, we can easily get false positives. If the parameter	 * <code>loose</code> is false, named entities can be delimited only by whitespace or by a comma.	 * 	 * @param a a character array containing the entity.	 * @param offset the offset at which the entity starts (the offset must point at the ampersand).	 * @param length an upper bound to the maximum returned position.	 * @param loose if true, named entities can be terminated by any non-alphabetical character 	 * (instead of whitespace or comma).	 * @param entity a support mutable string used to query {@link ParsingFactory#getEntity(MutableString)}.	 * @return the position of the last character of the entity, or -1 if no entity was found.	 */	protected int scanEntity( final char[] a, final int offset, final int length, final boolean loose, final MutableString entity ) {
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -