⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 parser.java

📁 Mobile 应用程序使用 Java Micro Edition (Java ME) 平台
💻 JAVA
📖 第 1 页 / 共 4 页
字号:
/* * @(#)Parser.java	1.47 06/02/26 * * Copyright 2006 Sun Microsystems, Inc. All rights reserved. * SUN PROPRIETARY/CONFIDENTIAL. Use is subject to license terms. */package javax.swing.text.html.parser;import javax.swing.text.SimpleAttributeSet;import javax.swing.text.html.HTML;import javax.swing.text.ChangedCharSetException;import java.io.*;import java.util.Hashtable;import java.util.Properties;import java.util.Vector;import java.util.Enumeration;import java.net.URL;import sun.misc.MessageUtils;/** * A simple DTD-driven HTML parser. The parser reads an * HTML file from an InputStream and calls various methods * (which should be overridden in a subclass) when tags and * data are encountered. * <p> * Unfortunately there are many badly implemented HTML parsers * out there, and as a result there are many badly formatted * HTML files. This parser attempts to parse most HTML files. * This means that the implementation sometimes deviates from * the SGML specification in favor of HTML. * <p> * The parser treats \r and \r\n as \n. Newlines after starttags * and before end tags are ignored just as specified in the SGML/HTML * specification. * <p> * The html spec does not specify how spaces are to be coalesced very well. * Specifically, the following scenarios are not discussed (note that a * space should be used here, but I am using &amp;nbsp to force the space to * be displayed): * <p> * '&lt;b>blah&nbsp;&lt;i>&nbsp;&lt;strike>&nbsp;foo' which can be treated as: * '&lt;b>blah&nbsp;&lt;i>&lt;strike>foo'  * <p>as well as: * '&lt;p>&lt;a href="xx">&nbsp;&lt;em>Using&lt;/em>&lt;/a>&lt;/p>' * which appears to be treated as: * '&lt;p>&lt;a href="xx">&lt;em>Using&lt;/em>&lt;/a>&lt;/p>' * <p> * If <code>strict</code> is false, when a tag that breaks flow, * (<code>TagElement.breaksFlows</code>) or trailing whitespace is * encountered, all whitespace will be ignored until a non whitespace * character is encountered. This appears to give behavior closer to * the popular browsers. * * @see DTD * @see TagElement * @see SimpleAttributeSet * @version 1.47, 02/26/06 * @author Arthur van Hoff * @author Sunita Mani */publicclass Parser implements DTDConstants {    private char text[] = new char[1024];    private int textpos = 0;    private TagElement last;    private boolean space;    private char str[] = new char[128];    private int strpos = 0;    protected DTD dtd = null;    private int ch;    private int ln;    private Reader in;    private Element recent;    private TagStack stack;    private boolean skipTag = false;    private TagElement lastFormSent = null;    private SimpleAttributeSet attributes = new SimpleAttributeSet();    // State for <html>, <head> and <body>.  Since people like to slap    // together HTML documents without thinking, occasionally they    // have multiple instances of these tags.  These booleans track    // the first sightings of these tags so they can be safely ignored    // by the parser if repeated.    private boolean seenHtml = false;    private boolean seenHead = false;    private boolean seenBody = false;    /**     * The html spec does not specify how spaces are coalesced very well.     * If strict == false, ignoreSpace is used to try and mimic the behavior     * of the popular browsers.     * <p>     * The problematic scenarios are:     * '&lt;b>blah &lt;i> &lt;strike> foo' which can be treated as:     * '&lt;b>blah &lt;i>&lt;strike>foo'     * as well as:     * '&lt;p>&lt;a href="xx"> &lt;em>Using&lt;/em>&lt;/a>&lt;/p>'     * which appears to be treated as:     * '&lt;p>&lt;a href="xx">&lt;em>Using&lt;/em>&lt;/a>&lt;/p>'     * <p>     * When a tag that breaks flow, or trailing whitespace is encountered     * ignoreSpace is set to true. From then on, all whitespace will be     * ignored.     * ignoreSpace will be set back to false the first time a     * non whitespace character is encountered. This appears to give     * behavior closer to the popular browsers.     */    private boolean ignoreSpace;    /**     * This flag determines whether or not the Parser will be strict     * in enforcing SGML compatibility.  If false, it will be lenient     * with certain common classes of erroneous HTML constructs.     * Strict or not, in either case an error will be recorded.     *     */    protected boolean strict = false;    /** Number of \r\n's encountered. */    private int crlfCount;    /** Number of \r's encountered. A \r\n will not increment this. */    private int crCount;    /** Number of \n's encountered. A \r\n will not increment this. */    private int lfCount;    //    // To correctly identify the start of a tag/comment/text we need two    // ivars. Two are needed as handleText isn't invoked until the tag    // after the text has been parsed, that is the parser parses the text,    // then a tag, then invokes handleText followed by handleStart.    //    /** The start position of the current block. Block is overloaded here,     * it really means the current start position for the current comment,     * tag, text. Use getBlockStartPosition to access this. */    private int currentBlockStartPos;    /** Start position of the last block. */    private int lastBlockStartPos;    /**     * array for mapping numeric references in range     * 130-159 to displayable Unicode characters.     */    private static final char[] cp1252Map = {        8218,  // &#130;        402,   // &#131;        8222,  // &#132;        8230,  // &#133;        8224,  // &#134;        8225,  // &#135;        710,   // &#136;        8240,  // &#137;        352,   // &#138;        8249,  // &#139;        338,   // &#140;        141,   // &#141;        142,   // &#142;        143,   // &#143;        144,   // &#144;        8216,  // &#145;        8217,  // &#146;        8220,  // &#147;        8221,  // &#148;        8226,  // &#149;        8211,  // &#150;        8212,  // &#151;        732,   // &#152;        8482,  // &#153;        353,   // &#154;        8250,  // &#155;        339,   // &#156;        157,   // &#157;        158,   // &#158;        376    // &#159;    };    public Parser(DTD dtd) {	this.dtd = dtd;    }    /**     * @return the line number of the line currently being parsed     */    protected int getCurrentLine() {	return ln;    }    /**     * Returns the start position of the current block. Block is     * overloaded here, it really means the current start position for     * the current comment tag, text, block.... This is provided for     * subclassers that wish to know the start of the current block when     * called with one of the handleXXX methods.     */    int getBlockStartPosition() {	return Math.max(0, lastBlockStartPos - 1);    }    /**     * Makes a TagElement.     */    protected TagElement makeTag(Element elem, boolean fictional) {	return new TagElement(elem, fictional);    }    protected TagElement makeTag(Element elem) {	return makeTag(elem, false);    }    protected SimpleAttributeSet getAttributes() {	return attributes;    }    protected void flushAttributes() {	attributes.removeAttributes(attributes);    }    /**     * Called when PCDATA is encountered.     */    protected void handleText(char text[]) {    }    /**     * Called when an HTML title tag is encountered.     */    protected void handleTitle(char text[]) {	// default behavior is to call handleText. Subclasses	// can override if necessary.	handleText(text);    }    /**     * Called when an HTML comment is encountered.     */    protected void handleComment(char text[]) {    }    protected void handleEOFInComment() {	// We've reached EOF.  Our recovery strategy is to	// see if we have more than one line in the comment;	// if so, we pretend that the comment was an unterminated	// single line comment, and reparse the lines after the	// first line as normal HTML content.	int commentEndPos = strIndexOf('\n');	if (commentEndPos >= 0) {	    handleComment(getChars(0, commentEndPos));	    try {		in.close();		in = new CharArrayReader(getChars(commentEndPos + 1));		ch = '>';	    } catch (IOException e) {		error("ioexception");	    }	    resetStrBuffer();	} else {	    // no newline, so signal an error	    error("eof.comment");	}    }    /**     * Called when an empty tag is encountered.     */    protected void handleEmptyTag(TagElement tag) throws ChangedCharSetException {    }    /**     * Called when a start tag is encountered.     */    protected void handleStartTag(TagElement tag) {    }    /**     * Called when an end tag is encountered.     */    protected void handleEndTag(TagElement tag) {    }    /**     * An error has occurred.     */    protected void handleError(int ln, String msg) {	/*	Thread.dumpStack();	System.out.println("**** " + stack);	System.out.println("line " + ln + ": error: " + msg);	System.out.println();	*/    }    /**     * Output text.     */    void handleText(TagElement tag) {	if (tag.breaksFlow()) {	    space = false;            if (!strict) {                ignoreSpace = true;            }	}	if (textpos == 0) {	    if ((!space) || (stack == null) || last.breaksFlow() ||		!stack.advance(dtd.pcdata)) {		last = tag;		space = false;		lastBlockStartPos = currentBlockStartPos;		return;	    }	}	if (space) {            if (!ignoreSpace) {                // enlarge buffer if needed                if (textpos + 1 > text.length) {                    char newtext[] = new char[text.length + 200];                    System.arraycopy(text, 0, newtext, 0, text.length);                    text = newtext;                }                // output pending space                text[textpos++] = ' ';                if (!strict && !tag.getElement().isEmpty()) {                    ignoreSpace = true;                }            }            space = false;	}	char newtext[] = new char[textpos];	System.arraycopy(text, 0, newtext, 0, textpos);	// Handles cases of bad html where the title tag	// was getting lost when we did error recovery.	if (tag.getElement().getName().equals("title")) {	    handleTitle(newtext);        } else {	    handleText(newtext);	}	lastBlockStartPos = currentBlockStartPos;	textpos = 0;	last = tag;	space = false;    }    /**     * Invoke the error handler.     */    protected void error(String err, String arg1, String arg2,	String arg3) {        handleError(ln, err + " " + arg1 + " " + arg2 + " " + arg3);    }    protected void error(String err, String arg1, String arg2) {	error(err, arg1, arg2, "?");    }    protected void error(String err, String arg1) {	error(err, arg1, "?", "?");    }    protected void error(String err) {	error(err, "?", "?", "?");    }    /**     * Handle a start tag. The new tag is pushed     * onto the tag stack. The attribute list is     * checked for required attributes.     */    protected void startTag(TagElement tag) throws ChangedCharSetException {	Element elem = tag.getElement();	// If the tag is an empty tag and texpos != 0	// this implies that there is text before the	// start tag that needs to be processed before	// handling the tag.	//	if (!elem.isEmpty() || textpos != 0) {	    handleText(tag);	} else {	    // this variable gets updated in handleText().	    // Since in this case we do not call handleText()	    // we need to update it here.	    //	    last = tag;	    // Note that we should really check last.breakFlows before	    // assuming this should be false.	    space = false;	}	lastBlockStartPos = currentBlockStartPos;	// check required attributes	for (AttributeList a = elem.atts ; a != null ; a = a.next) {	    if ((a.modifier == REQUIRED) && 		((attributes.isEmpty()) || 		 ((!attributes.isDefined(a.name)) && 		  (!attributes.isDefined(HTML.getAttributeKey(a.name)))))) {		error("req.att ", a.getName(), elem.getName());	    }	}	if (elem.isEmpty()) {	    handleEmptyTag(tag);            /*	} else if (elem.getName().equals("form")) {	    handleStartTag(tag);            */	} else {	    recent = elem;	    stack = new TagStack(tag, stack);	    handleStartTag(tag);	}    }    /**     * Handle an end tag. The end tag is popped     * from the tag stack.     */    protected void endTag(boolean omitted) {	handleText(stack.tag);	if (omitted && !stack.elem.omitEnd()) {	    error("end.missing", stack.elem.getName());	} else if (!stack.terminate()) {	    error("end.unexpected", stack.elem.getName());	}	// handle the tag	handleEndTag(stack.tag);	stack = stack.next;	recent = (stack != null) ? stack.elem : null;    }    boolean ignoreElement(Element elem) {        String stackElement = stack.elem.getName();	String elemName = elem.getName();	/* We ignore all elements that are not valid in the context of	   a table except <td>, <th> (these we handle in	   legalElementContext()) and #pcdata.  We also ignore the	   <font> tag in the context of <ul> and <ol> We additonally	   ignore the <meta> and the <style> tag if the body tag has	   been seen. **/	if ((elemName.equals("html") && seenHtml) ||	    (elemName.equals("head") && seenHead) ||	    (elemName.equals("body") && seenBody)) {	    return true;	}	if (elemName.equals("dt") || elemName.equals("dd")) {	    TagStack s = stack;	    while (s != null && !s.elem.getName().equals("dl")) {		s = s.next;	    }	    if (s == null) {		return true;	    }	}	if (((stackElement.equals("table")) &&	     (!elemName.equals("#pcdata")) && (!elemName.equals("input"))) ||	    ((elemName.equals("font")) &&	     (stackElement.equals("ul") || stackElement.equals("ol"))) ||	    (elemName.equals("meta") && stack != null) ||	    (elemName.equals("style") && seenBody) ||	    (stackElement.equals("table") && elemName.equals("a"))) {	    return true;	}	return false;    }    /**     * Marks the first time a tag has been seen in a document     */    protected void markFirstTime(Element elem) {	String elemName = elem.getName();	if (elemName.equals("html")) {	    seenHtml = true;	} else if (elemName.equals("head")) {	    seenHead = true;	} else if (elemName.equals("body")) {            if (buf.length == 1) {                // Refer to note in definition of buf for details on this.                char[] newBuf = new char[256];                newBuf[0] = buf[0];                buf = newBuf;            }	    seenBody = true;	}    }    /**     * Create a legal content for an element.     */    boolean legalElementContext(Element elem) throws ChangedCharSetException {	// System.out.println("-- legalContext -- " + elem);	// Deal with the empty stack	if (stack == null) {	    // System.out.println("-- stack is empty");	    if (elem != dtd.html) {		// System.out.println("-- pushing html");		startTag(makeTag(dtd.html, true));		return legalElementContext(elem);	    }	    return true;	}	// Is it allowed in the current context	if (stack.advance(elem)) {	    // System.out.println("-- legal context");	    markFirstTime(elem);	    return true;	}	boolean insertTag = false;	// The use of all error recovery strategies are contingent	// on the value of the strict property.	//	// These are commonly occuring errors.  if insertTag is true,	// then we want to adopt an error recovery strategy that	// involves attempting to insert an additional tag to	// legalize the context.  The two errors addressed here	// are:	// 1) when a <td> or <th> is seen soon after a <table> tag.	//    In this case we insert a <tr>.	// 2) when any other tag apart from a <tr> is seen	//    in the context of a <tr>.  In this case we would	//    like to add a <td>.  If a <tr> is seen within a	//    <tr> context, then we will close out the current	//    <tr>.	//	// This insertion strategy is handled later in the method.	// The reason for checking this now, is that in other cases	// we would like to apply other error recovery strategies for example	// ignoring tags.	//	// In certain cases it is better to ignore a tag than try to	// fix the situation.  So the first test is to see if this	// is what we need to do.	//	String stackElemName = stack.elem.getName();	String elemName = elem.getName();	if (!strict &&	    ((stackElemName.equals("table") && elemName.equals("td")) ||	     (stackElemName.equals("table") && elemName.equals("th")) ||	     (stackElemName.equals("tr") && !elemName.equals("tr")))){	     insertTag = true;	}	if (!strict && !insertTag && (stack.elem.getName() != elem.getName() ||				      elem.getName().equals("body"))) {	    if (skipTag = ignoreElement(elem)) {	        error("tag.ignore", elem.getName());		return skipTag;	    }	}	// Check for anything after the start of the table besides tr, td, th	// or caption, and if those aren't there, insert the <tr> and call

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -