📄 lib_4.0_1.fix
字号:
<HTML><HEAD><TITLE>SGML fix</TITLE><!-- Changed by: Henrik Frystyk Nielsen, 18-Jan-1996 --></HEAD><BODY><H1>SGML Fix</H1>I have a proposition to improve a little the SGML module. A small patch I send here improves theerror recovery in parsing ill-formed HTML documents. I have a proposition to improve a little theSGML module. A small patch I send here improves the error recovery in parsing ill-formed HTMLdocuments. <P>The end_element function in sgml.c hanles ill-nested tags pretty well, ignoring bad end tags orassuming corresponding start tags. The rule used now states that if we encounter an ending tagwithout the corresponding start tag on the top of the element=20 stack, we do one of the following:<OL><LI>If we are on the last level (that is there is only <HTML> on the element stack), we ignorethe bad ending tag.<LI>If there are some other elements (start tags) on the stack, we assume the corresponding end tags(except </HTML>), free the stack from those elements, and=20 then check again the tag whichcaused the whole problem against the rules 1 and 2.</OL>In some cases this algorithm doesn't work well. Consider the example:<PRE> <HTML> <BODY> <UL> <LI> <A...> <B> ... </A> </I> </B> <LI> ... </UL> </BODY> </HTML></PRE>There are two errors in this example:<OL><LI><A> and <B> are overlapping<LI>There is an </I> without any <I></OL>According to the rules given above, this is parsed exactly as shown below:<PRE> <HTML> <BODY> <UL> <LI> <A...> <B> ... </B> </A> </UL> </BODY> <LI> ... </HTML></PRE>This is because:<OL><LI>When the parser encounters </A> (line 4 of the original example) it assumes </B> and then goesto </A>, which is now nested perfectly.<LI>Next it encounters </I>, so it assumes all open start tags (except <HTML>); this gives</UL> </BODY>.<LI> Again with </I>, the parser is now on the last level (only <HTML> hasn't been closed), so itignores </I>.<LI> Now the parser encounters <LI>. Note that it's a <LI> outside <UL>! What=20 happen nextdepends on the quality of the HTML module (my version of html.c was crashing until I corrected it).</OL>The problem is that too many ending tags are assumed. I propose to ignore all the ill-nested endtags which doesn't have a corresponding opening tag on the element stack (on the whole stack, notonly on its top). Considering the example, after seeing=20 </I> we simply ignore it (and notassume all pending tags and then ignore it). <P>So, after this modification the example is parsed as:<PRE> <HTML> <BODY> <UL> <LI> <A...> <B> ... </B> </A> <LI> ... </UL> </BODY> </HTML></PRE>Here's the patch:<PRE>/*** Helper function to check if the tag is on the stack*/PRIVATE BOOL lookup_element_stack (HTElement* stack, HTTag *tag){ HTElement* elem; for (elem = stack; elem != NULL; elem = elem->next) { if (elem->tag == tag) return TRUE; } return FALSE;}/* ** Modified end_element function** Only one line is added, it's marked with <==*/PRIVATE void end_element (HTStream * context, HTTag * old_tag){ if (SGML_TRACE) TTYPrint(TDEST, "SGML: End </%s>\n", old_tag->name); if (old_tag->contents == SGML_EMPTY) { if (SGML_TRACE) TTYPrint(TDEST,"SGML: Illegal end tag </%s> found.\n", old_tag->name); return; } while (context->element_stack) {/* Loop is error path only */ HTElement * N = context->element_stack; HTTag * t = N->tag; if (old_tag != t) { /* Mismatch: syntax error */ if (context->element_stack->next /* This is not the last level */ && lookup_element_stack(context->element_stack, old_tag)) { /* <== */ if (SGML_TRACE) TTYPrint(TDEST, "SGML: Found </%s> when expecting </%s>. </%s> assumed.\n", old_tag->name, t->name, t->name); } else { /* last level */ if (SGML_TRACE) TTYPrint(TDEST, "SGML: Found </%s> when expecting </%s>. </%s> Ignored.\n", old_tag->name, t->name, old_tag->name); return; /* Ignore */ } } context->element_stack = N->next; /* Remove from stack */ free(N); (*context->actions->end_element)(context->target, t - context->dtd->tags); if (old_tag == t) return; /* Correct sequence */ /* Syntax error path only */ } if (SGML_TRACE) TTYPrint(TDEST, "SGML: Extra end tag </%s> found and ignored.\n", old_tag->name);}</PRE>Tha't all for now. I have also another proposition concerning SGML module (better handling of <P>tags), but in my opinion it will need some discussion, and I don't know if you are interested inthis.<P><HR><ADDRESS>Maciej Puzio, puzio@laser.mimuw.edu.pl</ADDRESS></BODY></HTML>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -