📄 htuml2txt.lex
字号:
/******** * $Id: htuml2txt.lex,v 1.1 2000/05/25 18:07:05 golda Exp $ * $Log: htuml2txt.lex,v $ * Revision 1.1 2000/05/25 18:07:05 golda * Added Christian's changes to allow dynamic filters. I believe this has only been tested on Linux * systems. --GV * * Revision 1.5 1999/11/06 21:25:07 cvogler * - Fixed bug that did not recognize the end of a comment correctly. * * Revision 1.4 1999/11/06 06:55:08 cvogler * - Added support for > and < (greather than, and less than). * - Fixed problems with the matching rules for non-spacing tags that * caused linefeeds to be incorrectly suppressed. As a result, jumping * to line numbers from webglimpse searches did not work. * * * htuml2text.lex * * A faster HTML filter for WebGlimpse than htuml2txt.pl. I found that * the spawning of all the perl processes by glimpse was way too expensive * to be practical. In particular, searching 2000 files for a frequently * occuring term took more than 30 seconds on a PII-400/Linux 2.2.5 * machine. Rewriting the filter as a set of lex rules reduced the search * time to 5 seconds, which is on par with the simple html2txt filter. * * Suggested options for compiling on i386/Linux with egcs 1.1.2/flex 2.5.4: * flex -F -8 htuml2txt.lex * gcc -O3 -fomit-frame-pointer -o htuml2txt lex.yy.c -lfl * * Note: For a smaller, slightly slower executable, omit the -F switch in * the call to flex. * * Caution: The -8 switch MUST be specified if -f or -F is specified! * * Note: It is also necessary to edit .glimpse_filters in the * WebGlimpse database directories. * * Suggested options for compiling with AT&T-style lex: * lex htuml2txt.lex * cc -O -o htuml2txt lex.yy.c -ll * * Written on 5/16/1999 by Christian Vogler * Send bugreports and suggestions to cvogler@gradient.cis.upenn.edu. ******/STRING \"([^\"\n\\]|\\\")*\"WHITE [\ \t]/* HTML tags that are to be eliminated altogether, without even a *//* substitution with a space */A [aA]B [bB]I [iI]EM [eE][mM]FONT [fF][oO][nN][tT]STRONG [sS][tT][rR][oO][nN][gG]BIG [bB][iI][gG]SUP [sS][uU][pP]SUB [sS][uU][bB]U [uU]STRIKE [sS][tT][rR][iI][kK][eE]STYLE [sS][tT][yY][lL][eE]NSPTAGS ({A}|{B}|{I}|{EM}|{FONT}|{STRONG}|{BIG}|{SUP}|{SUB}|{U}|{STRIKE}|{STYLE})/* These allocate the necessary space to make AT&T lex work. *//* flex ignores them. */%e 4000%p 10000%n 2000/* treat inside of HTML comments and tags specially, to ensure that *//* everything inside them is eliminated, even if they contain quotes */%s COMMENT%s TAG%s BEGINTAG%%<COMMENT>[^\-\"\n\r]+ {/* This ruleset eats up all */}<COMMENT>-+[^\-\>\"\n\r]+ {/* HTML comments */}<COMMENT>-\> {/* none */}<COMMENT>{STRING} {/* none */}<COMMENT>-{2,}\> BEGIN(INITIAL);<TAG>[^\"\>\r\n]+ {/* This ruleset discards all */}<TAG>{STRING} {/* HTML tags */}<TAG>\> BEGIN(INITIAL);<BEGINTAG>{WHITE}+ {/* eat whitespace to find tag name */}<BEGINTAG>!-- BEGIN(COMMENT); /* HTML comment */<BEGINTAG>\/ {/* eat slash in tags */}<BEGINTAG>{NSPTAGS} BEGIN(TAG); /* tag to be eliminated altogether */<BEGINTAG>\> { fputc(' ', yyout); BEGIN(INITIAL); /* whoa. Empty tag?!? Replace with space */ };<BEGINTAG>[A-Za-z0-9]+ |<BEGINTAG>[^\r\n] { fputc(' ', yyout); BEGIN(TAG); /* all else is a tag to be replaced with a space */ } <INITIAL>\< BEGIN(BEGINTAG); /* tag that must be analyzed further (comment, spacing tag, non-spacing tag) */<INITIAL> fputc(' ', yyout); /* replace special */<INITIAL>¡ fputc('
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -