⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 lex-docs.txt

📁 windowns 环境的lex和yacc编译器工具
💻 TXT
📖 第 1 页 / 共 4 页
字号:
http://www.cs.ucsb.edu/~cs160/machines/lex-docs.txt                     Lex - A Lexical Analyzer Generator                          M. E. Lesk and E. Schmidt                                  ABSTRACT       Lex helps write programs whose control flow is  directed  by  instances of regular expressions in the input stream.  It is well  suited for editor-script type transformations and for  segmenting  input in preparation for a parsing routine.       Lex source is a table of regular expressions and correspond-  ing  program  fragments.   The  table  is translated to a program  which reads an input stream, copying it to an output  stream  and  partitioning the input into strings which match the given expres-  sions.  As each such string is recognized the corresponding  pro-  gram fragment is executed.  The recognition of the expressions is  performed by a deterministic finite automaton generated  by  Lex.  The  program  fragments  written  by the user are executed in the  order in which the corresponding regular expressions occur in the  input stream.       The lexical analysis programs written with Lex accept  ambi-  guous  specifications  and  choose  the longest match possible at  each input point.  If necessary, substantial  lookahead  is  per-  formed  on  the  input, but the input stream will be backed up to  the end of the current partition, so that the  user  has  general  freedom to manipulate it.       Lex can generate analyzers in either C or Ratfor, a language  which can be translated automatically to portable Fortran.  It is  available on the PDP-11 UNIX, Honeywell GCOS, and IBM OS systems.  This manual, however, will only discuss generating analyzers in C  on the UNIX system, which is the only supported form of Lex under  UNIX  Version  7.   Lex  is designed to simplify interfacing with  Yacc, for those with access to this compiler-compiler system.1.  Introduction.     Lex is a program generator designed for lexical processing ofcharacter input streams.  It accepts a high-level, problemoriented specification for character string matching, and producesa program in a general purpose language which recognizes regularexpressions.  The regular expressions are specified by the user inthe source specifications given to Lex.  The Lex written coderecognizes these expressions in an input stream and partitions theinput stream into strings matching the expressions.  At theboundaries between strings program sections provided by the userare executed.  The Lex source file associates the regularexpressions and the program fragments.  As each expression appearsin the input to the program written by Lex, the correspondingfragment is executed.     The user supplies the additional code beyond expressionmatching needed to complete his tasks, possibly including codewritten by other generators.  The program that recognizes theexpressions is generated in the general purpose programminglanguage employed for the user's program fragments.  Thus, a highlevel expression language is provided to write the stringexpressions to be matched while the user's freedom to writeactions is unimpaired.  This avoids forcing the user who wishes touse a string manipulation language for input analysis to writeprocessing programs in the same and often inappropriate stringhandling language.     Lex is not a complete language, but rather a generatorrepresenting a new language feature which can be added todifferent programming languages, called ``host languages.'' Justas general purpose languages can produce code to run on differentcomputer hardware, Lex can write code in different host languages.The host language is used for the output code generated by Lex andalso for the program fragments added by the user.  Compatiblerun-time libraries for the different host languages are alsoprovided.  This makes Lex adaptable to different environments anddifferent users.  Each application may be directed to thecombination of hardware and host language appropriate to the task,the user's background, and the properties of local implementa-tions.  At present, the only supported host language is C,although Fortran (in the form of Ratfor [2] has been available inthe past.  Lex itself exists on UNIX, GCOS, and OS/370; but thecode generated by Lex may be taken anywhere the appropriatecompilers exist.     Lex turns the user's expressions and actions (called sourcein this memo) into the host general-purpose language; thegenerated program is named yylex.  The yylex program willrecognize expressions in a stream (called input in this memo) andperform the specified actions for each expression as it isdetected.  See Figure 1.                                  +-------+                        Source -> |  Lex  |  -> yylex                                  +-------+                                  +-------+                        Input ->  | yylex | -> Output                                  +-------+                             An overview of Lex                                  Figure 1     For a trivial example, consider a program to delete from theinput all blanks or tabs at the ends of lines.                                 %%                                 [ \t]+$   ;is all that is required.  The program contains a %% delimiter tomark the beginning of the rules, and one rule.  This rule containsa regular expression which matches one or more instances of thecharacters blank or tab (written \t for visibility, in accordancewith the C language convention) just prior to the end of a line.The brackets indicate the character class made of blank and tab;the + indicates ``one or more ...''; and the $ indicates ``endof line,'' as in QED.  No action is specified, so the programgenerated by Lex (yylex) will ignore these characters.  Everythingelse will be copied.  To change any remaining string of blanks ortabs to a single blank, add another rule:                           %%                           [ \t]+$   ;                           [ \t]+    printf(" ");The finite automaton generated for this source will scan for bothrules at once, observing at the termination of the string ofblanks or tabs whether or not there is a newline character, andexecuting the desired rule action.  The first rule matches allstrings of blanks or tabs at the end of lines, and the second ruleall remaining strings of blanks or tabs.     Lex can be used alone for simple transformations, or foranalysis and statistics gathering on a lexical level.  Lex canalso be used with a parser generator to perform the lexicalanalysis phase; it is particularly easy to interface Lex and Yacc[3].  Lex programs recognize only regular expressions; Yacc writesparsers that accept a large class of context free grammars, butrequire a lower level analyzer to recognize input tokens.  Thus, acombination of Lex and Yacc is often appropriate.  When used asa preprocessor for a later parser generator, Lex is used topartition the input stream, and the parser generator assignsstructure to the resulting pieces.  The flow of control in sucha case (which might be the first half of a compiler, for exam-ple) is shown in Figure 2.  Additional programs, written by othergenerators or by hand, can be added easily to programs written byLex.                        lexical        grammar                         rules          rules                           |              |                           v              v                      +---------+    +---------+                      |   Lex   |    |  Yacc   |                      +---------+    +---------+                           |              |                           v              v                      +---------+    +---------+             Input -> |  yylex  | -> | yyparse | -> Parsed input                      +---------+    +---------+                            Lex with Yacc                               Figure 2Yacc users will realize that the name yylex is what Yacc expectsits lexical analyzer to be named, so that the use of this name byLex simplifies interfacing.     Lex generates a deterministic finite automaton from theregular expressions in the source [4].  The automaton isinterpreted, rather than compiled, in order to save space.  Theresult is still a fast analyzer.  In particular, the time taken bya Lex program to recognize and partition an input stream isproportional to the length of the input.  The number of Lex rulesor the complexity of the rules is not important in determiningspeed, unless rules which include forward context require asignificant amount of rescanning.  What does increase with thenumber and complexity of rules is the size of the finiteautomaton, and therefore the size of the program generated by Lex.     In the program written by Lex, the user's fragments(representing the actions to be performed as each regularexpression is found) are gathered as cases of a switch.  Theautomaton interpreter directs the control flow.  Opportunity isprovided for the user to insert either declarations or addi-tional statements in the routine containing the actions, or to addsubroutines outside this action routine.     Lex is not limited to source which can be interpreted on thebasis of one character lookahead.  For example, if there are tworules, one looking for ab and another for abcdefg, and the inputstream is abcdefh, Lex will recognize ab and leave the inputpointer just before cd. . .  Such backup is more costly than theprocessing of simpler languages.2.  Lex Source.     The general format of Lex source is:                             {definitions}                             %%                             {rules}                             %%                             {user subroutines}where the definitions and the user subroutines are often omitted.The second %% is optional, but the first is required to mark thebeginning of the rules.  The absolute minimum Lex program is thus                                     %%(no definitions, no rules) which translates into a program whichcopies the input to the output unchanged.     In the outline of Lex programs shown above, the rulesrepresent the user's control decisions; they are a table, in whichthe left column contains regular expressions (see section 3) andthe right column contains actions, program fragments to beexecuted when the expressions are recognized.  Thus an individualrule might appear                   integer   printf("found keyword INT");to look for the string integer in the input stream and print themessage ``found keyword INT'' whenever it appears.  In thisexample the host procedural language is C and the C libraryfunction printf is used to print the string.  The end of theexpression is indicated by the first blank or tab character.  Ifthe action is merely a single C expression, it can just be givenon the right side of the line; if it is compound, or takes morethan a line, it should be enclosed in braces.  As a slightly moreuseful example, suppose it is desired to change a number of wordsfrom British to American spelling.  Lex rules such as                      colour      printf("color");                      mechanise   printf("mechanize");                      petrol      printf("gas");would be a start.  These rules are not quite enough, since theword petroleum would become gaseum; a way of dealing with thiswill be described later.3.  Lex Regular Expressions.     The definitions of regular expressions are very similar tothose in QED [5].  A regular expression specifies a set of stringsto be matched.  It contains text characters (which match thecorresponding characters in the strings being compared) andoperator characters (which specify repetitions, choices, and otherfeatures).  The letters of the alphabet and the digits are alwaystext characters; thus the regular expression                                 integermatches the string integer wherever it appears and the expression                                    a57Dlooks for the string a57D.     Operators.  The operator characters are                   " \ [ ] ^ - ? . * + | ( ) $ / { } % < >and if they are to be used as text characters, an escape should beused.  The quotation mark operator (") indicates that whatever iscontained between a pair of quotes is to be taken as textcharacters.  Thus                                   xyz"++"matches the string xyz++ when it appears.  Note that a part of astring may be quoted.  It is harmless but unnecessary to quote anordinary text character; the expression                                   "xyz++"is the same as the one above.  Thus by quoting everynon-alphanumeric character being used as a text character, theuser can avoid remembering the list above of current operatorcharacters, and is safe should further extensions to Lex lengthenthe list.     An operator character may also be turned into a textcharacter by preceding it with \ as in                                   xyz\+\+which is another, less readable, equivalent of the aboveexpressions.  Another use of the quoting mechanism is to get ablank into an expression; normally, as explained above, blanks ortabs end a rule.  Any blank character not contained within [] (seebelow) must be quoted.  Several normal C escapes with \ arerecognized: \n is newline, \t is tab, and \b is backspace.  Toenter \ itself, use \\.  Since newline is illegal in anexpression, \n must be used; it is not required to escape tab andbackspace.  Every character but blank, tab, newline and the listabove is always a text character.     Character classes.  Classes of characters can be specifiedusing the operator pair [].  The construction [abc] matches asingle character, which may be a, b, or c.  Within squarebrackets, most operator meanings are ignored.  Only threecharacters are special: these are \ - and ^.  The - characterindicates ranges.  For example,                                 [a-z0-9<>_]indicates the character class containing all the lower caseletters, the digits, the angle brackets, and underline.  Rangesmay be given in either order.  Using - between any pair of

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -