📄 lex-docs.txt
字号:
http://www.cs.ucsb.edu/~cs160/machines/lex-docs.txt Lex - A Lexical Analyzer Generator M. E. Lesk and E. Schmidt ABSTRACT Lex helps write programs whose control flow is directed by instances of regular expressions in the input stream. It is well suited for editor-script type transformations and for segmenting input in preparation for a parsing routine. Lex source is a table of regular expressions and correspond- ing program fragments. The table is translated to a program which reads an input stream, copying it to an output stream and partitioning the input into strings which match the given expres- sions. As each such string is recognized the corresponding pro- gram fragment is executed. The recognition of the expressions is performed by a deterministic finite automaton generated by Lex. The program fragments written by the user are executed in the order in which the corresponding regular expressions occur in the input stream. The lexical analysis programs written with Lex accept ambi- guous specifications and choose the longest match possible at each input point. If necessary, substantial lookahead is per- formed on the input, but the input stream will be backed up to the end of the current partition, so that the user has general freedom to manipulate it. Lex can generate analyzers in either C or Ratfor, a language which can be translated automatically to portable Fortran. It is available on the PDP-11 UNIX, Honeywell GCOS, and IBM OS systems. This manual, however, will only discuss generating analyzers in C on the UNIX system, which is the only supported form of Lex under UNIX Version 7. Lex is designed to simplify interfacing with Yacc, for those with access to this compiler-compiler system.1. Introduction. Lex is a program generator designed for lexical processing ofcharacter input streams. It accepts a high-level, problemoriented specification for character string matching, and producesa program in a general purpose language which recognizes regularexpressions. The regular expressions are specified by the user inthe source specifications given to Lex. The Lex written coderecognizes these expressions in an input stream and partitions theinput stream into strings matching the expressions. At theboundaries between strings program sections provided by the userare executed. The Lex source file associates the regularexpressions and the program fragments. As each expression appearsin the input to the program written by Lex, the correspondingfragment is executed. The user supplies the additional code beyond expressionmatching needed to complete his tasks, possibly including codewritten by other generators. The program that recognizes theexpressions is generated in the general purpose programminglanguage employed for the user's program fragments. Thus, a highlevel expression language is provided to write the stringexpressions to be matched while the user's freedom to writeactions is unimpaired. This avoids forcing the user who wishes touse a string manipulation language for input analysis to writeprocessing programs in the same and often inappropriate stringhandling language. Lex is not a complete language, but rather a generatorrepresenting a new language feature which can be added todifferent programming languages, called ``host languages.'' Justas general purpose languages can produce code to run on differentcomputer hardware, Lex can write code in different host languages.The host language is used for the output code generated by Lex andalso for the program fragments added by the user. Compatiblerun-time libraries for the different host languages are alsoprovided. This makes Lex adaptable to different environments anddifferent users. Each application may be directed to thecombination of hardware and host language appropriate to the task,the user's background, and the properties of local implementa-tions. At present, the only supported host language is C,although Fortran (in the form of Ratfor [2] has been available inthe past. Lex itself exists on UNIX, GCOS, and OS/370; but thecode generated by Lex may be taken anywhere the appropriatecompilers exist. Lex turns the user's expressions and actions (called sourcein this memo) into the host general-purpose language; thegenerated program is named yylex. The yylex program willrecognize expressions in a stream (called input in this memo) andperform the specified actions for each expression as it isdetected. See Figure 1. +-------+ Source -> | Lex | -> yylex +-------+ +-------+ Input -> | yylex | -> Output +-------+ An overview of Lex Figure 1 For a trivial example, consider a program to delete from theinput all blanks or tabs at the ends of lines. %% [ \t]+$ ;is all that is required. The program contains a %% delimiter tomark the beginning of the rules, and one rule. This rule containsa regular expression which matches one or more instances of thecharacters blank or tab (written \t for visibility, in accordancewith the C language convention) just prior to the end of a line.The brackets indicate the character class made of blank and tab;the + indicates ``one or more ...''; and the $ indicates ``endof line,'' as in QED. No action is specified, so the programgenerated by Lex (yylex) will ignore these characters. Everythingelse will be copied. To change any remaining string of blanks ortabs to a single blank, add another rule: %% [ \t]+$ ; [ \t]+ printf(" ");The finite automaton generated for this source will scan for bothrules at once, observing at the termination of the string ofblanks or tabs whether or not there is a newline character, andexecuting the desired rule action. The first rule matches allstrings of blanks or tabs at the end of lines, and the second ruleall remaining strings of blanks or tabs. Lex can be used alone for simple transformations, or foranalysis and statistics gathering on a lexical level. Lex canalso be used with a parser generator to perform the lexicalanalysis phase; it is particularly easy to interface Lex and Yacc[3]. Lex programs recognize only regular expressions; Yacc writesparsers that accept a large class of context free grammars, butrequire a lower level analyzer to recognize input tokens. Thus, acombination of Lex and Yacc is often appropriate. When used asa preprocessor for a later parser generator, Lex is used topartition the input stream, and the parser generator assignsstructure to the resulting pieces. The flow of control in sucha case (which might be the first half of a compiler, for exam-ple) is shown in Figure 2. Additional programs, written by othergenerators or by hand, can be added easily to programs written byLex. lexical grammar rules rules | | v v +---------+ +---------+ | Lex | | Yacc | +---------+ +---------+ | | v v +---------+ +---------+ Input -> | yylex | -> | yyparse | -> Parsed input +---------+ +---------+ Lex with Yacc Figure 2Yacc users will realize that the name yylex is what Yacc expectsits lexical analyzer to be named, so that the use of this name byLex simplifies interfacing. Lex generates a deterministic finite automaton from theregular expressions in the source [4]. The automaton isinterpreted, rather than compiled, in order to save space. Theresult is still a fast analyzer. In particular, the time taken bya Lex program to recognize and partition an input stream isproportional to the length of the input. The number of Lex rulesor the complexity of the rules is not important in determiningspeed, unless rules which include forward context require asignificant amount of rescanning. What does increase with thenumber and complexity of rules is the size of the finiteautomaton, and therefore the size of the program generated by Lex. In the program written by Lex, the user's fragments(representing the actions to be performed as each regularexpression is found) are gathered as cases of a switch. Theautomaton interpreter directs the control flow. Opportunity isprovided for the user to insert either declarations or addi-tional statements in the routine containing the actions, or to addsubroutines outside this action routine. Lex is not limited to source which can be interpreted on thebasis of one character lookahead. For example, if there are tworules, one looking for ab and another for abcdefg, and the inputstream is abcdefh, Lex will recognize ab and leave the inputpointer just before cd. . . Such backup is more costly than theprocessing of simpler languages.2. Lex Source. The general format of Lex source is: {definitions} %% {rules} %% {user subroutines}where the definitions and the user subroutines are often omitted.The second %% is optional, but the first is required to mark thebeginning of the rules. The absolute minimum Lex program is thus %%(no definitions, no rules) which translates into a program whichcopies the input to the output unchanged. In the outline of Lex programs shown above, the rulesrepresent the user's control decisions; they are a table, in whichthe left column contains regular expressions (see section 3) andthe right column contains actions, program fragments to beexecuted when the expressions are recognized. Thus an individualrule might appear integer printf("found keyword INT");to look for the string integer in the input stream and print themessage ``found keyword INT'' whenever it appears. In thisexample the host procedural language is C and the C libraryfunction printf is used to print the string. The end of theexpression is indicated by the first blank or tab character. Ifthe action is merely a single C expression, it can just be givenon the right side of the line; if it is compound, or takes morethan a line, it should be enclosed in braces. As a slightly moreuseful example, suppose it is desired to change a number of wordsfrom British to American spelling. Lex rules such as colour printf("color"); mechanise printf("mechanize"); petrol printf("gas");would be a start. These rules are not quite enough, since theword petroleum would become gaseum; a way of dealing with thiswill be described later.3. Lex Regular Expressions. The definitions of regular expressions are very similar tothose in QED [5]. A regular expression specifies a set of stringsto be matched. It contains text characters (which match thecorresponding characters in the strings being compared) andoperator characters (which specify repetitions, choices, and otherfeatures). The letters of the alphabet and the digits are alwaystext characters; thus the regular expression integermatches the string integer wherever it appears and the expression a57Dlooks for the string a57D. Operators. The operator characters are " \ [ ] ^ - ? . * + | ( ) $ / { } % < >and if they are to be used as text characters, an escape should beused. The quotation mark operator (") indicates that whatever iscontained between a pair of quotes is to be taken as textcharacters. Thus xyz"++"matches the string xyz++ when it appears. Note that a part of astring may be quoted. It is harmless but unnecessary to quote anordinary text character; the expression "xyz++"is the same as the one above. Thus by quoting everynon-alphanumeric character being used as a text character, theuser can avoid remembering the list above of current operatorcharacters, and is safe should further extensions to Lex lengthenthe list. An operator character may also be turned into a textcharacter by preceding it with \ as in xyz\+\+which is another, less readable, equivalent of the aboveexpressions. Another use of the quoting mechanism is to get ablank into an expression; normally, as explained above, blanks ortabs end a rule. Any blank character not contained within [] (seebelow) must be quoted. Several normal C escapes with \ arerecognized: \n is newline, \t is tab, and \b is backspace. Toenter \ itself, use \\. Since newline is illegal in anexpression, \n must be used; it is not required to escape tab andbackspace. Every character but blank, tab, newline and the listabove is always a text character. Character classes. Classes of characters can be specifiedusing the operator pair []. The construction [abc] matches asingle character, which may be a, b, or c. Within squarebrackets, most operator meanings are ignored. Only threecharacters are special: these are \ - and ^. The - characterindicates ranges. For example, [a-z0-9<>_]indicates the character class containing all the lower caseletters, the digits, the angle brackets, and underline. Rangesmay be given in either order. Using - between any pair of
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -