📄 lex-docs.txt
字号:
might seem a good way of recognizing a string in single quotes.But it is an invitation for the program to read far ahead, lookingfor a distant single quote. Presented with the input 'first' quoted string here, 'second' herethe above expression will match 'first' quoted string here, 'second'which is probably not what was wanted. A better rule is of theform '[^'\n]*'which, on the above input, will stop after 'first'. Theconsequences of errors like this are mitigated by the fact thatthe . operator will not match newline. Thus expressions like .*stop on the current line. Don't try to defeat this withexpressions like (.|\n)+ or equivalents; the Lex generatedprogram will try to read the entire input file, causing internalbuffer overflows. Note that Lex is normally partitioning the input stream, notsearching for all possible matches of each expression. This meansthat each character is accounted for once and only once. Forexample, suppose it is desired to count occurrences of both sheand he in an input text. Some Lex rules to do this might be she s++; he h++; \n | . ;where the last two rules ignore everything besides he and she.Remember that . does not include newline. Since she includes he,Lex will normally not recognize the instances of he included inshe, since once it has passed a she those characters are gone. Sometimes the user would like to override this choice. Theaction REJECT means ``go do the next alternative.'' It causeswhatever rule was second choice after the current rule to beexecuted. The position of the input pointer is adjustedaccordingly. Suppose the user really wants to count the includedinstances of he: she {s++; REJECT;} he {h++; REJECT;} \n | . ;these rules are one way of changing the previous example to dojust that. After counting each expression, it is rejected;whenever appropriate, the other expression will then be counted.In this example, of course, the user could note that she includeshe but not vice versa, and omit the REJECT action on he; in othercases, however, it would not be possible a priori to tell whichinput characters were in both classes. Consider the two rules a[bc]+ { ... ; REJECT;} a[cd]+ { ... ; REJECT;}If the input is ab, only the first rule matches, and on ad onlythe second matches. The input string accb matches the first rulefor four characters and then the second rule for three characters.In contrast, the input accd agrees with the second rule for fourcharacters and then the first rule for three. In general, REJECT is useful whenever the purpose of Lex isnot to partition the input stream but to detect all examples ofsome items in the input, and the instances of these items mayoverlap or include each other. Suppose a digram table of theinput is desired; normally the digrams overlap, that is the wordthe is considered to contain both th and he. Assuming atwo-dimensional array named digram to be incremented, theappropriate source is %% [a-z][a-z] { digram[yytext[0]][yytext[1]]++; REJECT; } . ; \n ;where the REJECT is necessary to pick up a letter pair beginningat every character, rather than at every other character.6. Lex Source Definitions. Remember the format of the Lex source: {definitions} %% {rules} %% {user routines}So far only the rules have been described. The user needsadditional options, though, to define variables for use in hisprogram and for use by Lex. These can go either in thedefinitions section or in the rules section. Remember that Lex is turning the rules into a program. Anysource not intercepted by Lex is copied into the generatedprogram. There are three classes of such things.1) Any line which is not part of a Lex rule or action whichbegins with a blank or tab is copied into the Lex generatedprogram. Such source input prior to the first %% delimiter willbe external to any function in the code; if it appears immediatelyafter the first %%, it appears in an appropriate place fordeclarations in the function written by Lex which contains theactions. This material must look like program fragments, andshould precede the first Lex rule. As a side effect of the above,lines which begin with a blank or tab, and which contain acomment, are passed through to the generated program. This can beused to include comments in either the Lex source or the generatedcode. The comments should follow the host language convention.2) Anything included between lines containing only %{ and %} iscopied out as above. The delimiters are discarded. This formatpermits entering text like preprocessor statements that must beginin column 1, or copying lines that do not look like programs.3) Anything after the third %% delimiter, regardless of formats,etc., is copied out after the Lex output. Definitions intended for Lex are given before the first %%delimiter. Any line in this section not contained between %{ and%}, and begining in column 1, is assumed to define Lexsubstitution strings. The format of such lines is name translationand it causes the string given as a translation to be associatedwith the name. The name and translation must be separated by atleast one blank or tab, and the name must begin with a letter.The translation can then be called out by the {name} syntax in arule. Using {D} for the digits and {E} for an exponent field, forexample, might abbreviate rules to recognize numbers: D [0-9] E [DEde][-+]?{D}+ %% {D}+ printf("integer"); {D}+"."{D}*({E})? | {D}*"."{D}+({E})? | {D}+{E}Note the first two rules for real numbers; both require a decimalpoint and contain an optional exponent field, but the firstrequires at least one digit before the decimal point and thesecond requires at least one digit after the decimal point. Tocorrectly handle the problem posed by a Fortran expression such as35.EQ.I, which does not contain a real number, a context-sensitiverule such as [0-9]+/"."EQ printf("integer");could be used in addition to the normal rule for integers. The definitions section may also contain other commands,including the selection of a host language, a character set table,a list of start conditions, or adjustments to the default sizeof arrays within Lex itself for larger source programs. Thesepossibilities are discussed below under ``Summary of SourceFormat,'' section 12.7. Usage. There are two steps in compiling a Lex source program.First, the Lex source must be turned into a generated program inthe host general purpose language. Then this program must becompiled and loaded, usually with a library of Lex subroutines.The generated program is on a file named lex.yy.c. The I/Olibrary is defined in terms of the C standard library [6]. The C programs generated by Lex are slightly different onOS/370, because the OS compiler is less powerful than the UNIX orGCOS compilers, and does less at compile time. C programsgenerated on GCOS and UNIX are the same. UNIX. The library is accessed by the loader flag -ll. So anappropriate set of commands is lex source cc lex.yy.c -llThe resulting program is placed on the usual file a.out for laterexecution. To use Lex with Yacc see below. Although the defaultLex I/O routines use the C standard library, the Lex automatathemselves do not do so; if private versions of input, output andunput are given, the library can be avoided.8. Lex and Yacc. If you want to use Lex with Yacc, note that what Lex writesis a program named yylex(), the name required by Yacc for itsanalyzer. Normally, the default main program on the Lex librarycalls this routine, but if Yacc is loaded, and its main program isused, Yacc will call yylex(). In this case each Lex rule shouldend with return(token);where the appropriate token value is returned. An easy way to getaccess to Yacc's names for tokens is to compile the Lex outputfile as part of the Yacc output file by placing the line # include "lex.yy.c"in the last section of Yacc input. Supposing the grammar to benamed ``good'' and the lexical rules to be named ``better'' theUNIX command sequence can just be: yacc good lex better cc y.tab.c -ly -llThe Yacc library (-ly) should be loaded before the Lex library, toobtain a main program which invokes the Yacc parser. Thegenerations of Lex and Yacc programs can be done in either order.9. Examples. As a trivial problem, consider copying an input file whileadding 3 to every positive number divisible by 7. Here is asuitable Lex source program %% int k; [0-9]+ { k = atoi(yytext); if (k%7 == 0) printf("%d", k+3); else printf("%d",k); }to do just that. The rule [0-9]+ recognizes strings of digits;atoi converts the digits to binary and stores the result in k.The operator % (remainder) is used to check whether k is divisibleby 7; if it is, it is incremented by 3 as it is written out. Itmay be objected that this program will alter such input items as49.63 or X7. Furthermore, it increments the absolute value of allnegative numbers divisible by 7. To avoid this, just add a fewmore rules after the active one, as here: %% int k; -?[0-9]+ { k = atoi(yytext); printf("%d", k%7 == 0 ? k+3 : k); } -?[0-9.]+ ECHO; [A-Za-z][A-Za-z0-9]+ ECHO;Numerical strings containing a ``.'' or preceded by a letter willbe picked up by one of the last two rules, and not changed. Theif-else has been replaced by a C conditional expression to savespace; the form a?b:c means ``if a then b else c''. For an example of statistics gathering, here is a programwhich histograms the lengths of words, where a word is definedas a string of letters. int lengs[100]; %% [a-z]+ lengs[yyleng]++; . | \n ; %% yywrap() { int i; printf("Length No. words\n"); for(i=0; i<100; i++) if (lengs[i] > 0) printf("%5d%10d\n",i,lengs[i]); return(1); }This program accumulates the histogram, while producing no output.At the end of the input it prints the table. The final statementreturn(1); indicates that Lex is to perform wrapup. If yywrapreturns zero (false) it implies that further input is availableand the program is to continue reading and processing. Toprovide a yywrap that never returns true causes an infinite loop. As a larger example, here are some parts of a program writtenby N. L. Schryer to convert double precision Fortran to singleprecision Fortran. Because Fortran does not distinguish upper andlower case letters, this routine begins by defining a set ofclasses including both cases of each letter: a [aA] b [bB] c [cC] ... z [zZ]An additional class recognizes white space: W [ \t]*The first rule changes ``double precision'' to ``real'', or``DOUBLE PRECISION'' to ``REAL''. {d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} { printf(yytext[0]=='d'? "real" : "REAL"); }Care is taken throughout this program to preserve the case (upperor lower) of the original program. The conditional operator isused to select the proper form of the keyword. The next rulecopies continuation card indications to avoid confusing themwith constants: ^" "[^ 0] ECHO;In the regular expression, the quotes surround the blanks. It isinterpreted as ``beginning of line, then five blanks, thenanything but blank or zero.'' Note the two different meanings of^. There follow some rules to change double precision constantsto ordinary floating constants. [0-9]+{W}{d}{W}[+-]?{W}[0-9]+ | [0-9]+{W}"."{W}{d}{W}[+-]?{W}[0-9]+ | "."{W}[0-9]+{W}{d}{W}[+-]?{W}[0-9]+ { /* convert constants */ for(p=yytext; *p != 0; p++)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -