📄 lex-docs.txt
字号:
might seem a good way of recognizing a string in single quotes.
But it is an invitation for the program to read far ahead, looking
for a distant single quote. Presented with the input
'first' quoted string here, 'second' here
the above expression will match
'first' quoted string here, 'second'
which is probably not what was wanted. A better rule is of the
form
'[^'\n]*'
which, on the above input, will stop after 'first'. The
consequences of errors like this are mitigated by the fact that
the . operator will not match newline. Thus expressions like .*
stop on the current line. Don't try to defeat this with
expressions like (.|\n)+ or equivalents; the Lex generated
program will try to read the entire input file, causing internal
buffer overflows.
Note that Lex is normally partitioning the input stream, not
searching for all possible matches of each expression. This means
that each character is accounted for once and only once. For
example, suppose it is desired to count occurrences of both she
and he in an input text. Some Lex rules to do this might be
she s++;
he h++;
\n |
. ;
where the last two rules ignore everything besides he and she.
Remember that . does not include newline. Since she includes he,
Lex will normally not recognize the instances of he included in
she, since once it has passed a she those characters are gone.
Sometimes the user would like to override this choice. The
action REJECT means ``go do the next alternative.'' It causes
whatever rule was second choice after the current rule to be
executed. The position of the input pointer is adjusted
accordingly. Suppose the user really wants to count the included
instances of he:
she {s++; REJECT;}
he {h++; REJECT;}
\n |
. ;
these rules are one way of changing the previous example to do
just that. After counting each expression, it is rejected;
whenever appropriate, the other expression will then be counted.
In this example, of course, the user could note that she includes
he but not vice versa, and omit the REJECT action on he; in other
cases, however, it would not be possible a priori to tell which
input characters were in both classes.
Consider the two rules
a[bc]+ { ... ; REJECT;}
a[cd]+ { ... ; REJECT;}
If the input is ab, only the first rule matches, and on ad only
the second matches. The input string accb matches the first rule
for four characters and then the second rule for three characters.
In contrast, the input accd agrees with the second rule for four
characters and then the first rule for three.
In general, REJECT is useful whenever the purpose of Lex is
not to partition the input stream but to detect all examples of
some items in the input, and the instances of these items may
overlap or include each other. Suppose a digram table of the
input is desired; normally the digrams overlap, that is the word
the is considered to contain both th and he. Assuming a
two-dimensional array named digram to be incremented, the
appropriate source is
%%
[a-z][a-z] {
digram[yytext[0]][yytext[1]]++;
REJECT;
}
. ;
\n ;
where the REJECT is necessary to pick up a letter pair beginning
at every character, rather than at every other character.
6. Lex Source Definitions.
Remember the format of the Lex source:
{definitions}
%%
{rules}
%%
{user routines}
So far only the rules have been described. The user needs
additional options, though, to define variables for use in his
program and for use by Lex. These can go either in the
definitions section or in the rules section.
Remember that Lex is turning the rules into a program. Any
source not intercepted by Lex is copied into the generated
program. There are three classes of such things.
1) Any line which is not part of a Lex rule or action which
begins with a blank or tab is copied into the Lex generated
program. Such source input prior to the first %% delimiter will
be external to any function in the code; if it appears immediately
after the first %%, it appears in an appropriate place for
declarations in the function written by Lex which contains the
actions. This material must look like program fragments, and
should precede the first Lex rule. As a side effect of the above,
lines which begin with a blank or tab, and which contain a
comment, are passed through to the generated program. This can be
used to include comments in either the Lex source or the generated
code. The comments should follow the host language convention.
2) Anything included between lines containing only %{ and %} is
copied out as above. The delimiters are discarded. This format
permits entering text like preprocessor statements that must begin
in column 1, or copying lines that do not look like programs.
3) Anything after the third %% delimiter, regardless of formats,
etc., is copied out after the Lex output.
Definitions intended for Lex are given before the first %%
delimiter. Any line in this section not contained between %{ and
%}, and begining in column 1, is assumed to define Lex
substitution strings. The format of such lines is
name translation
and it causes the string given as a translation to be associated
with the name. The name and translation must be separated by at
least one blank or tab, and the name must begin with a letter.
The translation can then be called out by the {name} syntax in a
rule. Using {D} for the digits and {E} for an exponent field, for
example, might abbreviate rules to recognize numbers:
D [0-9]
E [DEde][-+]?{D}+
%%
{D}+ printf("integer");
{D}+"."{D}*({E})? |
{D}*"."{D}+({E})? |
{D}+{E}
Note the first two rules for real numbers; both require a decimal
point and contain an optional exponent field, but the first
requires at least one digit before the decimal point and the
second requires at least one digit after the decimal point. To
correctly handle the problem posed by a Fortran expression such as
35.EQ.I, which does not contain a real number, a context-sensitive
rule such as
[0-9]+/"."EQ printf("integer");
could be used in addition to the normal rule for integers.
The definitions section may also contain other commands,
including the selection of a host language, a character set table,
a list of start conditions, or adjustments to the default size
of arrays within Lex itself for larger source programs. These
possibilities are discussed below under ``Summary of Source
Format,'' section 12.
7. Usage.
There are two steps in compiling a Lex source program.
First, the Lex source must be turned into a generated program in
the host general purpose language. Then this program must be
compiled and loaded, usually with a library of Lex subroutines.
The generated program is on a file named lex.yy.c. The I/O
library is defined in terms of the C standard library [6].
The C programs generated by Lex are slightly different on
OS/370, because the OS compiler is less powerful than the UNIX or
GCOS compilers, and does less at compile time. C programs
generated on GCOS and UNIX are the same.
UNIX. The library is accessed by the loader flag -ll. So an
appropriate set of commands is
lex source cc lex.yy.c -ll
The resulting program is placed on the usual file a.out for later
execution. To use Lex with Yacc see below. Although the default
Lex I/O routines use the C standard library, the Lex automata
themselves do not do so; if private versions of input, output and
unput are given, the library can be avoided.
8. Lex and Yacc.
If you want to use Lex with Yacc, note that what Lex writes
is a program named yylex(), the name required by Yacc for its
analyzer. Normally, the default main program on the Lex library
calls this routine, but if Yacc is loaded, and its main program is
used, Yacc will call yylex(). In this case each Lex rule should
end with
return(token);
where the appropriate token value is returned. An easy way to get
access to Yacc's names for tokens is to compile the Lex output
file as part of the Yacc output file by placing the line
# include "lex.yy.c"
in the last section of Yacc input. Supposing the grammar to be
named ``good'' and the lexical rules to be named ``better'' the
UNIX command sequence can just be:
yacc good
lex better
cc y.tab.c -ly -ll
The Yacc library (-ly) should be loaded before the Lex library, to
obtain a main program which invokes the Yacc parser. The
generations of Lex and Yacc programs can be done in either order.
9. Examples.
As a trivial problem, consider copying an input file while
adding 3 to every positive number divisible by 7. Here is a
suitable Lex source program
%%
int k;
[0-9]+ {
k = atoi(yytext);
if (k%7 == 0)
printf("%d", k+3);
else
printf("%d",k);
}
to do just that. The rule [0-9]+ recognizes strings of digits;
atoi converts the digits to binary and stores the result in k.
The operator % (remainder) is used to check whether k is divisible
by 7; if it is, it is incremented by 3 as it is written out. It
may be objected that this program will alter such input items as
49.63 or X7. Furthermore, it increments the absolute value of all
negative numbers divisible by 7. To avoid this, just add a few
more rules after the active one, as here:
%%
int k;
-?[0-9]+ {
k = atoi(yytext);
printf("%d",
k%7 == 0 ? k+3 : k);
}
-?[0-9.]+ ECHO;
[A-Za-z][A-Za-z0-9]+ ECHO;
Numerical strings containing a ``.'' or preceded by a letter will
be picked up by one of the last two rules, and not changed. The
if-else has been replaced by a C conditional expression to save
space; the form a?b:c means ``if a then b else c''.
For an example of statistics gathering, here is a program
which histograms the lengths of words, where a word is defined
as a string of letters.
int lengs[100];
%%
[a-z]+ lengs[yyleng]++;
. |
\n ;
%%
yywrap()
{
int i;
printf("Length No. words\n");
for(i=0; i<100; i++)
if (lengs[i] > 0)
printf("%5d%10d\n",i,lengs[i]);
return(1);
}
This program accumulates the histogram, while producing no output.
At the end of the input it prints the table. The final statement
return(1); indicates that Lex is to perform wrapup. If yywrap
returns zero (false) it implies that further input is available
and the program is to continue reading and processing. To
provide a yywrap that never returns true causes an infinite loop.
As a larger example, here are some parts of a program written
by N. L. Schryer to convert double precision Fortran to single
precision Fortran. Because Fortran does not distinguish upper and
lower case letters, this routine begins by defining a set of
classes including both cases of each letter:
a [aA]
b [bB]
c [cC]
...
z [zZ]
An additional class recognizes white space:
W [ \t]*
The first rule changes ``double precision'' to ``real'', or
``DOUBLE PRECISION'' to ``REAL''.
{d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} {
printf(yytext[0]=='d'? "real" : "REAL");
}
Care is taken throughout this program to preserve the case (upper
or lower) of the original program. The conditional operator is
used to select the proper form of the keyword. The next rule
copies continuation card indications to avoid confusing them
with constants:
^" "[^ 0] ECHO;
In the regular expression, the quotes surround the blanks. It is
interpreted as ``beginning of line, then five blanks, then
anything but blank or zero.'' Note the two different meanings of
^. There follow some rules to change double precision constants
to ordinary floating constants.
[0-9]+{W}{d}{W}[+-]?{W}[0-9]+ |
[0-9]+{W}"."{W}{d}{W}[+-]?{W}[0-9]+ |
"."{W}[0-9]+{W}{d}{W}[+-]?{W}[0-9]+ {
/* convert constants */
for(p=yytext; *p != 0; p++)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -