📄 lex-docs.txt
字号:
{
if (*p == 'd' || *p == 'D')
*p=+ 'e'- 'd';
ECHO;
}
After the floating point constant is recognized, it is scanned by
the for loop to find the letter d or D. The program than adds
'e'-'d', which converts it to the next letter of the alphabet.
The modified constant, now single-precision, is written out again.
There follow a series of names which must be respelled to remove
their initial d. By using the array yytext the same action
suffices for all the names (only a sample of a rather long list is
given here).
{d}{s}{i}{n} |
{d}{c}{o}{s} |
{d}{s}{q}{r}{t} |
{d}{a}{t}{a}{n} |
...
{d}{f}{l}{o}{a}{t} printf("%s",yytext+1);
Another list of names must have initial d changed to initial a:
{d}{l}{o}{g} |
{d}{l}{o}{g}10 |
{d}{m}{i}{n}1 |
{d}{m}{a}{x}1 {
yytext[0] =+ 'a' - 'd';
ECHO;
}
And one routine must have initial d changed to initial r:
{d}1{m}{a}{c}{h} {yytext[0] =+ 'r' - 'd';
To avoid such names as dsinx being detected as instances of dsin,
some final rules pick up longer words as identifiers and copy some
surviving characters:
[A-Za-z][A-Za-z0-9]* |
[0-9]+ |
\n |
. ECHO;
Note that this program is not complete; it does not deal with the
spacing problems in Fortran or with the use of keywords as
identifiers.
10. Left Context Sensitivity.
Sometimes it is desirable to have several sets of lexical
rules to be applied at different times in the input. For example,
a compiler preprocessor might distinguish preprocessor
statements and analyze them differently from ordinary statements.
This requires sensitivity to prior context, and there are several
ways of handling such problems. The ^ operator, for example, is
a prior context operator, recognizing immediately preceding left
context just as $ recognizes immediately following right
context. Adjacent left context could be extended, to produce a
facility similar to that for adjacent right context, but it is
unlikely to be as useful, since often the relevant left context
appeared some time earlier, such as at the beginning of a line.
This section describes three means of dealing with different
environments: a simple use of flags, when only a few rules
change from one environment to another, the use of start
conditions on rules, and the possibility of making multiple
lexical analyzers all run together. In each case, there are rules
which recognize the need to change the environment in which the
following input text is analyzed, and set some parameter to
reflect the change. This may be a flag explicitly tested by the
user's action code; such a flag is the simplest way of dealing
with the problem, since Lex is not involved at all. It may be
more convenient, however, to have Lex remember the flags as
initial conditions on the rules. Any rule may be associated with
a start condition. It will only be recognized when Lex is in that
start condition. The current start condition may be changed at
any time. Finally, if the sets of rules for the different
environments are very dissimilar, clarity may be best achieved by
writing several distinct lexical analyzers, and switching from one
to another as desired.
Consider the following problem: copy the input to the output,
changing the word magic to first on every line which began with
the letter a, changing magic to second on every line which began
with the letter b, and changing magic to third on every line which
began with the letter c. All other words and all other lines are
left unchanged.
These rules are so simple that the easiest way to do this job
is with a flag:
int flag;
%%
^a {flag = 'a'; ECHO;}
^b {flag = 'b'; ECHO;}
^c {flag = 'c'; ECHO;}
\n {flag = 0 ; ECHO;}
magic {
switch (flag)
{
case 'a': printf("first"); break;
case 'b': printf("second"); break;
case 'c': printf("third"); break;
default: ECHO; break;
}
}
should be adequate.
To handle the same problem with start conditions, each start
condition must be introduced to Lex in the definitions section
with a line reading
%Start name1 name2 ...
where the conditions may be named in any order. The word Start
may be abbreviated to s or S. The conditions may be referenced
at the head of a rule with the <> brackets:
<name1>expression
is a rule which is only recognized when Lex is in the start
condition name1. To enter a start condition, execute the action
statement
BEGIN name1;
which changes the start condition to name1. To resume the normal
state,
BEGIN 0;
resets the initial condition of the Lex automaton interpreter. A
rule may be active in several start conditions:
<name1,name2,name3>
is a legal prefix. Any rule not beginning with the <> prefix
operator is always active.
The same example as before can be written:
%START AA BB CC
%%
^a {ECHO; BEGIN AA;}
^b {ECHO; BEGIN BB;}
^c {ECHO; BEGIN CC;}
\n {ECHO; BEGIN 0;}
<AA>magic printf("first");
<BB>magic printf("second");
<CC>magic printf("third");
where the logic is exactly the same as in the previous method of
handling the problem, but Lex does the work rather than the user's
code.
11. Character Set.
The programs generated by Lex handle character I/O only
through the routines input, output, and unput. Thus the
character representation provided in these routines is accepted by
Lex and employed to return values in yytext. For internal use a
character is represented as a small integer which, if the standard
library is used, has a value equal to the integer value of the bit
pattern representing the character on the host computer.
Normally, the letter a is represented as the same form as the
character constant 'a'. If this interpretation is changed, by
providing I/O routines which translate the characters, Lex must be
told about it, by giving a translation table. This table must be
in the definitions section, and must be bracketed by lines con-
taining only ``%T''. The table contains lines of the form
{integer} {character string}
which indicate the value associated with each character. Thus the
next example
%T
1 Aa
2 Bb
...
26 Zz
27 \n
28 +
29 -
30 0
31 1
...
39 9
%T
Sample character table.
maps the lower and upper case letters together into the integers 1
through 26, newline into 27, + and - into 28 and 29, and the
digits into 30 through 39. Note the escape for newline. If a
table is supplied, every character that is to appear either in the
rules or in any valid input must be included in the table. No
character may be assigned the number 0, and no character may be
assigned a bigger number than the size of the hardware character
set.
12. Summary of Source Format.
The general form of a Lex source file is:
{definitions}
%%
{rules}
%%
{user subroutines}
The definitions section contains a combination of
1) Definitions, in the form ``name space translation''.
2) Included code, in the form ``space code''.
3) Included code, in the form
%{
code
%}
4) Start conditions, given in the form
%S name1 name2 ...
5) Character set tables, in the form
%T
number space character-string
...
%T
6) Changes to internal array sizes, in the form
%x nnn
where nnn is a decimal integer representing an array size and
x selects the parameter as follows:
Letter Parameter
p positions
n states
e tree nodes
a transitions
k packed character classes
o output array size
Lines in the rules section have the form ``expression action''
where the action may be continued on succeeding lines by using
braces to delimit it.
Regular expressions in Lex use the following operators:
x the character "x"
"x" an "x", even if x is an operator.
\x an "x", even if x is an operator.
[xy] the character x or y.
[x-z] the characters x, y or z.
[^x] any character but x.
. any character but newline.
^x an x at the beginning of a line.
<y>x an x when Lex is in start condition y.
x$ an x at the end of a line.
x? an optional x.
x* 0,1,2, ... instances of x.
x+ 1,2,3, ... instances of x.
x|y an x or a y.
(x) an x.
x/y an x but only if followed by y.
{xx} the translation of xx from the
definitions section.
x{m,n} m through n occurrences of x
13. Caveats and Bugs.
There are pathological expressions which produce exponential
growth of the tables when converted to deterministic machines;
fortunately, they are rare.
REJECT does not rescan the input; instead it remembers the
results of the previous scan. This means that if a rule with
trailing context is found, and REJECT executed, the user must not
have used unput to change the characters forthcoming from the
input stream. This is the only restriction on the user's ability
to manipulate the not-yet-processed input.
14. Acknowledgments.
As should be obvious from the above, the outside of Lex is
patterned on Yacc and the inside on Aho's string matching
routines. Therefore, both S. C. Johnson and A. V. Aho are really
originators of much of Lex, as well as debuggers of it. Many
thanks are due to both.
The code of the current version of Lex was designed, written,
and debugged by Eric Schmidt.
15. References.
1. B. W. Kernighan and D. M. Ritchie, The C Programming
Language, Prentice-Hall, N. J. (1978).
2. B. W. Kernighan, Ratfor: A Preprocessor for a Rational Fortran,
Software Practice and Experience, 5, pp. 395-496 (1975).
3. S. C. Johnson, Yacc: Yet Another Compiler Compiler, Computing
Science Technical Report No. 32, 1975, Bell Laboratories,
Murray Hill, NJ 07974.
4. A. V. Aho and M. J. Corasick, Efficient String Matching:
An Aid to Bibliographic Search, Comm. ACM 18, 333-340 (1975).
5. B. W. Kernighan, D. M. Ritchie and K. L. Thompson, QED Text
Editor, Computing Science Technical Report No. 5, 1972,
Bell Laboratories, Murray Hill, NJ 07974.
6. D. M. Ritchie, private communication. See also M. E. Lesk,
The Portable C Library, Computing Science Technical Report
No. 31, Bell Laboratories, Murray Hill, NJ 07974.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -