📄 tutor11.doc
字号:
O
PA2A
LET'S BUILD A COMPILER!
By
Jack W. Crenshaw, Ph.D.
3 June 1989
Part XI: LEXICAL SCAN REVISITED
PA2A
*****************************************************************
* *
* COPYRIGHT NOTICE *
* *
* Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. *
* *
*****************************************************************
INTRODUCTION
I've got some good news and some bad news. The bad news is that
this installment is not the one I promised last time. What's
more, the one after this one won't be, either.
The good news is the reason for this installment: I've found a
way to simplify and improve the lexical scanning part of the
compiler. Let me explain.
BACKGROUND
If you'll remember, we talked at length about the subject of
lexical scanners in Part VII, and I left you with a design for a
distributed scanner that I felt was about as simple as I could
make it ... more than most that I've seen elsewhere. We used
that idea in Part X. The compiler structure that resulted was
simple, and it got the job done.
Recently, though, I've begun to have problems, and they're the
kind that send a message that you might be doing something wrong.
The whole thing came to a head when I tried to address the issue
of semicolons. Several people have asked me about them, and
whether or not KISS will have them separating the statements. My
intention has been NOT to use semicolons, simply because I don't
like them and, as you can see, they have not proved necessary.
But I know that many of you, like me, have gotten used to them,
and so I set out to write a short installment to show you how
they could easily be added, if you were so inclined.
Well, it turned out that they weren't easy to add at all. In
fact it was darned difficult.
I guess I should have realized that something was wrong, because
of the issue of newlines. In the last couple of installments
we've addressed that issue, and I've shown you how to deal with
newlines with a procedure called, appropriately enough, NewLine.
In TINY Version 1.0, I sprinkled calls to this procedure in
strategic spots in the code.
It seems that every time I've addressed the issue of newlines,
though, I've found it to be tricky, and the resulting parserA*2A*
- 2 -
PA2A
turned out to be quite fragile ... one addition or deletion here
or there and things tended to go to pot. Looking back on it, I
realize that there was a message in this that I just wasn't
paying attention to.
When I tried to add semicolons on top of the newlines, that was
the last straw. I ended up with much too complex a solution. I
began to realize that something fundamental had to change.
So, in a way this installment will cause us to backtrack a bit
and revisit the issue of scanning all over again. Sorry about
that. That's the price you pay for watching me do this in real
time. But the new version is definitely an improvement, and will
serve us well for what is to come.
As I said, the scanner we used in Part X was about as simple as
one can get. But anything can be improved. The new scanner is
more like the classical scanner, and not as simple as before.
But the overall compiler structure is even simpler than before.
It's also more robust, and easier to add to and/or modify. I
think that's worth the time spent in this digression. So in this
installment, I'll be showing you the new structure. No doubt
you'll be happy to know that, while the changes affect many
procedures, they aren't very profound and so we lose very little
of what's been done so far.
Ironically, the new scanner is much more conventional than the
old one, and is very much like the more generic scanner I showed
you earlier in Part VII. Then I started trying to get clever,
and I almost clevered myself clean out of business. You'd think
one day I'd learn: K-I-S-S!
THE PROBLEM
The problem begins to show itself in procedure Block, which I've
reproduced below:
{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }
procedure Block;
begin
Scan;
while not(Token in ['e', 'l']) do begin
case Token of
'i': DoIf;
'w': DoWhile;
'R': DoRead;
'W': DoWrite;
else Assignment;
end;
Scan;A*2A*
- 3 -
PA2A
end;
end;
{--------------------------------------------------------------}
As you can see, Block is oriented to individual program
statements. At each pass through the loop, we know that we are
at the beginning of a statement. We exit the block when we have
scanned an END or an ELSE.
But suppose that we see a semicolon instead. The procedure as
it's shown above can't handle that, because procedure Scan only
expects and can only accept tokens that begin with a letter.
I tinkered around for quite awhile to come up with a fix. I
found many possible approaches, but none were very satisfying. I
finally figured out the reason.
Recall that when we started with our single-character parsers, we
adopted a convention that the lookahead character would always be
prefetched. That is, we would have the character that
corresponds to our current position in the input stream fetched
into the global character Look, so that we could examine it as
many times as needed. The rule we adopted was that EVERY
recognizer, if it found its target token, would advance Look to
the next character in the input stream.
That simple and fixed convention served us very well when we had
single-character tokens, and it still does. It would make a lot
of sense to apply the same rule to multi-character tokens.
But when we got into lexical scanning, I began to violate that
simple rule. The scanner of Part X did indeed advance to the
next token if it found an identifier or keyword, but it DIDN'T do
that if it found a carriage return, a whitespace character, or an
operator.
Now, that sort of mixed-mode operation gets us into deep trouble
in procedure Block, because whether or not the input stream has
been advanced depends upon the kind of token we encounter. If
it's a keyword or the target of an assignment statement, the
"cursor," as defined by the contents of Look, has been advanced
to the next token OR to the beginning of whitespace. If, on the
other hand, the token is a semicolon, or if we have hit a
carriage return, the cursor has NOT advanced.
Needless to say, we can add enough logic to keep us on track.
But it's tricky, and makes the whole parser very fragile.
There's a much better way, and that's just to adopt that same
rule that's worked so well before, to apply to TOKENS as well as
single characters. In other words, we'll prefetch tokens just as
we've always done for characters. It seems so obvious once you
think about it that way.A*2A*
- 4 -
PA2A
Interestingly enough, if we do things this way the problem that
we've had with newline characters goes away. We can just lump
them in as whitespace characters, which means that the handling
of newlines becomes very trivial, and MUCH less prone to error
than we've had to deal with in the past.
THE SOLUTION
Let's begin to fix the problem by re-introducing the two
procedures:
{--------------------------------------------------------------}
{ Get an Identifier }
procedure GetName;
begin
SkipWhite;
if Not IsAlpha(Look) then Expected('Identifier');
Token := 'x';
Value := '';
repeat
Value := Value + UpCase(Look);
GetChar;
until not IsAlNum(Look);
end;
{--------------------------------------------------------------}
{ Get a Number }
procedure GetNum;
begin
SkipWhite;
if not IsDigit(Look) then Expected('Number');
Token := '#';
Value := '';
repeat
Value := Value + Look;
GetChar;
until not IsDigit(Look);
end;
{--------------------------------------------------------------}
These two procedures are functionally almost identical to the
ones I showed you in Part VII. They each fetch the current
token, either an identifier or a number, into the global string
Value. They also set the encoded version, Token, to the
appropriate code. The input stream is left with Look containing
the first character NOT part of the token.
We can do the same thing for operators, even multi-character
operators, with a procedure such as:A*2A*
- 5 -
PA2A
{--------------------------------------------------------------}
{ Get an Operator }
procedure GetOp;
begin
Token := Look;
Value := '';
repeat
Value := Value + Look;
GetChar;
until IsAlpha(Look) or IsDigit(Look) or IsWhite(Look);
end;
{--------------------------------------------------------------}
Note that GetOp returns, as its encoded token, the FIRST
character of the operator. This is important, because it means
that we can now use that single character to drive the parser,
instead of the lookahead character.
We need to tie these procedures together into a single procedure
that can handle all three cases. The following procedure will
read any one of the token types and always leave the input stream
advanced beyond it:
{--------------------------------------------------------------}
{ Get the Next Input Token }
procedure Next;
begin
SkipWhite;
if IsAlpha(Look) then GetName
else if IsDigit(Look) then GetNum
else GetOp;
end;
{--------------------------------------------------------------}
***NOTE that here I have put SkipWhite BEFORE the calls rather
than after. This means that, in general, the variable Look will
NOT have a meaningful value in it, and therefore we should NOT
use it as a test value for parsing, as we have been doing so far.
That's the big departure from our normal approach.
Now, remember that before I was careful not to treat the carriage
return (CR) and line feed (LF) characters as white space. This
was because, with SkipWhite called as the last thing in the
scanner, the encounter with LF would trigger a read statement.
If we were on the last line of the program, we couldn't get out
until we input another line with a non-white character. That's
why I needed the second procedure, NewLine, to handle the CRLF's.
But now, with the call to SkipWhite coming first, that's exactly
the behavior we want. The compiler must know there's anotherA*2A*
- 6 -
PA2A
token coming or it wouldn't be calling Next. In other words, it
hasn't found the terminating END yet. So we're going to insist
on more data until we find something.
All this means that we can greatly simplify both the program and
the concepts, by treating CR and LF as whitespace characters, and
eliminating NewLine. You can do that simply by modifying the
function IsWhite:
{--------------------------------------------------------------}
{ Recognize White Space }
function IsWhite(c: char): boolean;
begin
IsWhite := c in [' ', TAB, CR, LF];
end;
{--------------------------------------------------------------}
We've already tried similar routines in Part VII, but you might
as well try these new ones out. Add them to a copy of the Cradle
and call Next with the following main program:
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
repeat
Next;
WriteLn(Token, ' ', Value);
until Token = '.';
end.
{--------------------------------------------------------------}
Compile it and verify that you can separate a program into a
series of tokens, and that you get the right encoding for each
token.
This ALMOST works, but not quite. There are two potential
problems: First, in KISS/TINY almost all of our operators are
single-character operators. The only exceptions are the relops
>=, <=, and <>. It seems a shame to treat all operators as
strings and do a string compare, when only a single character
compare will almost always suffice. Second, and much more
important, the thing doesn't WORK when two operators appear
together, as in (a+b)*(c+d). Here the string following 'b' would
be interpreted as a single operator ")*(."
It's possible to fix that problem. For example, we could just
give GetOp a list of legal characters, and we could treat theA*2A*
- 7 -
PA2A
parentheses as different operator types than the others. But
this begins to get messy.
Fortunately, there's a better way that solves all the problems.
Since almost all the operators are single characters, let's just
treat them that way, and let GetOp get only one character at a
time. This not only simplifies GetOp, but also speeds things up
quite a bit. We still have the problem of the relops, but we
were treating them as special cases anyway.
So here's the final version of GetOp:
{--------------------------------------------------------------}
{ Get an Operator }
procedure GetOp;
begin
SkipWhite;
Token := Look;
Value := Look;
GetChar;
end;
{--------------------------------------------------------------}
Note that I still give the string Value a value. If you're truly
concerned about efficiency, you could leave this out. When we're
expecting an operator, we will only be testing Token anyhow, so
the value of the string won't matter. But to me it seems to be
good practice to give the thing a value just in case.
Try this new version with some realistic-looking code. You
should be able to separate any program into its individual
tokens, with the caveat that the two-character relops will scan
into two separate tokens. That's OK ... we'll parse them that
way.
Now, in Part VII the function of Next was combined with procedure
Scan, which also checked every identifier against a list of
keywords and encoded each one that was found. As I mentioned at
the time, the last thing we would want to do is to use such a
procedure in places where keywords should not appear, such as in
expressions. If we did that, the keyword list would be scanned
for every identifier appearing in the code. Not good.
The right way to deal with that is to simply separate the
functions of fetching tokens and looking for keywords. The
version of Scan shown below does NOTHING but check for keywords.
Notice that it operates on the current token and does NOT advance
the input stream.
{--------------------------------------------------------------}A*2A*
- 8 -
PA2A
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -