📄 tutor10.doc
字号:
O
PA2A
LET'S BUILD A COMPILER!
By
Jack W. Crenshaw, Ph.D.
21 May 1989
Part X: INTRODUCING "TINY"
PA2A
*****************************************************************
* *
* COPYRIGHT NOTICE *
* *
* Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. *
* *
*****************************************************************
INTRODUCTION
In the last installment, I showed you the general idea for the
top-down development of a compiler. I gave you the first few
steps of the process for compilers for Pascal and C, but I
stopped far short of pushing it through to completion. The
reason was simple: if we're going to produce a real, functional
compiler for any language, I'd rather do it for KISS, the
language that I've been defining in this tutorial series.
In this installment, we're going to do just that, for a subset of
KISS which I've chosen to call TINY.
The process will be essentially that outlined in Installment IX,
except for one notable difference. In that installment, I
suggested that you begin with a full BNF description of the
language. That's fine for something like Pascal or C, for which
the language definition is firm. In the case of TINY, however,
we don't yet have a full description ... we seem to be defining
the language as we go. That's OK. In fact, it's preferable,
since we can tailor the language slightly as we go, to keep the
parsing easy.
So in the development that follows, we'll actually be doing a
top-down development of BOTH the language and its compiler. The
BNF description will grow along with the compiler.
In this process, there will be a number of decisions to be made,
each of which will influence the BNF and therefore the nature of
the language. At each decision point I'll try to remember to
explain the decision and the rationale behind my choice. That
way, if you happen to hold a different opinion and would prefer a
different option, you can choose it instead. You now have the
background to do that. I guess the important thing to note is
that nothing we do here is cast in concrete. When YOU'RE
designing YOUR language, you should feel free to do it YOUR way.
Many of you may be asking at this point: Why bother starting over
from scratch? We had a working subset of KISS as the outcome of
Installment VII (lexical scanning). Why not just extend it as
needed? The answer is threefold. First of all, I have been
making a number of changes to further simplify the program ...
changes like encapsulating the code generation procedures, so
that we can convert to a different target machine more easily.
Second, I want you to see how the development can indeed be doneA*2A*
- 2 -
PA2A
from the top down as outlined in the last installment. Finally,
we both need the practice. Each time I go through this exercise,
I get a little better at it, and you will, also.
GETTING STARTED
Many years ago there were languages called Tiny BASIC, Tiny
Pascal, and Tiny C, each of which was a subset of its parent full
language. Tiny BASIC, for example, had only single-character
variable names and global variables. It supported only a single
data type. Sound familiar? At this point we have almost all the
tools we need to build a compiler like that.
Yet a language called Tiny-anything still carries some baggage
inherited from its parent language. I've often wondered if this
is a good idea. Granted, a language based upon some parent
language will have the advantage of familiarity, but there may
also be some peculiar syntax carried over from the parent that
may tend to add unnecessary complexity to the compiler. (Nowhere
is this more true than in Small C.)
I've wondered just how small and simple a compiler could be made
and still be useful, if it were designed from the outset to be
both easy to use and to parse. Let's find out. This language
will just be called "TINY," period. It's a subset of KISS, which
I also haven't fully defined, so that at least makes us
consistent (!). I suppose you could call it TINY KISS. But that
opens up a whole can of worms involving cuter and cuter (and
perhaps more risque) names, so let's just stick with TINY.
The main limitations of TINY will be because of the things we
haven't yet covered, such as data types. Like its cousins Tiny C
and Tiny BASIC, TINY will have only one data type, the 16-bit
integer. The first version we develop will also have no
procedure calls and will use single-character variable names,
although as you will see we can remove these restrictions without
much effort.
The language I have in mind will share some of the good features
of Pascal, C, and Ada. Taking a lesson from the comparison of
the Pascal and C compilers in the previous installment, though,
TINY will have a decided Pascal flavor. Wherever feasible, a
language structure will be bracketed by keywords or symbols, so
that the parser will know where it's going without having to
guess.
One other ground rule: As we go, I'd like to keep the compiler
producing real, executable code. Even though it may not DO much
at the beginning, it will at least do it correctly.
Finally, I'll use a couple of Pascal restrictions that make
sense: All data and procedures must be declared before they are
used. That makes good sense, even though for now the only dataA*2A*
- 3 -
PA2A
type we'll use is a word. This rule in turn means that the only
reasonable place to put the executable code for the main program
is at the end of the listing.
The top-level definition will be similar to Pascal:
<program> ::= PROGRAM <top-level decl> <main> '.'
Already, we've reached a decision point. My first thought was to
make the main block optional. It doesn't seem to make sense to
write a "program" with no main program, but it does make sense if
we're allowing for multiple modules, linked together. As a
matter of fact, I intend to allow for this in KISS. But then we
begin to open up a can of worms that I'd rather leave closed for
now. For example, the term "PROGRAM" really becomes a misnomer.
The MODULE of Modula-2 or the Unit of Turbo Pascal would be more
appropriate. Second, what about scope rules? We'd need a
convention for dealing with name visibility across modules.
Better for now to just keep it simple and ignore the idea
altogether.
There's also a decision in choosing to require the main program
to be last. I toyed with the idea of making its position
optional, as in C. The nature of SK*DOS, the OS I'm compiling
for, make this very easy to do. But this doesn't really make
much sense in view of the Pascal-like requirement that all data
and procedures be declared before they're referenced. Since the
main program can only call procedures that have already been
declared, the only position that makes sense is at the end, a la
Pascal.
Given the BNF above, let's write a parser that just recognizes
the brackets:
{--------------------------------------------------------------}
{ Parse and Translate a Program }
procedure Prog;
begin
Match('p');
Header;
Prolog;
Match('.');
Epilog;
end;
{--------------------------------------------------------------}
The procedure Header just emits the startup code required by the
assembler:A62A6
- 4 -A*2A*
PA2A
{--------------------------------------------------------------}
{ Write Header Info }
procedure Header;
begin
WriteLn('WARMST', TAB, 'EQU $A01E');
end;
{--------------------------------------------------------------}
The procedures Prolog and Epilog emit the code for identifying
the main program, and for returning to the OS:
{--------------------------------------------------------------}
{ Write the Prolog }
procedure Prolog;
begin
PostLabel('MAIN');
end;
{--------------------------------------------------------------}
{ Write the Epilog }
procedure Epilog;
begin
EmitLn('DC WARMST');
EmitLn('END MAIN');
end;
{--------------------------------------------------------------}
The main program just calls Prog, and then looks for a clean
ending:
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
Prog;
if Look <> CR then Abort('Unexpected data after ''.''');
end.
{--------------------------------------------------------------}
At this point, TINY will accept only one input "program," the
null program:
PROGRAM . (or 'p.' in our shorthand.)A*2A*
- 5 -
PA2A
Note, though, that the compiler DOES generate correct code for
this program. It will run, and do what you'd expect the null
program to do, that is, nothing but return gracefully to the OS.
As a matter of interest, one of my favorite compiler benchmarks
is to compile, link, and execute the null program in whatever
language is involved. You can learn a lot about the
implementation by measuring the overhead in time required to
compile what should be a trivial case. It's also interesting to
measure the amount of code produced. In many compilers, the code
can be fairly large, because they always include the whole run-
time library whether they need it or not. Early versions of
Turbo Pascal produced a 12K object file for this case. VAX C
generates 50K!
The smallest null programs I've seen are those produced by
Modula-2 compilers, and they run about 200-800 bytes.
In the case of TINY, we HAVE no run-time library as yet, so the
object code is indeed tiny: two bytes. That's got to be a
record, and it's likely to remain one since it is the minimum
size required by the OS.
The next step is to process the code for the main program. I'll
use the Pascal BEGIN-block:
<main> ::= BEGIN <block> END
Here, again, we have made a decision. We could have chosen to
require a "PROCEDURE MAIN" sort of declaration, similar to C. I
must admit that this is not a bad idea at all ... I don't
particularly like the Pascal approach since I tend to have
trouble locating the main program in a Pascal listing. But the
alternative is a little awkward, too, since you have to deal with
the error condition where the user omits the main program or
misspells its name. Here I'm taking the easy way out.
Another solution to the "where is the main program" problem might
be to require a name for the program, and then bracket the main
by
BEGIN <name>
END <name>
similar to the convention of Modula 2. This adds a bit of
"syntactic sugar" to the language. Things like this are easy to
add or change to your liking, if the language is your own design.
To parse this definition of a main block, change procedure Prog
to read:A*2A*
- 6 -
PA2A
{--------------------------------------------------------------}
{ Parse and Translate a Program }
procedure Prog;
begin
Match('p');
Header;
Main;
Match('.');
end;
{--------------------------------------------------------------}
and add the new procedure:
{--------------------------------------------------------------}
{ Parse and Translate a Main Program }
procedure Main;
begin
Match('b');
Prolog;
Match('e');
Epilog;
end;
{--------------------------------------------------------------}
Now, the only legal program is:
PROGRAM BEGIN END . (or 'pbe.')
Aren't we making progress??? Well, as usual it gets better. You
might try some deliberate errors here, like omitting the 'b' or
the 'e', and see what happens. As always, the compiler should
flag all illegal inputs.
DECLARATIONS
The obvious next step is to decide what we mean by a declaration.
My intent here is to have two kinds of declarations: variables
and procedures/functions. At the top level, only global
declarations are allowed, just as in C.
For now, there can only be variable declarations, identified by
the keyword VAR (abbreviated 'v'):
<top-level decls> ::= ( <data declaration> )*A62A6
- 7 -A*2A*
PA2A
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -