📄 tutor10.doc

📁 计算机编译原理教材
💻 DOC
📖 第 1 页 / 共 5 页
字号:
12 3 4 5 下一页
O
PA2A



























                            LET'S BUILD A COMPILER!

                                       By

                            Jack W. Crenshaw, Ph.D.

                                  21 May 1989


                           Part X: INTRODUCING "TINY"





























PA2A





       *****************************************************************
       *                                                               *
       *                        COPYRIGHT NOTICE                       *
       *                                                               *
       *   Copyright (C) 1989 Jack W. Crenshaw. All rights reserved.   *
       *                                                               *
       *****************************************************************


       INTRODUCTION

       In the last installment, I showed you the general  idea  for  the
       top-down development of  a  compiler.    I gave you the first few
       steps  of  the process for compilers for  Pascal  and  C,  but  I
       stopped  far  short  of  pushing  it through to completion.   The
       reason was simple: if we're going to produce  a  real, functional
       compiler  for  any  language, I'd rather  do  it  for  KISS,  the
       language that I've been defining in this tutorial series.

       In this installment, we're going to do just that, for a subset of
       KISS which I've chosen to call TINY.

       The process  will be essentially that outlined in Installment IX,
       except  for  one  notable  difference.   In that  installment,  I
       suggested  that  you  begin  with  a full BNF description of  the
       language.  That's fine for something like Pascal or C,  for which
       the language definition is  firm.   In the case of TINY, however,
       we don't yet have a full  description  ... we seem to be defining
       the language as we go.  That's OK.    In  fact,  it's preferable,
       since we can tailor the language  slightly  as we go, to keep the
       parsing easy.

       So in the development  that  follows,  we'll  actually be doing a
       top-down development of BOTH the  language and its compiler.  The
       BNF description will grow along with the compiler.

       In this process, there will be a number of decisions to  be made,
       each of which will influence the BNF and therefore the  nature of
       the language.   At  each  decision  point I'll try to remember to
       explain  the  decision  and the rationale behind my choice.  That
       way, if you happen to hold a different opinion and would prefer a
       different option, you can choose it instead.  You  now  have  the
       background  to  do  that.  I guess the important thing to note is
       that  nothing  we  do  here  is  cast  in  concrete.  When YOU'RE
       designing YOUR language, you should feel free to do it YOUR way.

       Many of you may be asking at this point: Why bother starting over
       from  scratch?  We had a working subset of KISS as the outcome of
       Installment  VII  (lexical  scanning).  Why not just extend it as
       needed?  The  answer  is  threefold.    First of all, I have been
       making  a  number  of changes to further simplify the program ...
       changes  like  encapsulating  the  code generation procedures, so
       that  we  can  convert to a different target machine more easily.
       Second, I want you to see how the development can indeed  be doneA*2A*
                                     - 2 -

PA2A





       from the top down as outlined in the last installment.   Finally,
       we both need the practice.  Each time I go through this exercise,
       I get a little better at it, and you will, also.


       GETTING STARTED

       Many  years  ago  there were languages called  Tiny  BASIC,  Tiny
       Pascal, and Tiny C, each of which was a subset of its parent full
       language.  Tiny BASIC,  for  example,  had  only single-character
       variable names and global variables.   It supported only a single
       data type.  Sound familiar?  At this point we have almost all the
       tools we need to build a compiler like that.

       Yet a language called Tiny-anything  still  carries  some baggage
       inherited from its parent language.   I've often wondered if this
       is a  good  idea.    Granted,  a  language based upon some parent
       language will have the  advantage  of  familiarity, but there may
       also  be  some  peculiar syntax carried over from the parent that
       may tend  to add unnecessary complexity to the compiler. (Nowhere
       is this more true than in Small C.)

       I've wondered just how small and simple a compiler could  be made
       and  still  be  useful, if it were designed from the outset to be
       both easy to use and to  parse.    Let's find out.  This language
       will just be called "TINY," period.  It's a subset of KISS, which
       I  also  haven't  fully  defined,  so  that  at  least  makes  us
       consistent (!).  I suppose you could call it TINY KISS.  But that
       opens  up a whole can of worms involving  cuter  and  cuter  (and
       perhaps more risque) names, so let's just stick with TINY.

       The main limitations  of  TINY  will  be because of the things we
       haven't yet covered, such as data types.  Like its cousins Tiny C
       and Tiny BASIC,  TINY  will  have  only one data type, the 16-bit
       integer.    The  first  version  we  develop  will also  have  no
       procedure  calls  and  will  use single-character variable names,
       although as you will see we can remove these restrictions without
       much effort.

       The language I have in mind will share some of the  good features
       of  Pascal,  C,  and Ada.  Taking a lesson from the comparison of
       the Pascal and  C  compilers in the previous installment, though,
       TINY will have a decided Pascal flavor.  Wherever  feasible,    a
       language structure will  be  bracketed by keywords or symbols, so
       that  the parser will know where it's  going  without  having  to
       guess.

       One other ground rule:  As we go, I'd like  to  keep the compiler
       producing real, executable code.  Even though it may not  DO much
       at the beginning, it will at least do it correctly.

       Finally,  I'll  use  a couple of Pascal  restrictions  that  make
       sense:  All data and procedures must be declared before  they are
       used.  That makes good sense,  even  though for now the only dataA*2A*
                                     - 3 -

PA2A





       type we'll use  is a word.  This rule in turn means that the only
       reasonable place to put the  executable code for the main program
       is at the end of the listing.

       The top-level definition will be similar to Pascal:


            <program> ::= PROGRAM <top-level decl> <main> '.'


       Already, we've reached a decision point.  My first thought was to
       make the main block optional.   It  doesn't seem to make sense to
       write a "program" with no main program, but it does make sense if
       we're  allowing  for  multiple modules, linked together.    As  a
       matter of fact,  I intend to allow for this in KISS.  But then we
       begin  to open up a can of worms that I'd rather leave closed for
       now.  For example, the  term "PROGRAM" really becomes a misnomer.
       The MODULE of Modula-2 or the Unit of Turbo Pascal would  be more
       appropriate.  Second,  what  about  scope  rules?    We'd  need a
       convention for  dealing  with  name  visibility  across  modules.
       Better  for  now  to  just  keep  it  simple  and ignore the idea
       altogether.

       There's also a decision in choosing to require  the  main program
       to  be  last.    I  toyed  with  the idea of making its  position
       optional,  as  in  C.  The nature of SK*DOS, the OS I'm compiling
       for, make this very easy to do.   But  this  doesn't  really make
       much sense in view of the Pascal-like requirement  that  all data
       and procedures  be declared before they're referenced.  Since the
       main  program can only call procedures  that  have  already  been
       declared, the only position that makes sense is at the end,  a la
       Pascal.

       Given  the  BNF  above, let's write a parser that just recognizes
       the brackets:


       {--------------------------------------------------------------}
       {  Parse and Translate a Program }

       procedure Prog;
       begin
          Match('p');
          Header;
          Prolog;
          Match('.');
          Epilog;
       end;
       {--------------------------------------------------------------}


       The procedure Header just emits  the startup code required by the
       assembler:A62A6
                                     - 4 -A*2A*

PA2A





       {--------------------------------------------------------------}
       { Write Header Info }

       procedure Header;
       begin
          WriteLn('WARMST', TAB, 'EQU $A01E');
       end;
       {--------------------------------------------------------------}


       The procedures Prolog and  Epilog  emit  the code for identifying
       the main program, and for returning to the OS:


       {--------------------------------------------------------------}
       { Write the Prolog }

       procedure Prolog;
       begin
          PostLabel('MAIN');
       end;


       {--------------------------------------------------------------}
       { Write the Epilog }

       procedure Epilog;
       begin
          EmitLn('DC WARMST');
          EmitLn('END MAIN');
       end;
       {--------------------------------------------------------------}


       The  main program just calls Prog, and then  looks  for  a  clean
       ending:


       {--------------------------------------------------------------}
       { Main Program }

       begin
          Init;
          Prog;
          if Look <> CR then Abort('Unexpected data after ''.''');
       end.
       {--------------------------------------------------------------}


       At this point, TINY  will  accept  only  one input "program," the
       null program:


            PROGRAM .   (or 'p.' in our shorthand.)A*2A*
                                     - 5 -

PA2A





       Note, though, that the  compiler  DOES  generate correct code for
       this program.  It will run, and do  what  you'd  expect  the null
       program to do, that is, nothing but return gracefully to the OS.

       As a matter of interest, one of my  favorite  compiler benchmarks
       is to compile, link,  and  execute  the  null program in whatever
       language   is   involved.     You  can  learn  a  lot  about  the
       implementation by measuring  the  overhead  in  time  required to
       compile what should be a trivial case.  It's also  interesting to
       measure the amount of code produced.  In many compilers, the code
       can be fairly large, because they always include  the  whole run-
       time  library whether they need it or not.    Early  versions  of
       Turbo Pascal produced a 12K object file for  this  case.    VAX C
       generates 50K!

       The  smallest  null  programs  I've  seen are those  produced  by
       Modula-2 compilers, and they run about 200-800 bytes.

       In the case of TINY, we HAVE no run-time library  as  yet, so the
       object code is indeed tiny:  two  bytes.    That's  got  to  be a
       record, and it's  likely  to  remain  one since it is the minimum
       size required by the OS.

       The  next step is to process the code for the main program.  I'll
       use the Pascal BEGIN-block:


            <main> ::= BEGIN <block> END


       Here,  again,  we  have made a decision.  We could have chosen to
       require a "PROCEDURE MAIN" sort of declaration, similar to C.   I
       must  admit  that  this  is  not  a bad idea at all ...  I  don't
       particularly  like  the  Pascal  approach  since I tend  to  have
       trouble locating the main  program  in a Pascal listing.  But the
       alternative is a little awkward, too, since you have to deal with
       the  error condition where the user omits  the  main  program  or
       misspells its name.  Here I'm taking the easy way out.

       Another solution to the "where is the main program" problem might
       be to require a name for  the  program, and then bracket the main
       by


            BEGIN <name>
            END <name>


       similar to the convention of  Modula  2.    This  adds  a  bit of
       "syntactic sugar" to the language.  Things like this are  easy to
       add or change to your liking, if the language is your own design.

       To parse this definition of a main block,  change  procedure Prog
       to read:A*2A*
                                     - 6 -

PA2A





       {--------------------------------------------------------------}
       {  Parse and Translate a Program }

       procedure Prog;
       begin
          Match('p');
          Header;
          Main;
          Match('.');
       end;
       {--------------------------------------------------------------}


       and add the new procedure:


       {--------------------------------------------------------------}
       { Parse and Translate a Main Program }

       procedure Main;
       begin
          Match('b');
          Prolog;
          Match('e');
          Epilog;
       end;
       {--------------------------------------------------------------}


       Now, the only legal program is:


            PROGRAM BEGIN END . (or 'pbe.')


       Aren't we making progress???  Well, as usual it gets better.  You
       might try some deliberate errors here, like omitting  the  'b' or
       the 'e', and see what happens.  As always,  the  compiler  should
       flag all illegal inputs.


       DECLARATIONS

       The obvious next step is to decide what we mean by a declaration.
       My  intent  here  is to have two kinds of declarations: variables
       and  procedures/functions.    At  the  top  level,   only  global
       declarations are allowed, just as in C.

       For now, there  can  only be variable declarations, identified by
       the keyword VAR (abbreviated 'v'):


            <top-level decls> ::= ( <data declaration> )*A62A6
                                     - 7 -A*2A*

PA2A
12 3 4 5 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -