📄 tutor11.doc

📁 计算机编译原理教材
💻 DOC
📖 第 1 页 / 共 4 页
字号:
12 3 4 下一页
O
PA2A



























                            LET'S BUILD A COMPILER!

                                       By

                            Jack W. Crenshaw, Ph.D.

                                  3 June 1989


                        Part XI: LEXICAL SCAN REVISITED





























PA2A





       *****************************************************************
       *                                                               *
       *                        COPYRIGHT NOTICE                       *
       *                                                               *
       *   Copyright (C) 1989 Jack W. Crenshaw. All rights reserved.   *
       *                                                               *
       *****************************************************************


       INTRODUCTION

       I've got some  good news and some bad news.  The bad news is that
       this installment is  not  the  one  I promised last time.  What's
       more, the one after this one won't be, either.

       The good news is the reason for this installment:  I've  found  a
       way  to simplify and improve the lexical  scanning  part  of  the
       compiler.  Let me explain.


       BACKGROUND

       If  you'll remember, we talked at length  about  the  subject  of
       lexical  scanners in Part VII, and I left you with a design for a
       distributed scanner that I felt was about as simple  as  I  could
       make it ... more than most that I've  seen  elsewhere.    We used
       that idea in Part X.  The compiler structure  that  resulted  was
       simple, and it got the job done.

       Recently, though, I've begun  to  have  problems, and they're the
       kind that send a message that you might be doing something wrong.

       The  whole thing came to a head when I tried to address the issue
       of  semicolons.  Several people have asked  me  about  them,  and
       whether or not KISS will have them separating the statements.  My
       intention has been NOT to  use semicolons, simply because I don't
       like them and, as you can see, they have not proved necessary.

       But I know that many of you, like me, have  gotten  used to them,
       and so  I  set  out  to write a short installment to show you how
       they could easily be added, if you were so inclined.

       Well, it  turned  out  that  they weren't easy to add at all.  In
       fact it was darned difficult.

       I guess I should have  realized that something was wrong, because
       of the issue  of  newlines.    In the last couple of installments
       we've addressed that issue,  and  I've shown you how to deal with
       newlines with a  procedure called, appropriately enough, NewLine.
       In  TINY  Version  1.0,  I  sprinkled calls to this procedure  in
       strategic spots in the code.

       It  seems  that  every time I've addressed the issue of newlines,
       though,  I've found it to be tricky,  and  the  resulting  parserA*2A*
                                     - 2 -

PA2A





       turned out to be quite fragile ... one addition or  deletion here
       or  there and things tended to go to pot.  Looking back on it,  I
       realize that  there  was  a  message  in  this that I just wasn't
       paying attention to.

       When I tried to add semicolons  on  top of the newlines, that was
       the last straw.   I ended up with much too complex a solution.  I
       began to realize that something fundamental had to change.

       So,  in  a  way this installment will cause us to backtrack a bit
       and revisit the issue of scanning all over again.    Sorry  about
       that.  That's the price you pay for watching me  do  this in real
       time.  But the new version is definitely an improvement, and will
       serve us well for what is to come.

       As  I said, the scanner we used in Part X was about as simple  as
       one can get.  But anything can be improved.   The  new scanner is
       more like the classical  scanner,  and  not  as simple as before.
       But the overall  compiler  structure is even simpler than before.
       It's also more robust, and easier to add  to  and/or  modify.   I
       think that's worth the time spent in this digression.  So in this
       installment, I'll be showing  you  the  new  structure.  No doubt
       you'll  be  happy  to  know  that, while the changes affect  many
       procedures, they aren't very profound  and so we lose very little
       of what's been done so far.

       Ironically, the new scanner  is  much  more conventional than the
       old one, and is very much like the more generic scanner  I showed
       you  earlier  in  Part VII.  Then I started trying to get clever,
       and I almost clevered myself clean out of business.   You'd think
       one day I'd learn: K-I-S-S!


       THE PROBLEM

       The problem begins to show  itself in procedure Block, which I've
       reproduced below:


       {--------------------------------------------------------------}
       { Parse and Translate a Block of Statements }

       procedure Block;
       begin
          Scan;
          while not(Token in ['e', 'l']) do begin
             case Token of
              'i': DoIf;
              'w': DoWhile;
              'R': DoRead;
              'W': DoWrite;
             else Assignment;
             end;
             Scan;A*2A*
                                     - 3 -

PA2A





          end;
       end;
       {--------------------------------------------------------------}


       As  you   can  see,  Block  is  oriented  to  individual  program
       statements.  At each pass through  the  loop, we know that we are
       at  the beginning of a statement.  We exit the block when we have
       scanned an END or an ELSE.

       But suppose that we see a semicolon instead.   The  procedure  as
       it's shown above  can't  handle that, because procedure Scan only
       expects and can only accept tokens that begin with a letter.

       I  tinkered  around for quite awhile to come up with a  fix.    I
       found many possible approaches, but none were very satisfying.  I
       finally figured out the reason.

       Recall that when we started with our single-character parsers, we
       adopted a convention that the lookahead character would always be
       prefetched.    That   is,   we  would  have  the  character  that
       corresponds to our  current  position in the input stream fetched
       into the global character Look, so that we could  examine  it  as
       many  times  as  needed.    The  rule  we  adopted was that EVERY
       recognizer, if it found its target token, would  advance  Look to
       the next character in the input stream.

       That simple and fixed convention served us very well when  we had
       single-character tokens, and it still does.  It would make  a lot
       of sense to apply the same rule to multi-character tokens.

       But when we got into lexical scanning, I began  to  violate  that
       simple rule.  The scanner of Part X  did  indeed  advance  to the
       next token if it found an identifier or keyword, but it DIDN'T do
       that if it found a carriage return, a whitespace character, or an
       operator.

       Now, that sort of mixed-mode  operation gets us into deep trouble
       in procedure Block, because whether or not the  input  stream has
       been advanced depends upon the kind of token we  encounter.    If
       it's  a keyword or the target of  an  assignment  statement,  the
       "cursor," as defined by the contents of Look,  has  been advanced
       to  the next token OR to the beginning of whitespace.  If, on the
       other  hand,  the  token  is  a  semicolon,  or if we have hit  a
       carriage return, the cursor has NOT advanced.

       Needless to say, we can add enough logic  to  keep  us  on track.
       But it's tricky, and makes the whole parser very fragile.

       There's a much  better  way,  and  that's just to adopt that same
       rule that's worked so well before, to apply to TOKENS as  well as
       single characters.  In other words, we'll prefetch tokens just as
       we've always done for  characters.   It seems so obvious once you
       think about it that way.A*2A*
                                     - 4 -

PA2A





       Interestingly enough, if we do things this way  the  problem that
       we've had with newline characters goes away.  We  can  just  lump
       them in as  whitespace  characters, which means that the handling
       of  newlines  becomes  very trivial, and MUCH less prone to error
       than we've had to deal with in the past.


       THE SOLUTION

       Let's  begin  to  fix  the  problem  by  re-introducing  the  two
       procedures:

       {--------------------------------------------------------------}
       { Get an Identifier }

       procedure GetName;
       begin
          SkipWhite;
          if Not IsAlpha(Look) then Expected('Identifier');
          Token := 'x';
          Value := '';
          repeat
             Value := Value + UpCase(Look);
             GetChar;
          until not IsAlNum(Look);
       end;


       {--------------------------------------------------------------}
       { Get a Number }

       procedure GetNum;
       begin
          SkipWhite;
          if not IsDigit(Look) then Expected('Number');
          Token := '#';
          Value := '';
          repeat
             Value := Value + Look;
             GetChar;
          until not IsDigit(Look);
       end;
       {--------------------------------------------------------------}


       These two procedures are  functionally  almost  identical  to the
       ones  I  showed  you in Part VII.  They each  fetch  the  current
       token, either an identifier or a number, into  the  global string
       Value.    They  also  set  the  encoded  version, Token,  to  the
       appropriate code.  The input  stream is left with Look containing
       the first character NOT part of the token.

       We  can do the same thing  for  operators,  even  multi-character
       operators, with a procedure such as:A*2A*
                                     - 5 -

PA2A





       {--------------------------------------------------------------}
       { Get an Operator }

       procedure GetOp;
       begin
          Token := Look;
          Value := '';
          repeat
             Value := Value + Look;
             GetChar;
          until IsAlpha(Look) or IsDigit(Look) or IsWhite(Look);
       end;
       {--------------------------------------------------------------}

       Note  that  GetOp  returns,  as  its  encoded  token,  the  FIRST
       character of the operator.  This is important,  because  it means
       that we can now use that single character to  drive  the  parser,
       instead of the lookahead character.

       We need to tie these  procedures together into a single procedure
       that can handle all three  cases.  The  following  procedure will
       read any one of the token types and always leave the input stream
       advanced beyond it:


       {--------------------------------------------------------------}
       { Get the Next Input Token }

       procedure Next;
       begin
          SkipWhite;
          if IsAlpha(Look) then GetName
          else if IsDigit(Look) then GetNum
          else GetOp;
       end;
       {--------------------------------------------------------------}


       ***NOTE  that  here  I have put SkipWhite BEFORE the calls rather
       than after.  This means that, in general, the variable  Look will
       NOT have a meaningful value in it, and therefore  we  should  NOT
       use it as a test value for parsing, as we have been doing so far.
       That's the big departure from our normal approach.

       Now, remember that before I was careful not to treat the carriage
       return (CR) and line  feed  (LF) characters as white space.  This
       was  because,  with  SkipWhite  called  as the last thing in  the
       scanner, the encounter with  LF  would  trigger a read statement.
       If we were on the last line of the program,  we  couldn't get out
       until we input another line with a non-white  character.   That's
       why I needed the second procedure, NewLine, to handle the CRLF's.

       But now, with the call  to SkipWhite coming first, that's exactly
       the behavior we want.    The  compiler  must know there's anotherA*2A*
                                     - 6 -

PA2A





       token coming or it wouldn't be calling Next.  In other words,  it
       hasn't found the terminating  END  yet.  So we're going to insist
       on more data until we find something.

       All this means that we can greatly simplify both the  program and
       the concepts, by treating CR and LF as whitespace characters, and
       eliminating NewLine.  You  can  do  that  simply by modifying the
       function IsWhite:


       {--------------------------------------------------------------}
       { Recognize White Space }

       function IsWhite(c: char): boolean;
       begin
          IsWhite := c in [' ', TAB, CR, LF];
       end;
       {--------------------------------------------------------------}


       We've already tried similar routines in Part VII,  but  you might
       as well try these new ones out.  Add them to a copy of the Cradle
       and call Next with the following main program:


       {--------------------------------------------------------------}
       { Main Program }

       begin
          Init;
          repeat
             Next;
             WriteLn(Token, ' ', Value);
          until Token = '.';
       end.
       {--------------------------------------------------------------}


       Compile  it and verify that you can separate  a  program  into  a
       series of tokens, and that you get the right  encoding  for  each
       token.

       This ALMOST works,  but  not  quite.    There  are  two potential
       problems:    First,  in KISS/TINY almost all of our operators are
       single-character operators.  The only exceptions  are  the relops
       >=, <=, and <>.  It seems  a  shame  to  treat  all  operators as
       strings and do a  string  compare,  when  only a single character
       compare  will  almost  always  suffice.   Second, and  much  more
       important, the  thing  doesn't  WORK  when  two  operators appear
       together, as in (a+b)*(c+d).  Here the string following 'b' would
       be interpreted as a single operator ")*(."

       It's possible to fix that problem.  For example,  we  could  just
       give GetOp a  list  of  legal  characters, and we could treat theA*2A*
                                     - 7 -

PA2A





       parentheses as different operator types  than  the  others.   But
       this begins to get messy.

       Fortunately, there's a  better  way that solves all the problems.
       Since almost  all the operators are single characters, let's just
       treat  them  that  way, and let GetOp get only one character at a
       time.  This not only simplifies GetOp, but also speeds  things up
       quite a  bit.    We  still have the problem of the relops, but we
       were treating them as special cases anyway.

       So here's the final version of GetOp:


       {--------------------------------------------------------------}
       { Get an Operator }

       procedure GetOp;
       begin
          SkipWhite;
          Token := Look;
          Value := Look;
          GetChar;
       end;
       {--------------------------------------------------------------}


       Note that I still give the string Value a value.  If you're truly
       concerned about efficiency, you could leave this out.  When we're
       expecting an operator, we will only be testing  Token  anyhow, so
       the  value of the string won't matter.  But to me it seems to  be
       good practice to give the thing a value just in case.

       Try  this  new  version with some realistic-looking  code.    You
       should  be  able  to  separate  any program into  its  individual
       tokens, with the  caveat  that the two-character relops will scan
       into two separate tokens.  That's OK ... we'll  parse  them  that
       way.

       Now, in Part VII the function of Next was combined with procedure
       Scan,  which  also  checked every identifier against  a  list  of
       keywords and encoded each one that was found.  As I  mentioned at
       the time, the last thing we would want  to  do  is  to use such a
       procedure in places where keywords  should not appear, such as in
       expressions.  If we  did  that, the keyword list would be scanned
       for every identifier appearing in the code.  Not good.

       The  right  way  to  deal  with  that  is  to simply separate the
       functions  of  fetching  tokens and looking for  keywords.    The
       version of Scan shown below  does NOTHING but check for keywords.
       Notice that it operates on the current token and does NOT advance
       the input stream.


       {--------------------------------------------------------------}A*2A*
                                     - 8 -

PA2A
12 3 4 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -