📄 chapter 3 lexical analysis.htm

📁 英文版编译器设计：里面详细介绍啦C编译器的设计
💻 HTM
📖 第 1 页 / 共 5 页
字号:
上一页 1 2 3 45

         {"for",         forsy},               {"to",          tosy},
         {"by",          bysy},                  

         {"while",       whilesy},             {"until",       untilsy},
         {"do",          dosy},                {"loop",        loopsy},
         {"continue",    continuesy},          {"break",       breaksy},

         {"select",      selectsy},            {"from",        fromsy},
         {"case",        casesy},  

         {"try",         trysy},               {"catch",       catchsy},
         {"finally",     finallysy},           {"throw",       throwsy},

         {"read",        readsy},              {"write",       writesy},
         {"istream",     istreamsy},           {"ostream",     ostreamsy},

         {"tron",        tronsy},              {"troff",       troffsy},

         {"func",        funcsy},              {"proc",        procsy},
         {"return",      returnsy},            {"asm",         asmsy},

         {"",            badsy}
      };  

      <B>Listing {SALKLST}.</B>   A string list of all the SAL keywords.  This is
      a portion of the listing from SAL-Scanner.CPP.


      for(i=0; kwdsy[i].ksy!=badsy; i++) {
         if(!strcmp(token.data.id, kwdsy[i].key)) {
            token.sy=kwdsy[i].ksy;
            break;
         }
      }

      <B>Listing {SALKEY}.</B>   Recognizing a keyword from the list.
</PRE>

We now have a pretty good idea of what comprises a SAL scanner.  We have only 
two more very basic things to cover.<P>

<H3>3.3.4 Whitespace</H3><!-------------------------------------------------------------------------------->
Whitespace is ignored by the compiler, for the most part.  However, it does play
a subtle part as a separater.  In some older languages, notably BASIC and Fortran,
whenever a keyword appeared, it was recognized regardless of its proximity to the
neighboring token.  Reading a BASIC program, one might see lines like
<PRE>      1250 FORI=1TO100:PRINT"Hello":NEXTI
</PRE>
All modern languages of today make some subtle use of whitespace as a delimiter,
and such monstrosities are seldom seen anymore.<P>

Whitespace characters are listed in table {WHITSP}.
<PRE>        ASCII Esc Name              Description  
      ========================================================================
          9    \t Tab character     Usually 8 spaces.
         10    \n Linefeed          Does not bring the cursor to the left.
         13    \r Carriage Return   Brings the cursor to the left, same line.
         32   ' ' Blank Space       A space character.

      <B>Table {WHITSP}.</B>
</PRE>
There may be others, but they are insignificant.

<H3>3.3.5 Comments</H3><!-------------------------------------------------------------------------------->
In addition to whitespace, the scanner also has to deal with comments.  One might
think that it is easy enough to just eat tokens until a terminating comment is 
found.  However, in SAL, there are two concerns that need to be addressed.  In 
short, comments need to be <I>correctly ignored</I> by the scanner.
<OL>
  <LI>Multiline comments can be nested.  When skipping over comments, the 
      scanner needs to keep track of the level of nesting.
  <LI>Whenever a single line comment appears on the same 
  line as the starter or ender for a multiline comment, the issue of precidence 
  arises. This will be discussed shourly             
                  
        
</LI></OL>

For the most part both styles of comments in SAL look and behave similarly to 
comments in C++ and Java.<P>

<MENU>
<B>Single Line Comments.</B>  Single line comments take on the form of a 
double-slash, "<TT>//</TT>", and behave identically to single line comments in 
C++.  Everything appearing on a line after a "<TT>//</TT>" is ignored.<P>
<PRE>      x:= 6; // Assign 6 to x.
</PRE>
<B>Multiline Comments.</B> These come directly from C, with one noted exception.
<PRE>      /* some comment...
      that spans multiple lines...*/
</PRE>
These can be nested within each other.<P>
<PRE>      /* this is
         /* A nested multiple line comment */
      */
</PRE>
Comments can be nested, and programmers have the ability to  comment out 
large sections of code without having to worry about whether or not 
there are other "<TT>/*...*/</TT>" pairs within a section.<P></P>
</MENU>
Since there are two styles of commenting, the subject of priority comes up whenever
the two are used together.  This issue is solved by letting the first comment take 
priority.  Here is an example of a single line comment inside a multiline comment,
all on the same line.  In this case the double slash is ignored.
<PRE>      /* some comment   // &lt;-- ignore this  */
</PRE>
Here is a multiline comment that spans three lines.  The last line has an imbedded
single line comment.<PRE>     /*
      * A comment
      //  &lt;-- Ignore this */
</PRE>
In each of these cases, the scanner is in the mode of scanning multiline comments.
In each of these cases, the double slashes are ignored.  Although the double 
slashes are ignored, they still need to be recognized.  Here is an example:
<PRE>      /*  comment out this section...
          
          for i:=1 to 10 do   //*** Initialize the table
             table[i]:= 0;
          loop
          
      */
</PRE>
Notice the comment on the for loop.  It is a double slash followed by three stars.
If the compiler just ignored everything and only payed attention to multiline
starters and enders, it might mistakenly interpret the comment on the for loop.  The
second slash preceeds a star, and looks like a starter for another multiline comment.
This listing shows the correct and incorrect way of handling this (very subtle) bug.
<PRE>      <B>INCORRECT</B>

      <B>/*</B>  comment out this section...
          
          for i:=1 to 10 do   /<B>/*</B>** Initialize the table
             table[i]:= 0;
          loop
          
      <B>*/</B>



      <B>CORRECT</B>

      <B>/*</B>  comment out this section...
          
          for i:=1 to 10 do   <B>//</B>*** Initialize the table
             table[i]:= 0;
          loop
          
      <B>*/</B>
</PRE>
On the other hand, there are times when single line comments take precidence.  When
this happens, it can be mess up a multiline comment.  
<PRE>      //    /*
             *  This comment is illegal
             */
</PRE>
This causes a problem because the <TT>/*</TT> gets commented out by the preceeding
<TT>//</TT>.  This is the only case where the combination of the two styles is
illegal.  In this case, both the starter and ender must appear on the same line.<P>
<PRE>      //        /* This is okay */
</PRE>

The instructor may provide utility for the scanner called <TT>Comment()</TT>.  It 
takes no parameters and returns nothing.  It is used in <TT>GetNextToken()</TT> like so:
<PRE>      if(ch=='/' &amp;&amp; (nch=='*' || nch=='/'))
         Comment();
</PRE>
If <TT>Comment()</TT> is not provided, its implementation will be left as an 
exercise for the student.
<P>

<H2>3.4 Writing a SAL Scanner</H2><!-------------------------------------------------------------------------------->
We are now ready to fully implement a scanner in SAL.  We will not have to do the
entire thing, since this course is intended to fit into one semester; SAL is a
large language.  There are a number of utility procedures designed to help the 
student along.<P>

<H3>3.4.1 Lookahead buffer.</H3><!-------------------------------------------------------------------------------->
This is taken care of for you in NextToken, but you should still know about this.
One of the many things that the scanner has to work with is the token queue.  The
scanner has access to it through two functions.
<MENU>
  <B><TT>UseQueue()</TT></B>  This function returns a Boolean result, and is true
  when the scanner should look in the queue for the next token, and is false when
  it should not.  There are some complex subtelties as to when the scanner should
  really go to the queue, and this function hides those.<P>
<PRE>      bool UseQueue()
</PRE>
  <B><TT>DequeueNexttoken()</TT></B> This function will dequeue the next token, 
  and copy it into the global variable, <TT>token</TT>.  It should only be used
  when <TT>UseQueue()</TT> returns a true.
<PRE>      void DequeueNextToken()
</PRE>
</MENU>

These two functions should be used like so:
<PRE>      //*** Check look-ahead queue first
      if(UseQueue()) {
         DequeueNextToken();
         return;
      }
</PRE>
Using these two functions, <TT>GetNextToken()</TT> never has to know the details of 
the token queue.  We know from having taken a course in computational theory that 
parsers often like to look ahead.  Generally, a parser should be able to look 
ahead any number of tokens, though in practice it would only need to look ahead 
just a few.  It is the Scanner's job to provide this look-ahead service, to not 
loose track of any peeked-at tokens, and to continue to deliver all tokens to the 
parser in the proper order each time it calls for the next token.<P>

This is done by having the scanner queue up tokens as a look-ahead is called for.
When the parser calls for the next token, the scanner should first check the queue
to see if it contains any pre-scanned tokens.  If it does, it removes one of them
from the front of the queue and returns it to the parser.  Otherwise, it scans
from the input stream, as normal.  As the parser calls to look ahead, the scanner
will go to the input stream, scan out the next token, enqueue it, and return it to
the parser.  In this manner, the parser can look ahead an arbitrary distance 
before proceeding.<P>

One significant issue to this process is that any call to get the next token 
resets the look-ahead pointer.  Consider figure {LKAHD}.

<CENTER><IMG src="Chapter 3 Lexical Analysis.files/LKAHD.gif"></CENTER>

The current token being <I>yellow</I>, the parser looks ahead and finds that the 
next three tokens are <I>red</I>, <I>green</I>, and <I>blue</I>.  The parser then 
calls for the next token.  <I>Yellow</I> is discarded, and replaced by <I>red</I>.  
The look ahead pointer is reset, and successive calls to look ahead will yield 
<I>green</I>, <I>blue</I>, and <I>white</I>.  Notice also that nothing is appended
to the token queue by the call to get the next token.  That will not happen
unless the scanner it told to look ahead another two consecutive times, beyond 
<I>blue</I>.<P>


<H3>3.4.2 Internal helper procedures.</H3><!-------------------------------------------------------------------------------->
<B><TT>NextToken()</TT></B> This function is provided as a wrapper to the GetNextToken() function.
It is used for Token queue management and other functions in the compiler. When you write the parser, you will
be calling this function, instead of GetNextToken.
<PRE>      void NextToken(void)
</PRE>
<B><TT>EscapeCh()</TT></B> This function is provided to facilitate the conversion 
of escape sequences.  This is a common task when converting string and character 
constants.<P>
<PRE>      char EscapeCh(void)
</PRE>

<B><TT>Comment()</TT></B> This function will scan over comments.  As soon as the
scanner detects a <TT>/*</TT> or a <TT>//</TT>, we can call this functi
上一页 1 2 3 45
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -