📄 chapter 3 lexical analysis.htm

📁 英文版编译器设计：里面详细介绍啦C编译器的设计
💻 HTM
📖 第 1 页 / 共 5 页
字号:
            For instance, '6' equals 6.  'D' = 13 (in hex).
        <LI>Add the character's binary equivalent to our final 
    number.        
      </LI></OL>
</LI></OL>
The process continues until there are no more characters, or the final number 
overflows.  When the integer's base happens to be a power of two, we can employ a 
faster method of using left-shifts instead of multiplys and using ors instead of 
adds.  Listing {ADTOI} demonstrates the conversion process for decimal integers, 
and listing {AOTOI} demonstrates the same process for octal integers.<P>
<PRE>      //*** DECIMAL
      lim = MAXUQWORD/10;
      token.data.num = 0;

      while (buff[i]) {
         if(token.data.num &gt; lim)
      { Error(ER_NUM2BIG);
      //  The
      multiply  will

      be too big
         token.data.num =  
                          
            0;break;
            }
         else
         if ((buff[i]-'0') &gt;  (MAXUQWORD - (token.data.num*10))) {
            Error(ER_NUM2BIG);        // The add will be too big
            token.data.num=0;
            break;
         }
         token.data.num = (token.data.num*10) + (buff[i++]-'0');
      }
      token.base=10;

      <B>Listing {ADTOI}</B>


      lim = MAXUQWORD


      &gt;&gt;  3;  while
      (buf[i]) { if(token.data.num
         &gt;  lim) {
            Error(ER_NUM2BIG);        // The shift will be too big
            token.data.num=0;
            break;
         }
         token.data.num = (token.data.num&lt;&lt;3) | (buf[i++]-'0');
      }
      token.base=8;

      <B>Listing {AOTOI}</B> Notice the use of shifts instead of multiplys, and 
      ors instead of adds.
</PRE>

Notice that the number is placed into <TT>token.data.num</TT> as it is computed.
The variable <TT>lim</TT> is used to determine whether the next shift or multiply
will overflow the number.  If the number overflows, an error is signaled.<P>

In the case of integers, we cannot make use of the facilities provided in the C
language, since SAL has the ability to work with binary constants and quadword 
sized integers (64 bits).  Hence, we do the conversion ourselves.<P>

             
   Integers in SAL can be represented in base binary, octal, decimal, and in
hexadecimal  form. The base is presented by prepending a letter '0b',  '0o', 
'0d', or '0x' respectively. In the case  of decimal integers, the '0d' is optional. 
All hexadecimal digits must be in upper case.  The numbers 0b11111111, 0o377, 255, 0d255, and 
0xFF all represent the same quantity.  Integers and cardinals (unsigned integers)
in SAL default to 32 bits, but are stored in <TT>token</TT> as 64 bits.<P><!-------------------------------------------------------------------------------->
<B>Converting ASCII reals to binary.</B> This case is much simpler.  We can simply
make use of the C function <TT>atof()</TT>.  See listing {ATOF}.
<PRE>      //*** FLOAT
      token.sy = realconsy;
      token.data.rnum = atof(buff);
      token.base=10;

      <B>Listing {ATOF}</B>
</PRE>

SAL is very flexible about the way real numbers can be represented.  They can 
take on many forms.
<UL>
  <LI>A period followed by digits: .25, .5, .3333333
  <LI>Digits followed by a period: 1., 0.
  <LI>Digits, period, and digits: 1.0, 0.25, 0.0
  <LI>Scientific notation: digits of one of the previous 
  formats, or a decimal integer format followed by an E or an e, followed by an 
  optional sign, and then more digits. 1e2, 1.0E+2 are two examples of the 
  number 100.0 written in scientific format.           
                     
                    
        
</LI></UL>
A regular expression would be something like this:
<PRE>      . digit{digit}
      digit{digit} . [ digit{digit} ]
      digit{digit} [ . digit{digit} ] E|e [+|-] digit{digit}
      . digit{digit} E|e [+|-] digit{digit}
</PRE>
Real numbers in SAL are only in base 10.  The use of the b, o, d, and h suffixes
are illegal for reals, and should not be recognized.  The string 1.1b should be
interpreted as a real (1.1), and an identifier (b).  Real numbers in SAL default 
to double precision (64 bits).<P>

Because real numbers may begin with a period, there arises a possible ambiguity 
in distinguishing between real numbers and the period symbol, i.e., the period can 
appear by itself as a token.  SAL handles this by looking ahead to see whether or 
not the next character is a digit.  If not, then the the current token is set to a
period, and the numeric portion of the scanner is never entered.<P>

Here are state the machines representing integers and real numbers in SAL:
<CENTER>&nbsp;</CENTER><P><IMG alt="" src="Chapter 3 Lexical Analysis.files/specs-Numbers.gif">
{SPECS-INTEGER/REAL}
Notice that the state machine for real numbers also recognizes a period.  This was
included for clarity.<P>

<B>3.3.2.2 String and Character Constants</B><BR><!-------------------------------------------------------------------------------->
A character in SAL is enclosed in single quotes, like in C and C++.  In order to 
represent non-printable or non-typable characters, SAL has escape characters 
similar to those of C/C++.  There are some subtle differences, however.  These 
rules apply:
<OL>
  <LI>Standard escape sequences like <TT>\n</TT>, <TT>\t</TT>, and <TT>\"</TT> all
      behave like in C/C++.  SAL recognizes the same set of these as C and C++.
      Consult the <A href="http://topaz.cs.byu.edu/SAL/SAL_Tutorial_Expressions.html#character">SAL documentation</A> for more details.
  <LI>Characters may be represented in hexadecimal by a <TT>\x</TT> followed by
      <I>two</I> hexadecimal digits.  The number must be in the range of 00 to FF.
      Digits from A to F must be in upper case.
  <LI>Characters may be represented in decimal by a <TT>\d</TT> followed by
      <I>three</I> decimal digits.  The number must be in the range of 000 to 255.
  <LI>Characters may be represented in octal by a <TT>\</TT> followed by
      <I>three</I> octal 
  digits. The number must be in the range of 000 to 377.              
</LI></OL>
Character recognition is simple, and involves identifying a single quote
from the input stream, reading an additional character, and then verifying another
single quote from the input stream.  In the case of an escape sequence, the items
in the above list can be implemented using an FSA.  If necessary a multiply-add
algorithm can be used to convert numeric escape sequences.<P>

Characters are stored in a token as follows:
<PRE>      //*** CHARACTER
      token.sy = charconsy;
      token.data.ch = ch;

      <B>Listing {SALCHAR}</B>
</PRE>

Strings are stored a little differently in SAL.  They are read from the input 
stream into a temporary buffer.  A temporary buffer exists in <TT>SAL-Scanner.CPP</TT>:
<PRE>      #define TMPBUFFMAX 255                    // max buffer size                          
      static  CHAR tmpbuff[TMPBUFFMAX+1];       // Temp buffer used to store parsed numbers and strings
</PRE>
Probably the most straightforward way to recognize a string is to detect a double
quote, and then do the following:
<PRE>      ULONG i = stringaddr;               // a counter variable

      //*** Put the string into tmpbuff
      for(i = 0; ch!='"'; i++) {
         if(i &lt; TMPBUFFMAX)
            tmpbuff[i] = ch;
         NextCh();
      }
      NextCh();                          // eat the terminating "

      if(i &lt; TMPBUFFMAX)
         tmpbuff[i] = 0;                 // null-terminate the string
      else
         ErrorMsg("string is too long"); // or detect a string that is too long

      //*** Set up token data
      token.sy = stringsy;
      token.data.str.txt    = tmpbuff;
      token.data.str.length = i;

      <B>Listing {SALSTR}</B> General method for recognizing strings in SAL
</PRE>
This algorithm does not take into account some errors like unterminated strings,
and it also does not translate escape sequences.  Both of these things must be
accounted for in the actual implementation<P>

Here are two state machines that represent characters and strings in SAL.  They
each call a third state machine that interprets escape sequences.
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/specs-Char.gif"></CENTER><P>
Notice that this FSA can recognize two quotes with nothing in between.  This is
a null character.  Its ASCII value is zero.

<CENTER><IMG src="Chapter 3 Lexical Analysis.files/specs-String.gif"></CENTER><P>
This state machine recognizes escape sequences.  With ASCII representation (e.g.,
hex or decimal) the state machine only determines whether or not the correct 
number of characters are present.  It does not check to see if the number is 
within the appropriate range.  In <TT>SAL-Scanner.CPP</TT> there is a function
to convert escape sequences called <TT>EscapeCh()</TT>.  It works with <TT>ch</TT>
and <TT>NextCh()</TT> and returns the equivalent character.
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/specs-EscSeq.gif"></CENTER><P>

<H3>3.3.3 Identifiers and Keywords</H3><!-------------------------------------------------------------------------------->
Keywords and identifiers are similar, and we can merge the operation that 
recognizes the two types of token into one routine.  The strategy is to assume 
if a token begins with a letter or an underscore it is an identifier.  If the 
identifier then happens to match one of the items from a list of known keywords, 
then it is a keyword.<P>

Identifiers in SAL can be composed of (currently up to 16) letters in either upper
or lower case, numbers, and the underscore.  Identifiers may not begin with a 
number, however.  We saw an FSA to recognize identifiers in figure {IDFSA} in 
section 3.1.3.  Detecting an identifier is simple:
<PRE>      if (isalpha(ch) || ch=='_') {        // First char is a
      alpha or _ i=                
         0;  do
         { //                              following 16 are alpha-numeric or _ if
            (i &lt; = ALFAMAX)
               token.data.id[i++] = ch;
            NextCh();
         } while (isalnum(ch) || ch=='_');
         token.data.id[i] = 0;
      }

      <B>Listing {SALID}</B>
</PRE>
In order to recognize keywords, we use a list of strings and matching token 
values, much like what was discussed in section 3.3.1.  Listing {SALKLST} shows
a string list for all of the SAL keywords, and listing {SALKEY} shows a method for
searching the list to see if the latest identifier is really a keyword.<P>
<PRE>      typedef struct tagtokensym {
         Alfa key;
         Symbol ksy;
      } tokensym;

      static tokensym kwdsy[] = {
         {"div",         divsy},               {"mod",         modsy},
         {"and",         andsy},               {"or",          orsy},
         {"not",         notsy},

         {"const",       constsy},             {"type",        typesy},
         {"is",          issy},                {"var",         varsy},
         {"array",       arraysy},             {"of",          ofsy},
         {"record",      recordsy},            {"class",       classsy},
         {"extends",     extendssy},           {"virtual",     virtualsy},
         {"shared",      sharedsy},            {"public",      publicsy},
         {"private",     privatesy},           {"constructor", constructorsy},
         {"destructor",  destructorsy},        {"super",       supersy},

         {"program",     programsy},           {"library",     librarysy},
         {"import",      importsy},            {"export",      exportsy},

         {"begin",       beginsy},             {"end",         endsy},

         {"if",          ifsy},                {"then",        thensy},
         {"elseif",      elseifsy},            {"else",        elsesy},
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -