📄 chapter 3 lexical analysis.htm

📁 英文版编译器设计：里面详细介绍啦C编译器的设计
💻 HTM
📖 第 1 页 / 共 5 页
字号:
the <I>n</I>th character in the buffer, use 
<TT>sourceline[CurrCol+<I>n</I>]</TT>. With the current implementation, it is 
safe to look only two characters ahead. Incrementing <TT>CurrCol</TT> is 
equivalent to advancing to the next character. <TT>CurrCol</TT> should not be 
incremented beyond <TT>LineLen</TT>.
<P>The last character in the buffer is found at <TT>sourceline[LineLen]</TT>. 
The only way to force the input interface to get the next line of source is to 
set <TT>CurrCol</TT> to <TT>LineLen</TT> and then call <TT>NextCh()</TT>.
<P>
<MENU></MENU>The buffer is already properly managed by <TT>NextCh()</TT>, and 
should never be touched unless absolutely necessary. The function 
<TT>NextCh()</TT>manages all the features of the input interface, including the 
following: 
<OL>
  <LI>Buffering the input and providing a continuous stream of characters to the 
  scanner, 
  <LI>Providing a line-by-line printout of the source program. This includes 
  line numbers and the PC offset of the current line, 
  <LI>Provide facilities for formatting error message output, particularly the 
  proper positioning of the error on the line, so the user can see where an 
  error occurs on a line, and not just the line where an error occured. </LI></OL>
<P>
<H3>3.2.2 Token Representation</H3><!-------------------------------------------------------------------------------->A 
token is a basic symbol within a language. As an analogy, tokens are to a 
computer language as words are to any human language. The process of scanning is 
recognizing these basic symbols as they appear in the input stream, and building 
an internal structure that holds all their data (if there is any)
<P>Each token is represented by a single enumerated value. There are at present, 
97 unique tokens within the SAL language. Most of these are symbols (delimeters) 
and operators. They are listed in listing {SALTOK}. <PRE>      typedef enum {
         badsy = -1,

      //*** Factor types
         identsy, intconsy, realconsy, charconsy, stringsy,

      //*** Operators (in order of precidence)             
         expsy,                                            //  Arithmetic:
         compsy, notsy,
         timessy, rdivsy, divsy, modsy,
         plussy, minussy,
         
         bshrsy, bsshrsy, bshlsy, brorsy, brolsy,          // Bitwise
         bandsy,
         borsy, bxorsy,
         
         eqlsy, neqsy, gtrsy, lsssy, geqsy, leqsy,         // Boolean/comparison
         andsy,
         orsy,
         
         becomessy,                                        // Assignment

      //*** Symbols and delimiters
         lparentsy, rparentsy, lbracksy, rbracksy, commasy, 
         semicolonsy, periodsy, colonsy, scopesy, atsy,

      //*** Declaration keywords
         constsy, typesy,issy, varsy, pointersy,
         arraysy,ofsy,
         recordsy,
         classsy,extendssy,virtualsy,sharedsy, publicsy,privatesy, 
         constructorsy,destructorsy, supersy,

      //*** import/export keywords
         importsy, exportsy,

      //*** Statement keywords
         ifsy, thensy, elseifsy,elsesy,
         forsy,tosy,bysy,
         untilsy,whilesy,dosy,loopsy, continuesy,breaksy,
         selectsy,fromsy,casesy,
         trysy,catchsy,finallysy,throwsy,
         tronsy,troffsy, readsy,writesy, returnsy,

      //*** Function/module keywords
         procsy, funcsy, istreamsy,ostreamsy, netinsy, netoutsy,
         beginsy, endsy, programsy, librarysy,
         eopsy
      } Symbol;

      <B>Listing {SALTOK}</B>
</PRE>At this level we will not concern ourselves with all 97 of these. The SAL 
language has developed far beyond the limits of one or two semesters' work.
<P>The current token is usually stored in a structure of some sort. In an 
object-oriented design it could be retrieved through an access method or as a 
member of an instance of a scanner object. In the SAL implementation that we are 
developing in this book, the current token exists as a global instance of 
<TT>struct&nbsp;Token</TT>. The structure of a token is shown in listing 
{TKSTRCT} <PRE>//*** token class *********************
struct Token {
   Symbol   sy;            // last symbol read by GetNextToken
   union {
      Alfa     id;         // actual identifier from GetNextToken
      char     ch;
      QWORD    num;        // number values are ALWAYS unsigned
      DOUBLE   rnum;       // Same goes for real numbers
      struct {
         char *txt;        // last char or string from GetNextToken
         WORD length;      // string length
      } str;
   } data;
   CHAR base;
   long number;

   Token(Token &amp;other);    // copy constructor
   Token();                // default constructor
};

      <B>Listing {TKSTRCT}</B>
</PRE>Some tokens require additional data. For instance, it does little good if 
the current token is <TT>stringsy</TT> and you do not know anything about the 
string itself. The two classes of tokens that have an associated value are 
constants and identifiers.
<P>Aside from the two constructors, this structure contains a host of fields. 
Let's go through these one by one.
<P>
<MENU><B><TT>sy</TT></B> The enum value of the current token is kept here. It 
  is one of the values from <TT>enum symbol</TT> from listing {SALTOK}
  <P><B><TT>union data.</TT></B> This field holds the extra data for the current 
  token. It has several fields. Which one is valid depends upon the value of 
  <TT>sy</TT>. 
  <MENU><B><TT>alfa id</TT></B>. An <TT>alfa</TT> is a character array of 16 
    elements. If <TT>sy</TT> is equal to <TT>identsy</TT>, this field holds the 
    name of the current identifier.<BR><B><TT>char ch</TT></B> If <TT>sy</TT> is 
    equal to <TT>charconsy</TT>, then this field is valid, and contains the 
    character retrieved by the scanner.<BR><B><TT>QWORD num</TT></B> This field 
    is valid when <TT>sy</TT> is equal to <TT>intconsy</TT>, and it contains the 
    number that was recognized by the scanner. This structure holds a 64 bit 
    integer. The number held here is <I>unsigned</I>.<BR><B><TT>DOUBLE 
    rnum</TT></B> When <TT>sy</TT> is equal to <TT>realconsy</TT>, this field 
    contains the value of the real number that was scanned in. The number held 
    here is <I>unsigned</I>.<BR><B><TT>struct str</TT></B> This struct contains 
    two items. The first is a pointer to the string (within the string table), 
    and the second is the length of the string. It is valid when <TT>sy</TT> 
    equals <TT>stringsy</TT><BR></MENU>
  <P><B><TT>CHAR base</TT></B> This field works only when <TT>sy</TT> equals 
  <TT>intconsy</TT> It holds the base of the number scanned in, and it is used 
  in resolving the difference between signed and unsigned integers when there is 
  an ambiguity.
  <P><B><TT>long number</TT></B> This field increments with each new token. Its 
  pupose is to let the compiler know when it has gotten stuck in a loop during 
  error-detection.
  <P></P></MENU>You will notice that the fields <TT>num</TT> and <TT>rnum</TT>. 
contain an unsigned number. The scanner does not try to detect sign in a number, 
like <PRE>      -1.5
</PRE>The scanner should interpret this line as two separate tokens, which would 
be <TT>minussy</TT>, and <TT>reealsy</TT> with a value of 1.5.
<P>
<H2>3.3 Scanning SAL Tokens</H2><!-------------------------------------------------------------------------------->By 
now, we can begin to piece together the entire puzzle and the method of 
implementation can begin to take form. We know that FSAs are useful for 
recognizing a language's various tokens from the input stream. We know that the 
accept state outputs one of these tokens once it is recognized, and the output 
is in the form of a <TT>token</TT> struct, from listing {TKSTRCT}. As far as 
code goes, we have seen the SAL input interface, which consists of <TT>ch</TT>, 
<TT>nch</TT>, and <TT>NextCh()</TT>. We have also seen the <TT>Token</TT> 
structure, and have a basic understanding of what its fields are used for. We 
are now ready to discuss writing a scanner.
<P>There are four different classes of tokens. Each class is distinguished by 
the type of additional data that it carries, if there is any. 
<OL>
  <LI><B>Symbols and Delimiters.</B> These are the simplest class of tokens. 
  They consist of non-alphanumeric characters, and they are between one to three 
  characters in length. These tokens are scattered throughout the enum in 
  listing {SALTOK}. 
  <LI><B>Constants.</B> There are several types of these. There are numeric 
  constants--both integer and real, there are character constants, and there are 
  strings. The type of constants that we are refering to are <I>literal</I> 
  constants, like 3.1415 or 1000. The other type of constants, like <TT>pi</TT> 
  and <TT>ARRAYMAX</TT> are symbolic, and are dealt with at the symbol table 
  level. Some other examples of literal constants are <TT>123</TT>, 
  <TT>3.1415</TT>, <TT>'A'</TT>, and <TT>"Hello, world!"</TT>. The symbols 
  <TT>intconsy</TT>, <TT>realconsy</TT>, <TT>charconsy</TT>, and 
  <TT>stringsy</TT> correspond to these four types, respectively. 
  <LI><B>Identifiers.</B> Identifiers in SAL are like those of C. They can 
  consist of (currently up to 16) letters, numbers, and an the underscore 
  character. They cannot begin with a number. Some SAL identifiers are 
  <TT>abc</TT>, <TT>_ABC</TT>, and <TT>abc123</TT>. The string <TT>123abc</TT> 
  would be handled by the scanner as a literal constant <TT>123</TT> and an 
  identifier <TT>abc</TT>. Identifiers are represented by the symbol, 
  <TT>identsy</TT> of listing {SALTOK} 
  <LI><B>Keywords.</B> Keywords are a special class of identifier, which has a 
  specific meaning within the language, itself. Some examples are 
  <TT>program</TT>, <TT>if</TT>, and <TT>while</TT>. Some items recognized by 
  the compiler like <TT>int</TT>, <TT>card</TT>, <TT>real</TT>, and 
  <TT>char</TT> are not keywords , but are <I>intrinsic types</I>. Though 
  technically keywords, they are not treated as such; they are treated like 
  identifiers. This will make sense later on. </LI></OL>
<P>In designing each section of code that recognizes tokens, it is important to 
remember that, regardless of the input alphabet for a given FSA, we are working 
with an input stream that contains characters from the entire ASCII set. All 
portions of code for each of the four classes of tokens must adhere to two 
rules: 
<OL>
  <LI>FSAs recognize characters until they find a character that is not within 
  their input alphabet, then they stop. 
  <LI>FSAs signal an error iff they have not stopped in an accept state. 
</LI></OL>The reason for these two rules, in particular the second is that the 
offending character might be a legitimate part of the next token.
<P>
<H3>3.3.1 Symbols and Delimeters</H3><!-------------------------------------------------------------------------------->There 
are two methods to scanning symbols and delimeters. The first involves the use 
of a switch statement, and the second involves the use of string tables. For 
demonstration, let us suppose that we need to scan for all of the comparison 
operators, i.e., &gt;, &lt;, &gt;=, &lt;=, =, and &lt;&gt;.
<P>The first method involves a switch statement. This method works well for FSAs 
that take on a tree shape. An FSA to recognize these tokens is presented in 
figure {CMPFSA}. 
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/CMPFSA.gif"></CENTER>For this 
FSA, we can easily write an acceptor: <PRE>      switch (ch) {
         case '&gt;':
            token.sy = gtrsy;
            if (nch == '=')
               token.sy = geqsy;
            break;

         case '&lt;':
            token.sy = lsssy;
            if (nch == '=')
               token.sy = leqsy;
            else if (nch == '&gt;')
               token.sy = neqsy;
            break;

         case '=':
            token.sy = eqlsy;
            break;

         default:
            token.sy = badsy;
            ErrorMsg("Illegal symbol");;
            break;
      }
      return;

      <B>Listing {CMPC1}</B>
</PRE>This is perhaps the easiest of all methods for coding an acceptor for 
operators and delimiters. The process is quite straightforward. Each transition 
is a case or an if. When a decision point is reached, the proper field of 
<TT>token</TT> is filled.
<P>Another method, which is perhaps a little more involved makes use of string 
tables, and compares the next and current characters to items in a table of 
strings. Here are the tables for our example.
<P><PRE>      struct _sym1sy {
         char key;
         symbol ksy;
      };
         
      struct _sym2sy {
         char key[3];
         symbol ksy;
      };
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -