📄 chapter 3 lexical analysis.htm

📁 英文版编译器设计：里面详细介绍啦C编译器的设计
💻 HTM
📖 第 1 页 / 共 5 页
字号:
      static _sym1sy sym1sy[] = {
         {'=       
         ',eqlsy}, {'&gt;',
         gtrsy},{'&lt;', lsssy},
         {' ', badsy}
      };

      static _sym2sy sym2sy[] = 
         {{"&lt;&gt;", neqsy},               
         {"&gt;=", geqsy},
         {"&lt;=", leqsy},
         {"",   badsy}
      };
</PRE>Comparison begins with the two-character operators, and if there is no 
match, proceeds with the single-character operators.
<P><PRE>      //*** Try the 2-char symbols.
      for(i=0; sym2sy[i].ksy!=badsy; i++) {
         if(ch == sym2sy[i].key[0] &amp;&amp; nch sym2sy[i].key[1]) {
            NextCh();                 // advance a character
            NextCh();
            token.sy = sym2sy[i].ksy;
            return;
         }
      }

      //*** Else, try the 1-char symbols.
      for(i=0; sym1sy[i].ksy!=badsy; i++) {
         if(ch == sym1sy[i].key) {
            NextCh();                 // advance a character
            token.sy = sym1sy[i].ksy;
            return;
         }
      }

      //*** Illegal character
      token.sy = badsy;
      ErrorMsg("Illegal symbol");;
      return;
</PRE>The key to making this method work is to make sure we compare to the 
longest strings first, then the shorter ones. Notice the token <TT>badsy</TT> is 
used to signal the end of each list. Iterating through the list, we compare the 
strings one by one to the characters in the input stream until we reach the end 
of the list, or a match is found.
<P>Though not as fast, this method does have two other advantages. First is that 
it is far easier to maintain. If we need to add symbols or delimeters to the 
compiler later on we just make new entries in the table. The code is so general 
that we can make such changes in the compiler without it meaning extra work. 
Second is that the code is smaller. The SAL language has many symbols and 
delimeters. The first method would result in a huge, bloated switch statement.
<P>The operators and delimeters in SAL are presented in the table {SYDEL} below: 
<PRE>        Token        Sym  Description
      =====================================================   
        compsy       ~    Bitwise not (1's compliment)
        timessy      *    Multiplication
        rdivsy       /    Division
        plussy       +    Addition
        minussy      -    Subrtaction
        bandsy       &amp;    Bitwise AND
        borsy        |    Bitwise OR
        bxorsy       %    Bitwise XOR
        eqlsy        =    Equal to
        gtrsy        &gt;    Greater than
        lsssy        &lt;    Less than
        pointersy    ^    Pointer dereference
        periodsy     .    Period
        colonsy      :    Semicolon
        lparentsy    (    Left parenthesy
        rparentsy    )    Right parenthesy
        lbracksy     [    Left bracket
        rbracksy     ]    Right bracket
        commasy      ,    Comma
        semicolonsy  ;    Semicolon
        atsy         @    Module designator

        expsy        **   Exponentiation
        bshrsy       &gt;&gt;   Bitwise shift right
        bshlsy       &lt;&lt;   Bitwise shift left
        brolsy       &lt;#   Bitwise rotate left
        brorsy       #&gt;   Bitwise rotate right
        neqsy        &lt;&gt;   Not equal to
        geqsy        &gt;=   Greater than or equal to
        leqsy        &lt;=   Less than or equal to
        becomessy    :=   Assignment
        scopesy      ::   Scope resolution operator

        bsshrsy      &gt;&gt;&gt;  Signed bitwise shift right

      <B>Table {SYDEL}</B> Symbols and delimeters in SAL.
</PRE>Here is a state machine representing each of these. 
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/specs-Symbol.gif"></CENTER>
<H3>3.3.2 Constants</H3><!-------------------------------------------------------------------------------->The 
four types of constants fall into two different categories. Real numbers and 
integers fall into one, and strings and characters fall into the other. These 
two types are very different both in the way they are scanned and in the way 
they are represented.
<P><B>3.3.2.1 Real Numbers and Integers</B><BR><!-------------------------------------------------------------------------------->Scanning 
a number actually happens in two steps. The first step involves using te FSA to 
recognize a valid number. In this step we also save the current character in a 
buffer. This we do in preparation for phase two. If the first step passes, we 
know that we have a valid number, and we have it stored in a temporary buffer in 
ASCII form.
<P>The next step is to convert the number from ASCII to text. Upon exiting the 
FSA, we terminate the buffer with a null character. If the final state is an 
accept-state, the data in the buffer is converted into a binary form. Let's 
discuss these two steps in greater detail.
<P><B>Scanning.</B> For recognizing real numbers and integers we make use of an 
FSA. One way of implementing an FSA is by making use of a state table. A state 
table is a table where each row represents a character from the input alphabet, 
and each column represents a state. At any point in time, the the current 
character from the input stream and the current state can be used to look up a 
cell in the state table by row and column. Each cell in the state table contains 
the next state. Figure {STTRNS} demonstrates this.
<P><PRE>      Current Character: '4'
          Current State: start

                        curr st
                           |
                           V
                 Char  | start  INT   dot   REAL
                 ====================================
      curr ch--&gt; 0..9  | <B>INT</B>    INT   REAL  REAL
                 .     | Error  dot   Error Error

      <B>Figure {STTRNS}.</B>  Looking up the next state based on the current
      state and the current character.  The next state is INT.
</PRE>A token is recognized when the current state is an accept state, and the 
current character is not in the input alphabet or is otherwise invalid. For 
figure {STTRNS}, the accept states are in capitals: INT and REAL. An error 
occurs when the current state is not an accept state, and the current character 
is either invalid for the FSA or not in the input alphabet. Transitions proceed 
from state to state until one of these two conditions are met.
<P>The advantage of using state tables is that we have to force ourselves to 
consider all possible inputs at all states. This forces us to make sure we 
remember all conditions and exceptions that might exist. It is simply easier to 
verify correctness. Making a state table requires some thought and concentration 
however, and errors in a big state table can be hard to find.
<P>We already saw a table for figure {IRFSA} in figure {STTRNS}. It is shown 
again in table {IRTAB}. <PRE>        Char    start  INT   dot   real
      ====================================
        0..9    INT    INT   REAL  REAL
        .       Error  dot   Error Error

      <B>Table {IRTAB}.</B>  A state table for the FSA in figure {IRFSA}.
</PRE>In order to code up a table we need two things. First, we need a list of 
all our states as an enumerated type. See listing {IRENUM}. Notice that we also 
added an state for errors. <PRE>      typedef enum tagIRState {
         IR_START, IR_INT, IR_DOT, IR_REAL, IR_ERROR
      } IRState;

      <B>Listing {IRENUM}.</B>
</PRE>Next, we need a structure of some sort to hold the state table. A 
2-dimensional array would suit the purpose, but in turn would involve a lot of 
type-casting since some compilers will not automatically cast an enum to an int. 
What we will use is an array of structures that contain two fields. The first is 
a character, and the second is an array of next states. It looks like listing 
{IRSTRUCT}:
<P><PRE>      struct IRtab {
         char ch;
         IRState nextstate[IR_ERROR];
      };

      IRtab IRStates[] = {
         {'0', {IR_INT,   IR_INT, IR_REAL,  IR_REAL}},
         {'1', {IR_INT,   IR_INT, IR_REAL,  IR_REAL}},
         {'2', {IR_INT,   IR_INT, IR_REAL,  IR_REAL}},
         {'3', {IR_INT,   IR_INT, IR_REAL,  IR_REAL}},
         {'4', {IR_INT,   IR_INT, IR_REAL,  IR_REAL}},
         {'5', {IR_INT,   IR_INT, IR_REAL,  IR_REAL}},
         {'6', {IR_INT,   IR_INT, IR_REAL,  IR_REAL}},
         {'7', {IR_INT,   IR_INT, IR_REAL,  IR_REAL}},
         {'8', {IR_INT,   IR_INT, IR_REAL,  IR_REAL}},
         {'9', {IR_INT,   IR_INT, IR_REAL,  IR_REAL}},
         {'.', {IR_ERROR, IR_DOT, IR_ERROR, IR_ERROR}},
         {0}
      };

      <B>Listing {IRSTRUCT}</B>
</PRE>Coding a state table is quite simple. The main body is a loop, and each 
pass through the loop represents a transition from one state to the next. As a 
precondition to this piece of code, <TT>ch</TT> already contains the current 
character from the input stream. See listing {IRIMPL}. <PRE>      //*** Initialize some important variables.
      IRState state = IR_START;
      int i=0; j=0;

      while (state != IR_ERROR) {
          //*** Scan the table until the current character matches
          for (i=0; IRStates[i].ch!=0 &amp;&amp; IRStates[i].ch!=ch; i++);

          //*** If the current character is not in the input alphabet
          if (IRStates[i].ch == 0) {
             if (state != IR_REAL &amp;&amp; state != IR_INT)
                state = IR_ERROR;
             break;
          }
          else if (IRStates[i].nextstate[state] == IR_ERROR &amp;&amp; (state == IR_INT || state == IR_REAL))
             break;

          //*** Go to the next state
          state = IRStates[i].nextstate[state];

          //*** Store the current character
          buff[j++] = ch;

          //*** Get the next character
          <!--ch = ->NextCh();
      }
      buff[j] = 0;

      if (state == IR_ERROR)
         Error(ER_BADCH);
      else if (state == IR_INT) {
         //*** Convert data in buff[] to an int
         ...
      }
      else if (state == IR_REAL) {
         //*** Convert data in buff[] to a real
         ...
      }

      <b>Listing {IRIMPL}.</b>
</pre>

The current character from the input stream is compared to the state table in 
order to find the right row.  If the character is not in the state table and the 
current state is not an accept-state, then the current state is set to the 
error-state, <tt>IR_ERROR</tt>, and the computer exits the loop.  Otherwise if 
the current state is an accept state, the loop exits and the current state signals
either a <tt>IR_INT</tt>, or a <tt>IR_REAL</tt>.<p>

If the current character is found in the state table, the next state is looked up, 
and the current character is saved to a buffer.  The process repeats until the 
current character moves the FSA to an error state.<p>

The numeric portion of the SAL scanner utilizes a temporary buffer where the valid 
characters for the number are stored.  As a character is consumed, it is placed in 
this buffer.  Upon exiting from the loop, we terminate the buffer with a null 
character.  If the final state is an accept-state, the data in the buffer is 
converted into a binary form.  This process will be discussed shortly.<p><!-------------------------------------------------------------------------------->
<B>Converting ASCII integers to binary.</B>  The easiest method of performing the
task of converting integers to binary is to employ a leftshift-or (actually it's a 
multiply-add) method.  The process is straightforward.
<OL>
  <LI>Initialize the final number to zero.
  <LI>Initialize the buffer position to zero or the left most character or most significant character (Remember not to include the prefix notation 
      e.g "0x").
  <LI><B>while</B> there are characters in the buffer, iterate the buffer
      position, traversing in left to right order.
      <OL>
        <LI>Multiply our final number by its base.  In other words if we are 
            converting to binary, we multiply our final number by two.  For octal
            its eight, and so on.
        <LI>convert the current character to its binary (not ASCII) equivalent.
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -