📄 chapter 3 lexical analysis.htm

📁 英文版编译器设计：里面详细介绍啦C编译器的设计
💻 HTM
📖 第 1 页 / 共 5 页
字号:
12 3 4 5 下一页
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0062)http://topaz.cs.byu.edu/text/html/Textbook/Chapter3/index.html -->
<HTML><HEAD><TITLE>Chapter 3: Lexical Analysis</TITLE>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.2800.1458" name=GENERATOR></HEAD>
<BODY>
<CENTER>
<H1>Chapter 3<BR>Lexical Analysis</H1></CENTER>
<HR>
<!-------------------------------------------------------------------------------->The 
Scanner is that portion of the compiler that performs lexical analysis. For the 
most part, the scanner is the easiest part to implement. Dispite its simplicity, 
there is some basic underlying theory involved, and some practical techniques 
that will save the programmer some steps, and a lot of frustration in the long 
run. Although its common name is the scanner, you will also hear the terms 
tokenizer or, simply lexical analyzer (for lack of a better name). These all 
refer to that same part of the compiler that performs lexical analysis, i.e., 
converting raw text into the basic language elements called tokens.
<P>The scanner's job is to take a stream of characters and break it up into 
words that are recognized by the parser (the next component). These words are 
commonly called <I>tokens</I>. A token may be composed of a single character or 
a sequence of characters. Examples of tokens are identifiers (name of variables 
defined by the programmer in his program), or words that have a special meaning 
in the language like begin, end, etc. (the key words of the language), numbers 
and special character sequences (such as := in SAL, or Pascal). The lexical 
analyzer also eliminates from the text source the comments that may exist in the 
program and all whitespace.
<P>
<H2>3.1 Overview of a scanner</H2><!-------------------------------------------------------------------------------->Most 
texts refer to a part of the compiler that is usually called the <I>compiler 
interface</I>. This is nebulous amalgamation of various basic systems, and is 
usually a mixture of these things: 
<OL>
  <LI>Something that takes the source text to be compiled and buffers it, 
  <LI>Something to produce a listing of the program for error reporting, 
  <LI>Something to report and keep track of errors, 
  <LI>Something to output the final machine code object file. </LI></OL>We will 
make the distinction, however, that these are all just four of many different 
subcomponents within the compiler. Although the scanner is separate from this 
interface, it relies directly upon that portion that receives and buffers the 
source program. We will call this portion the input interface. The job of the 
input interface is threefold: 
<OL>
  <LI>Receive the raw source file, and provide it to the scanner as a continuous 
  stream of characters. In other words, the scanner should not have to deal with 
  the file or input device directly. To the scanner, the source should come as a 
  continuous stream. 
  <LI>Provide an output source listing. This is usually a file, although it can 
  be dumped to the screen (perhaps <I>stdout</I>), that contains the original 
  source, line numbers, and other data like the date and time, the program 
  counter offset for each line, or any other relevant data. 
  <LI>Manage the formatting of error reports in the source listing. This means 
  providing visual cues like the position or column of the line where the error 
  was detected. It will be the job of the error reporting and recovery system to 
  actually write errors. This portion just formats their output. </LI></OL>
<P>The basic process of lexical analysis is shown in figure {SCNPRS}. The input 
interface takes the source file, buffers it internally, and provides it to the 
scanner as a continuous stream of characters. 
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/SCNPRS.gif"></CENTER>The 
scanner then takes the characters, and using a set of internal methods produces 
the basic language elements, called tokens, in a continuous stream.
<P>
<H3>3.1.1 Regular Grammars</H3><!-------------------------------------------------------------------------------->Before 
proceeding, lets revisit a topic from an earlier course, just as a refresher. We 
learned in the course on discrete structures and computational theory that a 
certain class of grammars called regular grammars embodies the simplest of all 
languages. These are the languages whose productions are of the form, 
<MENU>
  <MENU><B><I>A = aB<BR>A = a </I></B></MENU></MENU>In other words, on the 
left-hand side are only non-terminals, and on the right hand is a terminal 
followed by an optional single non-terminal. The form shown here is said to be 
<I>right linear</I>, in that recognition of a word in that grammar proceeds from 
left to right.
<P>Determinism is an extremely important factor in regular grammars. Determinism 
means that there is no ambiguity in the grammar. Often in the design of a 
regular grammar, we make a rule or a set of rules where the computer has to 
"just make the right decision". The appropriate dicision at the time might seem 
obvious to us--whose brains are so much more powerful at recognizing 
patterns--but to the computer the choice is one of great significance. More 
often than not the computer will make the wrong choice. An example of an 
ambiguous grammar would be 
<MENU>
  <MENU><B><I>A = aB<BR>A = aC<BR>B = b<BR>C = c<BR></I></B></MENU></MENU>In this 
example there is an ambiguity between the first two rules. We cannot say that 
<B><I>A = aB1</I></B> and <B><I>A = aB2</I></B>, where <B><I>B1</I> &lt;&gt; 
<I>B2</I></B>. Upon recognizing an <I>a</I>, there is an ambiguity as to which 
production should come next. To correct this problem, create a new state, and 
merge the rules together. 
<MENU>
  <MENU><B><I>A = aX<BR>X = b<BR>X = c<BR></I></B></MENU></MENU>In short, a 
grammar is said to be deterministic if and only if its present state and the 
next symbol to be accepted uniquely determine the next state.
<P>
<H3>3.1.2 Finite State Automata</H3><!-------------------------------------------------------------------------------->It 
can be shown also that regular grammars are functionally equivalent to Finite 
State Automata (FSA). An FSA is a diagramatic representation of a regular 
grammar. In this text, it is important that we explain the notation that we will 
use for all state diagrams. The reader should be otherwise familiar with these.
<P>
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/STLEG.gif"></CENTER>There 
exists for all regular grammars and FSAs an input alphabet. Although there could 
be any number of other characters, an FSA only recognizes a select set of these. 
If a character is encountered that is not part of the FSAs input alphabet, then 
it is an error. Error handling will be discussed in a bit.
<P>States are designated by circles, which may or may not have a label. States 
with a thick arrow pointing to them are the start state. States with a double 
circle are called accept states, and for our purposes will signal the successful 
recognition of a token.
<P>At any given point in time there is a current state. It begins with the start 
state. State transitions work like this: 
<OL>
  <LI type=A><B>while</B> the current character is in the input alphabet 
  <OL>
    <LI><B>if</B> the current character matches the label for one of the edges 
    of the current state then 
    <OL>
      <LI type=a>the current state becomes the state pointed to by that edge 
      <LI type=a>the current character is "consumed" (in other words, we move to 
      the next character in the input stream). </LI></OL>
    <LI><B>else</B> break out of the loop </LI></OL>
  <LI type=A><B>if</B> the current state is an accept state then <B>success</B> 
  else <B>fail</B> </LI></OL>A transition along an edge labeled with a <I>?</I> 
may happen if and only if the current character fails to match any other edge 
leading from the current state, and the current character is within the input 
alphabet. A transition along an edge labeled as an empty move may happen if and 
only if there is no edge labeled <I>?</I> , or if the the current character is 
<I>not</I> a member of the input alphabet. In other words, an empty transition 
has lower priority than a <I>?</I>-transition.
<P>A token is recognized only under the following two conditions: 
<OL>
  <LI>The current state is an accept state. 
  <LI>The current character does not match any edge, excluding empty 
  transitions. </LI></OL>If these two conditions are met, then the FSA will output 
a valid token, and then perform an empty transition back to the start state.
<P>If there are no matching edges for the current character, or if the current 
character is not a member of the input alphabet, then a number of precise steps 
are followed. 
<OL>
  <LI>An error is signaled. This is not represented by our notation, but it 
  would be equivalent to an empty transition to a state labeled <I>error</I>, 
  and an appropriate error message is given. 
  <LI>The current state is reset to the start state (another empty transition). 
  This is equivalent to saying that the FSA failed to recognize the prevoius 
  number of characters as anything valid, and is starting over. 
  <LI>While the current character does not match any edges from the start state, 
  repeat: 
  <OL type=A>
    <LI>Consume the current character </LI></OL>
  <LI>At this point, the FSA operates as normal. </LI></OL>As an example we will 
create an FSA to recognize simple real numbers. We first have to decide in some 
detail what form the real numbers can take on. For simplicity's sake, we will 
say that they only need to be of the format 0.0. Next we should completely 
determine the input alphabet. For real numbers, it would be {0..9,&nbsp;.}. The 
FSA would look like figure {RNUMSA}. It has four states. 
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/RNUMFSA.gif"></CENTER>
<H3>3.1.3 Combining FSAs</H3><!-------------------------------------------------------------------------------->The 
essence of a scanner is an array of FSAs combined in a logical-OR fashion. Let's 
look at how this is done. We shall combine the FSA in figure {RNUMFSA} with 
another that recognizes identifiers. In SAL, an identifier can be composed of 
letters, numbers, and the underscore character, but it must not begin with a 
number. An FSA for SAL identifiers is presented in figure {CIDFSA}. 
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/IDFSA.gif"></CENTER>Combining 
two FSAs is a three step process: 
<OL>
  <LI><B>Combine the input alphabets.</B> This is a logical union. In this 
  example, the input alphabet for the real number FSA is {0..9, .}, and the 
  input alphabet for the identifier FSA is {A..Z, a..z, _, 0..9}. Combining 
  these two gives us {A..Z, a..z, _, 0..9, .}. 
  <LI><B>Combine all start states.</B> Combine all start states into one start 
  state. If the start state for one or more of the FSAs is also an accept state, 
  then the new start state will also become an accept state. Obviously, two or 
  more FSAs with the start state as an accept cannot be combined. 
  <LI><B>Combine states with identical edges.</B> This process removes any 
  non-deterministic transitions, and will be explained later, shortly. 
  <LI><B>Provide a return path</B>. There needs to be a return path from all 
  accept states to the new start state. The return path should be marked as an 
  empty transition. </LI></OL>In order for this demonstration to make a little 
more sense, let's add another very basic FSA: one that accepts blanks. See 
figure {BLNKFSA}. 
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/BLNKFSA.gif"></CENTER>These 
three combined gives us an FSA like the one shown in figure {COMBFSA}. 
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/COMBFSA.gif"></CENTER>At 
times there may be stark ambiguities in the combination of FSAs. Consider the 
case where we combine an FSA to recognize real numbers like figure {RNUMFSA}, 
and another to recognize integers: 
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/INTFSA.gif"></CENTER>We now 
have a situation similar to the one described in section 3.1.1. Combining these 
two as previously explained would give us an FSA that has two edges with the 
same label. This problem is fixed by combining the states. If one of the states 
is an accept state, then the new state will also be an accept state. 
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/IRFSA.gif"></CENTER>This is a 
simple example. Depending on the way the FSA is composed, it may need to be 
re-written.
<P>
<H2>3.2 SAL Scanner Overview</H2><!-------------------------------------------------------------------------------->In 
order to implement a scanner as a project for this text, there are several tools 
and helps. We will first cover a list of provided functions and data types, and 
then discuss the general algorithm and some practical methods for scanning and 
recognizing all SAL tokens.
<P>The scanner is embodied in a single function, called <TT>GetNextToken()</TT>. 
This function sets the fields of a global variable (a record) called 
<TT>token</TT>. These two items are made globally available within the compiler.
<P>To get an idea of where the scanner sits in the scheme of services offered, 
see the diagram in figure {SCNRDIG}. 
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/SCNRDIG.gif"></CENTER>As we 
can see, the scanner interacts with three different components, the input 
interface, the parser, and the token queue. To communicate with the input 
interface, the scanner has access to the token stream through a function that 
gets the next character, and the scanner also has access to the buffer, itself, 
if the need arises. In order to communicate with the parser, the scanner fills 
in the appropriate fields of the global variable, <TT>token</TT>. In order to 
communicate with the token queue, the scanner has a way to see if the queue 
should be used, and if so, another way to dequeue the next token. We will 
discuss the token queue later on.
<P>
<H3>3.2.1 The SAL Input Interface.</H3><!-------------------------------------------------------------------------------->As 
discussed in section 3.1, the input interface's job is to manage the source 
file, buffer its contents, and provide them to the scanner as a continuous 
stream. The scanner should never have to worry about getting the next line, and 
should not be concerned about whether or not the buffer is full or empty. The 
input interface can be described quite succinctly: it is three items. 
<MENU>
  <TABLE>
    <TBODY>
    <TR vAlign=top>
      <TD width="20%"><B><TT>ch</TT></B></TD>
      <TD>This is a <TT>char</TT>. It is a global variable containing the 
        current character from the input stream. At startup, <TT>ch</TT> is 
        initialized to a space. This tricks <TT>GetNextToken()</TT> into 
        thinking that it is scanning whitespace, which it knows to skip over.
        <P></P></TD></TR>
    <TR vAlign=top>
      <TD><B><TT>nch</TT></B></TD>
      <TD>This is (effectively) a <TT>char</TT>. It really a macro that 
        returns the character following <TT>ch<TT> in the stream. It is used to 
        peek ahead one character.
        <P></P></TT></TT></TD></TR>
    <TR vAlign=top>
      <TD><B><TT>NextCh()</TT></B></TD>
      <TD>This function gets the next character and assigns it to <TT>ch</TT>.
        <P></P></TD></TR></TBODY></TABLE></MENU>These three items should be all that is 
needed to write a decent scanner. In short, the current character is 
<TT>ch</TT>, to peek one character ahead we use <TT>nch</TT>, and to get the 
next character we use <TT>NextCh()</TT>. Nevertheless, it is necessary at times 
to work directly with the buffer, although this should be done with caution. The 
following four items following items work with it: <PRE>      #define SOURCELINEMAX 255                 // max size of the buffer
      static char sourceline[SOURCELINEMAX+2];  // last line read from source file 

      static WORD LineLen = 0;                  // length of current line
      WORD CurrCol = 0;                         // Current position in the character buffer
</PRE>In order to look ahead one character, a macro called <TT>nch</TT> has been 
defined. This is the preferred method for looking ahead. In order to look for
12 3 4 5 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -