📄 chapter 3 lexical analysis.htm
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0062)http://topaz.cs.byu.edu/text/html/Textbook/Chapter3/index.html -->
<HTML><HEAD><TITLE>Chapter 3: Lexical Analysis</TITLE>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.2800.1458" name=GENERATOR></HEAD>
<BODY>
<CENTER>
<H1>Chapter 3<BR>Lexical Analysis</H1></CENTER>
<HR>
<!-------------------------------------------------------------------------------->The
Scanner is that portion of the compiler that performs lexical analysis. For the
most part, the scanner is the easiest part to implement. Dispite its simplicity,
there is some basic underlying theory involved, and some practical techniques
that will save the programmer some steps, and a lot of frustration in the long
run. Although its common name is the scanner, you will also hear the terms
tokenizer or, simply lexical analyzer (for lack of a better name). These all
refer to that same part of the compiler that performs lexical analysis, i.e.,
converting raw text into the basic language elements called tokens.
<P>The scanner's job is to take a stream of characters and break it up into
words that are recognized by the parser (the next component). These words are
commonly called <I>tokens</I>. A token may be composed of a single character or
a sequence of characters. Examples of tokens are identifiers (name of variables
defined by the programmer in his program), or words that have a special meaning
in the language like begin, end, etc. (the key words of the language), numbers
and special character sequences (such as := in SAL, or Pascal). The lexical
analyzer also eliminates from the text source the comments that may exist in the
program and all whitespace.
<P>
<H2>3.1 Overview of a scanner</H2><!-------------------------------------------------------------------------------->Most
texts refer to a part of the compiler that is usually called the <I>compiler
interface</I>. This is nebulous amalgamation of various basic systems, and is
usually a mixture of these things:
<OL>
<LI>Something that takes the source text to be compiled and buffers it,
<LI>Something to produce a listing of the program for error reporting,
<LI>Something to report and keep track of errors,
<LI>Something to output the final machine code object file. </LI></OL>We will
make the distinction, however, that these are all just four of many different
subcomponents within the compiler. Although the scanner is separate from this
interface, it relies directly upon that portion that receives and buffers the
source program. We will call this portion the input interface. The job of the
input interface is threefold:
<OL>
<LI>Receive the raw source file, and provide it to the scanner as a continuous
stream of characters. In other words, the scanner should not have to deal with
the file or input device directly. To the scanner, the source should come as a
continuous stream.
<LI>Provide an output source listing. This is usually a file, although it can
be dumped to the screen (perhaps <I>stdout</I>), that contains the original
source, line numbers, and other data like the date and time, the program
counter offset for each line, or any other relevant data.
<LI>Manage the formatting of error reports in the source listing. This means
providing visual cues like the position or column of the line where the error
was detected. It will be the job of the error reporting and recovery system to
actually write errors. This portion just formats their output. </LI></OL>
<P>The basic process of lexical analysis is shown in figure {SCNPRS}. The input
interface takes the source file, buffers it internally, and provides it to the
scanner as a continuous stream of characters.
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/SCNPRS.gif"></CENTER>The
scanner then takes the characters, and using a set of internal methods produces
the basic language elements, called tokens, in a continuous stream.
<P>
<H3>3.1.1 Regular Grammars</H3><!-------------------------------------------------------------------------------->Before
proceeding, lets revisit a topic from an earlier course, just as a refresher. We
learned in the course on discrete structures and computational theory that a
certain class of grammars called regular grammars embodies the simplest of all
languages. These are the languages whose productions are of the form,
<MENU>
<MENU><B><I>A = aB<BR>A = a </I></B></MENU></MENU>In other words, on the
left-hand side are only non-terminals, and on the right hand is a terminal
followed by an optional single non-terminal. The form shown here is said to be
<I>right linear</I>, in that recognition of a word in that grammar proceeds from
left to right.
<P>Determinism is an extremely important factor in regular grammars. Determinism
means that there is no ambiguity in the grammar. Often in the design of a
regular grammar, we make a rule or a set of rules where the computer has to
"just make the right decision". The appropriate dicision at the time might seem
obvious to us--whose brains are so much more powerful at recognizing
patterns--but to the computer the choice is one of great significance. More
often than not the computer will make the wrong choice. An example of an
ambiguous grammar would be
<MENU>
<MENU><B><I>A = aB<BR>A = aC<BR>B = b<BR>C = c<BR></I></B></MENU></MENU>In this
example there is an ambiguity between the first two rules. We cannot say that
<B><I>A = aB1</I></B> and <B><I>A = aB2</I></B>, where <B><I>B1</I> <>
<I>B2</I></B>. Upon recognizing an <I>a</I>, there is an ambiguity as to which
production should come next. To correct this problem, create a new state, and
merge the rules together.
<MENU>
<MENU><B><I>A = aX<BR>X = b<BR>X = c<BR></I></B></MENU></MENU>In short, a
grammar is said to be deterministic if and only if its present state and the
next symbol to be accepted uniquely determine the next state.
<P>
<H3>3.1.2 Finite State Automata</H3><!-------------------------------------------------------------------------------->It
can be shown also that regular grammars are functionally equivalent to Finite
State Automata (FSA). An FSA is a diagramatic representation of a regular
grammar. In this text, it is important that we explain the notation that we will
use for all state diagrams. The reader should be otherwise familiar with these.
<P>
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/STLEG.gif"></CENTER>There
exists for all regular grammars and FSAs an input alphabet. Although there could
be any number of other characters, an FSA only recognizes a select set of these.
If a character is encountered that is not part of the FSAs input alphabet, then
it is an error. Error handling will be discussed in a bit.
<P>States are designated by circles, which may or may not have a label. States
with a thick arrow pointing to them are the start state. States with a double
circle are called accept states, and for our purposes will signal the successful
recognition of a token.
<P>At any given point in time there is a current state. It begins with the start
state. State transitions work like this:
<OL>
<LI type=A><B>while</B> the current character is in the input alphabet
<OL>
<LI><B>if</B> the current character matches the label for one of the edges
of the current state then
<OL>
<LI type=a>the current state becomes the state pointed to by that edge
<LI type=a>the current character is "consumed" (in other words, we move to
the next character in the input stream). </LI></OL>
<LI><B>else</B> break out of the loop </LI></OL>
<LI type=A><B>if</B> the current state is an accept state then <B>success</B>
else <B>fail</B> </LI></OL>A transition along an edge labeled with a <I>?</I>
may happen if and only if the current character fails to match any other edge
leading from the current state, and the current character is within the input
alphabet. A transition along an edge labeled as an empty move may happen if and
only if there is no edge labeled <I>?</I> , or if the the current character is
<I>not</I> a member of the input alphabet. In other words, an empty transition
has lower priority than a <I>?</I>-transition.
<P>A token is recognized only under the following two conditions:
<OL>
<LI>The current state is an accept state.
<LI>The current character does not match any edge, excluding empty
transitions. </LI></OL>If these two conditions are met, then the FSA will output
a valid token, and then perform an empty transition back to the start state.
<P>If there are no matching edges for the current character, or if the current
character is not a member of the input alphabet, then a number of precise steps
are followed.
<OL>
<LI>An error is signaled. This is not represented by our notation, but it
would be equivalent to an empty transition to a state labeled <I>error</I>,
and an appropriate error message is given.
<LI>The current state is reset to the start state (another empty transition).
This is equivalent to saying that the FSA failed to recognize the prevoius
number of characters as anything valid, and is starting over.
<LI>While the current character does not match any edges from the start state,
repeat:
<OL type=A>
<LI>Consume the current character </LI></OL>
<LI>At this point, the FSA operates as normal. </LI></OL>As an example we will
create an FSA to recognize simple real numbers. We first have to decide in some
detail what form the real numbers can take on. For simplicity's sake, we will
say that they only need to be of the format 0.0. Next we should completely
determine the input alphabet. For real numbers, it would be {0..9, .}. The
FSA would look like figure {RNUMSA}. It has four states.
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/RNUMFSA.gif"></CENTER>
<H3>3.1.3 Combining FSAs</H3><!-------------------------------------------------------------------------------->The
essence of a scanner is an array of FSAs combined in a logical-OR fashion. Let's
look at how this is done. We shall combine the FSA in figure {RNUMFSA} with
another that recognizes identifiers. In SAL, an identifier can be composed of
letters, numbers, and the underscore character, but it must not begin with a
number. An FSA for SAL identifiers is presented in figure {CIDFSA}.
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/IDFSA.gif"></CENTER>Combining
two FSAs is a three step process:
<OL>
<LI><B>Combine the input alphabets.</B> This is a logical union. In this
example, the input alphabet for the real number FSA is {0..9, .}, and the
input alphabet for the identifier FSA is {A..Z, a..z, _, 0..9}. Combining
these two gives us {A..Z, a..z, _, 0..9, .}.
<LI><B>Combine all start states.</B> Combine all start states into one start
state. If the start state for one or more of the FSAs is also an accept state,
then the new start state will also become an accept state. Obviously, two or
more FSAs with the start state as an accept cannot be combined.
<LI><B>Combine states with identical edges.</B> This process removes any
non-deterministic transitions, and will be explained later, shortly.
<LI><B>Provide a return path</B>. There needs to be a return path from all
accept states to the new start state. The return path should be marked as an
empty transition. </LI></OL>In order for this demonstration to make a little
more sense, let's add another very basic FSA: one that accepts blanks. See
figure {BLNKFSA}.
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/BLNKFSA.gif"></CENTER>These
three combined gives us an FSA like the one shown in figure {COMBFSA}.
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/COMBFSA.gif"></CENTER>At
times there may be stark ambiguities in the combination of FSAs. Consider the
case where we combine an FSA to recognize real numbers like figure {RNUMFSA},
and another to recognize integers:
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/INTFSA.gif"></CENTER>We now
have a situation similar to the one described in section 3.1.1. Combining these
two as previously explained would give us an FSA that has two edges with the
same label. This problem is fixed by combining the states. If one of the states
is an accept state, then the new state will also be an accept state.
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/IRFSA.gif"></CENTER>This is a
simple example. Depending on the way the FSA is composed, it may need to be
re-written.
<P>
<H2>3.2 SAL Scanner Overview</H2><!-------------------------------------------------------------------------------->In
order to implement a scanner as a project for this text, there are several tools
and helps. We will first cover a list of provided functions and data types, and
then discuss the general algorithm and some practical methods for scanning and
recognizing all SAL tokens.
<P>The scanner is embodied in a single function, called <TT>GetNextToken()</TT>.
This function sets the fields of a global variable (a record) called
<TT>token</TT>. These two items are made globally available within the compiler.
<P>To get an idea of where the scanner sits in the scheme of services offered,
see the diagram in figure {SCNRDIG}.
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/SCNRDIG.gif"></CENTER>As we
can see, the scanner interacts with three different components, the input
interface, the parser, and the token queue. To communicate with the input
interface, the scanner has access to the token stream through a function that
gets the next character, and the scanner also has access to the buffer, itself,
if the need arises. In order to communicate with the parser, the scanner fills
in the appropriate fields of the global variable, <TT>token</TT>. In order to
communicate with the token queue, the scanner has a way to see if the queue
should be used, and if so, another way to dequeue the next token. We will
discuss the token queue later on.
<P>
<H3>3.2.1 The SAL Input Interface.</H3><!-------------------------------------------------------------------------------->As
discussed in section 3.1, the input interface's job is to manage the source
file, buffer its contents, and provide them to the scanner as a continuous
stream. The scanner should never have to worry about getting the next line, and
should not be concerned about whether or not the buffer is full or empty. The
input interface can be described quite succinctly: it is three items.
<MENU>
<TABLE>
<TBODY>
<TR vAlign=top>
<TD width="20%"><B><TT>ch</TT></B></TD>
<TD>This is a <TT>char</TT>. It is a global variable containing the
current character from the input stream. At startup, <TT>ch</TT> is
initialized to a space. This tricks <TT>GetNextToken()</TT> into
thinking that it is scanning whitespace, which it knows to skip over.
<P></P></TD></TR>
<TR vAlign=top>
<TD><B><TT>nch</TT></B></TD>
<TD>This is (effectively) a <TT>char</TT>. It really a macro that
returns the character following <TT>ch<TT> in the stream. It is used to
peek ahead one character.
<P></P></TT></TT></TD></TR>
<TR vAlign=top>
<TD><B><TT>NextCh()</TT></B></TD>
<TD>This function gets the next character and assigns it to <TT>ch</TT>.
<P></P></TD></TR></TBODY></TABLE></MENU>These three items should be all that is
needed to write a decent scanner. In short, the current character is
<TT>ch</TT>, to peek one character ahead we use <TT>nch</TT>, and to get the
next character we use <TT>NextCh()</TT>. Nevertheless, it is necessary at times
to work directly with the buffer, although this should be done with caution. The
following four items following items work with it: <PRE> #define SOURCELINEMAX 255 // max size of the buffer
static char sourceline[SOURCELINEMAX+2]; // last line read from source file
static WORD LineLen = 0; // length of current line
WORD CurrCol = 0; // Current position in the character buffer
</PRE>In order to look ahead one character, a macro called <TT>nch</TT> has been
defined. This is the preferred method for looking ahead. In order to look for
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -