📄 chapter 3 lexical analysis.htm
字号:
the <I>n</I>th character in the buffer, use
<TT>sourceline[CurrCol+<I>n</I>]</TT>. With the current implementation, it is
safe to look only two characters ahead. Incrementing <TT>CurrCol</TT> is
equivalent to advancing to the next character. <TT>CurrCol</TT> should not be
incremented beyond <TT>LineLen</TT>.
<P>The last character in the buffer is found at <TT>sourceline[LineLen]</TT>.
The only way to force the input interface to get the next line of source is to
set <TT>CurrCol</TT> to <TT>LineLen</TT> and then call <TT>NextCh()</TT>.
<P>
<MENU></MENU>The buffer is already properly managed by <TT>NextCh()</TT>, and
should never be touched unless absolutely necessary. The function
<TT>NextCh()</TT>manages all the features of the input interface, including the
following:
<OL>
<LI>Buffering the input and providing a continuous stream of characters to the
scanner,
<LI>Providing a line-by-line printout of the source program. This includes
line numbers and the PC offset of the current line,
<LI>Provide facilities for formatting error message output, particularly the
proper positioning of the error on the line, so the user can see where an
error occurs on a line, and not just the line where an error occured. </LI></OL>
<P>
<H3>3.2.2 Token Representation</H3><!-------------------------------------------------------------------------------->A
token is a basic symbol within a language. As an analogy, tokens are to a
computer language as words are to any human language. The process of scanning is
recognizing these basic symbols as they appear in the input stream, and building
an internal structure that holds all their data (if there is any)
<P>Each token is represented by a single enumerated value. There are at present,
97 unique tokens within the SAL language. Most of these are symbols (delimeters)
and operators. They are listed in listing {SALTOK}. <PRE> typedef enum {
badsy = -1,
//*** Factor types
identsy, intconsy, realconsy, charconsy, stringsy,
//*** Operators (in order of precidence)
expsy, // Arithmetic:
compsy, notsy,
timessy, rdivsy, divsy, modsy,
plussy, minussy,
bshrsy, bsshrsy, bshlsy, brorsy, brolsy, // Bitwise
bandsy,
borsy, bxorsy,
eqlsy, neqsy, gtrsy, lsssy, geqsy, leqsy, // Boolean/comparison
andsy,
orsy,
becomessy, // Assignment
//*** Symbols and delimiters
lparentsy, rparentsy, lbracksy, rbracksy, commasy,
semicolonsy, periodsy, colonsy, scopesy, atsy,
//*** Declaration keywords
constsy, typesy,issy, varsy, pointersy,
arraysy,ofsy,
recordsy,
classsy,extendssy,virtualsy,sharedsy, publicsy,privatesy,
constructorsy,destructorsy, supersy,
//*** import/export keywords
importsy, exportsy,
//*** Statement keywords
ifsy, thensy, elseifsy,elsesy,
forsy,tosy,bysy,
untilsy,whilesy,dosy,loopsy, continuesy,breaksy,
selectsy,fromsy,casesy,
trysy,catchsy,finallysy,throwsy,
tronsy,troffsy, readsy,writesy, returnsy,
//*** Function/module keywords
procsy, funcsy, istreamsy,ostreamsy, netinsy, netoutsy,
beginsy, endsy, programsy, librarysy,
eopsy
} Symbol;
<B>Listing {SALTOK}</B>
</PRE>At this level we will not concern ourselves with all 97 of these. The SAL
language has developed far beyond the limits of one or two semesters' work.
<P>The current token is usually stored in a structure of some sort. In an
object-oriented design it could be retrieved through an access method or as a
member of an instance of a scanner object. In the SAL implementation that we are
developing in this book, the current token exists as a global instance of
<TT>struct Token</TT>. The structure of a token is shown in listing
{TKSTRCT} <PRE>//*** token class *********************
struct Token {
Symbol sy; // last symbol read by GetNextToken
union {
Alfa id; // actual identifier from GetNextToken
char ch;
QWORD num; // number values are ALWAYS unsigned
DOUBLE rnum; // Same goes for real numbers
struct {
char *txt; // last char or string from GetNextToken
WORD length; // string length
} str;
} data;
CHAR base;
long number;
Token(Token &other); // copy constructor
Token(); // default constructor
};
<B>Listing {TKSTRCT}</B>
</PRE>Some tokens require additional data. For instance, it does little good if
the current token is <TT>stringsy</TT> and you do not know anything about the
string itself. The two classes of tokens that have an associated value are
constants and identifiers.
<P>Aside from the two constructors, this structure contains a host of fields.
Let's go through these one by one.
<P>
<MENU><B><TT>sy</TT></B> The enum value of the current token is kept here. It
is one of the values from <TT>enum symbol</TT> from listing {SALTOK}
<P><B><TT>union data.</TT></B> This field holds the extra data for the current
token. It has several fields. Which one is valid depends upon the value of
<TT>sy</TT>.
<MENU><B><TT>alfa id</TT></B>. An <TT>alfa</TT> is a character array of 16
elements. If <TT>sy</TT> is equal to <TT>identsy</TT>, this field holds the
name of the current identifier.<BR><B><TT>char ch</TT></B> If <TT>sy</TT> is
equal to <TT>charconsy</TT>, then this field is valid, and contains the
character retrieved by the scanner.<BR><B><TT>QWORD num</TT></B> This field
is valid when <TT>sy</TT> is equal to <TT>intconsy</TT>, and it contains the
number that was recognized by the scanner. This structure holds a 64 bit
integer. The number held here is <I>unsigned</I>.<BR><B><TT>DOUBLE
rnum</TT></B> When <TT>sy</TT> is equal to <TT>realconsy</TT>, this field
contains the value of the real number that was scanned in. The number held
here is <I>unsigned</I>.<BR><B><TT>struct str</TT></B> This struct contains
two items. The first is a pointer to the string (within the string table),
and the second is the length of the string. It is valid when <TT>sy</TT>
equals <TT>stringsy</TT><BR></MENU>
<P><B><TT>CHAR base</TT></B> This field works only when <TT>sy</TT> equals
<TT>intconsy</TT> It holds the base of the number scanned in, and it is used
in resolving the difference between signed and unsigned integers when there is
an ambiguity.
<P><B><TT>long number</TT></B> This field increments with each new token. Its
pupose is to let the compiler know when it has gotten stuck in a loop during
error-detection.
<P></P></MENU>You will notice that the fields <TT>num</TT> and <TT>rnum</TT>.
contain an unsigned number. The scanner does not try to detect sign in a number,
like <PRE> -1.5
</PRE>The scanner should interpret this line as two separate tokens, which would
be <TT>minussy</TT>, and <TT>reealsy</TT> with a value of 1.5.
<P>
<H2>3.3 Scanning SAL Tokens</H2><!-------------------------------------------------------------------------------->By
now, we can begin to piece together the entire puzzle and the method of
implementation can begin to take form. We know that FSAs are useful for
recognizing a language's various tokens from the input stream. We know that the
accept state outputs one of these tokens once it is recognized, and the output
is in the form of a <TT>token</TT> struct, from listing {TKSTRCT}. As far as
code goes, we have seen the SAL input interface, which consists of <TT>ch</TT>,
<TT>nch</TT>, and <TT>NextCh()</TT>. We have also seen the <TT>Token</TT>
structure, and have a basic understanding of what its fields are used for. We
are now ready to discuss writing a scanner.
<P>There are four different classes of tokens. Each class is distinguished by
the type of additional data that it carries, if there is any.
<OL>
<LI><B>Symbols and Delimiters.</B> These are the simplest class of tokens.
They consist of non-alphanumeric characters, and they are between one to three
characters in length. These tokens are scattered throughout the enum in
listing {SALTOK}.
<LI><B>Constants.</B> There are several types of these. There are numeric
constants--both integer and real, there are character constants, and there are
strings. The type of constants that we are refering to are <I>literal</I>
constants, like 3.1415 or 1000. The other type of constants, like <TT>pi</TT>
and <TT>ARRAYMAX</TT> are symbolic, and are dealt with at the symbol table
level. Some other examples of literal constants are <TT>123</TT>,
<TT>3.1415</TT>, <TT>'A'</TT>, and <TT>"Hello, world!"</TT>. The symbols
<TT>intconsy</TT>, <TT>realconsy</TT>, <TT>charconsy</TT>, and
<TT>stringsy</TT> correspond to these four types, respectively.
<LI><B>Identifiers.</B> Identifiers in SAL are like those of C. They can
consist of (currently up to 16) letters, numbers, and an the underscore
character. They cannot begin with a number. Some SAL identifiers are
<TT>abc</TT>, <TT>_ABC</TT>, and <TT>abc123</TT>. The string <TT>123abc</TT>
would be handled by the scanner as a literal constant <TT>123</TT> and an
identifier <TT>abc</TT>. Identifiers are represented by the symbol,
<TT>identsy</TT> of listing {SALTOK}
<LI><B>Keywords.</B> Keywords are a special class of identifier, which has a
specific meaning within the language, itself. Some examples are
<TT>program</TT>, <TT>if</TT>, and <TT>while</TT>. Some items recognized by
the compiler like <TT>int</TT>, <TT>card</TT>, <TT>real</TT>, and
<TT>char</TT> are not keywords , but are <I>intrinsic types</I>. Though
technically keywords, they are not treated as such; they are treated like
identifiers. This will make sense later on. </LI></OL>
<P>In designing each section of code that recognizes tokens, it is important to
remember that, regardless of the input alphabet for a given FSA, we are working
with an input stream that contains characters from the entire ASCII set. All
portions of code for each of the four classes of tokens must adhere to two
rules:
<OL>
<LI>FSAs recognize characters until they find a character that is not within
their input alphabet, then they stop.
<LI>FSAs signal an error iff they have not stopped in an accept state.
</LI></OL>The reason for these two rules, in particular the second is that the
offending character might be a legitimate part of the next token.
<P>
<H3>3.3.1 Symbols and Delimeters</H3><!-------------------------------------------------------------------------------->There
are two methods to scanning symbols and delimeters. The first involves the use
of a switch statement, and the second involves the use of string tables. For
demonstration, let us suppose that we need to scan for all of the comparison
operators, i.e., >, <, >=, <=, =, and <>.
<P>The first method involves a switch statement. This method works well for FSAs
that take on a tree shape. An FSA to recognize these tokens is presented in
figure {CMPFSA}.
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/CMPFSA.gif"></CENTER>For this
FSA, we can easily write an acceptor: <PRE> switch (ch) {
case '>':
token.sy = gtrsy;
if (nch == '=')
token.sy = geqsy;
break;
case '<':
token.sy = lsssy;
if (nch == '=')
token.sy = leqsy;
else if (nch == '>')
token.sy = neqsy;
break;
case '=':
token.sy = eqlsy;
break;
default:
token.sy = badsy;
ErrorMsg("Illegal symbol");;
break;
}
return;
<B>Listing {CMPC1}</B>
</PRE>This is perhaps the easiest of all methods for coding an acceptor for
operators and delimiters. The process is quite straightforward. Each transition
is a case or an if. When a decision point is reached, the proper field of
<TT>token</TT> is filled.
<P>Another method, which is perhaps a little more involved makes use of string
tables, and compares the next and current characters to items in a table of
strings. Here are the tables for our example.
<P><PRE> struct _sym1sy {
char key;
symbol ksy;
};
struct _sym2sy {
char key[3];
symbol ksy;
};
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -