📄 chapter 3 lexical analysis.htm
字号:
For instance, '6' equals 6. 'D' = 13 (in hex).
<LI>Add the character's binary equivalent to our final
number.
</LI></OL>
</LI></OL>
The process continues until there are no more characters, or the final number
overflows. When the integer's base happens to be a power of two, we can employ a
faster method of using left-shifts instead of multiplys and using ors instead of
adds. Listing {ADTOI} demonstrates the conversion process for decimal integers,
and listing {AOTOI} demonstrates the same process for octal integers.<P>
<PRE> //*** DECIMAL
lim = MAXUQWORD/10;
token.data.num = 0;
while (buff[i]) {
if(token.data.num > lim)
{ Error(ER_NUM2BIG);
// The
multiply will
be too big
token.data.num =
0;break;
}
else
if ((buff[i]-'0') > (MAXUQWORD - (token.data.num*10))) {
Error(ER_NUM2BIG); // The add will be too big
token.data.num=0;
break;
}
token.data.num = (token.data.num*10) + (buff[i++]-'0');
}
token.base=10;
<B>Listing {ADTOI}</B>
lim = MAXUQWORD
>> 3; while
(buf[i]) { if(token.data.num
> lim) {
Error(ER_NUM2BIG); // The shift will be too big
token.data.num=0;
break;
}
token.data.num = (token.data.num<<3) | (buf[i++]-'0');
}
token.base=8;
<B>Listing {AOTOI}</B> Notice the use of shifts instead of multiplys, and
ors instead of adds.
</PRE>
Notice that the number is placed into <TT>token.data.num</TT> as it is computed.
The variable <TT>lim</TT> is used to determine whether the next shift or multiply
will overflow the number. If the number overflows, an error is signaled.<P>
In the case of integers, we cannot make use of the facilities provided in the C
language, since SAL has the ability to work with binary constants and quadword
sized integers (64 bits). Hence, we do the conversion ourselves.<P>
Integers in SAL can be represented in base binary, octal, decimal, and in
hexadecimal form. The base is presented by prepending a letter '0b', '0o',
'0d', or '0x' respectively. In the case of decimal integers, the '0d' is optional.
All hexadecimal digits must be in upper case. The numbers 0b11111111, 0o377, 255, 0d255, and
0xFF all represent the same quantity. Integers and cardinals (unsigned integers)
in SAL default to 32 bits, but are stored in <TT>token</TT> as 64 bits.<P><!-------------------------------------------------------------------------------->
<B>Converting ASCII reals to binary.</B> This case is much simpler. We can simply
make use of the C function <TT>atof()</TT>. See listing {ATOF}.
<PRE> //*** FLOAT
token.sy = realconsy;
token.data.rnum = atof(buff);
token.base=10;
<B>Listing {ATOF}</B>
</PRE>
SAL is very flexible about the way real numbers can be represented. They can
take on many forms.
<UL>
<LI>A period followed by digits: .25, .5, .3333333
<LI>Digits followed by a period: 1., 0.
<LI>Digits, period, and digits: 1.0, 0.25, 0.0
<LI>Scientific notation: digits of one of the previous
formats, or a decimal integer format followed by an E or an e, followed by an
optional sign, and then more digits. 1e2, 1.0E+2 are two examples of the
number 100.0 written in scientific format.
</LI></UL>
A regular expression would be something like this:
<PRE> . digit{digit}
digit{digit} . [ digit{digit} ]
digit{digit} [ . digit{digit} ] E|e [+|-] digit{digit}
. digit{digit} E|e [+|-] digit{digit}
</PRE>
Real numbers in SAL are only in base 10. The use of the b, o, d, and h suffixes
are illegal for reals, and should not be recognized. The string 1.1b should be
interpreted as a real (1.1), and an identifier (b). Real numbers in SAL default
to double precision (64 bits).<P>
Because real numbers may begin with a period, there arises a possible ambiguity
in distinguishing between real numbers and the period symbol, i.e., the period can
appear by itself as a token. SAL handles this by looking ahead to see whether or
not the next character is a digit. If not, then the the current token is set to a
period, and the numeric portion of the scanner is never entered.<P>
Here are state the machines representing integers and real numbers in SAL:
<CENTER> </CENTER><P><IMG alt="" src="Chapter 3 Lexical Analysis.files/specs-Numbers.gif">
{SPECS-INTEGER/REAL}
Notice that the state machine for real numbers also recognizes a period. This was
included for clarity.<P>
<B>3.3.2.2 String and Character Constants</B><BR><!-------------------------------------------------------------------------------->
A character in SAL is enclosed in single quotes, like in C and C++. In order to
represent non-printable or non-typable characters, SAL has escape characters
similar to those of C/C++. There are some subtle differences, however. These
rules apply:
<OL>
<LI>Standard escape sequences like <TT>\n</TT>, <TT>\t</TT>, and <TT>\"</TT> all
behave like in C/C++. SAL recognizes the same set of these as C and C++.
Consult the <A href="http://topaz.cs.byu.edu/SAL/SAL_Tutorial_Expressions.html#character">SAL documentation</A> for more details.
<LI>Characters may be represented in hexadecimal by a <TT>\x</TT> followed by
<I>two</I> hexadecimal digits. The number must be in the range of 00 to FF.
Digits from A to F must be in upper case.
<LI>Characters may be represented in decimal by a <TT>\d</TT> followed by
<I>three</I> decimal digits. The number must be in the range of 000 to 255.
<LI>Characters may be represented in octal by a <TT>\</TT> followed by
<I>three</I> octal
digits. The number must be in the range of 000 to 377.
</LI></OL>
Character recognition is simple, and involves identifying a single quote
from the input stream, reading an additional character, and then verifying another
single quote from the input stream. In the case of an escape sequence, the items
in the above list can be implemented using an FSA. If necessary a multiply-add
algorithm can be used to convert numeric escape sequences.<P>
Characters are stored in a token as follows:
<PRE> //*** CHARACTER
token.sy = charconsy;
token.data.ch = ch;
<B>Listing {SALCHAR}</B>
</PRE>
Strings are stored a little differently in SAL. They are read from the input
stream into a temporary buffer. A temporary buffer exists in <TT>SAL-Scanner.CPP</TT>:
<PRE> #define TMPBUFFMAX 255 // max buffer size
static CHAR tmpbuff[TMPBUFFMAX+1]; // Temp buffer used to store parsed numbers and strings
</PRE>
Probably the most straightforward way to recognize a string is to detect a double
quote, and then do the following:
<PRE> ULONG i = stringaddr; // a counter variable
//*** Put the string into tmpbuff
for(i = 0; ch!='"'; i++) {
if(i < TMPBUFFMAX)
tmpbuff[i] = ch;
NextCh();
}
NextCh(); // eat the terminating "
if(i < TMPBUFFMAX)
tmpbuff[i] = 0; // null-terminate the string
else
ErrorMsg("string is too long"); // or detect a string that is too long
//*** Set up token data
token.sy = stringsy;
token.data.str.txt = tmpbuff;
token.data.str.length = i;
<B>Listing {SALSTR}</B> General method for recognizing strings in SAL
</PRE>
This algorithm does not take into account some errors like unterminated strings,
and it also does not translate escape sequences. Both of these things must be
accounted for in the actual implementation<P>
Here are two state machines that represent characters and strings in SAL. They
each call a third state machine that interprets escape sequences.
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/specs-Char.gif"></CENTER><P>
Notice that this FSA can recognize two quotes with nothing in between. This is
a null character. Its ASCII value is zero.
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/specs-String.gif"></CENTER><P>
This state machine recognizes escape sequences. With ASCII representation (e.g.,
hex or decimal) the state machine only determines whether or not the correct
number of characters are present. It does not check to see if the number is
within the appropriate range. In <TT>SAL-Scanner.CPP</TT> there is a function
to convert escape sequences called <TT>EscapeCh()</TT>. It works with <TT>ch</TT>
and <TT>NextCh()</TT> and returns the equivalent character.
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/specs-EscSeq.gif"></CENTER><P>
<H3>3.3.3 Identifiers and Keywords</H3><!-------------------------------------------------------------------------------->
Keywords and identifiers are similar, and we can merge the operation that
recognizes the two types of token into one routine. The strategy is to assume
if a token begins with a letter or an underscore it is an identifier. If the
identifier then happens to match one of the items from a list of known keywords,
then it is a keyword.<P>
Identifiers in SAL can be composed of (currently up to 16) letters in either upper
or lower case, numbers, and the underscore. Identifiers may not begin with a
number, however. We saw an FSA to recognize identifiers in figure {IDFSA} in
section 3.1.3. Detecting an identifier is simple:
<PRE> if (isalpha(ch) || ch=='_') { // First char is a
alpha or _ i=
0; do
{ // following 16 are alpha-numeric or _ if
(i < = ALFAMAX)
token.data.id[i++] = ch;
NextCh();
} while (isalnum(ch) || ch=='_');
token.data.id[i] = 0;
}
<B>Listing {SALID}</B>
</PRE>
In order to recognize keywords, we use a list of strings and matching token
values, much like what was discussed in section 3.3.1. Listing {SALKLST} shows
a string list for all of the SAL keywords, and listing {SALKEY} shows a method for
searching the list to see if the latest identifier is really a keyword.<P>
<PRE> typedef struct tagtokensym {
Alfa key;
Symbol ksy;
} tokensym;
static tokensym kwdsy[] = {
{"div", divsy}, {"mod", modsy},
{"and", andsy}, {"or", orsy},
{"not", notsy},
{"const", constsy}, {"type", typesy},
{"is", issy}, {"var", varsy},
{"array", arraysy}, {"of", ofsy},
{"record", recordsy}, {"class", classsy},
{"extends", extendssy}, {"virtual", virtualsy},
{"shared", sharedsy}, {"public", publicsy},
{"private", privatesy}, {"constructor", constructorsy},
{"destructor", destructorsy}, {"super", supersy},
{"program", programsy}, {"library", librarysy},
{"import", importsy}, {"export", exportsy},
{"begin", beginsy}, {"end", endsy},
{"if", ifsy}, {"then", thensy},
{"elseif", elseifsy}, {"else", elsesy},
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -