📄 chapter 3 lexical analysis.htm
字号:
static _sym1sy sym1sy[] = {
{'=
',eqlsy}, {'>',
gtrsy},{'<', lsssy},
{' ', badsy}
};
static _sym2sy sym2sy[] =
{{"<>", neqsy},
{">=", geqsy},
{"<=", leqsy},
{"", badsy}
};
</PRE>Comparison begins with the two-character operators, and if there is no
match, proceeds with the single-character operators.
<P><PRE> //*** Try the 2-char symbols.
for(i=0; sym2sy[i].ksy!=badsy; i++) {
if(ch == sym2sy[i].key[0] && nch sym2sy[i].key[1]) {
NextCh(); // advance a character
NextCh();
token.sy = sym2sy[i].ksy;
return;
}
}
//*** Else, try the 1-char symbols.
for(i=0; sym1sy[i].ksy!=badsy; i++) {
if(ch == sym1sy[i].key) {
NextCh(); // advance a character
token.sy = sym1sy[i].ksy;
return;
}
}
//*** Illegal character
token.sy = badsy;
ErrorMsg("Illegal symbol");;
return;
</PRE>The key to making this method work is to make sure we compare to the
longest strings first, then the shorter ones. Notice the token <TT>badsy</TT> is
used to signal the end of each list. Iterating through the list, we compare the
strings one by one to the characters in the input stream until we reach the end
of the list, or a match is found.
<P>Though not as fast, this method does have two other advantages. First is that
it is far easier to maintain. If we need to add symbols or delimeters to the
compiler later on we just make new entries in the table. The code is so general
that we can make such changes in the compiler without it meaning extra work.
Second is that the code is smaller. The SAL language has many symbols and
delimeters. The first method would result in a huge, bloated switch statement.
<P>The operators and delimeters in SAL are presented in the table {SYDEL} below:
<PRE> Token Sym Description
=====================================================
compsy ~ Bitwise not (1's compliment)
timessy * Multiplication
rdivsy / Division
plussy + Addition
minussy - Subrtaction
bandsy & Bitwise AND
borsy | Bitwise OR
bxorsy % Bitwise XOR
eqlsy = Equal to
gtrsy > Greater than
lsssy < Less than
pointersy ^ Pointer dereference
periodsy . Period
colonsy : Semicolon
lparentsy ( Left parenthesy
rparentsy ) Right parenthesy
lbracksy [ Left bracket
rbracksy ] Right bracket
commasy , Comma
semicolonsy ; Semicolon
atsy @ Module designator
expsy ** Exponentiation
bshrsy >> Bitwise shift right
bshlsy << Bitwise shift left
brolsy <# Bitwise rotate left
brorsy #> Bitwise rotate right
neqsy <> Not equal to
geqsy >= Greater than or equal to
leqsy <= Less than or equal to
becomessy := Assignment
scopesy :: Scope resolution operator
bsshrsy >>> Signed bitwise shift right
<B>Table {SYDEL}</B> Symbols and delimeters in SAL.
</PRE>Here is a state machine representing each of these.
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/specs-Symbol.gif"></CENTER>
<H3>3.3.2 Constants</H3><!-------------------------------------------------------------------------------->The
four types of constants fall into two different categories. Real numbers and
integers fall into one, and strings and characters fall into the other. These
two types are very different both in the way they are scanned and in the way
they are represented.
<P><B>3.3.2.1 Real Numbers and Integers</B><BR><!-------------------------------------------------------------------------------->Scanning
a number actually happens in two steps. The first step involves using te FSA to
recognize a valid number. In this step we also save the current character in a
buffer. This we do in preparation for phase two. If the first step passes, we
know that we have a valid number, and we have it stored in a temporary buffer in
ASCII form.
<P>The next step is to convert the number from ASCII to text. Upon exiting the
FSA, we terminate the buffer with a null character. If the final state is an
accept-state, the data in the buffer is converted into a binary form. Let's
discuss these two steps in greater detail.
<P><B>Scanning.</B> For recognizing real numbers and integers we make use of an
FSA. One way of implementing an FSA is by making use of a state table. A state
table is a table where each row represents a character from the input alphabet,
and each column represents a state. At any point in time, the the current
character from the input stream and the current state can be used to look up a
cell in the state table by row and column. Each cell in the state table contains
the next state. Figure {STTRNS} demonstrates this.
<P><PRE> Current Character: '4'
Current State: start
curr st
|
V
Char | start INT dot REAL
====================================
curr ch--> 0..9 | <B>INT</B> INT REAL REAL
. | Error dot Error Error
<B>Figure {STTRNS}.</B> Looking up the next state based on the current
state and the current character. The next state is INT.
</PRE>A token is recognized when the current state is an accept state, and the
current character is not in the input alphabet or is otherwise invalid. For
figure {STTRNS}, the accept states are in capitals: INT and REAL. An error
occurs when the current state is not an accept state, and the current character
is either invalid for the FSA or not in the input alphabet. Transitions proceed
from state to state until one of these two conditions are met.
<P>The advantage of using state tables is that we have to force ourselves to
consider all possible inputs at all states. This forces us to make sure we
remember all conditions and exceptions that might exist. It is simply easier to
verify correctness. Making a state table requires some thought and concentration
however, and errors in a big state table can be hard to find.
<P>We already saw a table for figure {IRFSA} in figure {STTRNS}. It is shown
again in table {IRTAB}. <PRE> Char start INT dot real
====================================
0..9 INT INT REAL REAL
. Error dot Error Error
<B>Table {IRTAB}.</B> A state table for the FSA in figure {IRFSA}.
</PRE>In order to code up a table we need two things. First, we need a list of
all our states as an enumerated type. See listing {IRENUM}. Notice that we also
added an state for errors. <PRE> typedef enum tagIRState {
IR_START, IR_INT, IR_DOT, IR_REAL, IR_ERROR
} IRState;
<B>Listing {IRENUM}.</B>
</PRE>Next, we need a structure of some sort to hold the state table. A
2-dimensional array would suit the purpose, but in turn would involve a lot of
type-casting since some compilers will not automatically cast an enum to an int.
What we will use is an array of structures that contain two fields. The first is
a character, and the second is an array of next states. It looks like listing
{IRSTRUCT}:
<P><PRE> struct IRtab {
char ch;
IRState nextstate[IR_ERROR];
};
IRtab IRStates[] = {
{'0', {IR_INT, IR_INT, IR_REAL, IR_REAL}},
{'1', {IR_INT, IR_INT, IR_REAL, IR_REAL}},
{'2', {IR_INT, IR_INT, IR_REAL, IR_REAL}},
{'3', {IR_INT, IR_INT, IR_REAL, IR_REAL}},
{'4', {IR_INT, IR_INT, IR_REAL, IR_REAL}},
{'5', {IR_INT, IR_INT, IR_REAL, IR_REAL}},
{'6', {IR_INT, IR_INT, IR_REAL, IR_REAL}},
{'7', {IR_INT, IR_INT, IR_REAL, IR_REAL}},
{'8', {IR_INT, IR_INT, IR_REAL, IR_REAL}},
{'9', {IR_INT, IR_INT, IR_REAL, IR_REAL}},
{'.', {IR_ERROR, IR_DOT, IR_ERROR, IR_ERROR}},
{0}
};
<B>Listing {IRSTRUCT}</B>
</PRE>Coding a state table is quite simple. The main body is a loop, and each
pass through the loop represents a transition from one state to the next. As a
precondition to this piece of code, <TT>ch</TT> already contains the current
character from the input stream. See listing {IRIMPL}. <PRE> //*** Initialize some important variables.
IRState state = IR_START;
int i=0; j=0;
while (state != IR_ERROR) {
//*** Scan the table until the current character matches
for (i=0; IRStates[i].ch!=0 && IRStates[i].ch!=ch; i++);
//*** If the current character is not in the input alphabet
if (IRStates[i].ch == 0) {
if (state != IR_REAL && state != IR_INT)
state = IR_ERROR;
break;
}
else if (IRStates[i].nextstate[state] == IR_ERROR && (state == IR_INT || state == IR_REAL))
break;
//*** Go to the next state
state = IRStates[i].nextstate[state];
//*** Store the current character
buff[j++] = ch;
//*** Get the next character
<!--ch = ->NextCh();
}
buff[j] = 0;
if (state == IR_ERROR)
Error(ER_BADCH);
else if (state == IR_INT) {
//*** Convert data in buff[] to an int
...
}
else if (state == IR_REAL) {
//*** Convert data in buff[] to a real
...
}
<b>Listing {IRIMPL}.</b>
</pre>
The current character from the input stream is compared to the state table in
order to find the right row. If the character is not in the state table and the
current state is not an accept-state, then the current state is set to the
error-state, <tt>IR_ERROR</tt>, and the computer exits the loop. Otherwise if
the current state is an accept state, the loop exits and the current state signals
either a <tt>IR_INT</tt>, or a <tt>IR_REAL</tt>.<p>
If the current character is found in the state table, the next state is looked up,
and the current character is saved to a buffer. The process repeats until the
current character moves the FSA to an error state.<p>
The numeric portion of the SAL scanner utilizes a temporary buffer where the valid
characters for the number are stored. As a character is consumed, it is placed in
this buffer. Upon exiting from the loop, we terminate the buffer with a null
character. If the final state is an accept-state, the data in the buffer is
converted into a binary form. This process will be discussed shortly.<p><!-------------------------------------------------------------------------------->
<B>Converting ASCII integers to binary.</B> The easiest method of performing the
task of converting integers to binary is to employ a leftshift-or (actually it's a
multiply-add) method. The process is straightforward.
<OL>
<LI>Initialize the final number to zero.
<LI>Initialize the buffer position to zero or the left most character or most significant character (Remember not to include the prefix notation
e.g "0x").
<LI><B>while</B> there are characters in the buffer, iterate the buffer
position, traversing in left to right order.
<OL>
<LI>Multiply our final number by its base. In other words if we are
converting to binary, we multiply our final number by two. For octal
its eight, and so on.
<LI>convert the current character to its binary (not ASCII) equivalent.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -