📄 chapter 3 lexical analysis.htm
字号:
{"for", forsy}, {"to", tosy},
{"by", bysy},
{"while", whilesy}, {"until", untilsy},
{"do", dosy}, {"loop", loopsy},
{"continue", continuesy}, {"break", breaksy},
{"select", selectsy}, {"from", fromsy},
{"case", casesy},
{"try", trysy}, {"catch", catchsy},
{"finally", finallysy}, {"throw", throwsy},
{"read", readsy}, {"write", writesy},
{"istream", istreamsy}, {"ostream", ostreamsy},
{"tron", tronsy}, {"troff", troffsy},
{"func", funcsy}, {"proc", procsy},
{"return", returnsy}, {"asm", asmsy},
{"", badsy}
};
<B>Listing {SALKLST}.</B> A string list of all the SAL keywords. This is
a portion of the listing from SAL-Scanner.CPP.
for(i=0; kwdsy[i].ksy!=badsy; i++) {
if(!strcmp(token.data.id, kwdsy[i].key)) {
token.sy=kwdsy[i].ksy;
break;
}
}
<B>Listing {SALKEY}.</B> Recognizing a keyword from the list.
</PRE>
We now have a pretty good idea of what comprises a SAL scanner. We have only
two more very basic things to cover.<P>
<H3>3.3.4 Whitespace</H3><!-------------------------------------------------------------------------------->
Whitespace is ignored by the compiler, for the most part. However, it does play
a subtle part as a separater. In some older languages, notably BASIC and Fortran,
whenever a keyword appeared, it was recognized regardless of its proximity to the
neighboring token. Reading a BASIC program, one might see lines like
<PRE> 1250 FORI=1TO100:PRINT"Hello":NEXTI
</PRE>
All modern languages of today make some subtle use of whitespace as a delimiter,
and such monstrosities are seldom seen anymore.<P>
Whitespace characters are listed in table {WHITSP}.
<PRE> ASCII Esc Name Description
========================================================================
9 \t Tab character Usually 8 spaces.
10 \n Linefeed Does not bring the cursor to the left.
13 \r Carriage Return Brings the cursor to the left, same line.
32 ' ' Blank Space A space character.
<B>Table {WHITSP}.</B>
</PRE>
There may be others, but they are insignificant.
<H3>3.3.5 Comments</H3><!-------------------------------------------------------------------------------->
In addition to whitespace, the scanner also has to deal with comments. One might
think that it is easy enough to just eat tokens until a terminating comment is
found. However, in SAL, there are two concerns that need to be addressed. In
short, comments need to be <I>correctly ignored</I> by the scanner.
<OL>
<LI>Multiline comments can be nested. When skipping over comments, the
scanner needs to keep track of the level of nesting.
<LI>Whenever a single line comment appears on the same
line as the starter or ender for a multiline comment, the issue of precidence
arises. This will be discussed shourly
</LI></OL>
For the most part both styles of comments in SAL look and behave similarly to
comments in C++ and Java.<P>
<MENU>
<B>Single Line Comments.</B> Single line comments take on the form of a
double-slash, "<TT>//</TT>", and behave identically to single line comments in
C++. Everything appearing on a line after a "<TT>//</TT>" is ignored.<P>
<PRE> x:= 6; // Assign 6 to x.
</PRE>
<B>Multiline Comments.</B> These come directly from C, with one noted exception.
<PRE> /* some comment...
that spans multiple lines...*/
</PRE>
These can be nested within each other.<P>
<PRE> /* this is
/* A nested multiple line comment */
*/
</PRE>
Comments can be nested, and programmers have the ability to comment out
large sections of code without having to worry about whether or not
there are other "<TT>/*...*/</TT>" pairs within a section.<P></P>
</MENU>
Since there are two styles of commenting, the subject of priority comes up whenever
the two are used together. This issue is solved by letting the first comment take
priority. Here is an example of a single line comment inside a multiline comment,
all on the same line. In this case the double slash is ignored.
<PRE> /* some comment // <-- ignore this */
</PRE>
Here is a multiline comment that spans three lines. The last line has an imbedded
single line comment.<PRE> /*
* A comment
// <-- Ignore this */
</PRE>
In each of these cases, the scanner is in the mode of scanning multiline comments.
In each of these cases, the double slashes are ignored. Although the double
slashes are ignored, they still need to be recognized. Here is an example:
<PRE> /* comment out this section...
for i:=1 to 10 do //*** Initialize the table
table[i]:= 0;
loop
*/
</PRE>
Notice the comment on the for loop. It is a double slash followed by three stars.
If the compiler just ignored everything and only payed attention to multiline
starters and enders, it might mistakenly interpret the comment on the for loop. The
second slash preceeds a star, and looks like a starter for another multiline comment.
This listing shows the correct and incorrect way of handling this (very subtle) bug.
<PRE> <B>INCORRECT</B>
<B>/*</B> comment out this section...
for i:=1 to 10 do /<B>/*</B>** Initialize the table
table[i]:= 0;
loop
<B>*/</B>
<B>CORRECT</B>
<B>/*</B> comment out this section...
for i:=1 to 10 do <B>//</B>*** Initialize the table
table[i]:= 0;
loop
<B>*/</B>
</PRE>
On the other hand, there are times when single line comments take precidence. When
this happens, it can be mess up a multiline comment.
<PRE> // /*
* This comment is illegal
*/
</PRE>
This causes a problem because the <TT>/*</TT> gets commented out by the preceeding
<TT>//</TT>. This is the only case where the combination of the two styles is
illegal. In this case, both the starter and ender must appear on the same line.<P>
<PRE> // /* This is okay */
</PRE>
The instructor may provide utility for the scanner called <TT>Comment()</TT>. It
takes no parameters and returns nothing. It is used in <TT>GetNextToken()</TT> like so:
<PRE> if(ch=='/' && (nch=='*' || nch=='/'))
Comment();
</PRE>
If <TT>Comment()</TT> is not provided, its implementation will be left as an
exercise for the student.
<P>
<H2>3.4 Writing a SAL Scanner</H2><!-------------------------------------------------------------------------------->
We are now ready to fully implement a scanner in SAL. We will not have to do the
entire thing, since this course is intended to fit into one semester; SAL is a
large language. There are a number of utility procedures designed to help the
student along.<P>
<H3>3.4.1 Lookahead buffer.</H3><!-------------------------------------------------------------------------------->
This is taken care of for you in NextToken, but you should still know about this.
One of the many things that the scanner has to work with is the token queue. The
scanner has access to it through two functions.
<MENU>
<B><TT>UseQueue()</TT></B> This function returns a Boolean result, and is true
when the scanner should look in the queue for the next token, and is false when
it should not. There are some complex subtelties as to when the scanner should
really go to the queue, and this function hides those.<P>
<PRE> bool UseQueue()
</PRE>
<B><TT>DequeueNexttoken()</TT></B> This function will dequeue the next token,
and copy it into the global variable, <TT>token</TT>. It should only be used
when <TT>UseQueue()</TT> returns a true.
<PRE> void DequeueNextToken()
</PRE>
</MENU>
These two functions should be used like so:
<PRE> //*** Check look-ahead queue first
if(UseQueue()) {
DequeueNextToken();
return;
}
</PRE>
Using these two functions, <TT>GetNextToken()</TT> never has to know the details of
the token queue. We know from having taken a course in computational theory that
parsers often like to look ahead. Generally, a parser should be able to look
ahead any number of tokens, though in practice it would only need to look ahead
just a few. It is the Scanner's job to provide this look-ahead service, to not
loose track of any peeked-at tokens, and to continue to deliver all tokens to the
parser in the proper order each time it calls for the next token.<P>
This is done by having the scanner queue up tokens as a look-ahead is called for.
When the parser calls for the next token, the scanner should first check the queue
to see if it contains any pre-scanned tokens. If it does, it removes one of them
from the front of the queue and returns it to the parser. Otherwise, it scans
from the input stream, as normal. As the parser calls to look ahead, the scanner
will go to the input stream, scan out the next token, enqueue it, and return it to
the parser. In this manner, the parser can look ahead an arbitrary distance
before proceeding.<P>
One significant issue to this process is that any call to get the next token
resets the look-ahead pointer. Consider figure {LKAHD}.
<CENTER><IMG src="Chapter 3 Lexical Analysis.files/LKAHD.gif"></CENTER>
The current token being <I>yellow</I>, the parser looks ahead and finds that the
next three tokens are <I>red</I>, <I>green</I>, and <I>blue</I>. The parser then
calls for the next token. <I>Yellow</I> is discarded, and replaced by <I>red</I>.
The look ahead pointer is reset, and successive calls to look ahead will yield
<I>green</I>, <I>blue</I>, and <I>white</I>. Notice also that nothing is appended
to the token queue by the call to get the next token. That will not happen
unless the scanner it told to look ahead another two consecutive times, beyond
<I>blue</I>.<P>
<H3>3.4.2 Internal helper procedures.</H3><!-------------------------------------------------------------------------------->
<B><TT>NextToken()</TT></B> This function is provided as a wrapper to the GetNextToken() function.
It is used for Token queue management and other functions in the compiler. When you write the parser, you will
be calling this function, instead of GetNextToken.
<PRE> void NextToken(void)
</PRE>
<B><TT>EscapeCh()</TT></B> This function is provided to facilitate the conversion
of escape sequences. This is a common task when converting string and character
constants.<P>
<PRE> char EscapeCh(void)
</PRE>
<B><TT>Comment()</TT></B> This function will scan over comments. As soon as the
scanner detects a <TT>/*</TT> or a <TT>//</TT>, we can call this functi
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -