📄 chapter 4 the parserw.htm
字号:
possible, and resolve the correct rule once a differentiating terminal is found.
In the case of stage 2, we don't really know that we want T as opposed to T + E
or T - E simply because the differentiating token ended up being the end of the
string. Had the expression had a trailing "+ x" or a "- x", then we could have
used T + E or T - E, respectively.
<P> This is not intended to make a lot of sense, and in resursive descent
compilers, we use a far simpler method: syntax diagrams. <BR> <BR>
<H2>4.3 Using Syntax Diagrams in Recursive Descent Parsing</H2><!-------------------------------------------------------------------------------->Syntax
diagrams simplify life one hundred percent, because they lend themselves so
easily to being coded. The production rules of the previous section can be
expressed as follows:
<CENTER><IMG src="Chapter 4 The Parserw.files/SYNEXP.gif"></CENTER>Starting at
rule 3 we can trace the recognition of the string "x * (x - x)" in a very
straightforward way. Trace along the paths of the syntax diagrams as we go
along.
<OL>
<LI>E calls T. It has no other choice.
<LI>T calls F.
<LI>At this point, F can make a sure decision. It sees that the first item in
the input string is an <B>x</B>. The <B>x</B> is consumed, and the remaining
string is "* (x - x)". F then returns back to T.
<LI>At this point, T can surrender to E, or eat a terminal. Since the next
character in the string (a *) matches one of its choices, it eats the star,
and then loops, calling F one more time. The input string is now "(x - x)".
<LI>F has an easy choice. The left parenthesis takes it down the lower branch,
and F eats the "(" and then calls E. Notice that in this case E is being
called recursively. The input string is now "x - x)".
<LI>E calls T.
<LI>T calls F (again).
<LI>This instance of F eats an <B>x</B> and then returns back to T. The input
string is now "- x)".
<LI>T does not recognize either a * or a /, and so returns back to E.
<LI>E sees the -, eats it, and then loops back, calling T one last time. We
are left with "x)".
<LI>T calls F.
<LI>F sees the last <B>x</B>, and eats it, and returns. Our string is now ")".
<LI>T sees no matching * or /, and so returns.
<LI>E sees no matching + or -, and so returns.
<LI>We are now back at F at the lower branch where we were left off in step 5.
Having just returned from calling E, the only thing we can do is eat the
remaining ")" in our input string. We are left with an empty string. At this
point we are effectively through, but we are still a couple calls deep. F
returns to T.
<LI>T sees no matching * or /, since this we are now at the end of the string.
It returns to E.
<LI>E sees no matching + or -, since this we are now at the end of the string.
At this point, it returns, and effectively completes the parsing process.
</LI></OL>Since there were no errors (the example <I>was</I> contrived that
way), we have successfully parsed the expression, "x * (x - x)". In the process,
we have generated a neat little parse tree, that looks like this: <PRE> *
/ \
/ \
x -
/ \
/ \
x x</PRE>In
a parse tree, items in the lowest branches have priority. Thus, the minus in the
tree will be processed before the *. Parse trees are useful for many reasons.
One of their uses is for doing optimization. Parse trees are not used in this
text.
<P> Notice that during the parsing process with syntax diagrams there was
no point at which we had to guess. There was no point like in section 4.2 at
which we simply "knew" which rule to use. Syntax diagrams are very useful
because they are so clear and explicit. Because of this they can be used as a
direct template for coding a recursive descent parser for any language. We
discuss this in the next section.
<H2>4.4 Coding a Recursive Descent Parser</H2><!-------------------------------------------------------------------------------->We
will discuss the implementation of a recursive descent parser using syntax
diagrams as a template. We will discuss implementation of a parser for the set
of productions given in section 4.2, and we will add an additional set of rules
in order to make use of another concept introduced in that section: the set
called FIRST. <BR> <BR>
<H3>4.4.1 Using the Set, FIRST</H3><!-------------------------------------------------------------------------------->Recall
that the set, FIRST is the set of all first possible terminals that may
eventually appear when the calling of two rules comes into question. A standard
definition for FIRST is as follows: Let <I>a</I> be the right side of an
arbitrary production, <I>A</I>. FIRST-<I>A</I> is defined as the set of all
possible first terminals generated by <I>a</I>. To compute FIRST-<I>A</I> for
all grammar symbols <I>A</I>, apply the following rules until no more terminals
or the the null-terminal can be added to any FIRST set. This is a recursive
algorithm.
<OL>
<LI>If <I>a</I> begins with a terminal, then FIRST-<I>A</I> is the starting
terminal of <I>a</I>.
<LI>If <I>A</I> -> <I>(nothing)</I> is a production, then add the
null-terminal to FIRST-<I>A</I>.
<LI>For any production <I>A</I> -> <I>B</I>, add FIRST-<I>B</I> to
FIRST-<I>A</I>.
<LI>For any production of the form <I>A</I> -> <I>BC..M</I>N<I>..Z</I>,
where the null-terminal is in all sets FIRST-<I>B</I> through FIRST-<I>M</I>,
add FIRST-N to FIRST-<I>A</I>. </LI></OL>For example, take these three
productions:
<MENU>S -> AS | BS <BR>A -> x <BR>B -> y</MENU>These productions
generate arbitrary strings of Xs and Ys. When starting out, S has to decide
which alternative paths to call. It decided by keeping a set for FIRSTs for A
and for B. FIRST-A contains {x}, and FIRST-B contains {y}. Upon comparing the
current character to find a match in either FIRST-A or FIRST-B, S can know which
rule to call. For the string,
<MENU>xxy</MENU>S can know to invoke production A twice and then production B.
<P> Using this knowledge, we can add three rules to our expression
productions.
<MENU>A -> x := E ";" <BR>I -> if E then S endif <BR>S -> AS | IS |
<I>(nothing)</I></MENU>These three rules allow series of programming statements
consisting of simple if-constructs and assignments. Additionally, The
if-constructs can be nested. An example would be: <PRE> x := x - x;
if x then
x := x * x + x;
x := x / x;
endif</PRE>In order to know whether or not S
should invoke A or I at any given time, it must have FIRST-A and FIRST-I.
FIRST-A is {X}, and FIRST-I is {if}. When starting out on the example above, S
can know to call A first, and then I.
<P> The set of FIRST is not always easily found, and if needs be, we might
have to look several levels deep. Recall that FIRST-E was {x, "("}. We got this
by examing T (our only starting non-terminal), and then examing F (T's only
starting non-terminal, and there finding <B>x</B> and "(". For any given rule,
the FIRSTs only need to be found when the rule has more than one option, and the
rewrite sequence for one of those options begins with a non-terminal. In that
case, the rule will need to know the FIRST for that non-terminal. <BR>
<BR>
<H3>4.4.2 A Simple Parser</H3><!-------------------------------------------------------------------------------->Before
writing a recursive descent parser, we need an existing infrastructure
consisting a list of recognizable tokens, A procedure to invoke the scanner and
get the next token, some sort of error-handling routine, and a few other items.
Before we proceed further, we need to draw a line of distinction between two
items that a parser deals with a token and a symbol. A symbol is a part of a
token. Recall that a token may often have additional data. For instance, if the
scanner encountered the text, "123", it would construct a token for an integer,
with the data, "123". The symbol for this token as recognized by the SAL
compiler would be intconsy, an integer constant. The complete token would be
intconsy(123). This distinction between tokens and symbols is often blurred, in
that in some cases we can use the terms interchangeably. Keep in mind, however
that every token consists of two parts, a symbol, and a piece of (optional)
additional data.
<P>Before getting into this, let's rename our rules so that they are more human
readable.
<MENU>
<TABLE cellPadding=3 width=300 border=3>
<TBODY>
<TR>
<TD><B>Old Rule Name</B></TD>
<TD><B>New Rule Name</B></TD></TR>
<TR>
<TD>F</TD>
<TD>Factor</TD></TR>
<TR>
<TD>T</TD>
<TD>Term</TD></TR>
<TR>
<TD>E</TD>
<TD>Expression</TD></TR>
<TR>
<TD>A</TD>
<TD>AssignmentStatement</TD></TR>
<TR>
<TD>I</TD>
<TD>IfStatement</TD></TR>
<TR>
<TD>S</TD>
<TD>Statements</TD></TR></TBODY></TABLE></MENU>We will use the longer names to
refer to our rules from now on.
<P> <B>4.4.2.1 Scanner Interface</B> <BR><!-------------------------------------------------------------------------------->The
scanner interface for the purposes of the projects within this text will consist
of the function <TT>NextToken()</TT>, which retrieves the next token from the
source, and a global token struct, aptly named <TT>token</TT>. For the purposes
of this example, token will be a variable of enumerated type, declared like so: <PRE> enum _token {
badsy = -1,
idconsy, ifsy, thensy, endifsy, assignsy,
timessy, idivsy, plussy, minussy, lparentsy,
rparentsy, semicolonsy, eopsy
};</PRE>Note the blurred distinction between
tokens and symbols in this example. This is possible in our case since there is
no additional data per token.
<P> Each call to <TT>NextToken()</TT> will change the value of
<TT>token</TT>, to one of the values of the enumerated type listed above. An
explanation of these tokens and their corresponding textual strings is as
follows:
<MENU>
<TABLE cellPadding=3 width=400 border=3>
<TBODY>
<TR vAlign=top>
<TH width=100>Enum Value</TH>
<TH>Text string</TH></TR>
<TR vAlign=top>
<TD><TT>badsy</TT></TD>
<TD>Scanner failed to recognize the current token.</TD></TR>
<TR vAlign=top>
<TD><TT>idconsy</TT></TD>
<TD>x</TD></TR>
<TR vAlign=top>
<TD><TT>ifsy</TT></TD>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -