📄 chapter 4 the parserw.htm

📁 英文版编译器设计：里面详细介绍啦C编译器的设计
💻 HTM
📖 第 1 页 / 共 5 页
字号:
possible, and resolve the correct rule once a differentiating terminal is found. 
In the case of stage 2, we don't really know that we want T as opposed to T + E 
or T - E simply because the differentiating token ended up being the end of the 
string. Had the expression had a trailing "+ x" or a "- x", then we could have 
used T + E or T - E, respectively. 
<P>&nbsp;This is not intended to make a lot of sense, and in resursive descent 
compilers, we use a far simpler method: syntax diagrams. <BR>&nbsp; <BR>&nbsp; 
<H2>4.3 Using Syntax Diagrams in Recursive Descent Parsing</H2><!-------------------------------------------------------------------------------->Syntax 
diagrams simplify life one hundred percent, because they lend themselves so 
easily to being coded. The production rules of the previous section can be 
expressed as follows: 
<CENTER><IMG src="Chapter 4 The Parserw.files/SYNEXP.gif"></CENTER>Starting at 
rule 3 we can trace the recognition of the string "x * (x - x)" in a very 
straightforward way. Trace along the paths of the syntax diagrams as we go 
along. 
<OL>
  <LI>E calls T. It has no other choice. 
  <LI>T calls F. 
  <LI>At this point, F can make a sure decision. It sees that the first item in 
  the input string is an <B>x</B>. The <B>x</B> is consumed, and the remaining 
  string is "* (x - x)". F then returns back to T. 
  <LI>At this point, T can surrender to E, or eat a terminal. Since the next 
  character in the string (a *) matches one of its choices, it eats the star, 
  and then loops, calling F one more time. The input string is now "(x - x)". 
  <LI>F has an easy choice. The left parenthesis takes it down the lower branch, 
  and F eats the "(" and then calls E. Notice that in this case E is being 
  called recursively. The input string is now "x - x)". 
  <LI>E calls T. 
  <LI>T calls F (again). 
  <LI>This instance of F eats an <B>x</B> and then returns back to T. The input 
  string is now "- x)". 
  <LI>T does not recognize either a * or a /, and so returns back to E. 
  <LI>E sees the -, eats it, and then loops back, calling T one last time. We 
  are left with "x)". 
  <LI>T calls F. 
  <LI>F sees the last <B>x</B>, and eats it, and returns. Our string is now ")". 

  <LI>T sees no matching * or /, and so returns. 
  <LI>E sees no matching + or -, and so returns. 
  <LI>We are now back at F at the lower branch where we were left off in step 5. 
  Having just returned from calling E, the only thing we can do is eat the 
  remaining ")" in our input string. We are left with an empty string. At this 
  point we are effectively through, but we are still a couple calls deep. F 
  returns to T. 
  <LI>T sees no matching * or /, since this we are now at the end of the string. 
  It returns to E. 
  <LI>E sees no matching + or -, since this we are now at the end of the string. 
  At this point, it returns, and effectively completes the parsing process. 
</LI></OL>Since there were no errors (the example <I>was</I> contrived that 
way), we have successfully parsed the expression, "x * (x - x)". In the process, 
we have generated a neat little parse tree, that looks like this: <PRE>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; *
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; / \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; /&nbsp;&nbsp; \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; x&nbsp;&nbsp;&nbsp;&nbsp; -
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; / \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; /&nbsp;&nbsp; \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; x&nbsp;&nbsp;&nbsp;&nbsp; x</PRE>In 
a parse tree, items in the lowest branches have priority. Thus, the minus in the 
tree will be processed before the *. Parse trees are useful for many reasons. 
One of their uses is for doing optimization. Parse trees are not used in this 
text. 
<P>&nbsp;Notice that during the parsing process with syntax diagrams there was 
no point at which we had to guess. There was no point like in section 4.2 at 
which we simply "knew" which rule to use. Syntax diagrams are very useful 
because they are so clear and explicit. Because of this they can be used as a 
direct template for coding a recursive descent parser for any language. We 
discuss this in the next section. 
<H2>4.4 Coding a Recursive Descent Parser</H2><!-------------------------------------------------------------------------------->We 
will discuss the implementation of a recursive descent parser using syntax 
diagrams as a template. We will discuss implementation of a parser for the set 
of productions given in section 4.2, and we will add an additional set of rules 
in order to make use of another concept introduced in that section: the set 
called FIRST. <BR>&nbsp; <BR>&nbsp; 
<H3>4.4.1 Using the Set, FIRST</H3><!-------------------------------------------------------------------------------->Recall 
that the set, FIRST is the set of all first possible terminals that may 
eventually appear when the calling of two rules comes into question. A standard 
definition for FIRST is as follows: Let <I>a</I> be the right side of an 
arbitrary production, <I>A</I>. FIRST-<I>A</I> is defined as the set of all 
possible first terminals generated by <I>a</I>. To compute FIRST-<I>A</I> for 
all grammar symbols <I>A</I>, apply the following rules until no more terminals 
or the the null-terminal can be added to any FIRST set. This is a recursive 
algorithm. 
<OL>
  <LI>If <I>a</I> begins with a terminal, then FIRST-<I>A</I> is the starting 
  terminal of <I>a</I>. 
  <LI>If <I>A</I> -&gt; <I>(nothing)</I> is a production, then add the 
  null-terminal to FIRST-<I>A</I>. 
  <LI>For any production <I>A</I> -&gt; <I>B</I>, add FIRST-<I>B</I> to 
  FIRST-<I>A</I>. 
  <LI>For any production of the form <I>A</I> -&gt; <I>BC..M</I>N<I>..Z</I>, 
  where the null-terminal is in all sets FIRST-<I>B</I> through FIRST-<I>M</I>, 
  add FIRST-N to FIRST-<I>A</I>. </LI></OL>For example, take these three 
productions: 
<MENU>S -&gt; AS | BS <BR>A -&gt; x <BR>B -&gt; y</MENU>These productions 
generate arbitrary strings of Xs and Ys. When starting out, S has to decide 
which alternative paths to call. It decided by keeping a set for FIRSTs for A 
and for B. FIRST-A contains {x}, and FIRST-B contains {y}. Upon comparing the 
current character to find a match in either FIRST-A or FIRST-B, S can know which 
rule to call. For the string, 
<MENU>xxy</MENU>S can know to invoke production A twice and then production B. 
<P>&nbsp;Using this knowledge, we can add three rules to our expression 
productions. 
<MENU>A -&gt; x := E ";" <BR>I -&gt; if E then S endif <BR>S -&gt; AS | IS | 
  <I>(nothing)</I></MENU>These three rules allow series of programming statements 
consisting of simple if-constructs and assignments. Additionally, The 
if-constructs can be nested. An example would be: <PRE>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; x := x - x;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if x then&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; x := x * x + x;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; x := x / x;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; endif</PRE>In order to know whether or not S 
should invoke A or I at any given time, it must have FIRST-A and FIRST-I. 
FIRST-A is {X}, and FIRST-I is {if}. When starting out on the example above, S 
can know to call A first, and then I. 
<P>&nbsp;The set of FIRST is not always easily found, and if needs be, we might 
have to look several levels deep. Recall that FIRST-E was {x, "("}. We got this 
by examing T (our only starting non-terminal), and then examing F (T's only 
starting non-terminal, and there finding <B>x</B> and "(". For any given rule, 
the FIRSTs only need to be found when the rule has more than one option, and the 
rewrite sequence for one of those options begins with a non-terminal. In that 
case, the rule will need to know the FIRST for that non-terminal. <BR>&nbsp; 
<BR>&nbsp; 
<H3>4.4.2 A Simple Parser</H3><!-------------------------------------------------------------------------------->Before 
writing a recursive descent parser, we need an existing infrastructure 
consisting a list of recognizable tokens, A procedure to invoke the scanner and 
get the next token, some sort of error-handling routine, and a few other items. 
Before we proceed further, we need to draw a line of distinction between two 
items that a parser deals with a token and a symbol. A symbol is a part of a 
token. Recall that a token may often have additional data. For instance, if the 
scanner encountered the text, "123", it would construct a token for an integer, 
with the data, "123". The symbol for this token as recognized by the SAL 
compiler would be intconsy, an integer constant. The complete token would be 
intconsy(123). This distinction between tokens and symbols is often blurred, in 
that in some cases we can use the terms interchangeably. Keep in mind, however 
that every token consists of two parts, a symbol, and a piece of (optional) 
additional data. 
<P>Before getting into this, let's rename our rules so that they are more human 
readable. 
<MENU>&nbsp; 
  <TABLE cellPadding=3 width=300 border=3>
    <TBODY>
    <TR>
      <TD><B>Old Rule Name</B></TD>
      <TD><B>New Rule Name</B></TD></TR>
    <TR>
      <TD>F</TD>
      <TD>Factor</TD></TR>
    <TR>
      <TD>T</TD>
      <TD>Term</TD></TR>
    <TR>
      <TD>E</TD>
      <TD>Expression</TD></TR>
    <TR>
      <TD>A</TD>
      <TD>AssignmentStatement</TD></TR>
    <TR>
      <TD>I</TD>
      <TD>IfStatement</TD></TR>
    <TR>
      <TD>S</TD>
      <TD>Statements</TD></TR></TBODY></TABLE></MENU>We will use the longer names to 
refer to our rules from now on. 
<P>&nbsp;<B>4.4.2.1 Scanner Interface</B> <BR><!-------------------------------------------------------------------------------->The 
scanner interface for the purposes of the projects within this text will consist 
of the function <TT>NextToken()</TT>, which retrieves the next token from the 
source, and a global token struct, aptly named <TT>token</TT>. For the purposes 
of this example, token will be a variable of enumerated type, declared like so: <PRE>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; enum _token {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; badsy = -1,
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; idconsy, ifsy, thensy, endifsy, assignsy,&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; timessy, idivsy, plussy, minussy, lparentsy,&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; rparentsy, semicolonsy, eopsy
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; };</PRE>Note the blurred distinction between 
tokens and symbols in this example. This is possible in our case since there is 
no additional data per token. 
<P>&nbsp;Each call to <TT>NextToken()</TT> will change the value of 
<TT>token</TT>, to one of the values of the enumerated type listed above. An 
explanation of these tokens and their corresponding textual strings is as 
follows: 
<MENU>&nbsp; 
  <TABLE cellPadding=3 width=400 border=3>
    <TBODY>
    <TR vAlign=top>
      <TH width=100>Enum Value</TH>
      <TH>Text string</TH></TR>
    <TR vAlign=top>
      <TD><TT>badsy</TT></TD>
      <TD>Scanner failed to recognize the current token.</TD></TR>
    <TR vAlign=top>
      <TD><TT>idconsy</TT></TD>
      <TD>x</TD></TR>
    <TR vAlign=top>
      <TD><TT>ifsy</TT></TD>
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -