📄 chapter 4 the parserw.htm

📁 英文版编译器设计：里面详细介绍啦C编译器的设计
💻 HTM
📖 第 1 页 / 共 5 页
字号:
12 3 4 5 下一页
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0062)http://topaz.cs.byu.edu/text/html/Textbook/Chapter4/index.html -->
<HTML><HEAD><TITLE>Chapter 4: The Parser</TITLE>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.2800.1458" name=GENERATOR></HEAD>
<BODY>
<CENTER>
<H1>Chapter 4<BR>The Parser</H1></CENTER>
<HR>
<!-------------------------------------------------------------------------------->In 
this chapter we will discuss the creation of a top-down recursive descent 
parser. We will introduce the notation for the syntax diagrams that we will be 
using throughout the remainder of the text. We will also cover error checking, 
error handling, and error recovery. 
<P>&nbsp;We will cover the implementation of top-down recursive descent parsing 
in this text because it is perhaps the easiest method of compiler implementation 
to code by hand. Other methods are certainly more powerful, yet they are also 
more difficult to understand at the student level. <BR>&nbsp; <BR>&nbsp; 
<H2>4.1 Productions and Syntax Diagrams</H2><!-------------------------------------------------------------------------------->All 
languages are divided into rules. These rules govern the use of words, and at a 
higher semantical level they also govern their meaning. The rules of a language 
can be represented in many ways. One very simple way of representing them is 
through a simple set of productions. At this level we concern ourselves with 
only context-free grammars: those which on the left side of the production have 
a single non-terminal, and on the right side have any series of terminals and 
non terminals. These grammars are significant because they allow us the 
capability to generate nested matching sets of patterns. The following set of 
productions generates sets of matching nested parentheses. <BR>&nbsp; <BR>&nbsp; 

<OL>S -&gt; AS <BR>S -&gt; <I>(nothing)</I> <BR>A -&gt; (S)</OL>This grammar 
generates strings like so: 
<MENU>((())) <BR>(()(()))()</MENU>Every right parenthesis in this string has a 
matching left parenthesis. Productions may also be expressed in a more consise 
form, using a vertical bar: |. The vertical bar is placed between two right-hand 
side sequences. This effectively merges two productions into one. The previous 
set of rules can be more consisely written as 
<MENU>S -&gt; AS | (nothing) <BR>A -&gt; (S)</MENU>Just as regular grammar rules 
have a corresponding state machine, context-sensitive grammars have a 
corresponding set of <I>syntax diagrams</I> (also called railroad diagrams for 
their track-like appearance). Just as regular grammars generate strings that 
FSAs can recognize, context-sensitive rules generate strings that syntax 
diagrams can recognize. Figure {SYNLEG} explains the different elements of these 
syntax diagrams. <BR>&nbsp; 
<CENTER>
<P><IMG src="Chapter 4 The Parserw.files/SYNLEG.gif"></CENTER>When we build 
syntax diagrams, there is only one diagram per non-terminal. This is an 
important distinction from state machines, where one state machine could be used 
to represent the entire grammar. With syntax diagrams, we only combine rules 
when they have the same non-terminal on the left side. For converting 
productions to diagrams, we have a few basic patterns to follow that make the 
job easier. Some of these are presented in figure {SYNPAT}. 
<CENTER><IMG src="Chapter 4 The Parserw.files/SYNPAT.gif"></CENTER>In all cases, 
syntax diagrams represent a transition of state in the same way that finite 
state automata did. All syntax diagrams are read in a left-to-right, starting 
from the upper left, and following the arrows along an arbitrary path. Diagrams 
may "call" other diagrams, allowing a form of recursive procedure call. 
<P>&nbsp;<B>A -&gt; <I>(nothing)</I></B> <BR>This is the simplest of productions 
to convert. It represents an empty transition, and effectively gets rid of a 
non-terminal. 
<P>&nbsp;<B>A -&gt; <I>B</I></B> <BR>In this rule, A is replaced by B. In the 
matching syntax diagram, the diagram for A calls the diagram for B. 
<P>&nbsp;<B>A -&gt; <I>a</I></B> <BR>In this production, the non-terminal A is 
replaced by the terminal <B>a</B>. In the syntax diagram, a token for <B>a</B> 
is consumed. 
<P>&nbsp;<B>A -&gt; <I>BA</I></B> <BR><B>A -&gt; <I>B</I></B> <BR>In this 
example, we merge two productions. The effect of these two productions is to 
generate an arbitrary amount of Bs. This is expressed in a syntax diagram by 
employing a path that loops back to the beginning. Rule A may invoke B an 
arbitrary amount of times. 
<P>&nbsp;<B>A -&gt; <I>B</I></B> <BR><B>A -&gt; <I>C</I></B> <BR>For these two 
productions, we are given the choice of rewriting A as either a B or a C. The 
syntax diagram represents this choice by a branching path. 
<P>&nbsp;For an example of how these components are put together, we can build a 
syntax diagram representing our three rules for generating strings of matching 
parentheses. 
<CENTER><IMG src="Chapter 4 The Parserw.files/SYNEX1.gif"></CENTER>In all 
grammars, there is a root rule. The root rule is the starting rule: the rule 
that calls all the other rules. By convention, many texts name this rule 
<B><I>S</I></B>. 
<H2>4.2 Top-down Recursive Descent Parsing</H2><!-------------------------------------------------------------------------------->The 
process of parsing is one of matching grammar rules for a language to an input 
stream consisting of tokens. It is a pattern-matching problem, and as 
pattern-matching problems go, it is actually one of the simpler ones. In 
recursive descent parsing, we take an arbitrary input stream and match it to the 
grammar rules that we have been given in a general-to-specific pattern. Starting 
with the most general rule (the root rule), we compare the input stream to any 
known starting tokens. We call other rules successively, as we determine their 
appropriateness, and eventually the entire string of input is recognized. 
<P>&nbsp;Given the productions: 
<MENU>E -&gt; T | T "+" E | T "-" E <BR>T -&gt; F | F "*" T | F "/" T <BR>F 
  -&gt; x | "(" E ")"</MENU>The terminal <B>x</B> will be used to represent an 
arbitrary identifier or constant. We can generate strings of mathemetical 
expressions using this set of productions. For instance: 
<MENU>x * (x - x)</MENU>The goal is to match the string of given tokens to the 
root production (which is E). This is done in several stages. 
<TABLE>
  <TBODY>
  <TR vAlign=top>
    <TD><B>(1)</B></TD>
    <TD>We begin with the stated goal of matching production E to the input 
      stream of x * (x - x). The grey area represents the uncompleted portion of 
      the tree.&nbsp;</TD>
    <TD><IMG src="Chapter 4 The Parserw.files/Step01.gif"></TD></TR>
  <TR vAlign=top>
    <TD><B>(2)</B></TD>
    <TD>The first step is to substitute E with T. We non-deterministically 
      "know" to do this because at this level there is no rule to match to T + E 
      or T - E.&nbsp;</TD>
    <TD><IMG src="Chapter 4 The Parserw.files/Step02.gif"></TD></TR>
  <TR vAlign=top>
    <TD><B>(3)</B></TD>
    <TD>Since we "know" that * is part of F, we can expand the variant of the 
      T production into F * T.&nbsp;</TD>
    <TD><IMG src="Chapter 4 The Parserw.files/Step03.gif"></TD></TR>
  <TR vAlign=top>
    <TD><B>(4)</B></TD>
    <TD>At this point, we work down the branches of the tree, expanding them 
      in depth-first fashion and working from left to right. At this stage, F is 
      matched directly with the first x.&nbsp;</TD>
    <TD><IMG src="Chapter 4 The Parserw.files/Step04.gif"></TD></TR>
  <TR vAlign=top>
    <TD><B>(5)</B></TD>
    <TD>Since the next branch is a terminal (the *), we can match it directly 
      to our input string. Then we expand the next branch of the tree, a T into 
      an F.&nbsp;</TD>
    <TD><IMG src="Chapter 4 The Parserw.files/Step05.gif"></TD></TR>
  <TR vAlign=top>
    <TD><B>(6)</B></TD>
    <TD>From F we choose the production that gives us an ( E ). This is 
      obvious, since the left parenthesis terminal is the first character in the 
      rule.&nbsp;</TD>
    <TD><IMG src="Chapter 4 The Parserw.files/Step06.gif"></TD></TR>
  <TR vAlign=top>
    <TD><B>(7)</B></TD>
    <TD>Now, we match the left parenthesis to the one in the imput string, and 
      then move on to expand the next branch. We choose a T - E, since the minus 
      matches the string.&nbsp;</TD>
    <TD><IMG src="Chapter 4 The Parserw.files/Step07.gif"></TD></TR>
  <TR vAlign=top>
    <TD><B>(8)</B></TD>
    <TD>Working down the leftmost branch again, we replace the T with an 
    F.&nbsp;</TD>
    <TD><IMG src="Chapter 4 The Parserw.files/Step08.gif"></TD></TR>
  <TR vAlign=top>
    <TD><B>(9)</B></TD>
    <TD>The F then matches directly with the next <B>x</B>.&nbsp;</TD>
    <TD><IMG src="Chapter 4 The Parserw.files/Step09.gif"></TD></TR>
  <TR vAlign=top>
    <TD><B>(10)</B></TD>
    <TD>After matching the minus to the input string, we then go to the final 
      branch at this level, and substitute the E with T.&nbsp;</TD>
    <TD><IMG src="Chapter 4 The Parserw.files/Step10.gif"></TD></TR>
  <TR vAlign=top>
    <TD><B>(11)</B></TD>
    <TD>Again, T goes to F.&nbsp;</TD>
    <TD><IMG src="Chapter 4 The Parserw.files/Step11.gif"></TD></TR>
  <TR vAlign=top>
    <TD><B>(12)</B></TD>
    <TD>Then, F matches the third and final <B>x</B>.&nbsp;</TD>
    <TD><IMG src="Chapter 4 The Parserw.files/Step12.gif"></TD></TR>
  <TR vAlign=top>
    <TD><B>(13)</B></TD>
    <TD>Since this level is done, we recurse back four levels or so, and match 
      the remaining parenthesis to the input string, and that completes the 
      parsing!&nbsp;</TD>
    <TD><IMG 
src="Chapter 4 The Parserw.files/Step13.gif"></TD></TR></TBODY></TABLE>So the 
first question will undoubtedly be, how on earth did we simply "know" to match E 
with T in stage 2? The answer is that we just <I>knew</I>. That's what 
non-determinism is. It eventually becomes obvious at later stages of parsing 
which rule ought to be called, but at the outset it is always anyone's guess. 
Needless to say, this algorithm does not lend itself to being implemented in 
code. As it turns out, there are better ways. 
<P>&nbsp;There certainly does exist a method for determining which of all the 
rules ought to be used to expand the tree. For starters, we make sure that our 
productions are not of the class that is known as "left-recursive". In other 
words, we never have productions that look like 
<MENU>S -&gt; S<I>A</I></MENU>where <I>A</I> represents any sequence of 
terminals and non-terminals. This is very <I>bad</I>. If such a thing were coded 
into a computer, the computer would enter an infinite loop. This is the type of 
production that is trying to say, "We can make an arbitrarily long sequence of 
these things". If instead we represent such rules in a non left-recursive form, 
<MENU>S -&gt; <I>A</I>S</MENU>we can avoid such pitfalls. This essentially means 
that if the production ever calls itself, it is never first (left-most) in the 
rewrite sequence. 
<P>&nbsp;The next clue is to build a series of sets for each production that 
defines the set of first terminals for each of the rewrite alternatives. In our 
infamous case of matching E to T as opposed to T + E or T - E, we have the set 
{x, "("}. This set, usually known as <I>FIRST</I> will tell us in certain cases 
which rule we ought to take. However, in our case, it still does not help our 
cause. 
<P>&nbsp;The answer at this point is that we assume that all three rules are
12 3 4 5 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -