📄 chapter 4 the parserw.htm
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0062)http://topaz.cs.byu.edu/text/html/Textbook/Chapter4/index.html -->
<HTML><HEAD><TITLE>Chapter 4: The Parser</TITLE>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.2800.1458" name=GENERATOR></HEAD>
<BODY>
<CENTER>
<H1>Chapter 4<BR>The Parser</H1></CENTER>
<HR>
<!-------------------------------------------------------------------------------->In
this chapter we will discuss the creation of a top-down recursive descent
parser. We will introduce the notation for the syntax diagrams that we will be
using throughout the remainder of the text. We will also cover error checking,
error handling, and error recovery.
<P> We will cover the implementation of top-down recursive descent parsing
in this text because it is perhaps the easiest method of compiler implementation
to code by hand. Other methods are certainly more powerful, yet they are also
more difficult to understand at the student level. <BR> <BR>
<H2>4.1 Productions and Syntax Diagrams</H2><!-------------------------------------------------------------------------------->All
languages are divided into rules. These rules govern the use of words, and at a
higher semantical level they also govern their meaning. The rules of a language
can be represented in many ways. One very simple way of representing them is
through a simple set of productions. At this level we concern ourselves with
only context-free grammars: those which on the left side of the production have
a single non-terminal, and on the right side have any series of terminals and
non terminals. These grammars are significant because they allow us the
capability to generate nested matching sets of patterns. The following set of
productions generates sets of matching nested parentheses. <BR> <BR>
<OL>S -> AS <BR>S -> <I>(nothing)</I> <BR>A -> (S)</OL>This grammar
generates strings like so:
<MENU>((())) <BR>(()(()))()</MENU>Every right parenthesis in this string has a
matching left parenthesis. Productions may also be expressed in a more consise
form, using a vertical bar: |. The vertical bar is placed between two right-hand
side sequences. This effectively merges two productions into one. The previous
set of rules can be more consisely written as
<MENU>S -> AS | (nothing) <BR>A -> (S)</MENU>Just as regular grammar rules
have a corresponding state machine, context-sensitive grammars have a
corresponding set of <I>syntax diagrams</I> (also called railroad diagrams for
their track-like appearance). Just as regular grammars generate strings that
FSAs can recognize, context-sensitive rules generate strings that syntax
diagrams can recognize. Figure {SYNLEG} explains the different elements of these
syntax diagrams. <BR>
<CENTER>
<P><IMG src="Chapter 4 The Parserw.files/SYNLEG.gif"></CENTER>When we build
syntax diagrams, there is only one diagram per non-terminal. This is an
important distinction from state machines, where one state machine could be used
to represent the entire grammar. With syntax diagrams, we only combine rules
when they have the same non-terminal on the left side. For converting
productions to diagrams, we have a few basic patterns to follow that make the
job easier. Some of these are presented in figure {SYNPAT}.
<CENTER><IMG src="Chapter 4 The Parserw.files/SYNPAT.gif"></CENTER>In all cases,
syntax diagrams represent a transition of state in the same way that finite
state automata did. All syntax diagrams are read in a left-to-right, starting
from the upper left, and following the arrows along an arbitrary path. Diagrams
may "call" other diagrams, allowing a form of recursive procedure call.
<P> <B>A -> <I>(nothing)</I></B> <BR>This is the simplest of productions
to convert. It represents an empty transition, and effectively gets rid of a
non-terminal.
<P> <B>A -> <I>B</I></B> <BR>In this rule, A is replaced by B. In the
matching syntax diagram, the diagram for A calls the diagram for B.
<P> <B>A -> <I>a</I></B> <BR>In this production, the non-terminal A is
replaced by the terminal <B>a</B>. In the syntax diagram, a token for <B>a</B>
is consumed.
<P> <B>A -> <I>BA</I></B> <BR><B>A -> <I>B</I></B> <BR>In this
example, we merge two productions. The effect of these two productions is to
generate an arbitrary amount of Bs. This is expressed in a syntax diagram by
employing a path that loops back to the beginning. Rule A may invoke B an
arbitrary amount of times.
<P> <B>A -> <I>B</I></B> <BR><B>A -> <I>C</I></B> <BR>For these two
productions, we are given the choice of rewriting A as either a B or a C. The
syntax diagram represents this choice by a branching path.
<P> For an example of how these components are put together, we can build a
syntax diagram representing our three rules for generating strings of matching
parentheses.
<CENTER><IMG src="Chapter 4 The Parserw.files/SYNEX1.gif"></CENTER>In all
grammars, there is a root rule. The root rule is the starting rule: the rule
that calls all the other rules. By convention, many texts name this rule
<B><I>S</I></B>.
<H2>4.2 Top-down Recursive Descent Parsing</H2><!-------------------------------------------------------------------------------->The
process of parsing is one of matching grammar rules for a language to an input
stream consisting of tokens. It is a pattern-matching problem, and as
pattern-matching problems go, it is actually one of the simpler ones. In
recursive descent parsing, we take an arbitrary input stream and match it to the
grammar rules that we have been given in a general-to-specific pattern. Starting
with the most general rule (the root rule), we compare the input stream to any
known starting tokens. We call other rules successively, as we determine their
appropriateness, and eventually the entire string of input is recognized.
<P> Given the productions:
<MENU>E -> T | T "+" E | T "-" E <BR>T -> F | F "*" T | F "/" T <BR>F
-> x | "(" E ")"</MENU>The terminal <B>x</B> will be used to represent an
arbitrary identifier or constant. We can generate strings of mathemetical
expressions using this set of productions. For instance:
<MENU>x * (x - x)</MENU>The goal is to match the string of given tokens to the
root production (which is E). This is done in several stages.
<TABLE>
<TBODY>
<TR vAlign=top>
<TD><B>(1)</B></TD>
<TD>We begin with the stated goal of matching production E to the input
stream of x * (x - x). The grey area represents the uncompleted portion of
the tree. </TD>
<TD><IMG src="Chapter 4 The Parserw.files/Step01.gif"></TD></TR>
<TR vAlign=top>
<TD><B>(2)</B></TD>
<TD>The first step is to substitute E with T. We non-deterministically
"know" to do this because at this level there is no rule to match to T + E
or T - E. </TD>
<TD><IMG src="Chapter 4 The Parserw.files/Step02.gif"></TD></TR>
<TR vAlign=top>
<TD><B>(3)</B></TD>
<TD>Since we "know" that * is part of F, we can expand the variant of the
T production into F * T. </TD>
<TD><IMG src="Chapter 4 The Parserw.files/Step03.gif"></TD></TR>
<TR vAlign=top>
<TD><B>(4)</B></TD>
<TD>At this point, we work down the branches of the tree, expanding them
in depth-first fashion and working from left to right. At this stage, F is
matched directly with the first x. </TD>
<TD><IMG src="Chapter 4 The Parserw.files/Step04.gif"></TD></TR>
<TR vAlign=top>
<TD><B>(5)</B></TD>
<TD>Since the next branch is a terminal (the *), we can match it directly
to our input string. Then we expand the next branch of the tree, a T into
an F. </TD>
<TD><IMG src="Chapter 4 The Parserw.files/Step05.gif"></TD></TR>
<TR vAlign=top>
<TD><B>(6)</B></TD>
<TD>From F we choose the production that gives us an ( E ). This is
obvious, since the left parenthesis terminal is the first character in the
rule. </TD>
<TD><IMG src="Chapter 4 The Parserw.files/Step06.gif"></TD></TR>
<TR vAlign=top>
<TD><B>(7)</B></TD>
<TD>Now, we match the left parenthesis to the one in the imput string, and
then move on to expand the next branch. We choose a T - E, since the minus
matches the string. </TD>
<TD><IMG src="Chapter 4 The Parserw.files/Step07.gif"></TD></TR>
<TR vAlign=top>
<TD><B>(8)</B></TD>
<TD>Working down the leftmost branch again, we replace the T with an
F. </TD>
<TD><IMG src="Chapter 4 The Parserw.files/Step08.gif"></TD></TR>
<TR vAlign=top>
<TD><B>(9)</B></TD>
<TD>The F then matches directly with the next <B>x</B>. </TD>
<TD><IMG src="Chapter 4 The Parserw.files/Step09.gif"></TD></TR>
<TR vAlign=top>
<TD><B>(10)</B></TD>
<TD>After matching the minus to the input string, we then go to the final
branch at this level, and substitute the E with T. </TD>
<TD><IMG src="Chapter 4 The Parserw.files/Step10.gif"></TD></TR>
<TR vAlign=top>
<TD><B>(11)</B></TD>
<TD>Again, T goes to F. </TD>
<TD><IMG src="Chapter 4 The Parserw.files/Step11.gif"></TD></TR>
<TR vAlign=top>
<TD><B>(12)</B></TD>
<TD>Then, F matches the third and final <B>x</B>. </TD>
<TD><IMG src="Chapter 4 The Parserw.files/Step12.gif"></TD></TR>
<TR vAlign=top>
<TD><B>(13)</B></TD>
<TD>Since this level is done, we recurse back four levels or so, and match
the remaining parenthesis to the input string, and that completes the
parsing! </TD>
<TD><IMG
src="Chapter 4 The Parserw.files/Step13.gif"></TD></TR></TBODY></TABLE>So the
first question will undoubtedly be, how on earth did we simply "know" to match E
with T in stage 2? The answer is that we just <I>knew</I>. That's what
non-determinism is. It eventually becomes obvious at later stages of parsing
which rule ought to be called, but at the outset it is always anyone's guess.
Needless to say, this algorithm does not lend itself to being implemented in
code. As it turns out, there are better ways.
<P> There certainly does exist a method for determining which of all the
rules ought to be used to expand the tree. For starters, we make sure that our
productions are not of the class that is known as "left-recursive". In other
words, we never have productions that look like
<MENU>S -> S<I>A</I></MENU>where <I>A</I> represents any sequence of
terminals and non-terminals. This is very <I>bad</I>. If such a thing were coded
into a computer, the computer would enter an infinite loop. This is the type of
production that is trying to say, "We can make an arbitrarily long sequence of
these things". If instead we represent such rules in a non left-recursive form,
<MENU>S -> <I>A</I>S</MENU>we can avoid such pitfalls. This essentially means
that if the production ever calls itself, it is never first (left-most) in the
rewrite sequence.
<P> The next clue is to build a series of sets for each production that
defines the set of first terminals for each of the rewrite alternatives. In our
infamous case of matching E to T as opposed to T + E or T - E, we have the set
{x, "("}. This set, usually known as <I>FIRST</I> will tell us in certain cases
which rule we ought to take. However, in our case, it still does not help our
cause.
<P> The answer at this point is that we assume that all three rules are
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -