📄 tply.doc

📁 Yacc例子代码
💻 DOC
📖 第 1 页 / 共 4 页
字号:
There are some situations in which TP Yacc cannot easily determine the
type of values (when a typed parser is used). This is true, in particular,
for values in enclosing rules and for the $$ value in an action inside a
rule. In such cases you may use a type cast to explicitly specify the type
of a value. The format for such type casts is $<name>$ (for left-hand side
values) and $<name>i (for right-hand side values) where name is a type
identifier (which must occur in a %token, precedence or %type definition).


Auxiliary Procedures
--------------------

The third section of a TP Yacc program is optional. If it is present, it
may contain any Turbo Pascal code (such as supporting routines or a main
program) which is tacked on to the end of the output file.


Lexical Analysis
----------------

For any TP Yacc-generated parser, the programmer must supply a lexical
analyzer routine named yylex which performs the lexical analysis for
the parser. This routine must be declared as

   function yylex : Integer;

The yylex routine may either be prepared by hand, or by using the lexical
analyzer generator TP Lex (see Section `TP Lex').

The lexical analyzer must be included in your main program behind the
parser subroutine (the yyparse code template includes a forward
definition of the yylex routine such that the parser can access the
lexical analyzer). For instance, you may put the lexical analyzer
routine into the auxiliary procedures section of your TP Yacc grammar,
either directly, or by using the the Turbo Pascal include directive
($I).

The parser repeatedly calls the yylex routine to tokenize the input
stream and obtain the individual lexical items in the input. For any
literal character, the yylex routine has to return the corresponding
character code. For the other, symbolic, terminals of the input language,
the lexical analyzer must return corresponding Integer codes. These are
assigned automatically by TP Yacc in the order in which token definitions
appear in the definitions section of the source grammar. The lexical
analyzer can access these values through corresponding Integer constants
which are declared by TP Yacc in the output file.

For instance, if

   %token NUM

is the first definition in the Yacc grammar, then TP Yacc will create
a corresponding constant declaration

   const NUM = 257;

in the output file (TP Yacc automatically assigns symbolic token numbers
starting at 257; 1 thru 255 are reserved for character literals, 0 denotes
end-of-file, and 256 is reserved for the special error token which will be
discussed in Section `Error Handling'). This definition may then be used,
e.g., in a corresponding TP Lex program as follows:

   [0-9]+   return(NUM);

You can also explicitly assign token numbers in the grammar. For this
purpose, the first occurrence of a token identifier in the definitions
section may be followed by an unsigned integer. E.g. you may write:

   %token NUM 299

Besides the return value of yylex, the lexical analyzer routine may also
return an additional semantic value for the recognized token. This value
is assigned to a variable named "yylval" and may then be accessed in actions
through the $i notation (see above, Section `Grammar Rules and Actions').
The yylval variable is of type YYSType (the semantic value type, Integer
by default); its declaration may be found in the yyparse.cod file.

For instance, to assign an Integer value to a NUM token in the above
example, we may write:

   [0-9]+   begin
              val(yytext, yylval, code);
              return(NUM);
            end;

This assigns yylval the value of the NUM token (using the Turbo Pascal
standard procedure val).

If a parser uses tokens of different types (via a %token <name> definition),
then the yylval variable will not be of type Integer, but instead of a
corresponding variant record type which is capable of holding all the
different value types declared in the TP Yacc grammar. In this case, the
lexical analyzer must assign a semantic value to the corresponding record
component which is named yy<name> (where <name> stands for the corresponding
type identifier).

E.g., if token NUM is declared Real:

   %token <Real> NUM

then the value for token NUM must be assigned to yylval.yyReal.


How The Parser Works
--------------------

TP Yacc uses the LALR(1) technique developed by Donald E. Knuth and F.
DeRemer to construct a simple, efficient, non-backtracking bottom-up
parser for the source grammar. The LALR parsing technique is described
in detail in Aho/Sethi/Ullman (1986). It is quite instructive to take a
look at the parser description TP Yacc generates from a small sample
grammar, to get an idea of how the LALR parsing algorithm works. We
consider the following simplified version of the arithmetic expression
grammar:

%token NUM
%left '+'
%left '*'
%%
expr : expr '+' expr
     | expr '*' expr
     | '(' expr ')'
     | NUM
     ;

When run with the -v option on the above grammar, TP Yacc generates the
parser description listed below.

state 0:

	$accept : _ expr $end

	'('	shift 2
	NUM	shift 3
	.	error

	expr	goto 1

state 1:

	$accept : expr _ $end
	expr : expr _ '+' expr
	expr : expr _ '*' expr

	$end	accept
	'*'	shift 4
	'+'	shift 5
	.	error

state 2:

	expr : '(' _ expr ')'

	'('	shift 2
	NUM	shift 3
	.	error

	expr	goto 6

state 3:

	expr : NUM _	(4)

	.	reduce 4

state 4:

	expr : expr '*' _ expr

	'('	shift 2
	NUM	shift 3
	.	error

	expr	goto 7

state 5:

	expr : expr '+' _ expr

	'('	shift 2
	NUM	shift 3
	.	error

	expr	goto 8

state 6:

	expr : '(' expr _ ')'
	expr : expr _ '+' expr
	expr : expr _ '*' expr

	')'	shift 9
	'*'	shift 4
	'+'	shift 5
	.	error

state 7:

	expr : expr '*' expr _	(2)
	expr : expr _ '+' expr
	expr : expr _ '*' expr

	.	reduce 2

state 8:

	expr : expr '+' expr _	(1)
	expr : expr _ '+' expr
	expr : expr _ '*' expr

	'*'	shift 4
	$end	reduce 1
	')'	reduce 1
	'+'	reduce 1
	.	error

state 9:

	expr : '(' expr ')' _	(3)

	.	reduce 3


Each state of the parser corresponds to a certain prefix of the input
which has already been seen. The parser description lists the grammar
rules wich are parsed in each state, and indicates the portion of each
rule which has already been parsed by an underscore. In state 0, the
start state of the parser, the parsed rule is

	$accept : expr $end

This is not an actual grammar rule, but a starting rule automatically
added by TP Yacc. In general, it has the format

	$accept : X $end

where X is the start nonterminal of the grammar, and $end is a pseudo
token denoting end-of-input (the $end symbol is used by the parser to
determine when it has successfully parsed the input).

The description of the start rule in state 0,

	$accept : _ expr $end

with the underscore positioned before the expr symbol, indicates that
we are at the beginning of the parse and are ready to parse an expression
(nonterminal expr).

The parser maintains a stack to keep track of states visited during the
parse. There are two basic kinds of actions in each state: "shift", which
reads an input symbol and pushes the corresponding next state on top of
the stack, and "reduce" which pops a number of states from the stack
(corresponding to the number of right-hand side symbols of the rule used
in the reduction) and consults the "goto" entries of the uncovered state
to find the transition corresponding to the left-hand side symbol of the
reduced rule.

In each step of the parse, the parser is in a given state (the state on
top of its stack) and may consult the current "lookahead symbol", the
next symbol in the input, to determine the parse action - shift or reduce -
to perform. The parser terminates as soon as it reaches state 1 and reads
in the endmarker, indicated by the "accept" action on $end in state 1.

Sometimes the parser may also carry out an action without inspecting the
current lookahead token. This is the case, e.g., in state 3 where the
only action is reduction by rule 4:

	.	reduce 4

The default action in a state can also be "error" indicating that any
other input represents a syntax error. (In case of such an error the
parser will start syntactic error recovery, as described in Section
`Error Handling'.)

Now let us see how the parser responds to a given input. We consider the
input string 2+5*3 which is presented to the parser as the token sequence:

   NUM + NUM * NUM

The following table traces the corresponding actions of the parser. We also
show the current state in each move, and the remaining states on the stack.

State  Stack         Lookahead  Action
-----  ------------  ---------  --------------------------------------------

0                    NUM        shift state 3

3      0                        reduce rule 4 (pop 1 state, uncovering state
                                0, then goto state 1 on symbol expr)

1      0             +          shift state 5

5      1 0           NUM        shift state 3

3      5 1 0                    reduce rule 4 (pop 1 state, uncovering state
                                5, then goto state 8 on symbol expr)

8      5 1 0         *          shift 4

4      8 5 1 0       NUM        shift 3

3      4 8 5 1 0                reduce rule 4 (pop 1 state, uncovering state
                                4, then goto state 7 on symbol expr)

7      4 8 5 1 0                reduce rule 2 (pop 3 states, uncovering state
                                5, then goto state 8 on symbol expr)

8      5 1 0         $end       reduce rule 1 (pop 3 states, uncovering state
                                0, then goto state 1 on symbol expr)

1      0             $end       accept

It is also instructive to see how the parser responds to illegal inputs.
E.g., you may try to figure out what the parser does when confronted with:

   NUM + )

or:

   ( NUM * NUM

You will find that the parser, sooner or later, will always run into an
error action when confronted with errorneous inputs. An LALR parser will
never shift an invalid symbol and thus will always find syntax errors as
soon as it is possible during a left-to-right scan of the input.

TP Yacc provides a debugging option (-d) that may be used to trace the
actions performed by the parser. When a grammar is compiled with the
-d option, the generated parser will print out the actions as it parses
its input.


Ambigious Grammars
------------------

There are situations in which TP Yacc will not produce a valid parser for
a given input language. LALR(1) parsers are restricted to one-symbol
lookahead on which they have to base their parsing decisions. If a
grammar is ambigious, or cannot be parsed unambigiously using one-symbol
lookahead, TP Yacc will generate parsing conflicts when constructing the
parse table. There are two types of such conflicts: shift/reduce conflicts
(when there is both a shift and a reduce action for a given input symbol
in a given state), and reduce/reduce conflicts (if there is more than
one reduce action for a given input symbol in a given state). Note that
there never will be a shift/shift conflict.

When a grammar generates parsing conflicts, TP Yacc prints out the number
of shift/reduce and reduce/reduce conflicts it encountered when constructing
the parse table. However, TP Yacc will still generate the output code for the
parser. To resolve parsing conflicts, TP Yacc uses the following built-in
disambiguating rules:

- in a shift/reduce conflict, TP Yacc chooses the shift action.

- in a reduce/reduce conflict, TP Yacc chooses reduction of the first
  grammar rule.

The shift/reduce disambiguating rule correctly resolves a type of
ambiguity known as the "dangling-else ambiguity" which arises in the
syntax of conditional statements of many programming languages (as in
Pascal):

%token IF THEN ELSE
%%
stmt : IF expr THEN stmt
     | IF expr THEN stmt ELSE stmt
     ;

This grammar is ambigious, because a nested construct like

   IF expr-1 THEN IF expr-2 THEN stmt-1 ELSE stmt-2

can be parsed two ways, either as:

   IF expr-1 THEN ( IF expr-2 THEN stmt-1 ELSE stmt-2 )

or as:

   IF expr-1 THEN ( IF expr-2 THEN stmt-1 ) ELSE stmt-2

The first interpretation makes an ELSE belong to the last unmatched
IF which also is the interpretation chosen in most programming languages.
This is also the way that a TP Yacc-generated parser will parse the construct
since the shift/reduce disambiguating rule has the effect of neglecting the
reduction of IF expr-2 THEN stmt-1; instead, the parser will shift the ELSE
symbol which eventually leads to the reduction of IF expr-2 THEN stmt-1 ELSE
stmt-2.
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -