📄 yacc-docs.txt
字号:
marker is seen; it accepts the input. If the endmarker is seen
in any other context, it is an error.
It is the job of the user-supplied lexical analyzer to
return the endmarker when appropriate; see section 3, below.
Usually the endmarker represents some reasonably obvious I/O
status, such as ``end-of-file'' or ``end-of-record''.
2: Actions
With each grammar rule, the user may associate actions to be
Yacc: Yet Another Compiler-Compiler PS1:15-7
performed each time the rule is recognized in the input process.
These actions may return values, and may obtain the values
returned by previous actions. Moreover, the lexical analyzer can
return values for tokens, if desired.
An action is an arbitrary C statement, and as such can do
input and output, call subprograms, and alter external vectors
and variables. An action is specified by one or more statements,
enclosed in curly braces ``{'' and ``}''. For example,
A : '(' B ')'
{ hello( 1, "abc" ); }
and
XXX : YYY ZZZ
{ printf("a message\n");
flag = 25; }
are grammar rules with actions.
To facilitate easy communication between the actions and the
parser, the action statements are altered slightly. The symbol
``dollar sign'' ``$'' is used as a signal to Yacc in this con-
text.
To return a value, the action normally sets the pseudo-
variable ``$$'' to some value. For example, an action that does
nothing but return the value 1 is
{ $$ = 1; }
To obtain the values returned by previous actions and the
lexical analyzer, the action may use the pseudo-variables $1, $2,
. . ., which refer to the values returned by the components of
the right side of a rule, reading from left to right. Thus, if
the rule is
A : B C D ;
for example, then $2 has the value returned by C, and $3 the
value returned by D.
As a more concrete example, consider the rule
expr : '(' expr ')' ;
The value returned by this rule is usually the value of the expr
in parentheses. This can be indicated by
expr : '(' expr ')' { $$ = $2 ; }
PS1:15-8 Yacc: Yet Another Compiler-Compiler
By default, the value of a rule is the value of the first
element in it ($1). Thus, grammar rules of the form
A : B ;
frequently need not have an explicit action.
In the examples above, all the actions came at the end of
their rules. Sometimes, it is desirable to get control before a
rule is fully parsed. Yacc permits an action to be written in
the middle of a rule as well as at the end. This rule is assumed
to return a value, accessible through the usual mechanism by the
actions to the right of it. In turn, it may access the values
returned by the symbols to its left. Thus, in the rule
A : B
{ $$ = 1; }
C
{ x = $2; y = $3; }
;
the effect is to set x to 1, and y to the value returned by C.
Actions that do not terminate a rule are actually handled by
Yacc by manufacturing a new nonterminal symbol name, and a new
rule matching this name to the empty string. The interior action
is the action triggered off by recognizing this added rule. Yacc
actually treats the above example as if it had been written:
$ACT : /* empty */
{ $$ = 1; }
;
A : B $ACT C
{ x = $2; y = $3; }
;
In many applications, output is not done directly by the
actions; rather, a data structure, such as a parse tree, is con-
structed in memory, and transformations are applied to it before
output is generated. Parse trees are particularly easy to con-
struct, given routines to build and maintain the tree structure
desired. For example, suppose there is a C function node, writ-
ten so that the call
node( L, n1, n2 )
creates a node with label L, and descendants n1 and n2, and
returns the index of the newly created node. Then parse tree can
be built by supplying actions such as:
expr : expr '+' expr
{ $$ = node( '+', $1, $3 ); }
Yacc: Yet Another Compiler-Compiler PS1:15-9
in the specification.
The user may define other variables to be used by the
actions. Declarations and definitions can appear in the declara-
tions section, enclosed in the marks ``%{'' and ``%}''. These
declarations and definitions have global scope, so they are known
to the action statements and the lexical analyzer. For example,
%{ int variable = 0; %}
could be placed in the declarations section, making variable
accessible to all of the actions. The Yacc parser uses only
names beginning in ``yy''; the user should avoid such names.
In these examples, all the values are integers: a discussion
of values of other types will be found in Section 10.
3: Lexical Analysis
The user must supply a lexical analyzer to read the input
stream and communicate tokens (with values, if desired) to the
parser. The lexical analyzer is an integer-valued function
called yylex. The function returns an integer, the token number,
representing the kind of token read. If there is a value associ-
ated with that token, it should be assigned to the external vari-
able yylval.
The parser and the lexical analyzer must agree on these
token numbers in order for communication between them to take
place. The numbers may be chosen by Yacc, or chosen by the user.
In either case, the ``# define'' mechanism of C is used to allow
the lexical analyzer to return these numbers symbolically. For
example, suppose that the token name DIGIT has been defined in
the declarations section of the Yacc specification file. The
relevant portion of the lexical analyzer might look like:
yylex(){
extern int yylval;
int c;
. . .
c = getchar();
. . .
switch( c ) {
. . .
case '0':
case '1':
. . .
case '9':
yylval = c-'0';
return( DIGIT );
. . .
}
. . .
PS1:15-10 Yacc: Yet Another Compiler-Compiler
The intent is to return a token number of DIGIT, and a value
equal to the numerical value of the digit. Provided that the
lexical analyzer code is placed in the programs section of the
specification file, the identifier DIGIT will be defined as the
token number associated with the token DIGIT.
This mechanism leads to clear, easily modified lexical
analyzers; the only pitfall is the need to avoid using any token
names in the grammar that are reserved or significant in C or the
parser; for example, the use of token names if or while will
almost certainly cause severe difficulties when the lexical
analyzer is compiled. The token name error is reserved for error
handling, and should not be used naively (see Section 7).
As mentioned above, the token numbers may be chosen by Yacc
or by the user. In the default situation, the numbers are chosen
by Yacc. The default token number for a literal character is the
numerical value of the character in the local character set.
Other names are assigned token numbers starting at 257.
To assign a token number to a token (including literals),
the first appearance of the token name or literal in the declara-
tions section can be immediately followed by a nonnegative
integer. This integer is taken to be the token number of the
name or literal. Names and literals not defined by this mechan-
ism retain their default definition. It is important that all
token numbers be distinct.
For historical reasons, the endmarker must have token number
0 or negative. This token number cannot be redefined by the
user; thus, all lexical analyzers should be prepared to return 0
or negative as a token number upon reaching the end of their
input.
A very useful tool for constructing lexical analyzers is the
Lex program developed by Mike Lesk.[8] These lexical analyzers
are designed to work in close harmony with Yacc parsers. The
specifications for these lexical analyzers use regular expres-
sions instead of grammar rules. Lex can be easily used to pro-
duce quite complicated lexical analyzers, but there remain some
languages (such as FORTRAN) which do not fit any theoretical
framework, and whose lexical analyzers must be crafted by hand.
4: How the Parser Works
Yacc turns the specification file into a C program, which
parses the input according to the specification given. The algo-
rithm used to go from the specification to the parser is complex,
and will not be discussed here (see the references for more
information). The parser itself, however, is relatively simple,
and understanding how it works, while not strictly necessary,
will nevertheless make treatment of error recovery and ambigui-
ties much more comprehensible.
Yacc: Yet Another Compiler-Compiler PS1:15-11
The parser produced by Yacc consists of a finite state
machine with a stack. The parser is also capable of reading and
remembering the next input token (called the lookahead token).
The current state is always the one on the top of the stack. The
states of the finite state machine are given small integer
labels; initially, the machine is in state 0, the stack contains
only state 0, and no lookahead token has been read.
The machine has only four actions available to it, called
shift, reduce, accept, and error. A move of the parser is done
as follows:
1. Based on its current state, the parser decides whether it
needs a lookahead token to decide what action should be
done; if it needs one, and does not have one, it calls yylex
to obtain the next token.
2. Using the current state, and the lookahead token if needed,
the parser decides on its next action, and carries it out.
This may result in states being pushed onto the stack, or
popped off of the stack, and in the lookahead token being
processed or left alone.
The shift action is the most common action the parser takes.
Whenever a shift action is taken, there is always a lookahead
token. For example, in state 56 there may be an action:
IF shift 34
which says, in state 56, if the lookahead token is IF, the
current state (56) is pushed down on the stack, and state 34
becomes the current state (on the top of the stack). The looka-
head token is cleared.
The reduce action keeps the stack from growing without
bounds. Reduce actions are appropriate when the parser has seen
the right hand side of a grammar rule, and is prepared to
announce that it has seen an instance of the rule, replacing the
right hand side by the left hand side. It may be necessary to
consult the lookahead token to decide whether to reduce, but usu-
ally it is not; in fact, the default action (represented by a
``.'') is often a reduce action.
Reduce actions are associated with individual grammar rules.
Grammar rules are also given small integer numbers, leading to
some confusion. The action
. reduce 18
refers to grammar rule 18, while the action
IF shift 34
refers to state 34.
PS1:15-12 Yacc: Yet Another Compiler-Compiler
Suppose the rule being reduced is
A : x y z ;
The reduce action depends on the left hand symbol (A in this
case), and the number of symbols on the right hand side (three in
this case). To reduce, first pop off the top three states from
the stack (In general, the number of states popped equals the
number of symbols on the right side of the rule). In effect,
these states were the ones put on the stack while recognizing x,
y, and z, and no longer serve any useful purpose. After popping
these states, a state is uncovered which was the state the parser
was in before beginning to process the rule. Using this
uncovered state, and the symbol on the left side of the rule,
perform what is in effect a shift of A. A new state is obtained,
pushed onto the stack, and parsing continues. There are signifi-
cant differences between the processing of the left hand symbol
and an ordinary shift of a token, however, so this action is
called a goto action. In particular, the lookahead token is
cleared by a shift, and is not affected by a goto. In any case,
the uncovered state contains an entry such as:
A goto 20
causing state 20 to be pushed onto the stack, and become the
current state.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -