📄 yacc-docs.txt

📁 句法分析器。一般用文法(grammar)来刻画.常见的是短语结构文法(chomsky hierarchy),其中最常用的是上下文无关文法(CFG)。
💻 TXT
📖 第 1 页 / 共 5 页
字号:
marker is seen; it accepts the input.  If the endmarker  is  seen
in any other context, it is an error.

     It is the job  of  the  user-supplied  lexical  analyzer  to
return  the  endmarker  when  appropriate;  see section 3, below.
Usually the endmarker  represents  some  reasonably  obvious  I/O
status, such as ``end-of-file'' or ``end-of-record''.

2: Actions

     With each grammar rule, the user may associate actions to be









Yacc: Yet Another Compiler-Compiler                      PS1:15-7


performed  each time the rule is recognized in the input process.
These actions may  return  values,  and  may  obtain  the  values
returned by previous actions.  Moreover, the lexical analyzer can
return values for tokens, if desired.

     An action is an arbitrary C statement, and as  such  can  do
input  and  output,  call subprograms, and alter external vectors
and variables.  An action is specified by one or more statements,
enclosed in curly braces ``{'' and ``}''.  For example,

        A       :       '('  B  ')'
                                {       hello( 1, "abc" );  }

and

        XXX     :       YYY  ZZZ
                                {       printf("a message\n");
                                        flag = 25;   }

are grammar rules with actions.

     To facilitate easy communication between the actions and the
parser,  the  action statements are altered slightly.  The symbol
``dollar sign'' ``$'' is used as a signal to Yacc  in  this  con-
text.

     To return a value, the  action  normally  sets  the  pseudo-
variable  ``$$'' to some value.  For example, an action that does
nothing but return the value 1 is

                {  $$ = 1;  }


     To obtain the values returned by previous  actions  and  the
lexical analyzer, the action may use the pseudo-variables $1, $2,
. . ., which refer to the values returned by  the  components  of
the  right  side of a rule, reading from left to right.  Thus, if
the rule is

        A       :       B  C  D   ;

for example, then $2 has the value returned  by  C,  and  $3  the
value returned by D.

     As a more concrete example, consider the rule

        expr    :       '('  expr  ')'   ;

The value returned by this rule is usually the value of the  expr
in parentheses.  This can be indicated by

        expr    :        '('  expr  ')'         {  $$ = $2 ;  }











PS1:15-8                      Yacc: Yet Another Compiler-Compiler


     By default, the value of a rule is the value  of  the  first
element in it ($1).  Thus, grammar rules of the form

        A       :       B    ;

frequently need not have an explicit action.

     In the examples above, all the actions came at  the  end  of
their  rules.  Sometimes, it is desirable to get control before a
rule is fully parsed.  Yacc permits an action to  be  written  in
the middle of a rule as well as at the end.  This rule is assumed
to return a value, accessible through the usual mechanism by  the
actions  to  the  right of it.  In turn, it may access the values
returned by the symbols to its left.  Thus, in the rule

        A       :       B
                                {  $$ = 1;  }
                        C
                                {   x = $2;   y = $3;  }
                ;

the effect is to set x to 1, and y to the value returned by C.

     Actions that do not terminate a rule are actually handled by
Yacc  by  manufacturing  a new nonterminal symbol name, and a new
rule matching this name to the empty string.  The interior action
is the action triggered off by recognizing this added rule.  Yacc
actually treats the above example as if it had been written:

        $ACT    :       /* empty */
                                {  $$ = 1;  }
                ;

        A       :       B  $ACT  C
                                {   x = $2;   y = $3;  }
                ;


     In many applications, output is not  done  directly  by  the
actions;  rather, a data structure, such as a parse tree, is con-
structed in memory, and transformations are applied to it  before
output  is  generated.  Parse trees are particularly easy to con-
struct, given routines to build and maintain the  tree  structure
desired.   For example, suppose there is a C function node, writ-
ten so that the call

        node( L, n1, n2 )

creates a node with label L,  and  descendants  n1  and  n2,  and
returns the index of the newly created node.  Then parse tree can
be built by supplying actions such as:

        expr    :       expr  '+'  expr
                                {  $$ = node( '+', $1, $3 );  }









Yacc: Yet Another Compiler-Compiler                      PS1:15-9


in the specification.

     The user may define  other  variables  to  be  used  by  the
actions.  Declarations and definitions can appear in the declara-
tions section, enclosed in the marks ``%{''  and  ``%}''.   These
declarations and definitions have global scope, so they are known
to the action statements and the lexical analyzer.  For example,

        %{   int variable = 0;   %}

could be placed in  the  declarations  section,  making  variable
accessible  to  all  of  the  actions.  The Yacc parser uses only
names beginning in ``yy''; the user should avoid such names.

     In these examples, all the values are integers: a discussion
of values of other types will be found in Section 10.

3: Lexical Analysis

     The user must supply a lexical analyzer to  read  the  input
stream  and  communicate  tokens (with values, if desired) to the
parser.  The  lexical  analyzer  is  an  integer-valued  function
called yylex.  The function returns an integer, the token number,
representing the kind of token read.  If there is a value associ-
ated with that token, it should be assigned to the external vari-
able yylval.

     The parser and the lexical  analyzer  must  agree  on  these
token  numbers  in  order  for communication between them to take
place.  The numbers may be chosen by Yacc, or chosen by the user.
In  either case, the ``# define'' mechanism of C is used to allow
the lexical analyzer to return these numbers  symbolically.   For
example,  suppose  that  the token name DIGIT has been defined in
the declarations section of the  Yacc  specification  file.   The
relevant portion of the lexical analyzer might look like:

        yylex(){
                extern int yylval;
                int c;
                . . .
                c = getchar();
                . . .
                switch( c ) {
                        . . .
                case '0':
                case '1':
                  . . .
                case '9':
                        yylval = c-'0';
                        return( DIGIT );
                        . . .
                        }
                . . .










PS1:15-10                     Yacc: Yet Another Compiler-Compiler


     The intent is to return a token number of DIGIT, and a value
equal  to  the  numerical  value of the digit.  Provided that the
lexical analyzer code is placed in the programs  section  of  the
specification  file,  the identifier DIGIT will be defined as the
token number associated with the token DIGIT.

     This mechanism  leads  to  clear,  easily  modified  lexical
analyzers;  the only pitfall is the need to avoid using any token
names in the grammar that are reserved or significant in C or the
parser;  for  example,  the  use  of token names if or while will
almost certainly  cause  severe  difficulties  when  the  lexical
analyzer is compiled.  The token name error is reserved for error
handling, and should not be used naively (see Section 7).

     As mentioned above, the token numbers may be chosen by  Yacc
or by the user.  In the default situation, the numbers are chosen
by Yacc.  The default token number for a literal character is the
numerical  value  of  the  character  in the local character set.
Other names are assigned token numbers starting at 257.

     To assign a token number to a  token  (including  literals),
the first appearance of the token name or literal in the declara-
tions section  can  be  immediately  followed  by  a  nonnegative
integer.   This  integer  is  taken to be the token number of the
name or literal.  Names and literals not defined by this  mechan-
ism  retain  their  default definition.  It is important that all
token numbers be distinct.

     For historical reasons, the endmarker must have token number
0  or  negative.   This  token  number cannot be redefined by the
user; thus, all lexical analyzers should be prepared to return  0
or  negative  as  a  token  number upon reaching the end of their
input.

     A very useful tool for constructing lexical analyzers is the
Lex  program  developed  by Mike Lesk.[8] These lexical analyzers
are designed to work in close harmony  with  Yacc  parsers.   The
specifications  for  these  lexical analyzers use regular expres-
sions instead of grammar rules.  Lex can be easily used  to  pro-
duce  quite  complicated lexical analyzers, but there remain some
languages (such as FORTRAN) which  do  not  fit  any  theoretical
framework, and whose lexical analyzers must be crafted by hand.

4: How the Parser Works

     Yacc turns the specification file into a  C  program,  which
parses the input according to the specification given.  The algo-
rithm used to go from the specification to the parser is complex,
and  will  not  be  discussed  here  (see the references for more
information).  The parser itself, however, is relatively  simple,
and  understanding  how  it  works, while not strictly necessary,
will nevertheless make treatment of error recovery  and  ambigui-
ties much more comprehensible.










Yacc: Yet Another Compiler-Compiler                     PS1:15-11


     The parser produced by  Yacc  consists  of  a  finite  state
machine  with a stack.  The parser is also capable of reading and
remembering the next input token (called  the  lookahead  token).
The current state is always the one on the top of the stack.  The
states of the  finite  state  machine  are  given  small  integer
labels;  initially, the machine is in state 0, the stack contains
only state 0, and no lookahead token has been read.

     The machine has only four actions available  to  it,  called
shift,  reduce,  accept, and error.  A move of the parser is done
as follows:

1.   Based on its current state, the parser  decides  whether  it
     needs  a  lookahead  token  to  decide what action should be
     done; if it needs one, and does not have one, it calls yylex
     to obtain the next token.

2.   Using the current state, and the lookahead token if  needed,
     the  parser  decides on its next action, and carries it out.
     This may result in states being pushed onto  the  stack,  or
     popped  off  of  the stack, and in the lookahead token being
     processed or left alone.

     The shift action is the most common action the parser takes.
Whenever  a  shift  action  is taken, there is always a lookahead
token.  For example, in state 56 there may be an action:

                IF      shift 34

which says, in state 56,  if  the  lookahead  token  is  IF,  the
current  state  (56)  is  pushed  down on the stack, and state 34
becomes the current state (on the top of the stack).  The  looka-
head token is cleared.

     The reduce action  keeps  the  stack  from  growing  without
bounds.   Reduce actions are appropriate when the parser has seen
the right hand side  of  a  grammar  rule,  and  is  prepared  to
announce  that it has seen an instance of the rule, replacing the
right hand side by the left hand side.  It may  be  necessary  to
consult the lookahead token to decide whether to reduce, but usu-
ally it is not; in fact, the default  action  (represented  by  a
``.'') is often a reduce action.

     Reduce actions are associated with individual grammar rules.
Grammar  rules  are  also given small integer numbers, leading to
some confusion.  The action

                .       reduce 18

refers to grammar rule 18, while the action

                IF      shift 34

refers to state 34.









PS1:15-12                     Yacc: Yet Another Compiler-Compiler


     Suppose the rule being reduced is

        A       :       x  y  z    ;

The reduce action depends on the left  hand  symbol  (A  in  this
case), and the number of symbols on the right hand side (three in
this case).  To reduce, first pop off the top three  states  from
the  stack  (In  general,  the number of states popped equals the
number of symbols on the right side of  the  rule).   In  effect,
these  states were the ones put on the stack while recognizing x,
y, and z, and no longer serve any useful purpose.  After  popping
these states, a state is uncovered which was the state the parser
was  in  before  beginning  to  process  the  rule.   Using  this
uncovered  state,  and  the  symbol on the left side of the rule,
perform what is in effect a shift of A.  A new state is obtained,
pushed onto the stack, and parsing continues.  There are signifi-
cant differences between the processing of the left  hand  symbol
and  an  ordinary  shift  of  a token, however, so this action is
called a goto action.  In  particular,  the  lookahead  token  is
cleared  by a shift, and is not affected by a goto.  In any case,
the uncovered state contains an entry such as:

                A       goto 20

causing state 20 to be pushed onto  the  stack,  and  become  the
current state.
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -