📄 yacc-docs.txt

📁 句法分析器。一般用文法(grammar)来刻画.常见的是短语结构文法(chomsky hierarchy),其中最常用的是上下文无关文法(CFG)。
💻 TXT
📖 第 1 页 / 共 5 页
字号:
12 3 4 5 下一页
http://www.csc.calpoly.edu/~gfisher/450/doc/yacc/paper.txt


Yacc: Yet Another Compiler-Compiler                      PS1:15-1





               Yacc: Yet Another Compiler-Compiler


                       Stephen C. Johnson

                     AT&T Bell Laboratories
                  Murray Hill, New Jersey 07974


                            ABSTRACT


          Computer program input generally has  some  struc-
     ture;  in  fact, every computer program that does input
     can be thought of as  defining  an  ``input  language''
     which  it accepts.  An input language may be as complex
     as a programming language, or as simple as  a  sequence
     of  numbers.  Unfortunately, usual input facilities are
     limited, difficult to use,  and  often  are  lax  about
     checking their inputs for validity.

          Yacc provides a general tool  for  describing  the
     input  to  a computer program.  The Yacc user specifies
     the structures of his input, together with code  to  be
     invoked  as  each  such  structure is recognized.  Yacc
     turns such a specification into a subroutine that  han-
     dles  the  input  process; frequently, it is convenient
     and appropriate to have most of the flow of control  in
     the user's application handled by this subroutine.

          The input subroutine  produced  by  Yacc  calls  a
     user-supplied  routine  to  return the next basic input
     item.  Thus, the user can specify his input in terms of
     individual  input  characters,  or  in  terms of higher
     level constructs such as names and numbers.  The  user-
     supplied  routine  may  also  handle idiomatic features
     such as comment  and  continuation  conventions,  which
     typically defy easy grammatical specification.

          Yacc is written  in  portable  C.   The  class  of
     specifications  accepted is a very general one: LALR(1)
     grammars with disambiguating rules.

          In addition to compilers for C, APL, Pascal,  RAT-
     FOR,  etc.,  Yacc  has  also been used for less conven-
     tional languages, including a phototypesetter language,
     several desk calculator languages, a document retrieval
     system, and a Fortran debugging system.












PS1:15-2                      Yacc: Yet Another Compiler-Compiler


0: Introduction

     Yacc provides a general tool for imposing structure  on  the
input to a computer program.  The Yacc user prepares a specifica-
tion of the input process; this  includes  rules  describing  the
input  structure,  code to be invoked when these rules are recog-
nized, and a low-level routine to do the basic input.  Yacc  then
generates  a  function  to control the input process.  This func-
tion, called a parser, calls the  user-supplied  low-level  input
routine (the lexical analyzer) to pick up the basic items (called
tokens) from  the  input  stream.   These  tokens  are  organized
according  to  the  input  structure rules, called grammar rules;
when one of these rules has been recognized, then user code  sup-
plied  for  this  rule,  an  action, is invoked; actions have the
ability to return values and make use  of  the  values  of  other
actions.

     Yacc is written in  a  portable  dialect  of  C[1]  and  the
actions, and output subroutine, are in C as well.  Moreover, many
of the syntactic conventions of Yacc follow C.

     The heart of the input  specification  is  a  collection  of
grammar  rules.   Each  rule describes an allowable structure and
gives it a name.  For example, one grammar rule might be

        date  :  month_name  day  ','  year   ;

Here, date, month_name, day, and  year  represent  structures  of
interest  in  the input process; presumably, month_name, day, and
year are defined elsewhere.  The comma ``,'' is enclosed in  sin-
gle quotes; this implies that the comma is to appear literally in
the input.  The colon and semicolon merely serve  as  punctuation
in  the  rule, and have no significance in controlling the input.
Thus, with proper definitions, the input

        July  4, 1776

might be matched by the above rule.

     An important part of the input process is carried out by the
lexical  analyzer.   This  user  routine  reads the input stream,
recognizing the lower level structures,  and  communicates  these
tokens to the parser.  For historical reasons, a structure recog-
nized by the lexical analyzer is called a terminal symbol,  while
the  structure  recognized  by the parser is called a nonterminal
symbol.  To avoid confusion, terminal  symbols  will  usually  be
referred to as tokens.

     There is considerable leeway in deciding whether  to  recog-
nize structures using the lexical analyzer or grammar rules.  For
example, the rules












Yacc: Yet Another Compiler-Compiler                      PS1:15-3



        month_name  :  'J' 'a' 'n'   ;
        month_name  :  'F' 'e' 'b'   ;

                 . . .

        month_name  :  'D' 'e' 'c'   ;

might be used in the above example.  The lexical  analyzer  would
only  need  to recognize individual letters, and month_name would
be a nonterminal symbol.  Such low-level rules tend to waste time
and  space,  and  may  complicate the specification beyond Yacc's
ability to deal with it.  Usually,  the  lexical  analyzer  would
recognize  the  month  names,  and  return  an  indication that a
month_name was seen; in this case, month_name would be a token.

     Literal characters such as ``,'' must also be passed through
the lexical analyzer, and are also considered tokens.

     Specification files are very flexible.  It is realively easy
to add to the above example the rule

        date  :  month '/' day '/' year   ;

allowing

        7 / 4 / 1776

as a synonym for

        July 4, 1776

In most cases, this new rule could be ``slipped in'' to a working
system  with  minimal  effort,  and  little  danger of disrupting
existing input.

     The input being read may not conform to the  specifications.
These input errors are detected as early as is theoretically pos-
sible with a left-to-right scan; thus, not only is the chance  of
reading  and computing with bad input data substantially reduced,
but the bad data can usually be quickly found.   Error  handling,
provided as part of the input specifications, permits the reentry
of bad data, or the continuation of the input process after skip-
ping over the bad data.

     In some cases, Yacc fails to produce a parser when  given  a
set  of  specifications.   For example, the specifications may be
self contradictory, or they may require a more powerful  recogni-
tion  mechanism  than  that  available to Yacc.  The former cases
represent design errors; the latter cases can often be  corrected
by  making  the  lexical  analyzer more powerful, or by rewriting
some of the grammar rules.  While Yacc cannot handle all possible
specifications,  its  power  compares favorably with similar sys-
tems; moreover, the constructions which are difficult for Yacc to









PS1:15-4                      Yacc: Yet Another Compiler-Compiler


handle  are also frequently difficult for human beings to handle.
Some users have reported that the discipline of formulating valid
Yacc specifications for their input revealed errors of conception
or design early in the program development.

     The theory underlying Yacc has been described  elsewhere.[2,
3, 4] Yacc has been extensively used in numerous practical appli-
cations, including lint,[5] the Portable  C  Compiler,[6]  and  a
system for typesetting mathematics.[7]

     The next several sections  describe  the  basic  process  of
preparing  a Yacc specification; Section 1 describes the prepara-
tion of grammar rules, Section 2 the preparation of the user sup-
plied  actions  associated  with  these  rules, and Section 3 the
preparation of lexical analyzers.  Section 4 describes the opera-
tion of the parser.  Section 5 discusses various reasons why Yacc
may be unable to produce a parser from a specification, and  what
to  do about it.  Section 6 describes a simple mechanism for han-
dling operator precedences in arithmetic expressions.  Section  7
discusses  error detection and recovery.  Section 8 discusses the
operating environment and special features of  the  parsers  Yacc
produces.   Section 9 gives some suggestions which should improve
the style and  efficiency  of  the  specifications.   Section  10
discusses some advanced topics, and Section 11 gives acknowledge-
ments.  Appendix A has a brief example, and Appendix  B  gives  a
summary  of  the  Yacc input syntax.  Appendix C gives an example
using some of the more advanced features of Yacc,  and,  finally,
Appendix  D  describes  mechanisms  and syntax no longer actively
supported, but provided for historical continuity with older ver-
sions of Yacc.

1: Basic Specifications

     Names refer to either tokens or nonterminal  symbols.   Yacc
requires  token  names  to be declared as such.  In addition, for
reasons discussed in Section 3, it is often desirable to  include
the lexical analyzer as part of the specification file; it may be
useful to include other programs as well.  Thus, every specifica-
tion file consists of three sections: the declarations, (grammar)
rules, and programs.  The sections are separated by  double  per-
cent  ``%%'' marks.  (The percent ``%'' is generally used in Yacc
specifications as an escape character.)

     In other words, a full specification file looks like

        declarations
        %%
        rules
        %%
        programs


     The declaration section may be empty.  Moreover, if the pro-
grams section is omitted, the second %% mark may be omitted also;









Yacc: Yet Another Compiler-Compiler                      PS1:15-5


thus, the smallest legal Yacc specification is

        %%
        rules


     Blanks, tabs, and newlines are ignored except that they  may
not  appear  in  names or multi-character reserved symbols.  Com-
ments may appear wherever a name is legal; they are  enclosed  in
/* . . . */, as in C and PL/I.

     The rules section is made up of one or more  grammar  rules.
A grammar rule has the form:

        A  :  BODY  ;

A represents a nonterminal name, and BODY represents  a  sequence
of  zero or more names and literals.  The colon and the semicolon
are Yacc punctuation.

     Names may be of arbitrary length, and  may  be  made  up  of
letters,  dot  ``.'',  underscore  ``_'', and non-initial digits.
Upper and lower case letters are distinct.  The names used in the
body  of  a grammar rule may represent tokens or nonterminal sym-
bols.

     A literal consists of a character enclosed in single  quotes
``'''.   As  in  C,  the  backslash  ``\'' is an escape character
within literals, and all the C escapes are recognized.  Thus

        '\n'    newline
        '\r'    return
        '\''    single quote ``'''
        '\\'    backslash ``\''
        '\t'    tab
        '\b'    backspace
        '\f'    form feed
        '\xxx'  ``xxx'' in octal

For a number of technical reasons, the NUL character ('\0' or  0)
should never be used in grammar rules.

     If there are several grammar rules with the same  left  hand
side,  the  vertical bar ``|'' can be used to avoid rewriting the
left hand side.  In addition, the semicolon at the end of a  rule
can be dropped before a vertical bar.  Thus the grammar rules

        A       :       B  C  D   ;
        A       :       E  F   ;
        A       :       G   ;

can be given to Yacc as











PS1:15-6                      Yacc: Yet Another Compiler-Compiler



        A       :       B  C  D
                |       E  F
                |       G
                ;

It is not necessary that all grammar rules  with  the  same  left
side  appear  together  in the grammar rules section, although it
makes the input much more readable, and easier to change.

     If a nonterminal symbol matches the empty string,  this  can
be indicated in the obvious way:

        empty :   ;


     Names representing tokens must be  declared;  this  is  most
simply done by writing

        %token   name1  name2 . . .

in the declarations section.  (See Sections 3 , 5, and 6 for much
more  discussion).   Every  name  not defined in the declarations
section is assumed to represent a nonterminal symbol.  Every non-
terminal  symbol  must  appear  on  the left side of at least one
rule.

     Of all the nonterminal symbols, one, called the  start  sym-
bol, has particular importance.  The parser is designed to recog-
nize the start symbol; thus, this symbol represents the  largest,
most  general  structure  described  by  the  grammar  rules.  By
default, the start symbol is taken to be the left  hand  side  of
the first grammar rule in the rules section.  It is possible, and
in fact desirable, to declare the start symbol explicitly in  the
declarations section using the %start keyword:

        %start   symbol


     The end of the input to the parser is signaled by a  special
token,  called  the  endmarker.   If  the  tokens  up to, but not
including, the endmarker form a structure which matches the start
symbol,  the parser function returns to its caller after the end-
12 3 4 5 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -