📄 grammar5.txt

📁 用于lex和yacc的c++及c语言的文法,可用来构造C++语言的编译器
💻 TXT
📖 第 1 页 / 共 5 页
字号:
parser doesn't mis-parse a function-like-cast  as  a  declaration  and 
induce  a  syntax  error".   The  first  is  a commitment to LR parser 
technology, and an existing grammar (which could be cleaned up).   The 
second  is  a commitment to NOT use an LR parser, and to the use of an 
existing implementation.

It is my belief that LR parsers are well understood, and the  addition 
of a "smart lexer" destroys all structure in a parser.  The result can 
be anticipated to become a quagmire of code and hacks.  With this firm 
conviction,  I  have  provided my grammar in the hopes that a standard 
can emerge that IS well defined, and is implementable, and is readable 
by humans.


SOURCE OF CONFLICTS IN C++ (MIXING TYPES AND EXPRESSIONS)

One fundamental strength in C is the similarity  between  declarations 
and  expressions.   The  syntax  of  the  two  is  intended to be very 
similar, and the result is a clean declaration and expression  syntax. 
(It  takes  some  getting  used  to,  but  it  is in my opinion good).  
Unfortunately, there are some slight distinctions  between  types  and 
expressions,  which  Ritchie  et.  al.  apparently noticed.  It is for 
this reason (I am guessing) that the C cast operator REQUIRES that the 
type be enclosed in parenthesis.  Moreover,  there  is  also  a  clear 
separator in a declaration between the declarator and the initializing 
expression  (the  '=') (as some of you know, there is some interesting 
history  in  this  area.).   The  bottom  line  (as  seen  with  20-20 
hindsight)  is:  "keep  declarations  and expressions separate".  Each 
violation of this basic rule has induced conflicts.


To be concrete about the differences between  types  and  expressions, 
the following two distinctions are apparent:

    1)  Abstract declarators are permitted.  No analogy is provided in 
    expressions.  The notable distinction is that abstract declarators 
    include the possibility of trailing '*' tokens.
        
    2) The binding of elements in a declaration is very different from 
    any expression.  Notably,  the  declaration-specifiers  are  bound 
    separately  to  each  declarator  in  the  comma separated list of 
    declarators (example:  int  a,  b,  c;).   With  (most  forms  of) 
    expressions,   a   comma   provides   a  major  isolation  between 
    expressions.

C also used  reserved  names  to  GREATLY  reduce  the  complexity  of 
parsing.   The  introduction of typedef names increased the complexity 
(it made the language context sensitive), but a  simple  hack  between 
lex and YACC overcame the problem.  An example is the statement:

        name (*b)[4];

Note  that  this  is  ambiguous,  EVEN  in  ANSI  C,  IF  there  is no 
distinction between type-names and function names!  (i.e.,  "b"  could 
be getting redeclared to be of type "pointer to array of name", OR the 
function  "name"  could  be  called  with argument "*b", the result of 
which is indexed to the 4th element). In C, the  two  kinds  of  names 
(TYPEDEFnames and function names (a.k.a.: declared identifiers)) share 
a  name  space,  and  at  every  point  in a source program the (hack) 
contextual distinction can be made by the tokenizer.  Hacks  are  neat 
things in that the less you use them, the more likely they are to work 
when you REALLY need them (i.e., you don't have to fight with existing 
hacks).   Having  worked on designing and implementing a C compiler, I 
was pleasantly amazed at how the constructs all fell together.

The major violations of this approach (i.e., keep declaration separate 
from expressions) that come to mind with C++ are: 

        function-like-casts,
        freestore expressions without parens around the type,
        conversion function names using arbitrary type specifiers,
        parenthesized initializers that drive constructors.

The last problem, parenthesized initializers, provides  two  areas  of 
conflicts.  The first is really an interference issue with old style C 
function  definitions,  which  only bothers folks at file scope (GNU's 
G++ compiler considered this to be too great  an  obstacle,  and  they 
don't currently support old style C definitions!).  The second part of 
this   conflict   involves   a  more  subtle  separation  between  the 
declarator, and the initializer. (K&R  eventually  provided  an  equal 
sign  as  an unequivocal separator, but parens used in C++ are already 
TOO overloaded to separate anything).  The significance of  this  lack 
of  a  clear  separator  is  that  it  is difficult to decide that the 
"declarator" is complete, and that the declared name should  be  added 
to  the scope.  The last problem does interact in a nasty way with the 
function-like cast vs declaration conflicts  (the  problem  slows  the 
feedback  loop  to  the  symbol  table, which is critical to continued 
lexing).  The parened initializers also provide another context  where 
it  is  difficult  to distinguish between expressions (a true argument 
list for the constructor) and a declaration continuation (a  parameter 
type list).  

The  second  problem  listed falls out of the "new-expression" with an 
unparenthesized type.  This form of freestore (such as "new int *  *") 
allows  types  to  be placed adjacent to expressions, and the trailing 
'*' ambiguity rears its head.  I can easily prove  that  this  is  the 
culprit  in  terms  of  specific  ambiguities,  in that removing these 
(unnecessary?) forms significantly disambiguates the grammar.  (It  is 
rather  nice  to  use  YACC  as  a  tool  to  prove  that a grammar is 
unambiguous!).  It is interesting to note that if only the  derivation 
of  a  freestore  expression  were  limited to (using the non-terminal 
names of the form that the C++ Reference manual uses):

        new placement-opt ( type-name )  parened-arg-list-opt

then all the LR(1)  reduce  conflicts  based  on  this  problem  would 
vanish.  Indeed, the culprit can clearly be shown to be:

        new placement-opt  restricted-type-name parened-arg-list-opt

The  characters  which  excite  these reduction conflicts are '*', and 
'&'.

The    third    problem    that    I    indicated     involves     the 
conversion-function-name.   Here  again, if the syntax were restricted 
to ONLY:

        operator simple-type-name

then the LR(1) conflicts would vanish.  It is interesting to note that 
the  keyword  "operator"  serves  as  the  left  separator,  and   the 
restriction   to  "simple-type-name"  results  in  an  implicit  right 
separator  (simple-type-names  are  exactly  one  token  long).    The 
conflicts  appear  when  multiple tokens are allowed for a declaration 
specifier, and an optional pointer-modifier list may  be  added  as  a 
postfix.   The  conflicts  that  result  from  this lack of separation 
include all those provided by the freestore example.

Here again (as with the unambiguous version of freestore)  the  syntax 
could be extended to:

operator_function_name :
        OPERATOR any_operator
        | OPERATOR basic_type_name
        | OPERATOR TYPEDEFname
        | OPERATOR type_qualifier
        | OPERATOR '(' type_name ')'
        ;

instead of:

operator_function_name :
        OPERATOR any_operator
        | OPERATOR type_qualifier_list            operator_function_ptr_opt
        | OPERATOR non_elaborating_type_specifier operator_function_ptr_opt
        ;

and  the  ambiguities  would vanish (and the expressivity would not be 
diminished).


FUNCTION LIKE CAST vs DECLARATION AMBIGUITIES

The real big culprit (i.e., my anti-favorite) in this whole  ambiguity 
set   (re:   keeping   types   and   expressions   separate)   is  the 
function-like-cast.  The reason why it is so  significant  (to  an  LR 
parser)   is  that  the  binding  of  a  type-name,  when  used  in  a 
function-like-cast,  is  directly  to  the   following   parenthesized 
argument list.  In contrast, the binding of a type-name when used in a 
declaration  is to all the "stuff" that follows, up until a declarator 
termination mark like a ',', ';' or '='. Life really gets tough for LR 
folks when the parse stack MUST be reduced, but the parser just  can't 
tell  how  yet.  With this problem, the hacks began to appear (re: the 
"smart lexer").  Note that these new style casts are much more than  a 
notational  convenience  in  C++.   The necessity of the function like 
cast lies in the fact that such a cast  can  take  several  arguments, 
whereas the old style cast is ALWAYS a unary operator.

I was (past tense) actually working towards resolving this problem via 
some  standard  methods that I have developed (re: inline expansion of 
rules to provide deferred reduction).  I was (past tense)  also  using 
one  more  sneaky  piece  of  work  to  defer the reductions, as I was 
carefully making use of right recursion (instead of the standard  left 
recursion)  in  order  to  give  the  parser a chance to build up more 
context.  I can demonstrate the usefulness of right recursive grammars 
in disambiguating difficult toy grammars.  Unfortunately,  I  realized 
at  some  point  that I NEEDED to perform certain actions (such as add 
identifiers to the symbol table) in order  to  complete  the  parse!?!  
This  was  my catch 22.  I could POSSIBLY parse using an LALR grammar, 
if  I  could  only  defer  actions  until  I  had  enough  context  to 
disambiguate.   Unfortunately,  I  HAD to perform certain actions (re: 
modify the symbol table, which changed the action  of  the  tokenizer) 
BEFORE  I  could  continue to examine tokens!  In some terrible sense, 
the typedef hack had come back to haunt me.

I backed off a bit in my grammar after reaching this wall, and now  my 
grammar  only  waits  until  it reaches the identifier in the would be 
declarator.  I really  didn't  want  to  parse  the  stuff  after  the 
identifier  name  (sour  grapes?),  because  I  knew  I would not (for 
example) be able to identify  a  "constant  expression"  in  an  array 
subscript  (most of the time, if it isn't constant, then it can't be a 
declaration).  I don't believe that a compiler  should  compete  in  a 
battle  of  wits  with  the  programmer,  and  the  parser was already 
beginning to outwit me (i.e., I was having a hard time  parsing  stuff 
mentally that is provided as examples in the 2.0 Reference Manual.  In 
fact, many reviewers as well as the author had difficulty, and that is 
why errors appeared there originally, as well as in the ARM).


FUNCTION LIKE CAST vs DECLARATION : THE TOUGH EXAMPLE:        
        
The  following  is about the nastiest example that I have been able to 
construct for this ambiguity group.  I am presenting it here  just  in 
case someone is left with a thought that there is "an easy way out".

The   fact  that  identifiers  can  remain  ambiguous  SO  LONG  after 
encountering them can cause no end of  trouble  to  the  parser.   The 
following  example  does  not  succumb to static (re: no change to the 
symbol table)  anticipatory  lexing  of  a  statement.   As  such,  it 
demonstrates  the  futility  of  attempting  to use a "smart lexer" to 
support the philosophy: "If it can be interpreted  as  a  declaration, 
then so be it; otherwise it is an expression".  This ambiguous example 
exploits  the  fact that declarators MUST be added to the symbol table 
as  soon  as  they  are  complete  (and  hence  they   mask   external 
declarations).

First I will present the example without comments:

class Type1 {
            Type1 operator()(int);
            } ;
class wasType2 { ...}; 
int (*c2)(Type1 dummy);

main ()
    {
    const int    a1  = 0, new_var  (4), (*c1)(int     (a1));
    Type1       (a2) = 3, wasType2 (4), (*c2)(wasType2(a1));
    }




Now to repeat the example with comments:

class Type1 {....
            Type1 operator()(int);
            } ;
class wasType2 { ....}; /* we will almost redeclare this typename!?! */
int (*c2)(Type1 dummy); /* we will NOT redeclare Type1 */

main ()
    {
    /* The first line is indeed simple.  It is simply placed here
    to hint at how the second line MIGHT analogously be parsed. */
    
    const int    a1  = 0, new_var  (4), (*c1)(int     (a1));

    /*  As  a review, "a1" is declared to be a constant with value 0. 
        "new_var" is declared to be another constant, but with  value 
        4.  Finally,  "c1"  is  declared  to  be a pointer to a const 
        integer, and the initial value of this pointer is  "int(a1)", 
        which  is  the  same  as "int(0)", or simply "0" (a.k.a., the 
        null pointer). It is significant that "a1" entered the symbol 
        table  quickly  so  that  it  could  be  used  later  in  the 
        declaration. */

    /* Static lexing of what follows will suggest that the  following 
    is  also  a  declaration.   This  statement  is  actually 3 comma 
    separated expressions!! The explanation that follows  shows  that 
    assuming   the   2nd  statement  is  a  declaration  leads  to  a 
    contradiction. */
    
    Type1      (a2) = 3, wasType2 (4), (*c2)(wasType2(a1));
    
    /* Assume this second statement is a declaration.  Note  that  by 
    the  time  "c2" is parsed, "wasType2" has been redeclared to be a 
    variable of type "Type1".  Hence  "wasType2(a1)"  is  actually  a 
    function  call  to  "wasType2.operator()(a1)",  and  it  is not a 
    function    prototype    arg    list.     It     follows     that 
    "(*c2)(wasType2(a1))"  is  an  expression,  and NOT a declarator!  
    Since this last entry is not a declarator, the  entire  statement 
    must  be an expression (ugh! it is time to backtrack). After much 
    work on my part, I think it might even be  a  semantically  valid 
    expression.   Once this backtracking is complete, we see that the 
    first expression "Type1 (a2) = 3" is  an  assignment  to  a  cast 
    expression.  The second expression "wasType2 (4)", is a cast of a 
    constant.  The  third  expression  "(*c2)(wasType2(a1))",  is  an 
    indirect  function  call.  The argument of the call is the result 
    of a cast.  Note that "wasType2" is  actually  never  redeclared, 
    but it was close! */
    
    /*  For  those of you who can parse this stuff in your sleep, and 
    noticed the slight error  in  the  above  argument,  I  have  the 
    following "fix".  The error is that the 

        "(*c2)(wasType2(a1))" 

    could actually be a declaration with a parenthesized initializer.  
    I could have changed this token subsequence to:
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -