📄 grammar5.txt

📁 用于lex和yacc的c++及c语言的文法,可用来构造C++语言的编译器
💻 TXT
📖 第 1 页 / 共 5 页
字号:
	state: 738

3 RR caused by function-like cast vs identifier declaration ambiguity
	state: 739

3 RR caused by destructor declaration vs destructor call 
	state: 740

3 RR caused by parened initializer vs  prototype/typename
	state: 758

5 SR caused by redundant parened TYPEDEFname redeclaration vs old style cast
	states: 395, 621, 1038, 1102, 1103

In  each  case  the  conflicts are resolved in favor of a declaration.  
Note  that  this  decision  is  made  when  the  "declarator"  *seems* 
complete.  Typically this takes place when a character (such as a ')', 
',', or '=') that terminates a declarator or abstractor declarator  is 
seen.  At  such  a  point,  an  my LR(1) grammar makes the decision to 
support a declaration/type-name assuming it is possible.   (The  added 
complexity of the LALR-only conflicts discussed in the next section). 

Note  that  this disambiguation (provided by my grammar) is *NOT* what 
is suggested in the ARM.   The  ARM  suggests  what  some  folks  have 
referred  to  as "advanced parsing techniques" be used to disambiguate 
based on first "pre-parsing" arbitrarily far ahead.  This approach  is 
actually  not  "advanced  parsing", rather it is a through back to the 
days before parsing was understood, and exponential search  algorithms 
were  proposed  as  "parsing techniques." Currently, cfront supports a 
hacked partial look-ahead technique, that alternately  "disambiguates" 
hard  examples  and  "core-dumps"  for  others.  To date, I know of no 
implementation  that  actually   comes   near   matching   the   ARM's 
specification,   and   I  suspect  that  attempts  to  implement  such 
conformance will lead to many buggy compilers,  that  support  varying 
degrees  of  compliance  with  such ad-hoc requests for parsing.  Only 
time will tell whether  the  ANSI  C++  committee  will  support  this 
unimplemented  disambiguation technique.  Most users of cfront *think* 
that the ARM is implemented in cfront, and that cfront implements  the 
ARM.   This  area  of ambiguity not only demonstrates cfront's lack of 
compliance  with  the  ARM,  but  also   exemplifies   some   of   the 
unimplemented technology that is proposed in the ARM.


LALR-only CONFLICTS IN THE GRAMMAR

As  can  be seen in the analysis of the grammar in y.output, there are 
some  LALR-only  conflict  contexts  in  some  of  the   reduce/reduce 
conflicts.   Since yacc disambiguates in favor of lower numbered rules 
during  R/R  conflict,  most  of   these   LALR-only   conflicts   are 
insignificant  (see  autodoc5.txt  for  discussion  of  this concept). 
Specifically, when the left context *requires* unambiguously that that 
a certain reduction take place, and the parser was going  to  do  that 
reduction   anyway   (by  default),  then  the  LALR-only  problem  is 
insignificant.  Based on the above reasoning, the LALR-only  conflicts 
in the following states are insignificant:

	states: 738(mostly), 739, 740, 1105

The  states where there are real LALR-only conflicts with significance 
are:
	
	state 738 on ',' and '='

	state 758 on ',' and '='

The significant subsets of the unambiguous left context trees  (copied 
from the y.output file) are:


LALR-only conflict contexts leading to state 738 and lookahead symbol <','>
	Possible reductions rules include (61,73)
		type_qualifier_list_opt : (61)
		postfix_expression : TYPEDEFname '(' ')' (73)

--738+-1014--952+-837(73)    $start IDENTIFIER '{' ! CLCL TYPEDEFname '(' TYPEDEFname '(' ')'
     |          --835(73)    $start IDENTIFIER '{' TYPEDEFname ! '(' TYPEDEFname '(' ')'
     |
     --748(73)    $start IDENTIFIER '(' '(' TYPEDEFname ! '(' ')'


LALR-only conflict contexts leading to state 738 and lookahead symbol <'='>
	Possible reductions rules include (61,73)
		type_qualifier_list_opt : (61)
		postfix_expression : TYPEDEFname '(' ')' (73)

--738+-1014--952(73)    $start IDENTIFIER '{' TYPEDEFname '(' ! TYPEDEFname '(' ')'
     |
     --748(73)    $start IDENTIFIER '(' '(' TYPEDEFname ! '(' ')'


LALR-only conflict contexts leading to state 758 and lookahead symbol <','>
	Possible reductions rules include (61,74)
		type_qualifier_list_opt : (61)
		postfix_expression : global_or_scoped_typedefname '(' ')' (74)

--758--754(74)    $start IDENTIFIER '(' '(' ! CLCL TYPEDEFname '(' ')'


LALR-only conflict contexts leading to state 758 and lookahead symbol <'='>
	Possible reductions rules include (61,74)
		type_qualifier_list_opt : (61)
		postfix_expression : global_or_scoped_typedefname '(' ')' (74)

--758--754(74)    $start IDENTIFIER '(' '(' ! CLCL TYPEDEFname '(' ')'


As explained in the description of the "sample sentence" that is given 
for  each  context  (see  autodoc5.txt),  it  is not unique, and it is 
*especially* non-unique to the left of  the  '!'  symbol.   Never  the 
less, it provides (most of the time) rapid insight into the conflicts, 
which  makes  it  unnecessary to look at the details of the associated 
demonstrations (also provided in the y.output file).

Looking at these examples, it  can  be  seen  that  in  each  case,  a 
parenthesized  type  name  is  created,  and  a trailing ',' or '=' is 
found.  Since the type-name used in an old style cast  cannot  contain 
such  symbols  as  ',' or '=' (unprotected by parens), the presence of 
such a token  precludes  the  possibility  of  the  sequence  being  a 
typename.  For example, looking at the left context for state 738 that 
includes state 952, consider the fragment:

	struct  TYPE1 { ...... } ;
	struct  TYPE2 { ...... } ;
	main(){
		TYPE1 ( ( TYPE2()


There  is  a chance that TYPE2 is about to be redeclared at this inner 
scope.  In such a case the fragment might continue:


	struct  TYPE1 { ...... } ;
	struct  TYPE2 { ...... } ;
	main(){
		TYPE1 ( ( TYPE2()));
		}

with lots of redundant parents around the redeclared "TYPE2".   As  an 
alternative, the fragment might continue:

	struct  TYPE1 { ...... } ;
	struct  TYPE2 { ...... } ;
	main(){
		TYPE1 ( ( TYPE2() + 10));

and  become an expression.  The problematic situation is provided when 
(for example) a ',' follows:

	struct  TYPE1 { ...... } ;
	struct  TYPE2 { ...... } ;
	main(){
		TYPE1 ( ( TYPE2() , 

Clearly, as provided in the analysis, this must be a case  where  only 
an  expression  can  result,  as a ',' could not be placed as shown if 
TYPE2 were being redeclared.

Careful consideration of each of the other LALR-only conflicts reveals 
an identical fundamental problem.

As is the basis for all  LALR-only  contexts,  there  are  some  other 
contexts  where this same LALR states 738 is used with a trailing ',', 
and it is possible for the item to  the  left  of  the  ','  to  be  a 
declarator  (specifically,  when the declarator is the first type in a 
prototype list, a comma may follow it).  Interestingly,  this  precise 
problem  is also the reason for the '=' look-ahead problem, as '=' may 
be used to specify an initializer in a prototype list, but  can  never 
be  found  in  an abstract declarator (as is required for an old style 
cast typename), or  adjacent  to  a  simple  declaration's  declarator 
*while* nested within parenthesis.

With  this  insight, it is not difficult to change the grammar so that 
the critical rules are repeated for these two distinct contexts.  As I 
mentioned in my autodoc5.txt file,  I  am  working  on  a  subtle  and 
automated  method  for  removing  these  LALR-only conflicts.  I would 
prefer (for now) to leave these conflicts in place and remove them via 
subtle  means,  rather  than  by  brute   replications   of   critical 
reductions.   Since  the nesting complexities of these scenarios seems 
large (re: function like cast of function like cast  etc.),  I  expect 
that  leaving  these  conflicts  in  the  grammar will have no adverse 
effects on users with real code.   Recall  that  the  worst  that  can 
happen  with  such  conflicts  remaining  is that an expression can be 
signaled as a syntax error, when in fact it is "unambiguous", based on 
one token of look-ahead.


SAMPLE RESOLUTIONS OF AMBIGUITIES BY MY C++ GRAMMAR

Since my grammar tends to disambiguate "prematurely" (i.e., the parser 
does not use infinite lookahead) when compared with the ARM  standard, 
it tends to force non-declarations into being declarations.  Later on, 
when   the   tokens  appear  that  require  an  interpretation  as  an 
expression, a syntax error results.  Hence, the  "hard  examples"  are 
those   in   which  the  disambiguating  tokens  appear  late  in  the 
expressions.

Of the "hard examples" given in the ARM (r6.8), my  grammar  can  only 
"properly" detect a "statement-expression" for the stream:

        T(a,5)>>c;

All  the  other  examples  default  to  a declarator after the closing 
parenthesis  following  the  identifier.   (See  my  comments  in  the 
conclusion section of this paper).

I  actually  am  not sure I agree with all the examples in the C++ 2.0 
Reference Manual. Specifically, the example in section 6.8:

        T (*d) (double(3));  // expression statement

In the example "T"  is  specified  to  be  a  simple-type-name,  which 
includes  all  the  basic  types  as  well  as  class-names, and more. 
Considering the following are valid declarations:

        void *a (0);
        void *b (int(0));
        void (*c)(int(0));

I am unable to see the "syntactic" difference between this last  token 
stream  and  the  example  just  cited  in  the  reference manual.  My 
simplistic parser gives me the result that  I  at  least  expect.   It 
concludes  (prematurely, but seemingly correctly) that the stream is a 
declaration (with a new style initializer).

As a positive note, my grammar is able to parse the  example  given  a 
while back in comp.lang.c++, that Zortech C++ 1.07 could not parse:

        a = (double(a)/double(b))...;

Apparently,   upon   seeing   "(double"   some  parsers  commit  to  a 
parenthesized type-name for a cast expression, and cannot  proceed  to 
parse  a  parenthesized  expression.   No  mention  of this problem is 
listed in my conflict list, as resolution of this problem is simply  a 
matter of letting the LR parser wait long enough before committing. My 
grammar commits when it sees "a" in "(a)" to providing an expression.


DIFFICULT AMBIGUITIES FOR A "C++ 2.0" COMPATIBLE PARSER TO TRY

Having seen the above contexts, I would be curious to see if other C++ 
front  ends  with  "smart  lexers"  (such  as  cfront)  can handle the 
following.   These  examples  are  not  guaranteed  to  be   evaluated 
correctly  by  my grammar, but I expect them to demonstrate weaknesses 
in many other parsers.  The interpretation of these examples  per  C++ 
2.0 definitions requires massive lookahead.  In addition, the examples 
are  generally unreadable by humans, and rarely parsed the same way by 
any two implementations.

    main()
      {
      class T 
        { 
        /* ... */
        } a;
      typedef T T1,T2,T3,T4,T5,T7,T8,T9,Ta,Tb,Tc,Td;
        { /* start inner scope */
        T((T1)         ); // declaration
        T((T2)    a    ); // Statement expression
        T((T3)(       )); // declaration of T3
        T((T4)(T      )); // declaration of T4
        T((T5)(T  a   )); // declaration of T5
        T((T6)(T((a) ))); // declaration of T6
        T((T7)(T((T) ))); // declaration of T7
        T((T8)(T((T)b))); // statement expression

        T(b[5]); // declaration
        T(c());  // declaration
        T(d()[5]);  // statement expression ? (function returning array 
                      // is semantically illegal, but syntactically proper)
        T(e[5]());  // statement expression ? (No array of functions)
        T(f)[5]();  // statement expression ?  "                   "
        
        T(*Ta() )[5] [4];  //declaration
        T(*Tb() [5]) [4];  //statement expression ? (function returning array)

        T(*Tc()) [3 +2];  //declaration
        T(*Td()) [3 ]+2;  //statement expression

        }
      }        


COMMENTARY ON C++ 2.0 DISAMBIGUATING RULES

There are two distinct thrusts in conflict disambiguation as  provided 
by  AT&T's  efforts to define a standard for C++.  The first thrust is 
"parse tokens into the longest possible declarator, and  identify  the 
syntax  errors  that  result".   The  second thrust is to "use massive 
external technology ("smart lexer", a.k.a.: "recursive  decent  parser 
that  helps  the  lexer",  a.k.a.   LALEX)  to look ahead, so that the
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -