📄 grammar5.txt
字号:
state: 738
3 RR caused by function-like cast vs identifier declaration ambiguity
state: 739
3 RR caused by destructor declaration vs destructor call
state: 740
3 RR caused by parened initializer vs prototype/typename
state: 758
5 SR caused by redundant parened TYPEDEFname redeclaration vs old style cast
states: 395, 621, 1038, 1102, 1103
In each case the conflicts are resolved in favor of a declaration.
Note that this decision is made when the "declarator" *seems*
complete. Typically this takes place when a character (such as a ')',
',', or '=') that terminates a declarator or abstractor declarator is
seen. At such a point, an my LR(1) grammar makes the decision to
support a declaration/type-name assuming it is possible. (The added
complexity of the LALR-only conflicts discussed in the next section).
Note that this disambiguation (provided by my grammar) is *NOT* what
is suggested in the ARM. The ARM suggests what some folks have
referred to as "advanced parsing techniques" be used to disambiguate
based on first "pre-parsing" arbitrarily far ahead. This approach is
actually not "advanced parsing", rather it is a through back to the
days before parsing was understood, and exponential search algorithms
were proposed as "parsing techniques." Currently, cfront supports a
hacked partial look-ahead technique, that alternately "disambiguates"
hard examples and "core-dumps" for others. To date, I know of no
implementation that actually comes near matching the ARM's
specification, and I suspect that attempts to implement such
conformance will lead to many buggy compilers, that support varying
degrees of compliance with such ad-hoc requests for parsing. Only
time will tell whether the ANSI C++ committee will support this
unimplemented disambiguation technique. Most users of cfront *think*
that the ARM is implemented in cfront, and that cfront implements the
ARM. This area of ambiguity not only demonstrates cfront's lack of
compliance with the ARM, but also exemplifies some of the
unimplemented technology that is proposed in the ARM.
LALR-only CONFLICTS IN THE GRAMMAR
As can be seen in the analysis of the grammar in y.output, there are
some LALR-only conflict contexts in some of the reduce/reduce
conflicts. Since yacc disambiguates in favor of lower numbered rules
during R/R conflict, most of these LALR-only conflicts are
insignificant (see autodoc5.txt for discussion of this concept).
Specifically, when the left context *requires* unambiguously that that
a certain reduction take place, and the parser was going to do that
reduction anyway (by default), then the LALR-only problem is
insignificant. Based on the above reasoning, the LALR-only conflicts
in the following states are insignificant:
states: 738(mostly), 739, 740, 1105
The states where there are real LALR-only conflicts with significance
are:
state 738 on ',' and '='
state 758 on ',' and '='
The significant subsets of the unambiguous left context trees (copied
from the y.output file) are:
LALR-only conflict contexts leading to state 738 and lookahead symbol <','>
Possible reductions rules include (61,73)
type_qualifier_list_opt : (61)
postfix_expression : TYPEDEFname '(' ')' (73)
--738+-1014--952+-837(73) $start IDENTIFIER '{' ! CLCL TYPEDEFname '(' TYPEDEFname '(' ')'
| --835(73) $start IDENTIFIER '{' TYPEDEFname ! '(' TYPEDEFname '(' ')'
|
--748(73) $start IDENTIFIER '(' '(' TYPEDEFname ! '(' ')'
LALR-only conflict contexts leading to state 738 and lookahead symbol <'='>
Possible reductions rules include (61,73)
type_qualifier_list_opt : (61)
postfix_expression : TYPEDEFname '(' ')' (73)
--738+-1014--952(73) $start IDENTIFIER '{' TYPEDEFname '(' ! TYPEDEFname '(' ')'
|
--748(73) $start IDENTIFIER '(' '(' TYPEDEFname ! '(' ')'
LALR-only conflict contexts leading to state 758 and lookahead symbol <','>
Possible reductions rules include (61,74)
type_qualifier_list_opt : (61)
postfix_expression : global_or_scoped_typedefname '(' ')' (74)
--758--754(74) $start IDENTIFIER '(' '(' ! CLCL TYPEDEFname '(' ')'
LALR-only conflict contexts leading to state 758 and lookahead symbol <'='>
Possible reductions rules include (61,74)
type_qualifier_list_opt : (61)
postfix_expression : global_or_scoped_typedefname '(' ')' (74)
--758--754(74) $start IDENTIFIER '(' '(' ! CLCL TYPEDEFname '(' ')'
As explained in the description of the "sample sentence" that is given
for each context (see autodoc5.txt), it is not unique, and it is
*especially* non-unique to the left of the '!' symbol. Never the
less, it provides (most of the time) rapid insight into the conflicts,
which makes it unnecessary to look at the details of the associated
demonstrations (also provided in the y.output file).
Looking at these examples, it can be seen that in each case, a
parenthesized type name is created, and a trailing ',' or '=' is
found. Since the type-name used in an old style cast cannot contain
such symbols as ',' or '=' (unprotected by parens), the presence of
such a token precludes the possibility of the sequence being a
typename. For example, looking at the left context for state 738 that
includes state 952, consider the fragment:
struct TYPE1 { ...... } ;
struct TYPE2 { ...... } ;
main(){
TYPE1 ( ( TYPE2()
There is a chance that TYPE2 is about to be redeclared at this inner
scope. In such a case the fragment might continue:
struct TYPE1 { ...... } ;
struct TYPE2 { ...... } ;
main(){
TYPE1 ( ( TYPE2()));
}
with lots of redundant parents around the redeclared "TYPE2". As an
alternative, the fragment might continue:
struct TYPE1 { ...... } ;
struct TYPE2 { ...... } ;
main(){
TYPE1 ( ( TYPE2() + 10));
and become an expression. The problematic situation is provided when
(for example) a ',' follows:
struct TYPE1 { ...... } ;
struct TYPE2 { ...... } ;
main(){
TYPE1 ( ( TYPE2() ,
Clearly, as provided in the analysis, this must be a case where only
an expression can result, as a ',' could not be placed as shown if
TYPE2 were being redeclared.
Careful consideration of each of the other LALR-only conflicts reveals
an identical fundamental problem.
As is the basis for all LALR-only contexts, there are some other
contexts where this same LALR states 738 is used with a trailing ',',
and it is possible for the item to the left of the ',' to be a
declarator (specifically, when the declarator is the first type in a
prototype list, a comma may follow it). Interestingly, this precise
problem is also the reason for the '=' look-ahead problem, as '=' may
be used to specify an initializer in a prototype list, but can never
be found in an abstract declarator (as is required for an old style
cast typename), or adjacent to a simple declaration's declarator
*while* nested within parenthesis.
With this insight, it is not difficult to change the grammar so that
the critical rules are repeated for these two distinct contexts. As I
mentioned in my autodoc5.txt file, I am working on a subtle and
automated method for removing these LALR-only conflicts. I would
prefer (for now) to leave these conflicts in place and remove them via
subtle means, rather than by brute replications of critical
reductions. Since the nesting complexities of these scenarios seems
large (re: function like cast of function like cast etc.), I expect
that leaving these conflicts in the grammar will have no adverse
effects on users with real code. Recall that the worst that can
happen with such conflicts remaining is that an expression can be
signaled as a syntax error, when in fact it is "unambiguous", based on
one token of look-ahead.
SAMPLE RESOLUTIONS OF AMBIGUITIES BY MY C++ GRAMMAR
Since my grammar tends to disambiguate "prematurely" (i.e., the parser
does not use infinite lookahead) when compared with the ARM standard,
it tends to force non-declarations into being declarations. Later on,
when the tokens appear that require an interpretation as an
expression, a syntax error results. Hence, the "hard examples" are
those in which the disambiguating tokens appear late in the
expressions.
Of the "hard examples" given in the ARM (r6.8), my grammar can only
"properly" detect a "statement-expression" for the stream:
T(a,5)>>c;
All the other examples default to a declarator after the closing
parenthesis following the identifier. (See my comments in the
conclusion section of this paper).
I actually am not sure I agree with all the examples in the C++ 2.0
Reference Manual. Specifically, the example in section 6.8:
T (*d) (double(3)); // expression statement
In the example "T" is specified to be a simple-type-name, which
includes all the basic types as well as class-names, and more.
Considering the following are valid declarations:
void *a (0);
void *b (int(0));
void (*c)(int(0));
I am unable to see the "syntactic" difference between this last token
stream and the example just cited in the reference manual. My
simplistic parser gives me the result that I at least expect. It
concludes (prematurely, but seemingly correctly) that the stream is a
declaration (with a new style initializer).
As a positive note, my grammar is able to parse the example given a
while back in comp.lang.c++, that Zortech C++ 1.07 could not parse:
a = (double(a)/double(b))...;
Apparently, upon seeing "(double" some parsers commit to a
parenthesized type-name for a cast expression, and cannot proceed to
parse a parenthesized expression. No mention of this problem is
listed in my conflict list, as resolution of this problem is simply a
matter of letting the LR parser wait long enough before committing. My
grammar commits when it sees "a" in "(a)" to providing an expression.
DIFFICULT AMBIGUITIES FOR A "C++ 2.0" COMPATIBLE PARSER TO TRY
Having seen the above contexts, I would be curious to see if other C++
front ends with "smart lexers" (such as cfront) can handle the
following. These examples are not guaranteed to be evaluated
correctly by my grammar, but I expect them to demonstrate weaknesses
in many other parsers. The interpretation of these examples per C++
2.0 definitions requires massive lookahead. In addition, the examples
are generally unreadable by humans, and rarely parsed the same way by
any two implementations.
main()
{
class T
{
/* ... */
} a;
typedef T T1,T2,T3,T4,T5,T7,T8,T9,Ta,Tb,Tc,Td;
{ /* start inner scope */
T((T1) ); // declaration
T((T2) a ); // Statement expression
T((T3)( )); // declaration of T3
T((T4)(T )); // declaration of T4
T((T5)(T a )); // declaration of T5
T((T6)(T((a) ))); // declaration of T6
T((T7)(T((T) ))); // declaration of T7
T((T8)(T((T)b))); // statement expression
T(b[5]); // declaration
T(c()); // declaration
T(d()[5]); // statement expression ? (function returning array
// is semantically illegal, but syntactically proper)
T(e[5]()); // statement expression ? (No array of functions)
T(f)[5](); // statement expression ? " "
T(*Ta() )[5] [4]; //declaration
T(*Tb() [5]) [4]; //statement expression ? (function returning array)
T(*Tc()) [3 +2]; //declaration
T(*Td()) [3 ]+2; //statement expression
}
}
COMMENTARY ON C++ 2.0 DISAMBIGUATING RULES
There are two distinct thrusts in conflict disambiguation as provided
by AT&T's efforts to define a standard for C++. The first thrust is
"parse tokens into the longest possible declarator, and identify the
syntax errors that result". The second thrust is to "use massive
external technology ("smart lexer", a.k.a.: "recursive decent parser
that helps the lexer", a.k.a. LALEX) to look ahead, so that the
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -