📄 grammar5.txt
字号:
parser doesn't mis-parse a function-like-cast as a declaration and
induce a syntax error". The first is a commitment to LR parser
technology, and an existing grammar (which could be cleaned up). The
second is a commitment to NOT use an LR parser, and to the use of an
existing implementation.
It is my belief that LR parsers are well understood, and the addition
of a "smart lexer" destroys all structure in a parser. The result can
be anticipated to become a quagmire of code and hacks. With this firm
conviction, I have provided my grammar in the hopes that a standard
can emerge that IS well defined, and is implementable, and is readable
by humans.
SOURCE OF CONFLICTS IN C++ (MIXING TYPES AND EXPRESSIONS)
One fundamental strength in C is the similarity between declarations
and expressions. The syntax of the two is intended to be very
similar, and the result is a clean declaration and expression syntax.
(It takes some getting used to, but it is in my opinion good).
Unfortunately, there are some slight distinctions between types and
expressions, which Ritchie et. al. apparently noticed. It is for
this reason (I am guessing) that the C cast operator REQUIRES that the
type be enclosed in parenthesis. Moreover, there is also a clear
separator in a declaration between the declarator and the initializing
expression (the '=') (as some of you know, there is some interesting
history in this area.). The bottom line (as seen with 20-20
hindsight) is: "keep declarations and expressions separate". Each
violation of this basic rule has induced conflicts.
To be concrete about the differences between types and expressions,
the following two distinctions are apparent:
1) Abstract declarators are permitted. No analogy is provided in
expressions. The notable distinction is that abstract declarators
include the possibility of trailing '*' tokens.
2) The binding of elements in a declaration is very different from
any expression. Notably, the declaration-specifiers are bound
separately to each declarator in the comma separated list of
declarators (example: int a, b, c;). With (most forms of)
expressions, a comma provides a major isolation between
expressions.
C also used reserved names to GREATLY reduce the complexity of
parsing. The introduction of typedef names increased the complexity
(it made the language context sensitive), but a simple hack between
lex and YACC overcame the problem. An example is the statement:
name (*b)[4];
Note that this is ambiguous, EVEN in ANSI C, IF there is no
distinction between type-names and function names! (i.e., "b" could
be getting redeclared to be of type "pointer to array of name", OR the
function "name" could be called with argument "*b", the result of
which is indexed to the 4th element). In C, the two kinds of names
(TYPEDEFnames and function names (a.k.a.: declared identifiers)) share
a name space, and at every point in a source program the (hack)
contextual distinction can be made by the tokenizer. Hacks are neat
things in that the less you use them, the more likely they are to work
when you REALLY need them (i.e., you don't have to fight with existing
hacks). Having worked on designing and implementing a C compiler, I
was pleasantly amazed at how the constructs all fell together.
The major violations of this approach (i.e., keep declaration separate
from expressions) that come to mind with C++ are:
function-like-casts,
freestore expressions without parens around the type,
conversion function names using arbitrary type specifiers,
parenthesized initializers that drive constructors.
The last problem, parenthesized initializers, provides two areas of
conflicts. The first is really an interference issue with old style C
function definitions, which only bothers folks at file scope (GNU's
G++ compiler considered this to be too great an obstacle, and they
don't currently support old style C definitions!). The second part of
this conflict involves a more subtle separation between the
declarator, and the initializer. (K&R eventually provided an equal
sign as an unequivocal separator, but parens used in C++ are already
TOO overloaded to separate anything). The significance of this lack
of a clear separator is that it is difficult to decide that the
"declarator" is complete, and that the declared name should be added
to the scope. The last problem does interact in a nasty way with the
function-like cast vs declaration conflicts (the problem slows the
feedback loop to the symbol table, which is critical to continued
lexing). The parened initializers also provide another context where
it is difficult to distinguish between expressions (a true argument
list for the constructor) and a declaration continuation (a parameter
type list).
The second problem listed falls out of the "new-expression" with an
unparenthesized type. This form of freestore (such as "new int * *")
allows types to be placed adjacent to expressions, and the trailing
'*' ambiguity rears its head. I can easily prove that this is the
culprit in terms of specific ambiguities, in that removing these
(unnecessary?) forms significantly disambiguates the grammar. (It is
rather nice to use YACC as a tool to prove that a grammar is
unambiguous!). It is interesting to note that if only the derivation
of a freestore expression were limited to (using the non-terminal
names of the form that the C++ Reference manual uses):
new placement-opt ( type-name ) parened-arg-list-opt
then all the LR(1) reduce conflicts based on this problem would
vanish. Indeed, the culprit can clearly be shown to be:
new placement-opt restricted-type-name parened-arg-list-opt
The characters which excite these reduction conflicts are '*', and
'&'.
The third problem that I indicated involves the
conversion-function-name. Here again, if the syntax were restricted
to ONLY:
operator simple-type-name
then the LR(1) conflicts would vanish. It is interesting to note that
the keyword "operator" serves as the left separator, and the
restriction to "simple-type-name" results in an implicit right
separator (simple-type-names are exactly one token long). The
conflicts appear when multiple tokens are allowed for a declaration
specifier, and an optional pointer-modifier list may be added as a
postfix. The conflicts that result from this lack of separation
include all those provided by the freestore example.
Here again (as with the unambiguous version of freestore) the syntax
could be extended to:
operator_function_name :
OPERATOR any_operator
| OPERATOR basic_type_name
| OPERATOR TYPEDEFname
| OPERATOR type_qualifier
| OPERATOR '(' type_name ')'
;
instead of:
operator_function_name :
OPERATOR any_operator
| OPERATOR type_qualifier_list operator_function_ptr_opt
| OPERATOR non_elaborating_type_specifier operator_function_ptr_opt
;
and the ambiguities would vanish (and the expressivity would not be
diminished).
FUNCTION LIKE CAST vs DECLARATION AMBIGUITIES
The real big culprit (i.e., my anti-favorite) in this whole ambiguity
set (re: keeping types and expressions separate) is the
function-like-cast. The reason why it is so significant (to an LR
parser) is that the binding of a type-name, when used in a
function-like-cast, is directly to the following parenthesized
argument list. In contrast, the binding of a type-name when used in a
declaration is to all the "stuff" that follows, up until a declarator
termination mark like a ',', ';' or '='. Life really gets tough for LR
folks when the parse stack MUST be reduced, but the parser just can't
tell how yet. With this problem, the hacks began to appear (re: the
"smart lexer"). Note that these new style casts are much more than a
notational convenience in C++. The necessity of the function like
cast lies in the fact that such a cast can take several arguments,
whereas the old style cast is ALWAYS a unary operator.
I was (past tense) actually working towards resolving this problem via
some standard methods that I have developed (re: inline expansion of
rules to provide deferred reduction). I was (past tense) also using
one more sneaky piece of work to defer the reductions, as I was
carefully making use of right recursion (instead of the standard left
recursion) in order to give the parser a chance to build up more
context. I can demonstrate the usefulness of right recursive grammars
in disambiguating difficult toy grammars. Unfortunately, I realized
at some point that I NEEDED to perform certain actions (such as add
identifiers to the symbol table) in order to complete the parse!?!
This was my catch 22. I could POSSIBLY parse using an LALR grammar,
if I could only defer actions until I had enough context to
disambiguate. Unfortunately, I HAD to perform certain actions (re:
modify the symbol table, which changed the action of the tokenizer)
BEFORE I could continue to examine tokens! In some terrible sense,
the typedef hack had come back to haunt me.
I backed off a bit in my grammar after reaching this wall, and now my
grammar only waits until it reaches the identifier in the would be
declarator. I really didn't want to parse the stuff after the
identifier name (sour grapes?), because I knew I would not (for
example) be able to identify a "constant expression" in an array
subscript (most of the time, if it isn't constant, then it can't be a
declaration). I don't believe that a compiler should compete in a
battle of wits with the programmer, and the parser was already
beginning to outwit me (i.e., I was having a hard time parsing stuff
mentally that is provided as examples in the 2.0 Reference Manual. In
fact, many reviewers as well as the author had difficulty, and that is
why errors appeared there originally, as well as in the ARM).
FUNCTION LIKE CAST vs DECLARATION : THE TOUGH EXAMPLE:
The following is about the nastiest example that I have been able to
construct for this ambiguity group. I am presenting it here just in
case someone is left with a thought that there is "an easy way out".
The fact that identifiers can remain ambiguous SO LONG after
encountering them can cause no end of trouble to the parser. The
following example does not succumb to static (re: no change to the
symbol table) anticipatory lexing of a statement. As such, it
demonstrates the futility of attempting to use a "smart lexer" to
support the philosophy: "If it can be interpreted as a declaration,
then so be it; otherwise it is an expression". This ambiguous example
exploits the fact that declarators MUST be added to the symbol table
as soon as they are complete (and hence they mask external
declarations).
First I will present the example without comments:
class Type1 {
Type1 operator()(int);
} ;
class wasType2 { ...};
int (*c2)(Type1 dummy);
main ()
{
const int a1 = 0, new_var (4), (*c1)(int (a1));
Type1 (a2) = 3, wasType2 (4), (*c2)(wasType2(a1));
}
Now to repeat the example with comments:
class Type1 {....
Type1 operator()(int);
} ;
class wasType2 { ....}; /* we will almost redeclare this typename!?! */
int (*c2)(Type1 dummy); /* we will NOT redeclare Type1 */
main ()
{
/* The first line is indeed simple. It is simply placed here
to hint at how the second line MIGHT analogously be parsed. */
const int a1 = 0, new_var (4), (*c1)(int (a1));
/* As a review, "a1" is declared to be a constant with value 0.
"new_var" is declared to be another constant, but with value
4. Finally, "c1" is declared to be a pointer to a const
integer, and the initial value of this pointer is "int(a1)",
which is the same as "int(0)", or simply "0" (a.k.a., the
null pointer). It is significant that "a1" entered the symbol
table quickly so that it could be used later in the
declaration. */
/* Static lexing of what follows will suggest that the following
is also a declaration. This statement is actually 3 comma
separated expressions!! The explanation that follows shows that
assuming the 2nd statement is a declaration leads to a
contradiction. */
Type1 (a2) = 3, wasType2 (4), (*c2)(wasType2(a1));
/* Assume this second statement is a declaration. Note that by
the time "c2" is parsed, "wasType2" has been redeclared to be a
variable of type "Type1". Hence "wasType2(a1)" is actually a
function call to "wasType2.operator()(a1)", and it is not a
function prototype arg list. It follows that
"(*c2)(wasType2(a1))" is an expression, and NOT a declarator!
Since this last entry is not a declarator, the entire statement
must be an expression (ugh! it is time to backtrack). After much
work on my part, I think it might even be a semantically valid
expression. Once this backtracking is complete, we see that the
first expression "Type1 (a2) = 3" is an assignment to a cast
expression. The second expression "wasType2 (4)", is a cast of a
constant. The third expression "(*c2)(wasType2(a1))", is an
indirect function call. The argument of the call is the result
of a cast. Note that "wasType2" is actually never redeclared,
but it was close! */
/* For those of you who can parse this stuff in your sleep, and
noticed the slight error in the above argument, I have the
following "fix". The error is that the
"(*c2)(wasType2(a1))"
could actually be a declaration with a parenthesized initializer.
I could have changed this token subsequence to:
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -