📄 grammar5.txt
字号:
Copyright (C) 1989.1990, James A. Roskind, All rights reserved.
Abstracting with credit is permitted. The contents of this file may
be reproduced electronically, or in printed form, IN ITS ENTIRETY
with no changes, providing that this copyright notice is intact and
applicable in the copy. No other copying is permitted without the
expressed written consent of the author.
FILENAME: GRAMMAR5.TXT
AUTHOR: Jim Roskind
Independent Consultant
516 Latania Palm Drive
Indialantic FL 32903
(407)729-4348
jar@hq.ileaf.com
or ...!uunet!leafusa!jar
A YACC-able C++ 2.1 GRAMMAR, AND THE RESULTING AMBIGUITIES
(Release 2.0 Updated 7/11/91)
ABSTRACT
This paper describes ambiguous aspects of the C++ language that have
been exposed by the construction of a YACC-able grammar for C++. The
grammar is provided in a separate file, but there are extensive
references to the actual grammar. This release of the grammar
includes the output from a yacc cross-reference tool that I have
written. The discussion in this file will make many references to
that verbose output (typically provided in a file called y.output),
but the file is only critical if you are trying to "check my work",
rather than read my results. The format of the machine generated
documentation provided in y.output is defined in autodoc5.txt.
This release of the grammar provides what I believe is complete
support for nested types, at least syntactically. It does not support
either templates, or exception handling, as the discussion of the
syntax for these elements is not finalized.
Theoretical Note (skip if you hate theory): My C++ grammar does not
make use of the %prec or %assoc features of YACC, and hence conflicts
are never hidden within my grammar. The wondrous result is that, to
the extent to which my grammar can be seen to span the syntax of the
language, in all places where YACC reports no conflicts, the grammar
is a totally unambiguous statement of the syntax of the language.
This result is remarkable (I think), because the "general case" of
deciding whether a grammar is ambiguous or not, is equivalent to (the
unsolvable) halting problem.
Although this paper is terse, and at least hastily formed, I believe
that its content is so significant to my results that it had to be
included. I am sorry that I do not have the time at this point to
better arrange its contents.
CONTENTS
CONTENTS
INTRODUCTION
REVIEW: STANDARD LEXICAL ANALYSIS HACK: TYPEDEFname vs IDENTIFIER
STATUS OF MY "DISAMBIGUATED" GRAMMAR
SUMMARY OF CONFLICTS
17 EASY CONFLICTS, WITH HAPPY ENDINGS
1 SR CONFLICT WITH AN ALMOST HAPPY ENDING
6 NOVEL CONFLICT THAT YIELD TO SEMANTIC DISAMBIGUATION
1 CONFLICT THAT CONSTRAINTS SUPPORT THE RESOLUTION FOR
THE TOUGH AMBIGUITIES: FUNCTION LIKE CASTS AND COMPANY (17 CONFLICTS)
THE TOUGH AMBIGUITIES: FUNCTION LIKE CASTS AND COMPANY
LALR-only CONFLICTS IN THE GRAMMAR
SAMPLE RESOLUTIONS OF AMBIGUITIES BY MY C++ GRAMMAR
DIFFICULT AMBIGUITIES FOR C++ 2.0 PARSER TO TRY
COMMENTARY ON CURRENT C++ DISAMBIGUATING RULES
SOURCE OF CONFLICTS IN C++ (MIXING TYPES AND EXPRESSIONS)
FUNCTION LIKE CAST vs DECLARATION AMBIGUITIES
FUNCTION LIKE CAST vs DECLARATION : THE TOUGH EXAMPLE:
CONCLUSION
APPENDIX A:
PROPOSED GRAMMAR MODIFICATIONS (fixing '*', and '&' conflicts)
APPENDIX B:
CANONICAL DESCRIPTION OF CONFLICTS and STATES
INTRODUCTION
This paper is intended to go into significant detail about the
ambiguities that I have seen in the C++ 2.1 language (proposed in the
"C++ Annotated Reference Manual", a.k.a. ARM, by Ellis and Stroustrup)
and exposed by attempting to develop a YACC compatible (i.e., LALR(1))
grammar. I must point out that this is NOT meant as an attack on the
language or the originators thereof, but rather an R&D effort to
disambiguate some details. (I personally believe that Stroustrup et.
al. are doing a GREAT job). I have the vague hope that the extensive
hacking that I have done with YACC on the C++ grammar has given me a
rather novel vantage point. (I have expressed my observations to
Bjarne Stroustrup, and in doing so verified that other folks had not
previously identified several of my points). In light of my
activities, this paper includes a fair amount of philosophizing. I
hope that none of this paper assumes too greatly that I have insight
that is beyond that of the readers, and certainly no insults are
intended.
If you like investigating grammars directly (rather than reading what
I have to say), I would strongly suggest you read the ANSI C grammar
before looking at the C++ grammar. Additionally, if you want to
really understand the grammar, as suggested in autodoc5.txt, the
grammar cross-reference provided in y.output is also very helpful. The
C++ grammar is pretty large, and you can expect the following
statistics if you have a similar YACC to that I am using:
Berkeley YACC (1.8 01/02/91)
103 terminals, 186 nonterminals
665 grammar rules, 1256 states
24 shift/reduce conflicts, 18 reduce/reduce conflicts.
Sadly, many standard yacc implementations have static limits on the
grammars that they can handle. If you find this to be a problem
(i.e., you yacc says "too many states", and terminates), I would
suggest acquiring Berkeley yacc (which is free), or bison (which is
mostly free), or purchasing a commercial yacc implementation. A
simple port of these publicly available sources on machines with 16
bit "int"s (e.g., most PCs) will also be unable to process the
grammar. PCYACC (from Abraxas Software) and ydb (from Bloomsbury
Software), are examples of commercial products that run on the PC and
Sun respectively, that I know from experience can yacc these grammars
(I currently have no financial affiliation with either vendor).
REVIEW: STANDARD LEXICAL ANALYSIS HACK: TYPEDEFname vs IDENTIFIER
This section is targeted at readers that are not familiar with parsing
C with a context free grammar. The reader should note that there are
two distinct forms of `identifiers' gathered during lexical analysis,
and identified as terminal tokens in both of my grammars. The
terminal tokens are called IDENTIFIER and TYPEDEFname respectively.
This distinction is a required by a fundamental element of the C
language. The definition of a "TYPEDEFname" is a lexeme that looks
like a standard identifier, but is currently defined in the symbol
table as being declared a typedef (or class, struct, enum tag) name.
All other lexemes that appear as identifiers (and are not keywords)
are tokenized as IDENTIFIERs. The remainder of this section will
review the motivation for this approach.
Consider the following sample code, which uses the C language:
...
int main() { f (*a) [4] ; ...
Most readers will intuitively parse the above sequence as a function
call to "f", with an argument "*a", the return value of which is
(probably) a pointer, and hence can be indexed by "4". Such an
interpretation presumes that the prior context was something like:
...
extern float *a;
char * f(float);
int main() { f (*a) [4] ; ...
However, if the prior context was **INSTEAD**:
...
typedef int f;
extern float a;
int main() { f (*a) [4] ; ...
then the interpretation of the code is entirely different. The
fragment in question actually redeclares a local variable "a", such
that the type of "a" is "pointer to array of 4 int's"! So we see in
this example that the statement "f(*a)[4];" cannot be parsed
independent of context.
The standard solution to the above ambiguity is to allow the lexical
analyzer to form different tokens based on contextual information. The
contextual information that is used is the answer to the question: "Is
a given identifier defined as a typedef name (or class name) at the
current point in the parse"? I will refer to this feedback loop (from
the parser that stores information in a symbol table, wherein the
lexer extracts the information) as the "lex hack". With this lex hack
in place the code fragment "f(*a)[4];" would be provided by the lexer
as either:
IDENTIFIER '(' '*' IDENTIFIER ')' '[' INTEGERconstant ']' ';'
or
TYPEDEFname '(' '*' IDENTIFIER ')' '[' INTEGERconstant ']' ';'
The two case are very easy for a context free grammar to distinguish,
and the ambiguity vanishes. Note that the fact that such a hack is
used (out of necessity) demonstrates that C is not a context free
language, but the hack allows us to continue to use an LR(1) parser,
and a context free grammar.
Note that this hack is, of necessity, also made use of in the C++
grammar, but no additional feedback (a.k.a.: hack) is (at least
currently) required. The interested reader should also note that this
feedback loop (re: updating the symbol table) must be swift, as the
lexical analyzer cannot proceed very far without this contextual
information. This constraint on the feedback time often prevents the
parser from "deferring" actions, and hence increases the pressure on
the parser to rapidly disambiguate token sequences.
STATUS OF MY "DISAMBIGUATED" GRAMMAR
Several independent reviews, which provided a complete front end
lexical analyzers, and parsed existing C++ code, have verified that
the grammars span of the C++ language appears complete. I have
neither incorporated parametric types (a.k.a. templates) nor
exception handling into the grammars at this point, as they continue
to be in a state of flux.
The grammar does (I believe) support all the features provided in C++
2.1, including multiple inheritance and the enhanced operator "new"
syntax (includes placement expression). I believe that, except for the
minor change involving not permitting parentheses around bit field
names during a declaration, my C++ grammar supports a superset of my
ANSI C grammar. Note that I haven't inline expanded all the rules in
the C grammar that were required for C++ disambiguation (re: deferring
reductions), and hence a `diff' of the two grammars will not provide a
trivial comparison. The resulting major advantage of this grammar
over every other current C++ parser (that I know of) is that it
supports old style function definitions AS WELL AS all the standard
C++. (It is my personal belief that such support was dropped by many
compilers and translators in order to resolve the many syntax problems
that appear within C++. I believe this grammar shows that such a
decision was premature).
My list of shift-reduce and reduce-reduce conflicts is currently: 24
shift/reduce, 18 reduce/reduce conflicts reported. I have chosen to
leave so many conflicts in the grammar because I hope to see changes
to the syntax that will remove them, rather than making changes to my
grammar that will firmly accept and disambiguate them. (Considering
the detailed analysis presented here, such changes would only add
unnecessary complications to an already large grammar).
SUMMARY OF CONFLICTS
The following summarizes the conflicts based on a simple description
of the nature of the conflict:
8 SR caused by operator function name with trailing * or &
states: 131, 138, 281, 282
8 SR caused by freestore with trailing * or &
states: 571, 572, 778, 779
1 SR caused by dangling else and my laziness
state: 1100
1 SR caused by member declaration of sub-structure, with trailing :
state: 64
6 RR caused by constructor declaration vs member declaration
states: 1105, 1152, 1175
1 SR caused by explicit call to destructor, without explicit scope
state: 536
3 RR caused by function-like cast vs typedef redeclaration ambiguity
state: 738
3 RR caused by function-like cast vs identifier declaration ambiguity
state: 739
3 RR caused by destructor declaration vs destructor call
state: 740
3 RR caused by parened initializer vs prototype/typename
state: 758
5 SR caused by redundant parened TYPEDEFname redeclaration vs old style cast
states: 395, 621, 1038, 1102, 1103
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -