⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 grammar5.txt

📁 用于lex和yacc的c++及c语言的文法,可用来构造C++语言的编译器
💻 TXT
📖 第 1 页 / 共 5 页
字号:
Copyright (C) 1989.1990, James  A.   Roskind,  All  rights  reserved. 
Abstracting  with credit is permitted.  The contents of this file may 
be reproduced electronically, or in printed  form,  IN  ITS  ENTIRETY 
with  no  changes, providing that this copyright notice is intact and 
applicable in the copy.  No other copying is  permitted  without  the 
expressed written consent of the author.

FILENAME: GRAMMAR5.TXT
AUTHOR: Jim Roskind
        Independent Consultant
        516 Latania Palm Drive
        Indialantic FL 32903
        (407)729-4348
        jar@hq.ileaf.com
        or ...!uunet!leafusa!jar


    A YACC-able C++ 2.1 GRAMMAR, AND THE RESULTING AMBIGUITIES
                                        (Release 2.0 Updated 7/11/91)

ABSTRACT

This paper describes ambiguous aspects of the C++ language  that  have 
been  exposed by the construction of a YACC-able grammar for C++.  The 
grammar is provided in  a  separate  file,  but  there  are  extensive 
references  to  the  actual  grammar.   This  release  of  the grammar 
includes the output from a  yacc  cross-reference  tool  that  I  have 
written.   The  discussion  in  this file will make many references to 
that verbose output (typically provided in a  file  called  y.output), 
but  the  file  is only critical if you are trying to "check my work", 
rather than read my results.  The  format  of  the  machine  generated 
documentation provided in y.output is defined in autodoc5.txt.

This  release  of  the  grammar  provides  what  I believe is complete 
support for nested types, at least syntactically.  It does not support 
either templates, or exception handling,  as  the  discussion  of  the 
syntax for these elements is not finalized.  

Theoretical  Note  (skip  if you hate theory): My C++ grammar does not 
make use of the %prec or %assoc features of YACC, and hence  conflicts 
are  never  hidden within my grammar.  The wondrous result is that, to 
the extent to which my grammar can be seen to span the syntax  of  the 
language,  in  all places where YACC reports no conflicts, the grammar 
is a totally unambiguous statement of  the  syntax  of  the  language.  
This  result  is  remarkable  (I think), because the "general case" of 
deciding whether a grammar is ambiguous or not, is equivalent to  (the 
unsolvable) halting problem.

Although  this  paper is terse, and at least hastily formed, I believe 
that its content is so significant to my results that  it  had  to  be 
included.   I  am  sorry  that I do not have the time at this point to 
better arrange its contents.




CONTENTS

	CONTENTS
        INTRODUCTION
        REVIEW: STANDARD LEXICAL ANALYSIS HACK: TYPEDEFname vs IDENTIFIER
        STATUS OF MY "DISAMBIGUATED" GRAMMAR
	SUMMARY OF CONFLICTS
        
	17 EASY CONFLICTS, WITH HAPPY ENDINGS
	1 SR CONFLICT WITH AN ALMOST HAPPY ENDING
        6 NOVEL CONFLICT THAT YIELD TO SEMANTIC DISAMBIGUATION
	1 CONFLICT THAT CONSTRAINTS SUPPORT THE RESOLUTION FOR
	
	THE TOUGH AMBIGUITIES: FUNCTION LIKE CASTS AND COMPANY (17 CONFLICTS)
        THE TOUGH AMBIGUITIES: FUNCTION LIKE CASTS AND COMPANY
	LALR-only CONFLICTS IN THE GRAMMAR
        SAMPLE RESOLUTIONS OF AMBIGUITIES BY MY C++ GRAMMAR
        DIFFICULT AMBIGUITIES FOR C++ 2.0 PARSER TO TRY
        COMMENTARY ON CURRENT C++ DISAMBIGUATING RULES
        SOURCE OF CONFLICTS IN C++ (MIXING TYPES AND EXPRESSIONS)
	FUNCTION LIKE CAST vs DECLARATION AMBIGUITIES
	FUNCTION LIKE CAST vs DECLARATION : THE TOUGH EXAMPLE:        
        
	CONCLUSION

	APPENDIX A:
	  PROPOSED GRAMMAR MODIFICATIONS (fixing '*', and '&' conflicts)

	APPENDIX B:
	  CANONICAL DESCRIPTION OF CONFLICTS and STATES





INTRODUCTION

This paper is  intended  to  go  into  significant  detail  about  the 
ambiguities  that I have seen in the C++ 2.1 language (proposed in the 
"C++ Annotated Reference Manual", a.k.a. ARM, by Ellis and Stroustrup) 
and exposed by attempting to develop a YACC compatible (i.e., LALR(1)) 
grammar.  I must point out that this is NOT meant as an attack on  the 
language  or  the  originators  thereof,  but  rather an R&D effort to 
disambiguate some details.  (I personally believe that Stroustrup  et. 
al.  are doing a GREAT job).  I have the vague hope that the extensive 
hacking  that  I have done with YACC on the C++ grammar has given me a 
rather novel vantage point.  (I  have  expressed  my  observations  to 
Bjarne  Stroustrup,  and in doing so verified that other folks had not 
previously  identified  several  of  my  points).   In  light  of   my 
activities,  this  paper  includes  a fair amount of philosophizing. I 
hope that none of this paper assumes too greatly that I  have  insight 
that  is  beyond  that  of  the  readers, and certainly no insults are 
intended.

If you like investigating grammars directly (rather than reading  what 
I  have  to say), I would strongly suggest you read the ANSI C grammar 
before looking at the C++  grammar.   Additionally,  if  you  want  to 
really  understand  the  grammar,  as  suggested  in autodoc5.txt, the 
grammar cross-reference provided in y.output is also very helpful. The 
C++ grammar  is  pretty  large,  and  you  can  expect  the  following 
statistics if you have a similar YACC to that I am using:

Berkeley YACC (1.8  01/02/91)

103 terminals, 186 nonterminals
665 grammar rules, 1256 states
24 shift/reduce conflicts, 18 reduce/reduce conflicts.

Sadly,  many  standard  yacc implementations have static limits on the 
grammars that they can handle.  If you  find  this  to  be  a  problem 
(i.e.,  you  yacc  says  "too  many  states", and terminates), I would 
suggest acquiring Berkeley yacc (which is free), or  bison  (which  is 
mostly  free),  or  purchasing  a  commercial  yacc implementation.  A 
simple port of these publicly available sources on  machines  with  16 
bit  "int"s  (e.g.,  most  PCs)  will  also  be  unable to process the 
grammar.  PCYACC (from Abraxas  Software)  and  ydb  (from  Bloomsbury 
Software),  are examples of commercial products that run on the PC and 
Sun respectively, that I know from experience can yacc these  grammars 
(I currently have no financial affiliation with either vendor).


REVIEW: STANDARD LEXICAL ANALYSIS HACK: TYPEDEFname vs IDENTIFIER

This section is targeted at readers that are not familiar with parsing 
C  with a context free grammar.  The reader should note that there are 
two distinct forms of `identifiers' gathered during lexical  analysis, 
and  identified  as  terminal  tokens  in  both  of  my grammars.  The 
terminal tokens are called IDENTIFIER  and  TYPEDEFname  respectively. 
This  distinction  is  a  required  by  a fundamental element of the C 
language.  The definition of a "TYPEDEFname" is a  lexeme  that  looks 
like  a  standard  identifier,  but is currently defined in the symbol 
table as being declared a typedef (or class, struct, enum  tag)  name. 
All  other  lexemes  that appear as identifiers (and are not keywords) 
are tokenized as IDENTIFIERs.  The  remainder  of  this  section  will 
review the motivation for this approach.

Consider the following sample code, which uses the C language:

        ...
        int main() { f (*a) [4] ; ...

Most  readers  will intuitively parse the above sequence as a function 
call to "f", with an argument "*a",  the  return  value  of  which  is 
(probably)  a  pointer,  and  hence  can  be  indexed by "4".  Such an 
interpretation presumes that the prior context was something like:

        ...
        extern float *a;
        char * f(float);
        int main() { f (*a) [4] ; ...

However, if the prior context was **INSTEAD**:

        ...
        typedef int f;
        extern float a;
        int main() { f (*a) [4] ; ...

then the interpretation  of  the  code  is  entirely  different.   The 
fragment  in  question  actually redeclares a local variable "a", such 
that the type of "a" is "pointer to array of 4 int's"!  So we  see  in 
this   example   that  the  statement  "f(*a)[4];"  cannot  be  parsed 
independent of context.

The standard solution to the above ambiguity is to allow  the  lexical 
analyzer to form different tokens based on contextual information. The 
contextual information that is used is the answer to the question: "Is 
a  given  identifier  defined as a typedef name (or class name) at the 
current point in the parse"?  I will refer to this feedback loop (from 
the parser that stores information in  a  symbol  table,  wherein  the 
lexer extracts the information) as the "lex hack".  With this lex hack 
in  place the code fragment "f(*a)[4];" would be provided by the lexer 
as either:

        IDENTIFIER '(' '*' IDENTIFIER ')' '[' INTEGERconstant ']' ';'

or

        TYPEDEFname '(' '*' IDENTIFIER ')' '[' INTEGERconstant ']' ';'

The two case are very easy for a context free grammar to  distinguish, 
and  the  ambiguity  vanishes.  Note that the fact that such a hack is 
used (out of necessity) demonstrates that C  is  not  a  context  free 
language,  but  the hack allows us to continue to use an LR(1) parser, 
and a context free grammar.

Note that this hack is, of necessity, also made  use  of  in  the  C++ 
grammar,  but  no  additional  feedback  (a.k.a.:  hack)  is (at least 
currently) required.  The interested reader should also note that this 
feedback loop (re: updating the symbol table) must be  swift,  as  the 
lexical  analyzer  cannot  proceed  very  far  without this contextual 
information.  This constraint on the feedback time often prevents  the 
parser  from  "deferring" actions, and hence increases the pressure on 
the parser to rapidly disambiguate token sequences.



STATUS OF MY "DISAMBIGUATED" GRAMMAR

Several independent reviews,  which  provided  a  complete  front  end 
lexical  analyzers,  and  parsed existing C++ code, have verified that 
the grammars span of  the  C++  language  appears  complete.   I  have 
neither   incorporated   parametric   types  (a.k.a.   templates)  nor 
exception handling into the grammars at this point, as  they  continue 
to be in a state of flux.

The  grammar does (I believe) support all the features provided in C++ 
2.1, including multiple inheritance and the  enhanced  operator  "new" 
syntax (includes placement expression). I believe that, except for the 
minor  change  involving  not  permitting parentheses around bit field 
names during a declaration, my C++ grammar supports a superset  of  my 
ANSI  C  grammar. Note that I haven't inline expanded all the rules in 
the C grammar that were required for C++ disambiguation (re: deferring 
reductions), and hence a `diff' of the two grammars will not provide a 
trivial comparison.  The resulting major  advantage  of  this  grammar 
over  every  other  current  C++  parser  (that  I know of) is that it 
supports old style function definitions AS WELL AS  all  the  standard 
C++.   (It is my personal belief that such support was dropped by many 
compilers and translators in order to resolve the many syntax problems 
that appear within C++.  I believe this  grammar  shows  that  such  a 
decision was premature).

My  list  of shift-reduce and reduce-reduce conflicts is currently: 24 
shift/reduce, 18 reduce/reduce conflicts reported.  I have  chosen  to 
leave  so  many conflicts in the grammar because I hope to see changes 
to the syntax that will remove them, rather than making changes to  my 
grammar  that  will firmly accept and disambiguate them.  (Considering 
the detailed analysis presented here,  such  changes  would  only  add 
unnecessary complications to an already large grammar).


SUMMARY OF CONFLICTS

The  following  summarizes the conflicts based on a simple description 
of the nature of the conflict:

8 SR caused by operator function name with trailing * or &
	states: 131, 138, 281, 282

8 SR caused by freestore              with trailing * or &
	states: 571, 572, 778, 779

1 SR caused by dangling else and my laziness
	state: 1100

1 SR caused by member declaration of sub-structure, with trailing :
	state: 64

6 RR caused by constructor declaration vs member declaration
	states: 1105, 1152, 1175

1 SR caused by explicit call to destructor, without explicit scope
	state: 536




3 RR caused by function-like cast vs typedef redeclaration ambiguity
	state: 738

3 RR caused by function-like cast vs identifier declaration ambiguity
	state: 739

3 RR caused by destructor declaration vs destructor call 
	state: 740

3 RR caused by parened initializer vs  prototype/typename
	state: 758

5 SR caused by redundant parened TYPEDEFname redeclaration vs old style cast
	states: 395, 621, 1038, 1102, 1103



⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -