📄 yacc-docs.txt
字号:
http://www.csc.calpoly.edu/~gfisher/450/doc/yacc/paper.txtYacc: Yet Another Compiler-Compiler PS1:15-1 Yacc: Yet Another Compiler-Compiler Stephen C. Johnson AT&T Bell Laboratories Murray Hill, New Jersey 07974 ABSTRACT Computer program input generally has some struc- ture; in fact, every computer program that does input can be thought of as defining an ``input language'' which it accepts. An input language may be as complex as a programming language, or as simple as a sequence of numbers. Unfortunately, usual input facilities are limited, difficult to use, and often are lax about checking their inputs for validity. Yacc provides a general tool for describing the input to a computer program. The Yacc user specifies the structures of his input, together with code to be invoked as each such structure is recognized. Yacc turns such a specification into a subroutine that han- dles the input process; frequently, it is convenient and appropriate to have most of the flow of control in the user's application handled by this subroutine. The input subroutine produced by Yacc calls a user-supplied routine to return the next basic input item. Thus, the user can specify his input in terms of individual input characters, or in terms of higher level constructs such as names and numbers. The user- supplied routine may also handle idiomatic features such as comment and continuation conventions, which typically defy easy grammatical specification. Yacc is written in portable C. The class of specifications accepted is a very general one: LALR(1) grammars with disambiguating rules. In addition to compilers for C, APL, Pascal, RAT- FOR, etc., Yacc has also been used for less conven- tional languages, including a phototypesetter language, several desk calculator languages, a document retrieval system, and a Fortran debugging system.PS1:15-2 Yacc: Yet Another Compiler-Compiler0: Introduction Yacc provides a general tool for imposing structure on theinput to a computer program. The Yacc user prepares a specifica-tion of the input process; this includes rules describing theinput structure, code to be invoked when these rules are recog-nized, and a low-level routine to do the basic input. Yacc thengenerates a function to control the input process. This func-tion, called a parser, calls the user-supplied low-level inputroutine (the lexical analyzer) to pick up the basic items (calledtokens) from the input stream. These tokens are organizedaccording to the input structure rules, called grammar rules;when one of these rules has been recognized, then user code sup-plied for this rule, an action, is invoked; actions have theability to return values and make use of the values of otheractions. Yacc is written in a portable dialect of C[1] and theactions, and output subroutine, are in C as well. Moreover, manyof the syntactic conventions of Yacc follow C. The heart of the input specification is a collection ofgrammar rules. Each rule describes an allowable structure andgives it a name. For example, one grammar rule might be date : month_name day ',' year ;Here, date, month_name, day, and year represent structures ofinterest in the input process; presumably, month_name, day, andyear are defined elsewhere. The comma ``,'' is enclosed in sin-gle quotes; this implies that the comma is to appear literally inthe input. The colon and semicolon merely serve as punctuationin the rule, and have no significance in controlling the input.Thus, with proper definitions, the input July 4, 1776might be matched by the above rule. An important part of the input process is carried out by thelexical analyzer. This user routine reads the input stream,recognizing the lower level structures, and communicates thesetokens to the parser. For historical reasons, a structure recog-nized by the lexical analyzer is called a terminal symbol, whilethe structure recognized by the parser is called a nonterminalsymbol. To avoid confusion, terminal symbols will usually bereferred to as tokens. There is considerable leeway in deciding whether to recog-nize structures using the lexical analyzer or grammar rules. Forexample, the rulesYacc: Yet Another Compiler-Compiler PS1:15-3 month_name : 'J' 'a' 'n' ; month_name : 'F' 'e' 'b' ; . . . month_name : 'D' 'e' 'c' ;might be used in the above example. The lexical analyzer wouldonly need to recognize individual letters, and month_name wouldbe a nonterminal symbol. Such low-level rules tend to waste timeand space, and may complicate the specification beyond Yacc'sability to deal with it. Usually, the lexical analyzer wouldrecognize the month names, and return an indication that amonth_name was seen; in this case, month_name would be a token. Literal characters such as ``,'' must also be passed throughthe lexical analyzer, and are also considered tokens. Specification files are very flexible. It is realively easyto add to the above example the rule date : month '/' day '/' year ;allowing 7 / 4 / 1776as a synonym for July 4, 1776In most cases, this new rule could be ``slipped in'' to a workingsystem with minimal effort, and little danger of disruptingexisting input. The input being read may not conform to the specifications.These input errors are detected as early as is theoretically pos-sible with a left-to-right scan; thus, not only is the chance ofreading and computing with bad input data substantially reduced,but the bad data can usually be quickly found. Error handling,provided as part of the input specifications, permits the reentryof bad data, or the continuation of the input process after skip-ping over the bad data. In some cases, Yacc fails to produce a parser when given aset of specifications. For example, the specifications may beself contradictory, or they may require a more powerful recogni-tion mechanism than that available to Yacc. The former casesrepresent design errors; the latter cases can often be correctedby making the lexical analyzer more powerful, or by rewritingsome of the grammar rules. While Yacc cannot handle all possiblespecifications, its power compares favorably with similar sys-tems; moreover, the constructions which are difficult for Yacc toPS1:15-4 Yacc: Yet Another Compiler-Compilerhandle are also frequently difficult for human beings to handle.Some users have reported that the discipline of formulating validYacc specifications for their input revealed errors of conceptionor design early in the program development. The theory underlying Yacc has been described elsewhere.[2,3, 4] Yacc has been extensively used in numerous practical appli-cations, including lint,[5] the Portable C Compiler,[6] and asystem for typesetting mathematics.[7] The next several sections describe the basic process ofpreparing a Yacc specification; Section 1 describes the prepara-tion of grammar rules, Section 2 the preparation of the user sup-plied actions associated with these rules, and Section 3 thepreparation of lexical analyzers. Section 4 describes the opera-tion of the parser. Section 5 discusses various reasons why Yaccmay be unable to produce a parser from a specification, and whatto do about it. Section 6 describes a simple mechanism for han-dling operator precedences in arithmetic expressions. Section 7discusses error detection and recovery. Section 8 discusses theoperating environment and special features of the parsers Yaccproduces. Section 9 gives some suggestions which should improvethe style and efficiency of the specifications. Section 10discusses some advanced topics, and Section 11 gives acknowledge-ments. Appendix A has a brief example, and Appendix B gives asummary of the Yacc input syntax. Appendix C gives an exampleusing some of the more advanced features of Yacc, and, finally,Appendix D describes mechanisms and syntax no longer activelysupported, but provided for historical continuity with older ver-sions of Yacc.1: Basic Specifications Names refer to either tokens or nonterminal symbols. Yaccrequires token names to be declared as such. In addition, forreasons discussed in Section 3, it is often desirable to includethe lexical analyzer as part of the specification file; it may beuseful to include other programs as well. Thus, every specifica-tion file consists of three sections: the declarations, (grammar)rules, and programs. The sections are separated by double per-cent ``%%'' marks. (The percent ``%'' is generally used in Yaccspecifications as an escape character.) In other words, a full specification file looks like declarations %% rules %% programs The declaration section may be empty. Moreover, if the pro-grams section is omitted, the second %% mark may be omitted also;Yacc: Yet Another Compiler-Compiler PS1:15-5thus, the smallest legal Yacc specification is %% rules Blanks, tabs, and newlines are ignored except that they maynot appear in names or multi-character reserved symbols. Com-ments may appear wherever a name is legal; they are enclosed in/* . . . */, as in C and PL/I. The rules section is made up of one or more grammar rules.A grammar rule has the form: A : BODY ;A represents a nonterminal name, and BODY represents a sequenceof zero or more names and literals. The colon and the semicolonare Yacc punctuation. Names may be of arbitrary length, and may be made up ofletters, dot ``.'', underscore ``_'', and non-initial digits.Upper and lower case letters are distinct. The names used in thebody of a grammar rule may represent tokens or nonterminal sym-bols. A literal consists of a character enclosed in single quotes``'''. As in C, the backslash ``\'' is an escape characterwithin literals, and all the C escapes are recognized. Thus '\n' newline '\r' return '\'' single quote ``''' '\\' backslash ``\'' '\t' tab '\b' backspace '\f' form feed '\xxx' ``xxx'' in octalFor a number of technical reasons, the NUL character ('\0' or 0)should never be used in grammar rules. If there are several grammar rules with the same left handside, the vertical bar ``|'' can be used to avoid rewriting theleft hand side. In addition, the semicolon at the end of a rulecan be dropped before a vertical bar. Thus the grammar rules A : B C D ; A : E F ; A : G ;can be given to Yacc asPS1:15-6 Yacc: Yet Another Compiler-Compiler A : B C D | E F | G ;It is not necessary that all grammar rules with the same leftside appear together in the grammar rules section, although itmakes the input much more readable, and easier to change. If a nonterminal symbol matches the empty string, this canbe indicated in the obvious way: empty : ; Names representing tokens must be declared; this is mostsimply done by writing %token name1 name2 . . .in the declarations section. (See Sections 3 , 5, and 6 for muchmore discussion). Every name not defined in the declarationssection is assumed to represent a nonterminal symbol. Every non-terminal symbol must appear on the left side of at least onerule. Of all the nonterminal symbols, one, called the start sym-bol, has particular importance. The parser is designed to recog-nize the start symbol; thus, this symbol represents the largest,most general structure described by the grammar rules. Bydefault, the start symbol is taken to be the left hand side ofthe first grammar rule in the rules section. It is possible, andin fact desirable, to declare the start symbol explicitly in thedeclarations section using the %start keyword: %start symbol The end of the input to the parser is signaled by a specialtoken, called the endmarker. If the tokens up to, but notincluding, the endmarker form a structure which matches the startsymbol, the parser function returns to its caller after the end-marker is seen; it accepts the input. If the endmarker is seenin any other context, it is an error. It is the job of the user-supplied lexical analyzer toreturn the endmarker when appropriate; see section 3, below.Usually the endmarker represents some reasonably obvious I/Ostatus, such as ``end-of-file'' or ``end-of-record''.2: Actions
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -