📄 bison_4.htm
字号:
<HTML><HEAD><!-- This HTML file has been created by texi2html 1.44 from /opt/src/gnu/bison-1.25/bison.texinfo on 30 June 1997 --><TITLE>Bison 1.25 - The Concepts of Bison</TITLE></HEAD><BODY>Go to the <A HREF="bison_1.html">first</A>, <A HREF="bison_3.html">previous</A>, <A HREF="bison_5.html">next</A>, <A HREF="bison_15.html">last</A> section, <A HREF="index.html">table of contents</A>.<HR><H1><A NAME="SEC7" HREF="index.html#SEC7">The Concepts of Bison</A></H1><P>This chapter introduces many of the basic concepts without which thedetails of Bison will not make sense. If you do not already know how touse Bison or Yacc, we suggest you start by reading this chapter carefully.</P><H2><A NAME="SEC8" HREF="index.html#SEC8">Languages and Context-Free Grammars</A></H2><P><A NAME="IDX2"></A><A NAME="IDX3"></A>In order for Bison to parse a language, it must be described by a<STRONG>context-free grammar</STRONG>. This means that you specify one or more<STRONG>syntactic groupings</STRONG> and give rules for constructing them from theirparts. For example, in the C language, one kind of grouping is called an`expression'. One rule for making an expression might be, "An expressioncan be made of a minus sign and another expression". Another would be,"An expression can be an integer". As you can see, rules are oftenrecursive, but there must be at least one rule which leads out of therecursion.</P><P><A NAME="IDX4"></A><A NAME="IDX5"></A>The most common formal system for presenting such rules for humans to readis <STRONG>Backus-Naur Form</STRONG> or "BNF", which was developed in order tospecify the language Algol 60. Any grammar expressed in BNF is acontext-free grammar. The input to Bison is essentially machine-readableBNF.</P><P>Not all context-free languages can be handled by Bison, only thosethat are LALR(1). In brief, this means that it must be possible totell how to parse any portion of an input string with just a singletoken of look-ahead. Strictly speaking, that is a description of anLR(1) grammar, and LALR(1) involves additional restrictions that arehard to explain simply; but it is rare in actual practice to find anLR(1) grammar that fails to be LALR(1). See section <A HREF="bison_8.html#SEC79">Mysterious Reduce/Reduce Conflicts</A>, for more information on this.</P><P><A NAME="IDX6"></A><A NAME="IDX7"></A><A NAME="IDX8"></A><A NAME="IDX9"></A>In the formal grammatical rules for a language, each kind of syntactic unitor grouping is named by a <STRONG>symbol</STRONG>. Those which are built by groupingsmaller constructs according to grammatical rules are called<STRONG>nonterminal symbols</STRONG>; those which can't be subdivided are called<STRONG>terminal symbols</STRONG> or <STRONG>token types</STRONG>. We call a piece of inputcorresponding to a single terminal symbol a <STRONG>token</STRONG>, and a piececorresponding to a single nonterminal symbol a <STRONG>grouping</STRONG>.</P><P>We can use the C language as an example of what symbols, terminal andnonterminal, mean. The tokens of C are identifiers, constants (numeric andstring), and the various keywords, arithmetic operators and punctuationmarks. So the terminal symbols of a grammar for C include `identifier',`number', `string', plus one symbol for each keyword, operator orpunctuation mark: `if', `return', `const', `static', `int', `char',`plus-sign', `open-brace', `close-brace', `comma' and many more. (Thesetokens can be subdivided into characters, but that is a matter oflexicography, not grammar.)</P><P>Here is a simple C function subdivided into tokens:</P><PRE>int /* keyword `int' */square (x) /* identifier, open-paren, */ /* identifier, close-paren */ int x; /* keyword `int', identifier, semicolon */{ /* open-brace */ return x * x; /* keyword `return', identifier, */ /* asterisk, identifier, semicolon */} /* close-brace */</PRE><P>The syntactic groupings of C include the expression, the statement, thedeclaration, and the function definition. These are represented in thegrammar of C by nonterminal symbols `expression', `statement',`declaration' and `function definition'. The full grammar uses dozens ofadditional language constructs, each with its own nonterminal symbol, inorder to express the meanings of these four. The example above is afunction definition; it contains one declaration, and one statement. Inthe statement, each <SAMP>`x'</SAMP> is an expression and so is <SAMP>`x * x'</SAMP>.</P><P>Each nonterminal symbol must have grammatical rules showing how it is madeout of simpler constructs. For example, one kind of C statement is the<CODE>return</CODE> statement; this would be described with a grammar rule whichreads informally as follows:</P><BLOCKQUOTE><P>A `statement' can be made of a `return' keyword, an `expression' and a`semicolon'.</BLOCKQUOTE><P>There would be many other rules for `statement', one for each kind ofstatement in C.</P><P><A NAME="IDX10"></A>One nonterminal symbol must be distinguished as the special one whichdefines a complete utterance in the language. It is called the <STRONG>startsymbol</STRONG>. In a compiler, this means a complete input program. In the Clanguage, the nonterminal symbol `sequence of definitions and declarations'plays this role.</P><P>For example, <SAMP>`1 + 2'</SAMP> is a valid C expression--a valid part of a Cprogram--but it is not valid as an <EM>entire</EM> C program. In thecontext-free grammar of C, this follows from the fact that `expression' isnot the start symbol.</P><P>The Bison parser reads a sequence of tokens as its input, and groups thetokens using the grammar rules. If the input is valid, the end result isthat the entire token sequence reduces to a single grouping whose symbol isthe grammar's start symbol. If we use a grammar for C, the entire inputmust be a `sequence of definitions and declarations'. If not, the parserreports a syntax error.</P><H2><A NAME="SEC9" HREF="index.html#SEC9">From Formal Rules to Bison Input</A></H2><P><A NAME="IDX11"></A><A NAME="IDX12"></A><A NAME="IDX13"></A></P><P>A formal grammar is a mathematical construct. To define the languagefor Bison, you must write a file expressing the grammar in Bison syntax:a <STRONG>Bison grammar</STRONG> file. See section <A HREF="bison_6.html#SEC34">Bison Grammar Files</A>.</P><P>A nonterminal symbol in the formal grammar is represented in Bison inputas an identifier, like an identifier in C. By convention, it should bein lower case, such as <CODE>expr</CODE>, <CODE>stmt</CODE> or <CODE>declaration</CODE>.</P><P>The Bison representation for a terminal symbol is also called a <STRONG>tokentype</STRONG>. Token types as well can be represented as C-like identifiers. Byconvention, these identifiers should be upper case to distinguish them fromnonterminals: for example, <CODE>INTEGER</CODE>, <CODE>IDENTIFIER</CODE>, <CODE>IF</CODE> or<CODE>RETURN</CODE>. A terminal symbol that stands for a particular keyword inthe language should be named after that keyword converted to upper case.The terminal symbol <CODE>error</CODE> is reserved for error recovery.See section <A HREF="bison_6.html#SEC40">Symbols, Terminal and Nonterminal</A>.</P><P>A terminal symbol can also be represented as a character literal, just likea C character constant. You should do this whenever a token is just asingle character (parenthesis, plus-sign, etc.): use that same character ina literal as the terminal symbol for that token.</P><P>A third way to represent a terminal symbol is with a C string constantcontaining several characters. See section <A HREF="bison_6.html#SEC40">Symbols, Terminal and Nonterminal</A>, for more information.</P><P>The grammar rules also have an expression in Bison syntax. For example,here is the Bison rule for a C <CODE>return</CODE> statement. The semicolon inquotes is a literal character token, representing part of the C syntax forthe statement; the naked semicolon, and the colon, are Bison punctuationused in every rule.</P><PRE>stmt: RETURN expr ';' ;</PRE><P>See section <A HREF="bison_6.html#SEC41">Syntax of Grammar Rules</A>.</P><H2><A NAME="SEC10" HREF="index.html#SEC10">Semantic Values</A></H2><P><A NAME="IDX14"></A><A NAME="IDX15"></A></P><P>A formal grammar selects tokens only by their classifications: for example,if a rule mentions the terminal symbol `integer constant', it means that<EM>any</EM> integer constant is grammatically valid in that position. Theprecise value of the constant is irrelevant to how to parse the input: if<SAMP>`x+4'</SAMP> is grammatical then <SAMP>`x+1'</SAMP> or <SAMP>`x+3989'</SAMP> is equallygrammatical.</P><P>But the precise value is very important for what the input means once it isparsed. A compiler is useless if it fails to distinguish between 4, 1 and3989 as constants in the program! Therefore, each token in a Bison grammarhas both a token type and a <STRONG>semantic value</STRONG>. See section <A HREF="bison_6.html#SEC43">Defining Language Semantics</A>,for details.</P><P>The token type is a terminal symbol defined in the grammar, such as<CODE>INTEGER</CODE>, <CODE>IDENTIFIER</CODE> or <CODE>','</CODE>. It tells everythingyou need to know to decide where the token may validly appear and how togroup it with other tokens. The grammar rules know nothing about tokens
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -