📄 grammar.otx

📁 inger小型c编译器源码
💻 OTX
字号:
% grammar.tex% Practical Compiler Construction% Chapter: Grammar\chapter{Grammar}        \section{Introduction}        % "syntax errors"    	This chapter will introduce the concepts of \emph{language}	and \emph{grammar} in both informal and formal terms.	After we have established exactly what a grammar is, we 	offer several example grammars with	documentation.		This introductory section discusses the value of the 	material that follows in writing a compiler. A compiler can	be thought of as a sequence of actions, performed on some code	(formulated in the source language) that transform that code	into the desired output. For example, a Pascal compiler	transforms Pascal code to assembly code, and a Java compiler	transforms Java code to its corresponding Java bytecode.		If you have used a compiler in the past, you may be familiar	with ``syntax errors''. These occur when the input code does	not conform to a set of rules set by the language specification.	You may have forgotten to terminate a statement with a semicolon,	or you may have used the \verb|THEN| keyword in a C program (the C 	language defines no \verb|THEN| keyword).		One of the things that a compiler does when transforming source	code to target code is check the structure of the source code.	This is a required step before the compiler can move on to 	something else. 		The first thing we must do when writing a compiler is write 	a grammar for the source language. This chapter explains what	a grammar is, how to create one. Furthermore, it introduces	several common ways of writing down a grammar.		\section{Languages}		% Natural vs formal languages.	In this section we will try to formalize the concept of a	\emph{language}. When thinking of languages, the first 	languages that usually come to mind are \emph{natural languages} 	like English or French. This is a class of languages that	we will only consider in passing here, since they are	very difficult to understand by a computer. There is another	class of languages, the \emph{computer} or \emph{formal}	languages, that are far easier to parse since they obey	a rigid set of rules. This is in constrast with natural	languages, whose leniant rules allow the speaker a great 	deal of freedom in expressing himself.		% Definition of alphabet.	All languages draw the words that they allow to be used	from a pool of words, called the \emph{alphabet}. This is	rather confusing, because we tend to think of the alphabet 	as the 26 latin letters, A through Z. 	However, the definition of a language is not	concerned with how its most basic elements, the words, 	are constructed from individual letters, but how these words	are strung together. In definitions, an alphabet is denoted	as $\Sigma$.		% L is a part of S*.	A language is a collection of sentences or \emph{strings}. 	From all the words	that a language allows, many sentences can be built but only	some of these sentences are valid for the language under	consideration. All the sentences that may be constructed from	an alphabet $\Sigma$ are denoted $\Sigma^{*}$. Also, there exists	a special sentence: the sentence with no words in it. This sentence	is denoted $\lambda$.		In definitions, we refer to words using lowercase letters at the	beginning of our alphabet ($a, b, c$...), while we refer to sentences	using letters near the end of our alphabet ($u,v,w,x$...).	We will now define how sentences may be built from words.		% What is S, what is S*?	\begin{definition} \label{definition:alphabet}	Let $\Sigma$ be an alphabet. $\Sigma^{*}$, the set of strings over	$\Sigma$, is defined recursively as follows:		\begin{enumerate}	  \item Basis: $\lambda$ $\in$ $\Sigma^{*}$.	  \item Recursive step: If $w$ $\in$ $\Sigma^{*}$, then $wa$ $\in$ $\Sigma^{*}$.	  \item Closure: $w$ $\in$ $\Sigma^{*}$ only if it can be obtained from $\lambda$ by a finite number of applications of the recursive step.	\end{enumerate}	\end{definition}		This definition may need some explanation. It is put using	\emph{induction}. What this means will become clear in a moment.		In the \emph{basis} (line 1 of the defintion), we state that	the empty string ($\lambda$) is a sentence over $\Sigma$. This is	a statement, not proof. We just state that for any alphabet $\Sigma$, 	the empy string $\lambda$ is among the sentences that may be 	constructed from it.		In the \emph{recursive step} (line 2 of the definition), 	we state that given a string $w$ that	is part of $\Sigma^{*}$, the string $wa$ is also part of 	$\Sigma^{*}$. Note that $w$ denotes a string, and $a$ denotes a	single word. Therefore what we mean is that given a string generated	from the alphabet, we may append any word from that alphabet to it	and the resulting string will still be part of the set of strings	that can be generated from the alphabet.		Finally, in the \emph{closure} (line 3 of the definition), we	add that all the strings that can be built using the basis	and recursive step are part of the set of strings over $\Sigma^{*}$,	and all the other strings are not. You can think of this as a	sort of safeguard for the definition. In most inductive defintions,	we will leave the closure line out.		Is $\Sigma^{*}$, then, a language? The answer is no. $\Sigma^{*}$	is the set of all possible strings that may be built using the	alphabet $\Sigma$. Only some of these strings are actually valid	for a language. Therefore a language over an alphabet $\Sigma$	is a subset of $\Sigma^{*}$.		As an example, consider a small part of the English language, with	the alphabet \{ 'dog', 'bone,', the', 'eats' \} 	(we	cannot consider the actual English language, as it has far	too many words to list here). From this alphabet, we can derive	strings using definition \ref{definition:alphabet}:		\begin{quote}\emph{		   $\lambda$		\\ dog		\\ dog dog dog		\\ bone dog the		\\ the dog eats the bone		\\ the bone eats the dog	}	\end{quote}		Many more strings are possible, but we can at least see that	most of the strings above are not valid for the English language:	their structure does not obey the rules of English grammar. Thus	we may conclude that a language over an alphabet $\Sigma$ is a	subset of $\Sigma^{*}$ \emph{that follows certain grammar rules}.		If you are wondering how all this relates to compiler construction,	you should realize that one of the things that a compiler does	is check the structure of its input by applying grammar rules.	If the structure is off, the compiler prints a syntax error.		% \includegraphics[width=10cm,height=2cm]{syntax.wmf}				\section{Syntax and Semantics}		To illustrate the concept of grammar, let us examine the	following line of text:     	\begin{quote}\begin{verbatim}        jumps the fox the dog over	\end{verbatim}\end{quote}		Since it obviously does not obey the rules of English grammar,	this sentence is meaningless. It is said to be 	\emph{syntactically incorrect}. The \emph{syntax} of a sentence is	its form or structure. Every sentence in a language must 	obey to that language's	syntax for it to have meaning.		Here is another example of a sentence, whose meaning is unclear:		\begin{quote}\begin{verbatim}		the fox drinks the color red	\end{verbatim}\end{quote}		Though possibly considered a wondrous statement in Vogon poetry,	this statement has no meaning in the English language. 	We know that the color red cannot be drunk, so that although	the sentence is syntactically correct, it conveys no useful	information, and is therefore considered incorrect. Sentences whose	structure conforms to a language's syntax but whose meaning 	cannot be understood are said to be \emph{semantically} incorrect.		The purpose of a grammar is to give a set of rules which	all the sentences of a certain language must follow. It should	be noted that speakers of a natural language (like English or 	French) will generally understand sentences that differ from	these rules, but this is not so for formal languages used 	by computers. All sentences are required to adhere to the grammar	rules without deviation for a computer to be able to understand	them.    % TODO: Which chapter are we pointing to below?		Because language semantics are hard to express in a set of rules	(although we will show a way to deal with sematics in a 	subsequent chapter), grammars deal with syntax only: a grammar	defines the structure of sentences in a language.	    \section{Words}    In a grammar, we are not usually interested in the individual     letters that make up a word, but in the words themselves.     We can give    these words names so that we can refer to them in a grammar.    For example, there are very many words that can be the subject    or object of an English sentence ('fox', 'dog', 'chair' and so on)     and it would not be feasible to list them all. Therefore we     simply refer to them as 'noun'. In the same way we can give    all words that precede nouns to add extra information (like     'brown', 'lazy' and 'small') a name too: 'adjective'. We call     the set of all articles ('the', 'a', 'an') 'article'.    Finally, we call all verbs 'verb'. Each of these names    represent a set of many words, with the exception of 'article',     which has all its members already listed.        Armed with this new terminology, we are now in a position to    describe the form of a very simple English sentence:    	\begin{quote}		\emph{sentence ::= article adjective noun verb adjective noun.}	\end{quote}		From this lone \emph{grammar rule}, we can generate English 	sentences. We can replace every set name to the right of \emph{::=}	with one of its elements. For example, we can replace 	\emph{article} with	'the', \emph{adjective} with 'quick', 	\emph{noun} with 'fox' and so on. This way we can build sentences	such as		\begin{quote}\begin{verbatim}		the quick fox eats a delicious banana		the delicious banana thinks the quick fox		a quick banana outruns a delicious fox	\end{verbatim}\end{quote}		The structure of these sentences matches the preceding rule, which	means that they conform to the syntax we specified. Incidentally, 	some of these sentences have no real meaning, thus illustrating	that semantic rules are not included in the grammar rules we	discuss here.		We have just defined a grammar, even though it contains only one	rule that allows only one type of sentence. Note that our grammar	is a so-called \emph{abstract grammar}, since it does not specify	the actual words that we may use to replace the word classes	(article, noun, verb) that we introduced. 		So far we have given names to classes of individual words. We can 	also assign names to common combinations of words. This requires	multiple rules, making the individual rules simpler:		\begin{quote}		\emph{		sentence ::= object verb object.		\\object ::= article adjective noun.		}	\end{quote}     	This grammar generates the same sentences as the previous one,	but is somewhat easier to read. Now we will also limit the	choices that we can make when replacing	word classes by	introducing some more rules:		\begin{quote}		\emph{	       hello		}	\end{quote}						        words class, constructions, aux. symbols    and finally: nonterminal vs terminal.                \section{Theory}    (V,N,S,P)        \section{Notation}    syntax diagrams    BNF    EBNF        \section{Syntax Trees}    graphics!            \section{Common Pitfalls}        \section{Example Grammars}    Calculation grammar    Logic grammar        \chapter{Parsing}	\section{Parsing theory}	(Meijer).
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -