📄 grammar.otx
字号:
% grammar.tex% Practical Compiler Construction% Chapter: Grammar\chapter{Grammar} \section{Introduction} % "syntax errors" This chapter will introduce the concepts of \emph{language} and \emph{grammar} in both informal and formal terms. After we have established exactly what a grammar is, we offer several example grammars with documentation. This introductory section discusses the value of the material that follows in writing a compiler. A compiler can be thought of as a sequence of actions, performed on some code (formulated in the source language) that transform that code into the desired output. For example, a Pascal compiler transforms Pascal code to assembly code, and a Java compiler transforms Java code to its corresponding Java bytecode. If you have used a compiler in the past, you may be familiar with ``syntax errors''. These occur when the input code does not conform to a set of rules set by the language specification. You may have forgotten to terminate a statement with a semicolon, or you may have used the \verb|THEN| keyword in a C program (the C language defines no \verb|THEN| keyword). One of the things that a compiler does when transforming source code to target code is check the structure of the source code. This is a required step before the compiler can move on to something else. The first thing we must do when writing a compiler is write a grammar for the source language. This chapter explains what a grammar is, how to create one. Furthermore, it introduces several common ways of writing down a grammar. \section{Languages} % Natural vs formal languages. In this section we will try to formalize the concept of a \emph{language}. When thinking of languages, the first languages that usually come to mind are \emph{natural languages} like English or French. This is a class of languages that we will only consider in passing here, since they are very difficult to understand by a computer. There is another class of languages, the \emph{computer} or \emph{formal} languages, that are far easier to parse since they obey a rigid set of rules. This is in constrast with natural languages, whose leniant rules allow the speaker a great deal of freedom in expressing himself. % Definition of alphabet. All languages draw the words that they allow to be used from a pool of words, called the \emph{alphabet}. This is rather confusing, because we tend to think of the alphabet as the 26 latin letters, A through Z. However, the definition of a language is not concerned with how its most basic elements, the words, are constructed from individual letters, but how these words are strung together. In definitions, an alphabet is denoted as $\Sigma$. % L is a part of S*. A language is a collection of sentences or \emph{strings}. From all the words that a language allows, many sentences can be built but only some of these sentences are valid for the language under consideration. All the sentences that may be constructed from an alphabet $\Sigma$ are denoted $\Sigma^{*}$. Also, there exists a special sentence: the sentence with no words in it. This sentence is denoted $\lambda$. In definitions, we refer to words using lowercase letters at the beginning of our alphabet ($a, b, c$...), while we refer to sentences using letters near the end of our alphabet ($u,v,w,x$...). We will now define how sentences may be built from words. % What is S, what is S*? \begin{definition} \label{definition:alphabet} Let $\Sigma$ be an alphabet. $\Sigma^{*}$, the set of strings over $\Sigma$, is defined recursively as follows: \begin{enumerate} \item Basis: $\lambda$ $\in$ $\Sigma^{*}$. \item Recursive step: If $w$ $\in$ $\Sigma^{*}$, then $wa$ $\in$ $\Sigma^{*}$. \item Closure: $w$ $\in$ $\Sigma^{*}$ only if it can be obtained from $\lambda$ by a finite number of applications of the recursive step. \end{enumerate} \end{definition} This definition may need some explanation. It is put using \emph{induction}. What this means will become clear in a moment. In the \emph{basis} (line 1 of the defintion), we state that the empty string ($\lambda$) is a sentence over $\Sigma$. This is a statement, not proof. We just state that for any alphabet $\Sigma$, the empy string $\lambda$ is among the sentences that may be constructed from it. In the \emph{recursive step} (line 2 of the definition), we state that given a string $w$ that is part of $\Sigma^{*}$, the string $wa$ is also part of $\Sigma^{*}$. Note that $w$ denotes a string, and $a$ denotes a single word. Therefore what we mean is that given a string generated from the alphabet, we may append any word from that alphabet to it and the resulting string will still be part of the set of strings that can be generated from the alphabet. Finally, in the \emph{closure} (line 3 of the definition), we add that all the strings that can be built using the basis and recursive step are part of the set of strings over $\Sigma^{*}$, and all the other strings are not. You can think of this as a sort of safeguard for the definition. In most inductive defintions, we will leave the closure line out. Is $\Sigma^{*}$, then, a language? The answer is no. $\Sigma^{*}$ is the set of all possible strings that may be built using the alphabet $\Sigma$. Only some of these strings are actually valid for a language. Therefore a language over an alphabet $\Sigma$ is a subset of $\Sigma^{*}$. As an example, consider a small part of the English language, with the alphabet \{ 'dog', 'bone,', the', 'eats' \} (we cannot consider the actual English language, as it has far too many words to list here). From this alphabet, we can derive strings using definition \ref{definition:alphabet}: \begin{quote}\emph{ $\lambda$ \\ dog \\ dog dog dog \\ bone dog the \\ the dog eats the bone \\ the bone eats the dog } \end{quote} Many more strings are possible, but we can at least see that most of the strings above are not valid for the English language: their structure does not obey the rules of English grammar. Thus we may conclude that a language over an alphabet $\Sigma$ is a subset of $\Sigma^{*}$ \emph{that follows certain grammar rules}. If you are wondering how all this relates to compiler construction, you should realize that one of the things that a compiler does is check the structure of its input by applying grammar rules. If the structure is off, the compiler prints a syntax error. % \includegraphics[width=10cm,height=2cm]{syntax.wmf} \section{Syntax and Semantics} To illustrate the concept of grammar, let us examine the following line of text: \begin{quote}\begin{verbatim} jumps the fox the dog over \end{verbatim}\end{quote} Since it obviously does not obey the rules of English grammar, this sentence is meaningless. It is said to be \emph{syntactically incorrect}. The \emph{syntax} of a sentence is its form or structure. Every sentence in a language must obey to that language's syntax for it to have meaning. Here is another example of a sentence, whose meaning is unclear: \begin{quote}\begin{verbatim} the fox drinks the color red \end{verbatim}\end{quote} Though possibly considered a wondrous statement in Vogon poetry, this statement has no meaning in the English language. We know that the color red cannot be drunk, so that although the sentence is syntactically correct, it conveys no useful information, and is therefore considered incorrect. Sentences whose structure conforms to a language's syntax but whose meaning cannot be understood are said to be \emph{semantically} incorrect. The purpose of a grammar is to give a set of rules which all the sentences of a certain language must follow. It should be noted that speakers of a natural language (like English or French) will generally understand sentences that differ from these rules, but this is not so for formal languages used by computers. All sentences are required to adhere to the grammar rules without deviation for a computer to be able to understand them. % TODO: Which chapter are we pointing to below? Because language semantics are hard to express in a set of rules (although we will show a way to deal with sematics in a subsequent chapter), grammars deal with syntax only: a grammar defines the structure of sentences in a language. \section{Words} In a grammar, we are not usually interested in the individual letters that make up a word, but in the words themselves. We can give these words names so that we can refer to them in a grammar. For example, there are very many words that can be the subject or object of an English sentence ('fox', 'dog', 'chair' and so on) and it would not be feasible to list them all. Therefore we simply refer to them as 'noun'. In the same way we can give all words that precede nouns to add extra information (like 'brown', 'lazy' and 'small') a name too: 'adjective'. We call the set of all articles ('the', 'a', 'an') 'article'. Finally, we call all verbs 'verb'. Each of these names represent a set of many words, with the exception of 'article', which has all its members already listed. Armed with this new terminology, we are now in a position to describe the form of a very simple English sentence: \begin{quote} \emph{sentence ::= article adjective noun verb adjective noun.} \end{quote} From this lone \emph{grammar rule}, we can generate English sentences. We can replace every set name to the right of \emph{::=} with one of its elements. For example, we can replace \emph{article} with 'the', \emph{adjective} with 'quick', \emph{noun} with 'fox' and so on. This way we can build sentences such as \begin{quote}\begin{verbatim} the quick fox eats a delicious banana the delicious banana thinks the quick fox a quick banana outruns a delicious fox \end{verbatim}\end{quote} The structure of these sentences matches the preceding rule, which means that they conform to the syntax we specified. Incidentally, some of these sentences have no real meaning, thus illustrating that semantic rules are not included in the grammar rules we discuss here. We have just defined a grammar, even though it contains only one rule that allows only one type of sentence. Note that our grammar is a so-called \emph{abstract grammar}, since it does not specify the actual words that we may use to replace the word classes (article, noun, verb) that we introduced. So far we have given names to classes of individual words. We can also assign names to common combinations of words. This requires multiple rules, making the individual rules simpler: \begin{quote} \emph{ sentence ::= object verb object. \\object ::= article adjective noun. } \end{quote} This grammar generates the same sentences as the previous one, but is somewhat easier to read. Now we will also limit the choices that we can make when replacing word classes by introducing some more rules: \begin{quote} \emph{ hello } \end{quote} words class, constructions, aux. symbols and finally: nonterminal vs terminal. \section{Theory} (V,N,S,P) \section{Notation} syntax diagrams BNF EBNF \section{Syntax Trees} graphics! \section{Common Pitfalls} \section{Example Grammars} Calculation grammar Logic grammar \chapter{Parsing} \section{Parsing theory} (Meijer).
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -