📄 chapter 2 the components of a compiler.htm

📁 英文版编译器设计：里面详细介绍啦C编译器的设计
💻 HTM
📖 第 1 页 / 共 2 页
字号:
12 下一页
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0062)http://topaz.cs.byu.edu/text/html/Textbook/Chapter2/index.html -->
<HTML><HEAD><TITLE>Chapter 2: The Components of a Compiler</TITLE>
<META content="MSHTML 6.00.2800.1458" name=GENERATOR>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1"></HEAD>
<BODY>
<CENTER>
<H1>Chapter 2<BR>The Components of a Compiler </H1></CENTER>
<HR>
This book is intended to introduce the reader to compiler construction. But 
first, what is a compiler? A compiler is a program that translates sentences 
expressed in a source language, lets call it language A, into a target language, 
lets call it language B. The translation process done by a compiler has 
important similarities to the translation process done by a human translator. 
For example, let us consider the translation process done by a human translator.
<P>Let us consider the translation from French to English. When translating from 
French to English, the translator must comprehend each French sentence and 
produce a corresponding sentence in English. In order to see how a compiler 
works let us analyze how a human translator works. The translator receives a 
text that is written in the source language and he decomposes it in sentences 
and words using his knowledge of the language. This "knowledge" is stored in his 
"internal memory," that is his brain, and it consists of the vocabulary of the 
language--used to recognize the words -- and the grammar of the language--used 
to analyze the correctness of the sentences. Once the translator recognizes all 
the words and understands the meaning of the sentence, he starts the process of 
translating.
<P>Now he shifts his attention from French to English, and in the process of 
producing the English sentence he will use his knowledge of the English 
language. This knowledge, which is also stored in his memory, consists of 
vocabulary, grammar and semantics of the words. To make a long story short, the 
translator must generate a sequence of English words. For any translator, the 
most difficult part comes from the fact that he or she can not translate word by 
word because the two languages do not have the same sentence patterns, and in 
spite of this the meaning of the sentence in both languages must be the same.
<P>A compiler follows the same sequence of operations that is performed by a 
human translator. Fortunately for the constructor of a compiler, all the 
computer languages are by far simpler than any spoken language, hence the task 
of writing a program that translates from the source language to the target 
language is by far more simple than translating from French to English. If the 
source is a high level programming language, like FORTRAN, COBOL, Pascal or 
Modula, and the target language is a machine language that is understood and 
executed by the hardware of a computer then the translator is called a compiler. 
The output of a compiler is sometimes called object code. If the source is a low 
level computer language that is close to the hardware of a computer and the 
target is a machine language, then the translator is called an assembler. The 
output of an assembler is also called object code.
<P>One difference between a compiler and an assembler is the level of the source 
language: for a compiler the source language is considered to be of a high 
level, while for an assembler the source language is considered to be of a low 
level. A quantitative measurement of the source language level is the amount of 
the object code that is generated for a given sentence: a compiler generates a 
lot of object code, while an assembler generates only one object code. In 
conclusion, the higher a computer language is, the more object code has to 
generated for a given sentence, and as a consequence, the task of translating 
becomes more complicated.
<P><!---------------------------------------------------------------------------->
<H2>2.1 The Components of a Compiler</H2><!---------------------------------------------------------------------------->A 
compiler is composed of several components, each performing one specific task. 
Each of these components performs at a slightly higher language level than the 
previous. The job of a compiler is monumental. It must take an 
arbitrarily-formed body of text, and translate it into a binary stream that 
computer can understand.
<P>For starters, let's explain the text source. The text source is a program 
written by a programmer, or programming tool and it consists of a string of 
characters. This text source can come from a variety of places like a disk file, 
a terminal, or a magnetic tape. The way a particular source is accessed varies 
largely from case to case. However, from the point of view of the person writing 
the compiler it does not make any difference. In almost all cases the operating 
system provides the necessary functions which will deliver the text source as an 
input to the compiler.
<P>
<H3>2.1.1 Scanner (Lexical Analzer)</H3><!---------------------------------------------------------------------------->The 
most basic part of the compiler is known by many names. In addition to scanner, 
it is also known as a lexical analyzer, tokenizer, and more. Throughout this 
book, we will call it by its shortest name: the scanner. This module has the 
task of separating the continuous string of characters into distinctive groups 
that make sense. Such a group is called a token. A token may be composed of a 
single character or a sequence of characters. Examples of tokens are identifiers 
(name of variables defined by the programmer in his program), or words that have 
a special meaning in the language like begin, end, etc. (the key words of the 
language), numbers and special character sequences (such as := in SAL, or 
Pascal). The Scanner also eliminates from the text source the comments that may 
exist in the program and whitespace (e.g., tabs, blanks, linefeeds, etc). For an 
example, see figure 2.1. The output of the scanner gives the input to the next 
module that is the parser, or the Syntactical Analyzer, or Parser.
<P>
<MENU><IMG src="Chapter 2 The Components of a Compiler.files/TOKENS.gif">
  <P><FONT face=arial size=-1><B>Figure {TOKENS}</B> A list of some of the 
  tokens as a scanner would possibly recognize and classify them. The scanner 
  extracts the tokens in the order that they appear, classifying each one. 
  </FONT></P></MENU>
<H3>2.1.2 The Parser</H3><!---------------------------------------------------------------------------->Again, 
this component of the compiler is known by many names. Its most common is the 
parser. It was the scanner's duty to recognize individual words, or tokens of 
the language. The scanner does not, however recognize if these words have been 
used correctly. The main task of the parser is to group the tokens into 
sentences, that is, to determine if the sequences of tokens that have been 
extracted by the scanner are in the correct order or not. In other words, until 
the parser is reached the tokens have been collected with no regard to the whole 
context of the program as a whole. The parser analyzes the context of each token 
and groups the tokens in declarations, statements, and control statements. In 
the process of analyzing each sentence, the parser builds abstract tree 
structures. A tree is a useful representation of a program segment because it 
facilitates transformations of the program that may lead to possible 
minimization of the machine instructions that are needed to carry out the 
required operation i.e., optimization.
<P>
<H3>2.1.3 Symbol Tables</H3><!---------------------------------------------------------------------------->Though 
the parser determined the correct usage of tokens, and whether or not they 
appeared in the correct order, it still did not determine whether or not the 
program said anything that made sense. This is type of checking occurs at the 
semantic level. In order to perform this task, the compiler makes use of a 
detailed system of lists, known as the symbol tables.
<P>In addition to structural interpretation of the program, the parser also 
extracts information about each token. The information refers to the token as an 
element--for instance, the name of an identifier, or the value of a 
constant--and also to the relation of the token in the general context of the 
program. For example if the token is the name of a procedure, of a record or an 
array, etc. All this information is stored in tables, and is later used in the 
process of generating the object code.
<P>
<H3>2.1.4 Error Messages</H3><!---------------------------------------------------------------------------->It 
is only at the previous three levels of language that error checking can occur. 
Beyond that, it will become the program developer's duty to detect errors. 
Beyond this point, the program will compile correctly. As to whether or not it 
actually executes correctly is another matter. Regardless, as far as 
compiler-detectable errors go, there are mainly three types. Each of these three 
corresponds to the three levels of language discussed so far.
<P>he first of these are scanner errors. Some of the most common types here 
consist of illegal or unrecognized characters, mainly caused by typing errors. A 
common way for this to happen is for the programmer to type a character that is 
illegal in any instance in the language, and is never used. For instance, the 
"!" character is never used in SAL. Any occurrence of the exclamation point 
other than within a string or character constant is illegal. Another way for 
this type of error to happen is to mistype an operator, like accidentally typing 
";=" instead of ":=". Finally and quite commonly, another type of error that the 
scanner may detect is an unterminated character or string constant. This happens 
whenever the programmer types something in quotes, and forgets the trailing 
quote. Again, these are mostly typing errors.
<P>he second class of errors is syntactic in nature, and is caught by the 
parser. These errors are among the most common. The really difficult part is to 
decide from where to continue the syntactical analysis after an error has been 
found. What happens if the parser is not carefully written, or if the error 
detection and recovery scheme is sloppy, the parser will hit one error and "mess 
up" after that, and cascade spurious error messages all throughout the rest of 
the program. In the case of an error what one would like to see happening, is to 
have the compiler skip any improper tokens, and continue to detect errors 
without generating error messages that are not really an error but a consequence 
of the first error. This aspect is so important that some compilers are 
categorized based on how good their error detection system is. Anyone familiar 
to programming has certainly encountered this problem before in his or her own 
experience.
<P>The third type of error is semantic in nature. The semantics that are used in 
computer languages are by far simpler than the semantics that are used in spoken 
languages. This is the case because in computer languages everything is very 
exactly defined, there are no nuances implied or used. The semantic errors that 
may occur in a program are related to the fact that some statements may be 
correct from the syntactical point of view, but they make no sense, and there is 
no code that can be generated to carry out the meaning of the statement. For 
example the statement A:= B; is correct from a syntactic point of view, but if A 
is type bool and B is of type array, the statement does not make sense, and it 
can not be executed. 
<P>Quite often, both semantic and syntactic errors become heavily intermingled. 
For instance, suppose A is an int. An example of this would be if the programmer 
mistakenly tried to call A like a procedure, A(1, 2, 3), or accidentally treated 
it like an array, as in A[i+1]. These errors can be very tricky for a compiler 
to recover from.
<P>
<H3>2.1.5 Code Optimization</H3><!---------------------------------------------------------------------------->The 
structure of the tree that is generated by the parser can be rearranged to suit 
the needs of the machine architecture or the tree can be restructured in order 
to produce an object code that runs faster. What results is a simplified tree 
that may be used to generate the object code. An example of such an optimization 
is the following FOR loop:
<P><PRE>      for a:=1 to 2000 do
        x:= y + 3;
        q:= a * i;
      loop;
</PRE>In this loop there are instructions that are not related at all to the 
variable a of the for statement, such as x:= y + 3. In the above program this 
instruction is executed unnecessarily 2000 times. The object code can be 
generated such that the machine code for that line is moved outside the for's 
body, (either before or after the for command). The resulting object code is 
more efficient at run time because the loop has been significantly tightened, 
and the above mentioned instruction is executed only once, instead of 2000 
times.
<P>In the process of generating the code, there may be several stages of 
transformations the code is going through. For example, consider the expression:
<P><PRE>      Position:= initial + rate*60;
</PRE>After the compiler broke the stream of text down into tokens, parsed the 
tokens and created the tree, and finally performed the necessary type checking, 
it might create a sequence of three-address instructions. Three-address 
instructions contain no more than three operands (registers) per instruction, 
and in addition to an assignment, contain only one other operator [Aho85]. 
Three-address instructions also assume an unlimited supply of registers. An 
example of the above code rendered in three-address code might be:
<P><PRE>      temp1:= inttoreal(60)
      temp2:= id3 * temp1
      temp3:= id2 + temp2
      id1:= temp3
</PRE>The next phase would be to optimize the code, by applying abstract 
algebraic rules to it. A subsequent optimization of the above code might be:
<P><PRE>      temp1:= id3 * 60.0
      id1:= id2 + temp1
</PRE>Once the intermediate code has been optimized, the compiler can generate
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -