📄 chapter 2 the components of a compiler.htm
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0062)http://topaz.cs.byu.edu/text/html/Textbook/Chapter2/index.html -->
<HTML><HEAD><TITLE>Chapter 2: The Components of a Compiler</TITLE>
<META content="MSHTML 6.00.2800.1458" name=GENERATOR>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1"></HEAD>
<BODY>
<CENTER>
<H1>Chapter 2<BR>The Components of a Compiler </H1></CENTER>
<HR>
This book is intended to introduce the reader to compiler construction. But
first, what is a compiler? A compiler is a program that translates sentences
expressed in a source language, lets call it language A, into a target language,
lets call it language B. The translation process done by a compiler has
important similarities to the translation process done by a human translator.
For example, let us consider the translation process done by a human translator.
<P>Let us consider the translation from French to English. When translating from
French to English, the translator must comprehend each French sentence and
produce a corresponding sentence in English. In order to see how a compiler
works let us analyze how a human translator works. The translator receives a
text that is written in the source language and he decomposes it in sentences
and words using his knowledge of the language. This "knowledge" is stored in his
"internal memory," that is his brain, and it consists of the vocabulary of the
language--used to recognize the words -- and the grammar of the language--used
to analyze the correctness of the sentences. Once the translator recognizes all
the words and understands the meaning of the sentence, he starts the process of
translating.
<P>Now he shifts his attention from French to English, and in the process of
producing the English sentence he will use his knowledge of the English
language. This knowledge, which is also stored in his memory, consists of
vocabulary, grammar and semantics of the words. To make a long story short, the
translator must generate a sequence of English words. For any translator, the
most difficult part comes from the fact that he or she can not translate word by
word because the two languages do not have the same sentence patterns, and in
spite of this the meaning of the sentence in both languages must be the same.
<P>A compiler follows the same sequence of operations that is performed by a
human translator. Fortunately for the constructor of a compiler, all the
computer languages are by far simpler than any spoken language, hence the task
of writing a program that translates from the source language to the target
language is by far more simple than translating from French to English. If the
source is a high level programming language, like FORTRAN, COBOL, Pascal or
Modula, and the target language is a machine language that is understood and
executed by the hardware of a computer then the translator is called a compiler.
The output of a compiler is sometimes called object code. If the source is a low
level computer language that is close to the hardware of a computer and the
target is a machine language, then the translator is called an assembler. The
output of an assembler is also called object code.
<P>One difference between a compiler and an assembler is the level of the source
language: for a compiler the source language is considered to be of a high
level, while for an assembler the source language is considered to be of a low
level. A quantitative measurement of the source language level is the amount of
the object code that is generated for a given sentence: a compiler generates a
lot of object code, while an assembler generates only one object code. In
conclusion, the higher a computer language is, the more object code has to
generated for a given sentence, and as a consequence, the task of translating
becomes more complicated.
<P><!---------------------------------------------------------------------------->
<H2>2.1 The Components of a Compiler</H2><!---------------------------------------------------------------------------->A
compiler is composed of several components, each performing one specific task.
Each of these components performs at a slightly higher language level than the
previous. The job of a compiler is monumental. It must take an
arbitrarily-formed body of text, and translate it into a binary stream that
computer can understand.
<P>For starters, let's explain the text source. The text source is a program
written by a programmer, or programming tool and it consists of a string of
characters. This text source can come from a variety of places like a disk file,
a terminal, or a magnetic tape. The way a particular source is accessed varies
largely from case to case. However, from the point of view of the person writing
the compiler it does not make any difference. In almost all cases the operating
system provides the necessary functions which will deliver the text source as an
input to the compiler.
<P>
<H3>2.1.1 Scanner (Lexical Analzer)</H3><!---------------------------------------------------------------------------->The
most basic part of the compiler is known by many names. In addition to scanner,
it is also known as a lexical analyzer, tokenizer, and more. Throughout this
book, we will call it by its shortest name: the scanner. This module has the
task of separating the continuous string of characters into distinctive groups
that make sense. Such a group is called a token. A token may be composed of a
single character or a sequence of characters. Examples of tokens are identifiers
(name of variables defined by the programmer in his program), or words that have
a special meaning in the language like begin, end, etc. (the key words of the
language), numbers and special character sequences (such as := in SAL, or
Pascal). The Scanner also eliminates from the text source the comments that may
exist in the program and whitespace (e.g., tabs, blanks, linefeeds, etc). For an
example, see figure 2.1. The output of the scanner gives the input to the next
module that is the parser, or the Syntactical Analyzer, or Parser.
<P>
<MENU><IMG src="Chapter 2 The Components of a Compiler.files/TOKENS.gif">
<P><FONT face=arial size=-1><B>Figure {TOKENS}</B> A list of some of the
tokens as a scanner would possibly recognize and classify them. The scanner
extracts the tokens in the order that they appear, classifying each one.
</FONT></P></MENU>
<H3>2.1.2 The Parser</H3><!---------------------------------------------------------------------------->Again,
this component of the compiler is known by many names. Its most common is the
parser. It was the scanner's duty to recognize individual words, or tokens of
the language. The scanner does not, however recognize if these words have been
used correctly. The main task of the parser is to group the tokens into
sentences, that is, to determine if the sequences of tokens that have been
extracted by the scanner are in the correct order or not. In other words, until
the parser is reached the tokens have been collected with no regard to the whole
context of the program as a whole. The parser analyzes the context of each token
and groups the tokens in declarations, statements, and control statements. In
the process of analyzing each sentence, the parser builds abstract tree
structures. A tree is a useful representation of a program segment because it
facilitates transformations of the program that may lead to possible
minimization of the machine instructions that are needed to carry out the
required operation i.e., optimization.
<P>
<H3>2.1.3 Symbol Tables</H3><!---------------------------------------------------------------------------->Though
the parser determined the correct usage of tokens, and whether or not they
appeared in the correct order, it still did not determine whether or not the
program said anything that made sense. This is type of checking occurs at the
semantic level. In order to perform this task, the compiler makes use of a
detailed system of lists, known as the symbol tables.
<P>In addition to structural interpretation of the program, the parser also
extracts information about each token. The information refers to the token as an
element--for instance, the name of an identifier, or the value of a
constant--and also to the relation of the token in the general context of the
program. For example if the token is the name of a procedure, of a record or an
array, etc. All this information is stored in tables, and is later used in the
process of generating the object code.
<P>
<H3>2.1.4 Error Messages</H3><!---------------------------------------------------------------------------->It
is only at the previous three levels of language that error checking can occur.
Beyond that, it will become the program developer's duty to detect errors.
Beyond this point, the program will compile correctly. As to whether or not it
actually executes correctly is another matter. Regardless, as far as
compiler-detectable errors go, there are mainly three types. Each of these three
corresponds to the three levels of language discussed so far.
<P>he first of these are scanner errors. Some of the most common types here
consist of illegal or unrecognized characters, mainly caused by typing errors. A
common way for this to happen is for the programmer to type a character that is
illegal in any instance in the language, and is never used. For instance, the
"!" character is never used in SAL. Any occurrence of the exclamation point
other than within a string or character constant is illegal. Another way for
this type of error to happen is to mistype an operator, like accidentally typing
";=" instead of ":=". Finally and quite commonly, another type of error that the
scanner may detect is an unterminated character or string constant. This happens
whenever the programmer types something in quotes, and forgets the trailing
quote. Again, these are mostly typing errors.
<P>he second class of errors is syntactic in nature, and is caught by the
parser. These errors are among the most common. The really difficult part is to
decide from where to continue the syntactical analysis after an error has been
found. What happens if the parser is not carefully written, or if the error
detection and recovery scheme is sloppy, the parser will hit one error and "mess
up" after that, and cascade spurious error messages all throughout the rest of
the program. In the case of an error what one would like to see happening, is to
have the compiler skip any improper tokens, and continue to detect errors
without generating error messages that are not really an error but a consequence
of the first error. This aspect is so important that some compilers are
categorized based on how good their error detection system is. Anyone familiar
to programming has certainly encountered this problem before in his or her own
experience.
<P>The third type of error is semantic in nature. The semantics that are used in
computer languages are by far simpler than the semantics that are used in spoken
languages. This is the case because in computer languages everything is very
exactly defined, there are no nuances implied or used. The semantic errors that
may occur in a program are related to the fact that some statements may be
correct from the syntactical point of view, but they make no sense, and there is
no code that can be generated to carry out the meaning of the statement. For
example the statement A:= B; is correct from a syntactic point of view, but if A
is type bool and B is of type array, the statement does not make sense, and it
can not be executed.
<P>Quite often, both semantic and syntactic errors become heavily intermingled.
For instance, suppose A is an int. An example of this would be if the programmer
mistakenly tried to call A like a procedure, A(1, 2, 3), or accidentally treated
it like an array, as in A[i+1]. These errors can be very tricky for a compiler
to recover from.
<P>
<H3>2.1.5 Code Optimization</H3><!---------------------------------------------------------------------------->The
structure of the tree that is generated by the parser can be rearranged to suit
the needs of the machine architecture or the tree can be restructured in order
to produce an object code that runs faster. What results is a simplified tree
that may be used to generate the object code. An example of such an optimization
is the following FOR loop:
<P><PRE> for a:=1 to 2000 do
x:= y + 3;
q:= a * i;
loop;
</PRE>In this loop there are instructions that are not related at all to the
variable a of the for statement, such as x:= y + 3. In the above program this
instruction is executed unnecessarily 2000 times. The object code can be
generated such that the machine code for that line is moved outside the for's
body, (either before or after the for command). The resulting object code is
more efficient at run time because the loop has been significantly tightened,
and the above mentioned instruction is executed only once, instead of 2000
times.
<P>In the process of generating the code, there may be several stages of
transformations the code is going through. For example, consider the expression:
<P><PRE> Position:= initial + rate*60;
</PRE>After the compiler broke the stream of text down into tokens, parsed the
tokens and created the tree, and finally performed the necessary type checking,
it might create a sequence of three-address instructions. Three-address
instructions contain no more than three operands (registers) per instruction,
and in addition to an assignment, contain only one other operator [Aho85].
Three-address instructions also assume an unlimited supply of registers. An
example of the above code rendered in three-address code might be:
<P><PRE> temp1:= inttoreal(60)
temp2:= id3 * temp1
temp3:= id2 + temp2
id1:= temp3
</PRE>The next phase would be to optimize the code, by applying abstract
algebraic rules to it. A subsequent optimization of the above code might be:
<P><PRE> temp1:= id3 * 60.0
id1:= id2 + temp1
</PRE>Once the intermediate code has been optimized, the compiler can generate
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -