📄 guide.tex
字号:
\documentstyle[guide,sortbib,11pt]{article}\thispagestyle{empty}\begin{document}\title{A Guide to the Rochester PL/0 Compiler \thanks{Copyright \copyright\ 1994--1999 Michael L. Scott}}\author{Michael L. Scott\\[2ex]Computer Science Department\\University of Rochester}\date{Revised July 1999}\maketitleRochester's PL/0 compiler attempts to illustrate thecharacteristics of a ``real'' compiler on a modest scale.It runs on Unix systems and generates MIPS~I assembler.Unlike many pedagogicalcompilers, it has multiple passes, employs a table-driven scanner andparser, performs high-quality syntax error repair, and issueswell-formatted diagnostic messages. It does notinclude any significant code improvement (optimization), though thisis planned for the future. For now, it generates very poor code.\section{The PL/0 Language}The PL/0 compiler is small mainly because it compiles a very smalllanguage. PL/0 was defined by Niklaus Wirth (the inventor of Pascaland Modula) as part of an extended example in his 1976 book, {\it Algorithms + Data Structures = Languages\/} (pp.\ 307--347). It isessentially a subset of Pascal, with nested procedures, but withoutparameters; functions; case, for, or repeat statements; else clauses;or any types other than integer. The PL/0 compiler implements thelanguage exactly as defined by Wirth, with the addition of primitiveI/O.\subsection{Tokens}Case is insignificant in PL/0. The PL/0 tokens are:\begin{itemize}\item keywords \\ {\tt begin}, {\tt call}, {\tt const}, {\tt do}, {\tt end}, {\tt if}, {\tt odd}, {\tt procedure}, {\tt then}, {\tt var}, {\tt while}.\item identifiers \\Any other string of letters and digits beginning with a letter.\item numbers \\ A string of (decimal) digits.\item operators, relations, and punctuation marks \\ {\tt +}, {\tt -}, {\tt *}, {\tt /}, {\tt =}, {\tt \#}, {\tt <}, {\tt >}, {\tt <=}, {\tt >=}, {\tt (}, {\tt )}, {\tt ,}, {\tt ;}, {\tt .}, {\tt :=}.\end{itemize}\subsection{Syntax}In the EBNF below,{\tt typewriter font} is used for tokens and {\em italicized font\/}is used for non-terminals. Bold parentheses areused for grouping. The vertical bar is used for alternation (``or'').Bold curly brackets indicate an optional item. A Kleene star (\kstar)indicates that the preceding item can be repeated zero or more times.Epsilon (\nothing) denotes the empty string.White space (spaces, tabs, and newlines) is required in PL/0 programs toseparate characters that could otherwise be considered part of the sametoken. Otherwise, input is ``free format'': indentation and additionalwhite space are irrelevant.\begin{grammar}\item[program] block \lit{.}\item[block] \lcurly \lit{CONST} \lit{identifier} \lit{=} \lit{number} \lparen \lit{,} \lit{identifier} \lit{=} \lit{number} \rparen\kstar\ \lit{;} \rcurly\\ \lcurly \lit{VAR} \lit{identifier} \lparen \lit{,} \lit{identifier} \rparen\kstar\ \lit{;} \rcurly\\ \lparen \lit{PROCEDURE} \lit{identifier} \lit{;} block \lit{;} \rparen\kstar\\ statement\item[statement] \lit{identifier} \lit{:=} expression\\ \alt \lit{CALL} \lit{identifier}\\ \alt \lit{BEGIN} statement \lparen \lit{;} statement \rparen\kstar\ \lit{END}\\ \alt \lit{IF} condition \lit{THEN} statement\\ \alt \lit{WHILE} condition \lit{DO} statement\\ \alt \nothing\item[expression] fragment \lparen \lparen \lit{+} \alt \lit{-} \alt \lit{*} \alt \lit{/} \rparen fragment \rparen\kstar\item[fragment] \lit{identifier} \alt \lit{number} \alt \lparen {\tt +} \alt {\tt -} \rparen fragment \alt \lit{(} expression \lit{)}\item[condition] \lit{ODD} expression \alt expression \lparen \lit{=} \alt \lit{\#} \alt \lit{<} \alt \lit{>} \alt \lit{<=} \alt \lit{>=} \rparen expression\end{grammar}\subsection{Semantics}PL/0 is simple enough that semantic issues generally amount to commonsense. The checks performed by the PL/0 compiler are listed insection~\ref{sec-semantic-checks} below. Constants, variables, andprocedures declared within a block are local to that block, and cannotbe referenced outside it. The arithmetic operators associateleft-to-right. Precedence is as follows: unary \code{+} and \code{-}group mosttightly, followed by binary \code{*} and \code{/},and then binary \code{+} and \code{-}.Conditions (comparisons) are above the level of expressions in thegrammar, effectively giving them the lowest precedence of all. Thesymbol \code{\#} means ``not equal''.Two special ``variables'' are pre-defined: The variable \code{input},when read, yields the next integer from the standard input stream.The variable \code{output}, when written, produces an integer (withtrailing linefeed) on the standard output stream.\section{Tools}Construction and operation of the PL/0 compiler are assisted by anumber of tools, as illustrated in figure~\ref{fig-structure}.\begin{figure}\centerline{%\epsfysize 6in\epsfbox{structure.eps}}\caption{Construction and operation of the PL/O compiler.}\label{fig-structure}\end{figure}\subsection{Scanner and Parser Generators}{\tt Scangen} and {\tt fmq} are scanner and parser generators,respectively. They were written by students of Charles Fischer at theUniversity of Wisconsin around about 1980, and are described in detail inappendices B, C, and E of Fischer and LeBlanc's book, {\it Crafting aCompiler}.\footnote{Note that {\tt fmq} isreferred to as {\tt LLgen} in the book. In the standard tool distribution,{\tt llgen} is a version of {\tt fmq} without error repair.}{\tt Scangen} takes as input a set of regular expressions, and produces DFAtables (file {\tt tables}) to drive a scanner for the tokens described bythe regular expressions. {\tt Fmq} takes as input a context free grammar,and produces tables (file {\tt ptablebin}) to drive an LL(1) parser forthat grammar. {\tt Fmq} takes its name from the Fischer/Milton/Quiringsyntax error repair algorithm, for which it also generates tables (file{\tt etablebin}).Both {\tt scangen} and {\tt fmq} produce numeric tables, in contrast to theC code produced by Unix's standard {\tt lex} and {\tt yacc}. ({\ttScangen} produces tables in {\sc ascii}.{\tt Fmq} can produce tables in either {\sc ascii} or binary; the PL/0compiler uses the binary option.) These tablesare turned into initialized data structures by the tools {\tt makescan} and{\tt makeparse}. The C++ output files, {\tt scantab.cc} and {\ttparsetab.cc} are then compiled by the GNU C++ compiler, {\tt g++}, with theother C++ source files. The files {\tt scantab.h} and {\tt parsetab.h}contain definitions for variables exported by {\tt scantab.cc} and {\ttparsetab.cc} respectively.\subsection{Mungegrammar}Several source files for the PL/0 compiler contain redundant informationabout the PL/0 syntax. These include the input to {\tt scangen} and {\ttfmq} ({\tt scangen.in} and {\tt fmq.in}, respectively), semantic actionroutine numbers (switch statement labels in {\tt actions.cc}), anddeclarations of token numbers used in semantic analysis ({\tt tokens.h}).To free the programmer from the need to keep these files mutuallyconsistent, they are all generated automatically by a preprocessor named{\tt mungegrammar}. The input to {\tt mungegrammar} is roughly a merger ofthe input expected by {\tt scangen} and {\tt fmq}, with some syntacticchanges necessary to make the merger work and to simplify the specificationof long productions and action routines. The {\tt grammar} filedistributed with the PL/0 compiler contains extensive self-descriptivecomments. Read them carefully before making any changes. Principalfeatures include the following:\begin{itemize}\item Ada-style (double dash through end-of-line) comments are allowed anywhere in the {\tt grammar} file.\item Tokens are declared in a single list that includes alphabetic token names, optional character string images (for error messages), insertion and deletion costs (for syntax error repair), and lexical definitions (regular expressions). A token may have {\em variants}, which will all appear the same to the parser, but may be distinguished by semantic action routines. Variants are output as tokens with matching major token numbers in {\tt scangen.in}. Note the following limitations: the semicolon that terminates a token definition must be the last thing on its input line (other than white space or a comment). Similarly, the comma that separates token variants must be the last thing (ther than white space or a comment) on its line, and a comma within a regular expression must {\em not\/} be the last thing (other than white space or a comment) on its line.\item Two special token names {\em must\/} appear: \lit{IDENT} and \lit{SPACE}. Reserve words must match the lexical definition of \lit{IDENT}, but appear in the token list with a double-quoted character string definition, rather than a regular expression. \lit{SPACE} is the only token recognized by the scanner but not passed to the parser.\item Semantic action routines can be embedded directly in productions, delimited by double square brackets (\verb|[[|\ldots\verb|]]|). The code in {\tt actions.cc} will look better if you leave the square brackets on lines by themselves, but this is not required.\item Second and subsequent lines of multi-line productions do not begin with ellipsis (\lit{\dots}), as they do in {\tt fmq.in}. First lines of productions, however, must begin in column 1, and second and subsequent lines must not do so.\end{itemize}In addition to generating {\tt scangen.in} and {\tt fmq.in}, {\ttmungegrammar} collects action routines into the arms of a large switchstatement, which it then embeds in a procedure named {\tt do\_action} infile {\tt actions.cc}. The parser calls {\tt do\_action} when it reachesthe appropriate point in the right-hand side of a production. Finally,{\tt mungegrammar} generates file {\tt tokens.h}, containing a definition({\tt \#define}) for every major and minor token number. Major tokennumbers distinguish non-terminals in the context-free grammar understood bythe parser. Minor token numbers distinguish token variants. For example,in PL/0 the multiplication and division operators play identical roles inthe grammar, and are therefore defined as variants of a single token namedMULOP. {\tt Tokens.h} defines the symbol {\tt MAJ\_MULOP} to contain themajor token number for MULOP. The two variants of MULOP are named TIMESand SLASH, with minor token numbers {\tt MIN\_TIMES} and {\tt MIN\_SLASH},respectively in {\tt tokens.h}. The character string image for MULOP is``*'', which the syntax error repair routines will use in error messageswhen they insert a MULOP. If an image is not provided, the repair routineswill use the token name.With the exceptions noted above, the token definitions (regular expressions)and productions in the {\tt grammar} file conform to the syntax expected by{\tt scangen} and {\tt fmq}. One potentially confusing feature is the``\{TOSS\}'' annotation in regular expressions: it indicates that thepreceding character will not be important to the rest of the compiler, andmay be discarded by the scanner. In general, characters of tokens thatrepresent only a single possible string (e.g.\ \lit{:=} or \lit{IF}) cansafely be tossed; characters of tokens for which there are many possiblestrings (e.g.\ \lit{IDENT} or \lit{NUMBER}) should not be tossed.Sometimes (e.g.\ in character strings, which are not present in PL/0 asoriginally defined) it makes sense to toss some characters (the quotemarks) and save the rest.\subsection{RCS}I strongly recommend that you use RCS (the Revision Control System, developedby Walter Tichy of Purdue University) to manage your compiler source.As you make changes over time, RCS can be used to keep track of all the oldversions of all your files, in an organized and highly space-efficient way.Read the man pages for {\tt ci}, {\tt co}, {\tt rcs}, {\tt rlog}, and {\tt rcsdiff}, in that order.\subsection{The {\tt Makefile}}Creation of the PL/0 compiler is automated by a {\tt Makefile}.You should read the {\tt Makefile} and figure out how it works.You should also learn to use the extra rules it provides, in addition to{\tt make pl0}.In particular, you will want to use the following:\begin{description}\item[{\tt make depend}]This rule modifies the {\tt Makefile} itself to incorporate knowledge ofwhich files {\tt \#include} which others.If you add or alter inclusions anywhere in the compiler, you will want tore-run {\tt make depend}.\item [{\tt make tags}]This rule creates cross-reference indices (files {\tt tags} and {\ttTAGS}), which are used by editors such as {\tt vi} and {\tt emacs} to assistin source perusal.Typing the appropriate command to the editorwill cause the cursor to move to the file andline at which the identifier is declared.In {\tt vi}, the command is {\tt :ta tag\_name},or control-\verb|]| when the cursor is positioned on the identifier.In {\tt emacs}, the command is meta-{\tt x find-tag} {\em tag-name},or meta-{\tt .} {\em tag-name}.If the cursor is on or near an identifier, {\tt emacs} will supply itas a default when prompting for {\em tag-name}.To pop back to the previous tag in {\tt vi}, type control-{\tt t}.To pop back in {\tt emacs}, type meta-{\tt-} meta-{\tt x find-tag},or meta-{\tt -} meta-{\tt .}.Note that the algorithm used to build the tag cross-reference database isheuristic; tags work most of the time but not always.{\tt Vi} keeps track of a single location for each tag.{\tt Emacs} keeps track of several possibilities, and jumps to the most``likely'' one first. If you don't like that one and want to try thenext, type control-{\tt u} meta-{\tt .}.In addition to creating cross-references for identifiers,{\tt make tags} creates cross-ref\-er\-ences for action routines, grammarsymbols, and production numbers.Tag search for a grammar symbol moves the cursor to a production inthe grammar in which that symbol appears on the left hand side.Tag search for {\tt R37} (and similarly {\tt R$n$} for any appropriate $n$) moves to the 37th ($n$th) action routine.This is useful for finding syntax errors in action routines, since the C++compiler produces error messages for the file {\tt actions.cc}, not for{\tt grammar}.Tag search for {\tt P37} (in general P$n$) moves to the 37th ($n$th)production inthe grammar. This is useful for debugging the parser, and shouldn't benecessary unless you break something.\item[{\tt make sources}]This rule simply prints the names of all the source files required to buildthe compiler.It is useful when embedded in backward quotes in the argument lists of otherprograms.For example, \begin{verbatim} grep foobar `make sources`\end{verbatim}will find all occurrences of {\tt foobar} in the source ofthe compiler. Similarly, \begin{verbatim} ls -l `make sources` | grep "^\\-rw"\end{verbatim}will list all source files that are currently writable.If you use RCS with strict locking, these will be the files that you havechecked out to make modifications.If you don't want the long version of the listing,\begin{verbatim} ls -l `make sources` | grep "^\\-rw" | awk '{print $8}'\end{verbatim}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -