📄 tply.tex
字号:
Finally, the auxiliary procedures section may contain arbitrary Turbo
Pascal code (such as supporting routines or a main program) which is
simply tacked on to the end of the output file. The auxiliary procedures
section is optional.
\subsection{Regular Expressions}
Table \ref{tab1} summarizes the format of the regular expressions
recognized by TP Lex (also compare Aho, Sethi, Ullman 1986, fig.\ 3.48).
$c$ stands for a single character, $s$ for a string, $r$ for a regular
expression, and $n,m$ for nonnegative integers.
\begin{table*}\centering
\begin{tabular}{c|c|c}
\hline\hline
{\sc Expression}& {\sc Matches}& {\sc Example}\\
\hline
$c$& any non-operator character $c$& \verb"a"\\
\verb"\"$c$& character $c$ literally& \verb"\*"\\
\verb'"'$s$\verb'"'& string $s$ literally& \verb'"**"'\\
\verb"."& any character but newline& \verb"a.*b"\\
\verb"^"& beginning of line& \verb"^abc"\\
\verb"$"& end of line& \verb"abc$"\\
\verb"["$s$\verb"]"& any character in $s$& \verb"[abc]"\\
\verb"[^"$s$\verb"]"& any character not in $s$& \verb"[^abc]"\\
$r$\verb"*"& zero or more $r$'s& \verb"a*"\\
$r$\verb"+"& one or more $r$'s& \verb"a+"\\
$r$\verb"?"& zero or one $r$& \verb"a?"\\
$r$\verb"{"$m$\verb","$n$\verb"}"& $m$ to $n$ occurrences of $r$& \verb"a{1,5}"\\
$r$\verb"{"$m$\verb"}"& $m$ occurrences of $r$& \verb"a{5}"\\
$r_1r_2$& $r_1$ then $r_2$& \verb"ab"\\
$r_1$\verb"|"$r_2$& $r_1$ or $r_2$& \verb"a|b"\\
\verb"("$r$\verb")"& $r$& \verb"(a|b)"\\
$r_1$\verb"/"$r_2$& $r_1$ when followed by $r_2$& \verb"a/b"\\
\verb"<"$x$\verb">"$r$& $r$ when in start condition $x$& \verb"<x>abc"\\
\hline
\end{tabular}
\caption{Regular expressions.}
\label{tab1}
\end{table*}
The operators \verb"*", \verb"+", \verb"?" and \verb"{}" have highest
precedence, followed by concatenation. The \verb"|" operator has lowest
precedence. Parentheses \verb"()" may be used to group expressions and
overwrite default precedences. The \verb"<>" and \verb"/" operators may only
occur once in an expression.
The usual C-like escapes are recognized:
\begin{itemize}
\item \verb"\n" denotes newline
\item \verb"\r" denotes carriage return
\item \verb"\t" denotes tab
\item \verb"\b" denotes backspace
\item \verb"\f" denotes form feed
\item \verb"\"$nnn$ denotes character no.\ $nnn$ in octal base
\end{itemize}
You can also use the \verb"\" character to quote characters which would
otherwise be interpreted as operator symbols. In character classes, you may
use the \verb"-" character to denote ranges of characters. For instance,
\verb"[a-z]" denotes the class of all lowercase letters.
The expressions in a TP Lex program may be ambigious, i.e. there may be inputs
which match more than one rule. In such a case, the lexical analyzer prefers
the longest match and, if it still has the choice between different rules,
it picks the first of these. If no rule matches, the lexical analyzer
executes a default action which consists of copying the input character
to the output unchanged. Thus, if the purpose of a lexical analyzer is
to translate some parts of the input, and leave the rest unchanged, you
only have to specify the patterns which have to be treated specially. If,
however, the lexical analyzer has to absorb its whole input, you will have
to provide rules that match everything. E.g., you might use the rules
\begin{quote}\begin{verbatim}
. |
\n ;
\end{verbatim}\end{quote}
which match ``any other character'' (and ignore it).
Sometimes certain patterns have to be analyzed differently depending on some
amount of context in which the pattern appears. In such a case the \verb"/"
operator is useful. For instance, the expression \verb"a/b" matches \verb"a",
but only if followed by \verb"b". Note that the \verb"b" does not belong to
the match; rather, the lexical analyzer, when matching an \verb"a", will look
ahead in the input to see whether it is followed by a \verb"b", before it
declares that it has matched an \verb"a". Such lookahead may be arbitrarily
complex (up to the size of the \verb"LexLib" input buffer). E.g., the pattern
\verb"a/.*b" matches an \verb"a" which is followed by a \verb"b" somewhere on
the same input line. TP Lex also has a means to specify left context which is
described in the next section.
\subsection{Start Conditions}
TP Lex provides some features which make it possible to handle left context.
The \verb"^" character at the beginning of a regular expression may be used
to denote the beginning of the line. More distant left context can be described
conveniently by using start conditions on rules.
Any rule which is prefixed with the \verb"<>" construct is only valid if the
lexical analyzer is in the denoted start state. For instance, the expression
\verb"<x>a" can only be matched if the lexical analyzer is in start state
\verb"x". You can have multiple start states in a rule; e.g., \verb"<x,y>a"
can be matched in start states \verb"x" or \verb"y".
Start states have to be declared in the definitions section by means of
one or more start state definitions (see above). The lexical analyzer enters
a start state through a call to the \verb"LexLib" routine \verb"start". E.g.,
you may write:
\begin{quote}\begin{verbatim}
%start x y
%%
<x>a start(y);
<y>b start(x);
%%
begin
start(x); if yylex=0 then ;
end.
\end{verbatim}\end{quote}
Upon initialization, the lexical analyzer is put into state \verb"x". It then
proceeds in state \verb"x" until it matches an \verb"a" which puts it into
state \verb"y". In state \verb"y" it may match a \verb"b" which puts it into
state \verb"x" again, etc.
Start conditions are useful when certain constructs have to be analyzed
differently depending on some left context (such as a special character
at the beginning of the line), and if multiple lexical analyzers have to
work in concert. If a rule is not prefixed with a start condition, it is
valid in all user-defined start states, as well as in the lexical analyzer's
default start state.
\subsection{Lex Library}
The TP Lex library (\verb"LexLib") unit provides various variables and
routines which are used by Lex-generated lexical analyzers and application
programs. It provides the input and output streams and other internal data
structures used by the lexical analyzer routine, and supplies some variables
and utility routines which may be used by actions and application programs.
Refer to the file \verb"lexlib.pas" for a closer description.
You can also modify the Lex library unit (and/or the code template in the
\verb"yylex.cod" file) to customize TP Lex to your target applications. E.g.,
you might wish to optimize the code of the lexical analyzer for some
special application, make the analyzer read from/write to memory instead
of files, etc.
\subsection{Implementation Restrictions}
Internal table sizes and the main memory available limit the complexity of
source grammars that TP Lex can handle. There is currently no possibility to
change internal table sizes (apart from modifying the sources of TP Lex
itself), but the maximum table sizes provided by TP Lex seem to be large
enough to handle most realistic applications. The actual table sizes depend on
the particular implementation (they are much larger than the defaults if TP
Lex has been compiled with one of the 32 bit compilers such as Delphi 2 or
Free Pascal), and are shown in the statistics printed by TP Lex when a
compilation is finished. The units given there are ``p'' (positions, i.e.
items in the position table used to construct the DFA), ``s'' (DFA states) and
``t'' (transitions of the generated DFA).
As implemented, the generated DFA table is stored as a typed array constant
which is inserted into the \verb"yylex.cod" code template. The transitions in
each state are stored in order. Of course it would have been more efficient to
generate a big \verb"CASE" statement instead, but I found that this may cause
problems with the encoding of large DFA tables because Turbo Pascal has
a quite rigid limit on the code size of individual procedures. I decided to
use a scheme in which transitions on different symbols to the same state are
merged into one single transition (specifying a character set and the
corresponding next state). This keeps the number of transitions in each state
quite small and still allows a fairly efficient access to the transition
table.
The TP Lex program has an option (\verb"-o") to optimize DFA tables. This
causes a minimal DFA to be generated, using the algorithm described in Aho,
Sethi, Ullman (1986). Although the absolute limit on the number of DFA states
that TP Lex can handle is at least 300, TP Lex poses an additional restriction
(100) on the number of states in the initial partition of the DFA optimization
algorithm. Thus, you may get a fatal \verb"integer set overflow" message when
using the \verb"-o" option even when TP Lex is able to generate an unoptimized
DFA. In such cases you will just have to be content with the unoptimized DFA.
(Hopefully, this will be fixed in a future version. Anyhow, using the merged
transitions scheme described above, TP Lex usually constructs unoptimized
DFA's which are not far from being optimal, and thus in most cases DFA
optimization won't have a great impact on DFA table sizes.)
\subsection{Differences from UNIX Lex}
Major differences between TP Lex and UNIX Lex are listed below.
\begin{itemize}
\item
TP Lex produces output code for Turbo Pascal, rather than for C.
\item
Character tables (\verb"%T") are not supported; neither are any
directives to determine internal table sizes (\verb"%p", \verb"%n",
etc.).
\item
Library routines are named differently from the UNIX version (e.g.,
the \verb"start" routine takes the place of the \verb"BEGIN" macro of
UNIX Lex), and, of course, all macros of UNIX Lex (\verb"ECHO",
\verb"REJECT", etc.) had to be implemented as procedures.
\item
The TP Lex library unit starts counting line numbers at 0, incrementing
the count {\em before\/} a line is read (in contrast, UNIX Lex
initializes \verb"yylineno" to 1 and increments it {\em after\/} the
line end has been read). This is motivated by the way in which TP Lex
maintains the current line, and will not affect your programs unless you
explicitly reset the \verb"yylineno" value (e.g., when opening a new
input file). In such a case you should set \verb"yylineno" to 0 rather
than 1.
\end{itemize}
\section{TP Yacc}
This section describes the TP Yacc compiler compiler.
\subsection{Usage}
\begin{quote}\begin{verbatim}
yacc [options] yacc-file[.y] [output-file[.pas]]
\end{verbatim}\end{quote}
\subsection{Options}
\begin{description}
\item[\verb"-v"]
``Verbose:'' TP Yacc generates a readable description of the generated
parser, written to \verb"yacc-file" with new extension \verb".lst".
\item[\verb"-d"]
``Debug:'' TP Yacc generates parser with debugging output.
\end{description}
\subsection{Description}
TP Yacc is a program that lets you prepare parsers from the description
of input languages by BNF-like grammars. You simply specify the grammar
for your target language, augmented with the Turbo Pascal code necessary
to process the syntactic constructs, and TP Yacc translates your grammar
into the Turbo Pascal code for a corresponding parser subroutine named
\verb"yyparse".
TP Yacc parses the source grammar contained in \verb"yacc-file" (with default
suffix \verb".y") and writes the constructed parser subroutine to the
specified \verb"output-file" (with default suffix \verb".pas"); if no output
file is specified, output goes to \verb"yacc-file" with new suffix
\verb".pas". If any errors are found during compilation, error messages are
written to the list file (\verb"yacc-file" with new suffix \verb".lst").
The generated parser routine, \verb"yyparse", is declared as:
\begin{quote}\begin{verbatim}
function yyparse : Integer;
\end{verbatim}\end{quote}
This routine may be called by your main program to execute the parser.
The return value of the \verb"yyparse" routine denotes success or failure of
the parser (possible return values: 0 = success, 1 = unrecoverable syntax
error or parse stack overflow).
Similar to TP Lex, the code template for the \verb"yyparse" routine may be
found in the \verb"yyparse.cod" file. The rules for locating this file are
analogous to those of TP Lex (see Section {\em TP Lex\/}).
The TP Yacc library (\verb"YaccLib") unit is required by programs using Yacc-
generated parsers; you will therefore have to put an appropriate \verb"uses"
clause into your program or unit that contains the parser routine. The
\verb"YaccLib" unit also provides some routines which may be used to control
the actions of the parser. See the file \verb"yacclib.pas" for further
information.
\subsection{Yacc Source}
A TP Yacc program consists of three sections separated with the \verb"%%"
delimiter:
\begin{quote}\begin{verbatim}
definitions
%%
rules
%%
auxiliary procedures
\end{verbatim}\end{quote}
The TP Yacc language is free-format: whitespace (blanks, tabs and newlines)
is ignored, except if it serves as a delimiter. Comments have the C-like
format \verb"/* ... */". They are treated as whitespace. Grammar symbols are
denoted by identifiers which have the usual form (letter, including
underscore, followed by a sequence of letters and digits; upper- and
lowercase is distinct). The TP Yacc language also has some keywords which
always start with the \verb"%" character. Literals are denoted by characters
enclosed in single quotes. The usual C-like escapes are recognized:
\begin{itemize}
\item \verb"\n" denotes newline
\item \verb"\r" denotes carriage return
\item \verb"\t" denotes tab
\item \verb"\b" denotes backspace
\item \verb"\f" denotes form feed
\item \verb"\"$nnn$ denotes character no.\ $nnn$ in octal base
\end{itemize}
\subsection{Definitions}
The first section of a TP Yacc grammar serves to define the symbols used in
the grammar. It may contain the following types of definitions:
\begin{itemize}
\item
start symbol definition: A definition of the form
\begin{quote}\begin{verbatim}
%start symbol
\end{verbatim}\end{quote}
declares the start nonterminal of the grammar (if this definition is
omitted, TP Yacc assumes the left-hand side nonterminal of the first
grammar rule as the start symbol of the grammar).
\item
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -