📄 用yacclex设计实现小型basic语言.txt

📁 本文包含简单中文说明和详细源代码
💻 TXT
📖 第 1 页 / 共 5 页
字号:
  function yylex : Integer;

 

This routine has to be called by your main program to execute the lexical

analyzer. The return value of the yylex routine usually denotes the number

of a token recognized by the lexical analyzer (see the return routine in the

LexLib unit). At end-of-file the yylex routine normally returns 0.

 

The code template for the yylex routine may be found in the yylex.cod

file. This file is needed by TP Lex when it constructs the output file. It

must be present either in the current directory or in the directory from which

TP Lex was executed (TP Lex searches these directories in the indicated

order). (NB: For the Linux/Free Pascal version, the code template is searched

in some directory defined at compile-time instead of the execution path,

usually /usr/lib/fpc/lexyacc.)

 

The TP Lex library (LexLib) unit is required by programs using Lex-generated

lexical analyzers; you will therefore have to put an appropriate uses clause

into your program or unit that contains the lexical analyzer routine. The

LexLib unit also provides various useful utility routines; see the file

lexlib.pas for further information.

 

 

Lex Source

----------

 

A TP Lex program consists of three sections separated with the %% delimiter:

 

definitions

%%

rules

%%

auxiliary procedures

 

All sections may be empty. The TP Lex language is line-oriented; definitions

and rules are separated by line breaks. There is no special notation for

comments, but (Turbo Pascal style) comments may be included as Turbo Pascal

fragments (see below).

 

The definitions section may contain the following elements:

 

- regular definitions in the format:

 

     name   substitution

 

  which serve to abbreviate common subexpressions. The {name} notation

  causes the corresponding substitution from the definitions section to

  be inserted into a regular expression. The name must be a legal

  identifier (letter followed by a sequence of letters and digits;

  the underscore counts as a letter; upper- and lowercase are distinct).

  Regular definitions must be non-recursive.

 

- start state definitions in the format:

 

     %start name ...

 

  which are used in specifying start conditions on rules (described

  below). The %start keyword may also be abbreviated as %s or %S.

 

- Turbo Pascal declarations enclosed between %{ and %}. These will be

  inserted into the output file (at global scope). Also, any line that

  does not look like a Lex definition (e.g., starts with blank or tab)

  will be treated as Turbo Pascal code. (In particular, this also allows

  you to include Turbo Pascal comments in your Lex program.)

 

The rules section of a TP Lex program contains the actual specification of

the lexical analyzer routine. It may be thought of as a big CASE statement

discriminating over the different patterns to be matched and listing the

corresponding statements (actions) to be executed. Each rule consists of a

regular expression describing the strings to be matched in the input, and a

corresponding action, a Turbo Pascal statement to be executed when the

expression matches. Expression and statement are delimited with whitespace

(blanks and/or tabs). Thus the format of a Lex grammar rule is:

 

   expression      statement;

 

Note that the action must be a single Turbo Pascal statement terminated

with a semicolon (use begin ... end for compound statements). The statement

may span multiple lines if the successor lines are indented with at least

one blank or tab. The action may also be replaced by the | character,

indicating that the action for this rule is the same as that for the next

one.

 

The TP Lex library unit provides various variables and routines which are

useful in the programming of actions. In particular, the yytext string

variable holds the text of the matched string, and the yyleng Byte variable

its length.

 

Regular expressions are used to describe the strings to be matched in a

grammar rule. They are built from the usual constructs describing character

classes and sequences, and operators specifying repetitions and alternatives.

The precise format of regular expressions is described in the next section.

 

The rules section may also start with some Turbo Pascal declarations

(enclosed in %{ %}) which are treated as local declarations of the

actions routine.

 

Finally, the auxiliary procedures section may contain arbitrary Turbo

Pascal code (such as supporting routines or a main program) which is

simply tacked on to the end of the output file. The auxiliary procedures

section is optional.

 

 

Regular Expressions

-------------------

 

The following table summarizes the format of the regular expressions

recognized by TP Lex (also compare Aho, Sethi, Ullman 1986, fig. 3.48).

c stands for a single character, s for a string, r for a regular expression,

and n,m for nonnegative integers.

 

expression   matches                        example

----------   ----------------------------   -------

c            any non-operator character c   a

\c           character c literally          \*

"s"          string s literally             "**"

.            any character but newline      a.*b

^            beginning of line              ^abc

$            end of line                    abc$

[s]          any character in s             [abc]

[^s]         any character not in s         [^abc]

r*           zero or more r's               a*

r+           one or more r's                a+

r?           zero or one r                  a?

r{m,n}       m to n occurrences of r        a{1,5}

r{m}         m occurrences of r             a{5}

r1r2         r1 then r2                     ab

r1|r2        r1 or r2                       a|b

(r)          r                              (a|b)

r1/r2        r1 when followed by r2         a/b

<x>r         r when in start condition x    <x>abc

---------------------------------------------------

 

The operators *, +, ? and {} have highest precedence, followed by

concatenation. The | operator has lowest precedence. Parentheses ()

may be used to group expressions and overwrite default precedences.

The <> and / operators may only occur once in an expression.

 

The usual C-like escapes are recognized:

 

\n     denotes newline

\r     denotes carriage return

\t     denotes tab

\b     denotes backspace

\f     denotes form feed

\NNN   denotes character no. NNN in octal base

 

You can also use the \ character to quote characters which would otherwise

be interpreted as operator symbols. In character classes, you may use

the - character to denote ranges of characters. For instance, [a-z]

denotes the class of all lowercase letters.

 

The expressions in a TP Lex program may be ambigious, i.e. there may be inputs

which match more than one rule. In such a case, the lexical analyzer prefers

the longest match and, if it still has the choice between different rules,

it picks the first of these. If no rule matches, the lexical analyzer

executes a default action which consists of copying the input character

to the output unchanged. Thus, if the purpose of a lexical analyzer is

to translate some parts of the input, and leave the rest unchanged, you

only have to specify the patterns which have to be treated specially. If,

however, the lexical analyzer has to absorb its whole input, you will have

to provide rules that match everything. E.g., you might use the rules

 

   .   |

   \n  ;

 

which match "any other character" (and ignore it).

 

Sometimes certain patterns have to be analyzed differently depending on some

amount of context in which the pattern appears. In such a case the / operator

is useful. For instance, the expression a/b matches a, but only if followed

by b. Note that the b does not belong to the match; rather, the lexical

analyzer, when matching an a, will look ahead in the input to see whether

it is followed by a b, before it declares that it has matched an a. Such

lookahead may be arbitrarily complex (up to the size of the LexLib input

buffer). E.g., the pattern a/.*b matches an a which is followed by a b

somewhere on the same input line. TP Lex also has a means to specify left

context which is described in the next section.

 

 

Start Conditions

----------------

 

TP Lex provides some features which make it possible to handle left context.

The ^ character at the beginning of a regular expression may be used to

denote the beginning of the line. More distant left context can be described

conveniently by using start conditions on rules.

 

Any rule which is prefixed with the <> construct is only valid if the lexical

analyzer is in the denoted start state. For instance, the expression <x>a

can only be matched if the lexical analyzer is in start state x. You can have

multiple start states in a rule; e.g., <x,y>a can be matched in start states

x or y.

 

Start states have to be declared in the definitions section by means of

one or more start state definitions (see above). The lexical analyzer enters

a start state through a call to the LexLib routine start. E.g., you may

write:

 

%start x y

%%

<x>a    start(y);

<y>b    start(x);

%%

begin

  start(x); if yylex=0 then ;

end.

 

Upon initialization, the lexical analyzer is put into state x. It then

proceeds in state x until it matches an a which puts it into state y.

In state y it may match a b which puts it into state x again, etc.

 

Start conditions are useful when certain constructs have to be analyzed

differently depending on some left context (such as a special character

at the beginning of the line), and if multiple lexical analyzers have to

work in concert. If a rule is not prefixed with a start condition, it is

valid in all user-defined start states, as well as in the lexical analyzer's

default start state.

 

 

Lex Library

-----------

 

The TP Lex library (LexLib) unit provides various variables and routines

which are used by Lex-generated lexical analyzers and application programs.

It provides the input and output streams and other internal data structures

used by the lexical analyzer routine, and supplies some variables and utility

routines which may be used by actions and application programs. Refer to

the file lexlib.pas for a closer description.

 

You can also modify the Lex library unit (and/or the code template in the

yylex.cod file) to customize TP Lex to your target applications. E.g.,

you might wish to optimize the code of the lexical analyzer for some

special application, make the analyzer read from/write to memory instead

of files, etc.

 

 

Implementation Restrictions

---------------------------

 

Internal table sizes and the main memory available limit the complexity of

source grammars that TP Lex can handle. There is currently no possibility to

change internal table sizes (apart from modifying the sources of TP Lex

itself), but the maximum table sizes provided by TP Lex seem to be large

enough to handle most realistic applications. The actual table sizes depend on

the particular implementation (they are much larger than the defaults if TP

Lex has been compiled with one of the 32 bit compilers such as Delphi 2 or

Free Pascal), and are shown in the statistics printed by TP Lex when a

compilation is finished. The units given there are "p" (positions, i.e. items

in the position table used to construct the DFA), "s" (DFA states) and "t"

(transitions of the generated DFA).

 

As implemented, the generated DFA table is stored as a typed array constant

which is inserted into the yylex.cod code template. The transitions in each

state are stored in order. Of course it would have been more efficient to

generate a big CASE statement instead, but I found that this may cause

problems with the encoding of large DFA tables because Turbo Pascal has

a quite rigid limit on the code size of individual procedures. I decided to

use a scheme in which transitions on different symbols to the same state are

merged into one single transition (specifying a character set and the

corresponding next state). This keeps the number of transitions in each state

quite small and still allows a fairly efficient access to the transition

table.

 

The TP Lex program has an option (-o) to optimize DFA tables. This causes a

minimal DFA to be generated, using the algorithm described in Aho, Sethi,

Ullman (1986). Although the absolute limit on the number of DFA states that TP

Lex can handle is at least 300, TP Lex poses an additional restriction (100)

on the number of states in the initial partition of the DFA optimization

algorithm. Thus, you may get a fatal `integer set overflow' message when using

the -o option even when TP Lex is able to generate an unoptimized DFA. In such

cases you will just have to be content with the unoptimized DFA. (Hopefully,

this will be fixed in a future version. Anyhow, using the merged transitions

scheme described above, TP Lex usually constructs unoptimized DFA's which are

not far from being optimal, and thus in most cases DFA optimization won't have

a great impact on DFA table sizes.)

 

 

Differences from UNIX Lex

-------------------------

 

Major differences between TP Lex and UNIX Lex are listed below.

 

- TP Lex produces output code for Turbo Pascal, rather than for C.

 

- Character tables (%T) are not supported; neither are any directives

  to determine internal table sizes (%p, %n, etc.).

 

- Library routines are named differently from the UNIX version (e.g.,

  the `start' routine takes the place of the `BEGIN' macro of UNIX

  Lex), and, of course, all macros of UNIX Lex (ECHO, REJECT, etc.) had

  to be implemented as procedures.

 

- The TP Lex library unit starts counting line numbers at 0, incrementing

  the count BEFORE a line is read (in contrast, UNIX Lex initializes

  yylineno to 1 and increments it AFTER the line end has been read). This

  is motivated by the way in which TP Lex maintains the current line,

  and will not affect your programs unless you explicitly reset the

  yylineno value (e.g., when opening a new input file). In such a case

  you should set yylineno to 0 rather than 1.

 

 



 

 

TP Yacc

=======

 

This section describes the TP Yacc compiler compiler.

 

 

Usage

-----

 

yacc [options] yacc-file[.y] [output-file[.pas]]

 

 

Options

-------

 

-v  "Verbose:" TP Yacc generates a readable description of the generated

    parser, written to yacc-file with new extension .lst.

 

-d  "Debug:" TP Yacc generates parser with debugging output.

 

 

Description

-----------

 

TP Yacc is a program that lets you prepare parsers from the description

of input languages by BNF-like grammars. You simply specify the grammar

for your target language, augmented with the Turbo Pascal code necessary

to process the syntactic constructs, and TP Yacc translates your grammar

into the Turbo Pascal code for a corresponding parser subroutine named

yyparse.

 

TP Yacc parses the source grammar contained in yacc-file (with default

suffix .y) and writes the constructed parser subroutine to the specified

output-file (with default suffix .pas); if no output file is specified,

output goes to yacc-file with new suffix .pas. If any errors are found

during compilation, error messages are written to the list file (yacc-file

with new suffix .lst).

 

The generated parser routine, yyparse, is declared as:

 

   function yyparse : Integer;

 

This routine may be called by your main program to execute the parser.
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -