📄 the lemon parser generator.htm
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0040)http://www.hwaci.com/sw/lemon/lemon.html -->
<HTML><HEAD><TITLE>The Lemon Parser Generator</TITLE>
<META http-equiv=Content-Type content="text/html; charset=big5">
<META content="MSHTML 6.00.2800.1400" name=GENERATOR></HEAD>
<BODY bgColor=white>
<H1 align=center>The Lemon Parser Generator</H1>
<P>Lemon is an LALR(1) parser generator for C or C++. It does the same job as
``bison'' and ``yacc''. But lemon is not another bison or yacc clone. It uses a
different grammar syntax which is designed to reduce the number of coding
errors. Lemon also uses a more sophisticated parsing engine that is faster than
yacc and bison and which is both reentrant and thread-safe. Furthermore, Lemon
implements features that can be used to eliminate resource leaks, making is
suitable for use in long-running programs such as graphical user interfaces or
embedded controllers.</P>
<P>This document is an introduction to the Lemon parser generator.</P>
<H2>Theory of Operation</H2>
<P>The main goal of Lemon is to translate a context free grammar (CFG) for a
particular language into C code that implements a parser for that language. The
program has two inputs:
<UL>
<LI>The grammar specification.
<LI>A parser template file. </LI></UL>Typically, only the grammar specification
is supplied by the programmer. Lemon comes with a default parser template which
works fine for most applications. But the user is free to substitute a different
parser template if desired.
<P></P>
<P>Depending on command-line options, Lemon will generate between one and three
files of outputs.
<UL>
<LI>C code to implement the parser.
<LI>A header file defining an integer ID for each terminal symbol.
<LI>An information file that describes the states of the generated parser
automaton. </LI></UL>By default, all three of these output files are generated.
The header file is suppressed if the ``-m'' command-line option is used and the
report file is omitted when ``-q'' is selected.
<P></P>
<P>The grammar specification file uses a ``.y'' suffix, by convention. In the
examples used in this document, we'll assume the name of the grammar file is
``gram.y''. A typical use of Lemon would be the following command: <PRE> lemon gram.y
</PRE>This command will generate three output files named ``gram.c'', ``gram.h''
and ``gram.out''. The first is C code to implement the parser. The second is the
header file that defines numerical values for all terminal symbols, and the last
is the report that explains the states used by the parser automaton.
<P></P>
<H3>Command Line Options</H3>
<P>The behavior of Lemon can be modified using command-line options. You can
obtain a list of the available command-line options together with a brief
explanation of what each does by typing <PRE> lemon -?
</PRE>As of this writing, the following command-line options are supported:
<UL>
<LI><TT>-b</TT>
<LI><TT>-c</TT>
<LI><TT>-g</TT>
<LI><TT>-m</TT>
<LI><TT>-q</TT>
<LI><TT>-s</TT>
<LI><TT>-x</TT> </LI></UL>The ``-b'' option reduces the amount of text in the
report file by printing only the basis of each parser state, rather than the
full configuration. The ``-c'' option suppresses action table compression. Using
-c will make the parser a little larger and slower but it will detect syntax
errors sooner. The ``-g'' option causes no output files to be generated at all.
Instead, the input grammar file is printed on standard output but with all
comments, actions and other extraneous text deleted. This is a useful way to get
a quick summary of a grammar. The ``-m'' option causes the output C source file
to be compatible with the ``makeheaders'' program. Makeheaders is a program that
automatically generates header files from C source code. When the ``-m'' option
is used, the header file is not output since the makeheaders program will take
care of generated all header files automatically. The ``-q'' option suppresses
the report file. Using ``-s'' causes a brief summary of parser statistics to be
printed. Like this: <PRE> Parser statistics: 74 terminals, 70 nonterminals, 179 rules
340 states, 2026 parser table entries, 0 conflicts
</PRE>Finally, the ``-x'' option causes Lemon to print its version number and
copyright information and then stop without attempting to read the grammar or
generate a parser.
<P></P>
<H3>The Parser Interface</H3>
<P>Lemon doesn't generate a complete, working program. It only generates a few
subroutines that implement a parser. This section describes the interface to
those subroutines. It is up to the programmer to call these subroutines in an
appropriate way in order to produce a complete system.</P>
<P>Before a program begins using a Lemon-generated parser, the program must
first create the parser. A new parser is created as follows: <PRE> void *pParser = ParseAlloc( malloc );
</PRE>The ParseAlloc() routine allocates and initializes a new parser and
returns a pointer to it. The actual data structure used to represent a parser is
opaque -- its internal structure is not visible or usable by the calling
routine. For this reason, the ParseAlloc() routine returns a pointer to void
rather than a pointer to some particular structure. The sole argument to the
ParseAlloc() routine is a pointer to the subroutine used to allocate memory.
Typically this means ``malloc()''.
<P></P>
<P>After a program is finished using a parser, it can reclaim all memory
allocated by that parser by calling <PRE> ParseFree(pParser, free);
</PRE>The first argument is the same pointer returned by ParseAlloc(). The
second argument is a pointer to the function used to release bulk memory back to
the system.
<P></P>
<P>After a parser has been allocated using ParseAlloc(), the programmer must
supply the parser with a sequence of tokens (terminal symbols) to be parsed.
This is accomplished by calling the following function once for each token: <PRE> Parse(pParser, hTokenID, sTokenData, pArg);
</PRE>The first argument to the Parse() routine is the pointer returned by
ParseAlloc(). The second argument is a small positive integer that tells the
parse the type of the next token in the data stream. There is one token type for
each terminal symbol in the grammar. The gram.h file generated by Lemon contains
#define statements that map symbolic terminal symbol names into appropriate
integer values. (A value of 0 for the second argument is a special flag to the
parser to indicate that the end of input has been reached.) The third argument
is the value of the given token. By default, the type of the third argument is
integer, but the grammar will usually redefine this type to be some kind of
structure. Typically the second argument will be a broad category of tokens such
as ``identifier'' or ``number'' and the third argument will be the name of the
identifier or the value of the number.
<P></P>
<P>The Parse() function may have either three or four arguments, depending on
the grammar. If the grammar specification file request it, the Parse() function
will have a fourth parameter that can be of any type chosen by the programmer.
The parser doesn't do anything with this argument except to pass it through to
action routines. This is a convenient mechanism for passing state information
down to the action routines without having to use global variables.</P>
<P>A typical use of a Lemon parser might look something like the following: <PRE> 01 ParseTree *ParseFile(const char *zFilename){
02 Tokenizer *pTokenizer;
03 void *pParser;
04 Token sToken;
05 int hTokenId;
06 ParserState sState;
07
08 pTokenizer = TokenizerCreate(zFilename);
09 pParser = ParseAlloc( malloc );
10 InitParserState(&sState);
11 while( GetNextToken(pTokenizer, &hTokenId, &sToken) ){
12 Parse(pParser, hTokenId, sToken, &sState);
13 }
14 Parse(pParser, 0, sToken, &sState);
15 ParseFree(pParser, free );
16 TokenizerFree(pTokenizer);
17 return sState.treeRoot;
18 }
</PRE>This example shows a user-written routine that parses a file of text and
returns a pointer to the parse tree. (We've omitted all error-handling from this
example to keep it simple.) We assume the existence of some kind of tokenizer
which is created using TokenizerCreate() on line 8 and deleted by
TokenizerFree() on line 16. The GetNextToken() function on line 11 retrieves the
next token from the input file and puts its type in the integer variable
hTokenId. The sToken variable is assumed to be some kind of structure that
contains details about each token, such as its complete text, what line it
occurs on, etc.
<P></P>
<P>This example also assumes the existence of structure of type ParserState that
holds state information about a particular parse. An instance of such a
structure is created on line 6 and initialized on line 10. A pointer to this
structure is passed into the Parse() routine as the optional 4th argument. The
action routine specified by the grammar for the parser can use the ParserState
structure to hold whatever information is useful and appropriate. In the
example, we note that the treeRoot field of the ParserState structure is left
pointing to the root of the parse tree.</P>
<P>The core of this example as it relates to Lemon is as follows: <PRE> ParseFile(){
pParser = ParseAlloc( malloc );
while( GetNextToken(pTokenizer,&hTokenId, &sToken) ){
Parse(pParser, hTokenId, sToken);
}
Parse(pParser, 0, sToken);
ParseFree(pParser, free );
}
</PRE>Basically, what a program has to do to use a Lemon-generated parser is
first create the parser, then send it lots of tokens obtained by tokenizing an
input source. When the end of input is reached, the Parse() routine should be
called one last time with a token type of 0. This step is necessary to inform
the parser that the end of input has been reached. Finally, we reclaim memory
used by the parser by calling ParseFree().
<P></P>
<P>There is one other interface routine that should be mentioned before we move
on. The ParseTrace() function can be used to generate debugging output from the
parser. A prototype for this routine is as follows: <PRE> ParseTrace(FILE *stream, char *zPrefix);
</PRE>After this routine is called, a short (one-line) message is written to the
designated output stream every time the parser changes states or calls an action
routine. Each such message is prefaced using the text given by zPrefix. This
debugging output can be turned off by calling ParseTrace() again with a first
argument of NULL (0).
<P></P>
<H3>Differences With YACC and BISON</H3>
<P>Programmers who have previously used the yacc or bison parser generator will
notice several important differences between yacc and/or bison and Lemon.
<UL>
<LI>In yacc and bison, the parser calls the tokenizer. In Lemon, the tokenizer
calls the parser.
<LI>Lemon uses no global variables. Yacc and bison use global variables to
pass information between the tokenizer and parser.
<LI>Lemon allows multiple parsers to be running simultaneously. Yacc and bison
do not. </LI></UL>These differences may cause some initial confusion for
programmers with prior yacc and bison experience. But after years of experience
using Lemon, I firmly believe that the Lemon way of doing things is better.
<P></P>
<H2>Input File Syntax</H2>
<P>The main purpose of the grammar specification file for Lemon is to define the
grammar for the parser. But the input file also specifies additional information
Lemon requires to do its job. Most of the work in using Lemon is in writing an
appropriate grammar file.</P>
<P>The grammar file for lemon is, for the most part, free format. It does not
have sections or divisions like yacc or bison. Any declaration can occur at any
point in the file. Lemon ignores whitespace (except where it is needed to
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -