📄 flex.texi

📁 flex编译器的源代码
💻 TEXI
📖 第 1 页 / 共 5 页
字号:
@end example

The "name" is a word beginning with a letter or an
underscore ('_') followed by zero or more letters, digits, '_',
or '-' (dash).  The definition is taken to begin at the
first non-white-space character following the name and
continuing to the end of the line.  The definition can
subsequently be referred to using "@{name@}", which will
expand to "(definition)".  For example,

@example
DIGIT    [0-9]
ID       [a-z][a-z0-9]*
@end example

@noindent
defines "DIGIT" to be a regular expression which matches a
single digit, and "ID" to be a regular expression which
matches a letter followed by zero-or-more
letters-or-digits.  A subsequent reference to

@example
@{DIGIT@}+"."@{DIGIT@}*
@end example

@noindent
is identical to

@example
([0-9])+"."([0-9])*
@end example

@noindent
and matches one-or-more digits followed by a '.' followed
by zero-or-more digits.

The @var{rules} section of the @code{flex} input contains a series of
rules of the form:

@example
pattern   action
@end example

@noindent
where the pattern must be unindented and the action must
begin on the same line.

See below for a further description of patterns and
actions.

Finally, the user code section is simply copied to
@file{lex.yy.c} verbatim.  It is used for companion routines
which call or are called by the scanner.  The presence of
this section is optional; if it is missing, the second @samp{%%}
in the input file may be skipped, too.

In the definitions and rules sections, any @emph{indented} text or
text enclosed in @samp{%@{} and @samp{%@}} is copied verbatim to the
output (with the @samp{%@{@}}'s removed).  The @samp{%@{@}}'s must
appear unindented on lines by themselves.

In the rules section, any indented or %@{@} text appearing
before the first rule may be used to declare variables
which are local to the scanning routine and (after the
declarations) code which is to be executed whenever the
scanning routine is entered.  Other indented or %@{@} text
in the rule section is still copied to the output, but its
meaning is not well-defined and it may well cause
compile-time errors (this feature is present for @code{POSIX} compliance;
see below for other such features).

In the definitions section (but not in the rules section),
an unindented comment (i.e., a line beginning with "/*")
is also copied verbatim to the output up to the next "*/".

@node Patterns, Matching, Format, Top
@section Patterns

The patterns in the input are written using an extended
set of regular expressions.  These are:

@table @samp
@item x
match the character @samp{x}
@item .
any character (byte) except newline
@item [xyz]
a "character class"; in this case, the pattern
matches either an @samp{x}, a @samp{y}, or a @samp{z}
@item [abj-oZ]
a "character class" with a range in it; matches
an @samp{a}, a @samp{b}, any letter from @samp{j} through @samp{o},
or a @samp{Z}
@item [^A-Z]
a "negated character class", i.e., any character
but those in the class.  In this case, any
character EXCEPT an uppercase letter.
@item [^A-Z\n]
any character EXCEPT an uppercase letter or
a newline
@item @var{r}*
zero or more @var{r}'s, where @var{r} is any regular expression
@item @var{r}+
one or more @var{r}'s
@item @var{r}?
zero or one @var{r}'s (that is, "an optional @var{r}")
@item @var{r}@{2,5@}
anywhere from two to five @var{r}'s
@item @var{r}@{2,@}
two or more @var{r}'s
@item @var{r}@{4@}
exactly 4 @var{r}'s
@item @{@var{name}@}
the expansion of the "@var{name}" definition
(see above)
@item "[xyz]\"foo"
the literal string: @samp{[xyz]"foo}
@item \@var{x}
if @var{x} is an @samp{a}, @samp{b}, @samp{f}, @samp{n}, @samp{r}, @samp{t}, or @samp{v},
then the ANSI-C interpretation of \@var{x}.
Otherwise, a literal @samp{@var{x}} (used to escape
operators such as @samp{*})
@item \0
a NUL character (ASCII code 0)
@item \123
the character with octal value 123
@item \x2a
the character with hexadecimal value @code{2a}
@item (@var{r})
match an @var{r}; parentheses are used to override
precedence (see below)
@item @var{r}@var{s}
the regular expression @var{r} followed by the
regular expression @var{s}; called "concatenation"
@item @var{r}|@var{s}
either an @var{r} or an @var{s}
@item @var{r}/@var{s}
an @var{r} but only if it is followed by an @var{s}.  The text
matched by @var{s} is included when determining whether this rule is
the @dfn{longest match}, but is then returned to the input before
the action is executed.  So the action only sees the text matched
by @var{r}.  This type of pattern is called @dfn{trailing context}.
(There are some combinations of @samp{@var{r}/@var{s}} that @code{flex}
cannot match correctly; see notes in the Deficiencies / Bugs section
below regarding "dangerous trailing context".)
@item ^@var{r}
an @var{r}, but only at the beginning of a line (i.e.,
which just starting to scan, or right after a
newline has been scanned).
@item @var{r}$
an @var{r}, but only at the end of a line (i.e., just
before a newline).  Equivalent to "@var{r}/\n".

Note that flex's notion of "newline" is exactly
whatever the C compiler used to compile flex
interprets '\n' as; in particular, on some DOS
systems you must either filter out \r's in the
input yourself, or explicitly use @var{r}/\r\n for "r$".
@item <@var{s}>@var{r}
an @var{r}, but only in start condition @var{s} (see
below for discussion of start conditions)
<@var{s1},@var{s2},@var{s3}>@var{r}
same, but in any of start conditions @var{s1},
@var{s2}, or @var{s3}
@item <*>@var{r}
an @var{r} in any start condition, even an exclusive one.
@item <<EOF>>
an end-of-file
<@var{s1},@var{s2}><<EOF>>
an end-of-file when in start condition @var{s1} or @var{s2}
@end table

Note that inside of a character class, all regular
expression operators lose their special meaning except escape
('\') and the character class operators, '-', ']', and, at
the beginning of the class, '^'.

The regular expressions listed above are grouped according
to precedence, from highest precedence at the top to
lowest at the bottom.  Those grouped together have equal
precedence.  For example,

@example
foo|bar*
@end example

@noindent
is the same as

@example
(foo)|(ba(r*))
@end example

@noindent
since the '*' operator has higher precedence than
concatenation, and concatenation higher than alternation ('|').
This pattern therefore matches @emph{either} the string "foo" @emph{or}
the string "ba" followed by zero-or-more r's.  To match
"foo" or zero-or-more "bar"'s, use:

@example
foo|(bar)*
@end example

@noindent
and to match zero-or-more "foo"'s-or-"bar"'s:

@example
(foo|bar)*
@end example

In addition to characters and ranges of characters,
character classes can also contain character class
@dfn{expressions}.  These are expressions enclosed inside @samp{[}: and @samp{:}]
delimiters (which themselves must appear between the '['
and ']' of the character class; other elements may occur
inside the character class, too).  The valid expressions
are:

@example
[:alnum:] [:alpha:] [:blank:]
[:cntrl:] [:digit:] [:graph:]
[:lower:] [:print:] [:punct:]
[:space:] [:upper:] [:xdigit:]
@end example

These expressions all designate a set of characters
equivalent to the corresponding standard C @samp{isXXX} function.  For
example, @samp{[:alnum:]} designates those characters for which
@samp{isalnum()} returns true - i.e., any alphabetic or numeric.
Some systems don't provide @samp{isblank()}, so flex defines
@samp{[:blank:]} as a blank or a tab.

For example, the following character classes are all
equivalent:

@example
[[:alnum:]]
[[:alpha:][:digit:]
[[:alpha:]0-9]
[a-zA-Z0-9]
@end example

If your scanner is case-insensitive (the @samp{-i} flag), then
@samp{[:upper:]} and @samp{[:lower:]} are equivalent to @samp{[:alpha:]}.

Some notes on patterns:

@itemize -
@item
A negated character class such as the example
"[^A-Z]" above @emph{will match a newline} unless "\n" (or an
equivalent escape sequence) is one of the
characters explicitly present in the negated character
class (e.g., "[^A-Z\n]").  This is unlike how many
other regular expression tools treat negated
character classes, but unfortunately the inconsistency
is historically entrenched.  Matching newlines
means that a pattern like [^"]* can match the
entire input unless there's another quote in the
input.

@item
A rule can have at most one instance of trailing
context (the '/' operator or the '$' operator).
The start condition, '^', and "<<EOF>>" patterns
can only occur at the beginning of a pattern, and,
as well as with '/' and '$', cannot be grouped
inside parentheses.  A '^' which does not occur at
the beginning of a rule or a '$' which does not
occur at the end of a rule loses its special
properties and is treated as a normal character.

The following are illegal:

@example
foo/bar$
<sc1>foo<sc2>bar
@end example

Note that the first of these, can be written
"foo/bar\n".

The following will result in '$' or '^' being
treated as a normal character:

@example
foo|(bar$)
foo|^bar
@end example

If what's wanted is a "foo" or a
bar-followed-by-a-newline, the following could be used (the special
'|' action is explained below):

@example
foo      |
bar$     /* action goes here */
@end example

A similar trick will work for matching a foo or a
bar-at-the-beginning-of-a-line.
@end itemize

@node Matching, Actions, Patterns, Top
@section How the input is matched

When the generated scanner is run, it analyzes its input
looking for strings which match any of its patterns.  If
it finds more than one match, it takes the one matching
the most text (for trailing context rules, this includes
the length of the trailing part, even though it will then
be returned to the input).  If it finds two or more
matches of the same length, the rule listed first in the
@code{flex} input file is chosen.

Once the match is determined, the text corresponding to
the match (called the @var{token}) is made available in the
global character pointer @code{yytext}, and its length in the
global integer @code{yyleng}.  The @var{action} corresponding to the
matched pattern is then executed (a more detailed
description of actions follows), and then the remaining input is
scanned for another match.

If no match is found, then the @dfn{default rule} is executed:
the next character in the input is considered matched and
copied to the standard output.  Thus, the simplest legal
@code{flex} input is:

@example
%%
@end example

which generates a scanner that simply copies its input
(one character at a time) to its output.

Note that @code{yytext} can be defined in two different ways:
either as a character @emph{pointer} or as a character @emph{array}.
You can control which definition @code{flex} uses by including
one of the special directives @samp{%pointer} or @samp{%array} in the
first (definitions) section of your flex input.  The
default is @samp{%pointer}, unless you use the @samp{-l} lex
compatibility option, in which case @code{yytext} will be an array.  The
advantage of using @samp{%pointer} is substantially faster
scanning and no buffer overflow when matching very large
tokens (unless you run out of dynamic memory).  The
disadvantage is that you are restricted in how your actions can
modify @code{yytext} (see the next section), and calls to the
@samp{unput()} function destroys the present contents of @code{yytext},
which can be a considerable porting headache when moving
between different @code{lex} versions.

The advantage of @samp{%array} is that you can then modify @code{yytext}
to your heart's content, and calls to @samp{unput()} do not
destroy @code{yytext} (see below).  Furthermore, existing @code{lex}
programs sometimes access @code{yytext} externally using
declarations of the form:
@example
extern char yytext[];
@end example
This definition is erroneous when used with @samp{%pointer}, but
correct for @samp{%array}.

@samp{%array} defines @code{yytext} to be an array of @code{YYLMAX} characters,
which defaults to a fairly large value.  You can change
the size by simply #define'ing @code{YYLMAX} to a different value
in the first section of your @code{flex} input.  As mentioned
above, with @samp{%pointer} yytext grows dynamically to
accommodate large tokens.  While this means your @samp{%pointer} scanner
can accommodate very large tokens (such as matching entire
blocks of comments), bear in mind that each time the
scanner must resize @code{yytext} it also must rescan the entire
token from the beginning, so matching such tokens can
prove slow.  @code{yytext} presently does @emph{not} dynamically grow if
a call to @samp{unput()} results in too much text being pushed
back; instead, a run-time error results.

Also note that you cannot use @samp{%array} with C++ scanner
classes (the @code{c++} option; see below).

@node Actions, Generated scanner, Matching, Top
@section Actions

Each pattern in a rule has a corresponding action, which
can be any arbitrary C statement.  The pattern ends at the
first non-escaped whitespace character; the remainder of
the line is its action.  If the action is empty, then when
the pattern is matched the input token is simply
discarded.  For example, here is the specification for a
program which deletes all occurrences of "zap me" from its
input:

@example
%%
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -