📄 flexdoc.1
字号:
-Ce directs flex to construct equivalence classes, i.e., sets of
characters which have identical lexical properties (for example, if
the only appearance of digits in the flex input is in the character
class "[0-9]" then the digits '0', '1', ..., '9' will all be put in
the same equivalence class). Equivalence classes usually give
dramatic reductions in the final table/object file sizes (typically
a factor of 2-5) and are pretty cheap performance-wise (one array
look-up per character scanned).
-Cf specifies that the full scanner tables should be generated -
flex should not compress the tables by taking advantages of similar
transition functions for different states.
-CF specifies that the alternate fast scanner representation
(described above under the -F flag) should be used.
-Cm directs flex to construct meta-equivalence classes, which are
sets of equivalence classes (or characters, if equivalence classes
are not being used) that are commonly used together. Meta-
equivalence classes are often a big win when using compressed
tables, but they have a moderate performance impact (one or two "if"
tests and one array look-up per character scanned).
A lone -C specifies that the scanner tables should be compressed but
neither equivalence classes nor meta-equivalence classes should be
used.
The options -Cf or -CF and -Cm do not make sense together - there is
no opportunity for meta-equivalence classes if the table is not
being compressed. Otherwise the options may be freely mixed.
The default setting is -Cem, which specifies that flex should
generate equivalence classes and meta-equivalence classes. This
setting provides the highest degree of table compression. You can
trade off faster-executing scanners at the cost of larger tables
with the following generally being true:
slowest & smallest
-Cem
-Cm
-Ce
-C
-C{f,F}e
-C{f,F}
fastest & largest
26 May 1990 23
FLEX(1) Minix Programmer's Manual FLEX(1)
Note that scanners with the smallest tables are usually generated
and compiled the quickest, so during development you will usually
want to use the default, maximal compression.
-Cfe is often a good compromise between speed and size for
production scanners.
-C options are not cumulative; whenever the flag is encountered, the
previous -C settings are forgotten.
-Sskeleton_file
overrides the default skeleton file from which flex constructs its
scanners. You'll never need this option unless you are doing flex
maintenance or development.
PERFORMANCE CONSIDERATIONS
The main design goal of flex is that it generate high-performance
scanners. It has been optimized for dealing well with large sets of
rules. Aside from the effects of table compression on scanner speed
outlined above, there are a number of options/actions which degrade
performance. These are, from most expensive to least:
REJECT
pattern sets that require backtracking
arbitrary trailing context
'^' beginning-of-line operator
yymore()
with the first three all being quite expensive and the last two being
quite cheap.
REJECT should be avoided at all costs when performance is important. It
is a particularly expensive option.
Getting rid of backtracking is messy and often may be an enormous amount
of work for a complicated scanner. In principal, one begins by using the
-b flag to generate a lex.backtrack file. For example, on the input
%%
foo return TOK_KEYWORD;
foobar return TOK_KEYWORD;
the file looks like:
State #6 is non-accepting -
associated rule line numbers:
2 3
out-transitions: [ o ]
26 May 1990 24
FLEX(1) Minix Programmer's Manual FLEX(1)
jam-transitions: EOF [ \001-n p-\177 ]
State #8 is non-accepting -
associated rule line numbers:
3
out-transitions: [ a ]
jam-transitions: EOF [ \001-` b-\177 ]
State #9 is non-accepting -
associated rule line numbers:
3
out-transitions: [ r ]
jam-transitions: EOF [ \001-q s-\177 ]
Compressed tables always backtrack.
The first few lines tell us that there's a scanner state in which it can
make a transition on an 'o' but not on any other character, and that in
that state the currently scanned text does not match any rule. The state
occurs when trying to match the rules found at lines 2 and 3 in the input
file. If the scanner is in that state and then reads something other
than an 'o', it will have to backtrack to find a rule which is matched.
With a bit of headscratching one can see that this must be the state it's
in when it has seen "fo". When this has happened, if anything other than
another 'o' is seen, the scanner will have to back up to simply match the
'f' (by the default rule).
The comment regarding State #8 indicates there's a problem when "foob"
has been scanned. Indeed, on any character other than a 'b', the scanner
will have to back up to accept "foo". Similarly, the comment for State
#9 concerns when "fooba" has been scanned.
The final comment reminds us that there's no point going to all the
trouble of removing backtracking from the rules unless we're using -f or
-F, since there's no performance gain doing so with compressed scanners.
The way to remove the backtracking is to add "error" rules:
%%
foo return TOK_KEYWORD;
foobar return TOK_KEYWORD;
fooba |
foob |
fo {
/* false alarm, not really a keyword */
return TOK_ID;
}
26 May 1990 25
FLEX(1) Minix Programmer's Manual FLEX(1)
Eliminating backtracking among a list of keywords can also be done using
a "catch-all" rule:
%%
foo return TOK_KEYWORD;
foobar return TOK_KEYWORD;
[a-z]+ return TOK_ID;
This is usually the best solution when appropriate.
Backtracking messages tend to cascade. With a complicated set of rules
it's not uncommon to get hundreds of messages. If one can decipher them,
though, it often only takes a dozen or so rules to eliminate the
backtracking (though it's easy to make a mistake and have an error rule
accidentally match a valid token. A possible future flex feature will be
to automatically add rules to eliminate backtracking).
Variable trailing context (where both the leading and trailing parts do
not have a fixed length) entails almost the same performance loss as
REJECT (i.e., substantial). So when possible a rule like:
%%
mouse|rat/(cat|dog) run();
is better written:
%%
mouse/cat|dog run();
rat/cat|dog run();
or as
%%
mouse|rat/cat run();
mouse|rat/dog run();
Note that here the special '|' action does not provide any savings, and
can even make things worse (see BUGS in flex(1)).
Another area where the user can increase a scanner's performance (and one
that's easier to implement) arises from the fact that the longer the
tokens matched, the faster the scanner will run. This is because with
long tokens the processing of most input characters takes place in the
(short) inner scanning loop, and does not often have to go through the
additional work of setting up the scanning environment (e.g., yytext) for
the action. Recall the scanner for C comments:
%x comment
%%
26 May 1990 26
FLEX(1) Minix Programmer's Manual FLEX(1)
int line_num = 1;
"/*" BEGIN(comment);
<comment>[^*\n]*
<comment>"*"+[^*/\n]*
<comment>\n ++line_num;
<comment>"*"+"/" BEGIN(INITIAL);
This could be sped up by writing it as:
%x comment
%%
int line_num = 1;
"/*" BEGIN(comment);
<comment>[^*\n]*
<comment>[^*\n]*\n ++line_num;
<comment>"*"+[^*/\n]*
<comment>"*"+[^*/\n]*\n ++line_num;
<comment>"*"+"/" BEGIN(INITIAL);
Now instead of each newline requiring the processing of another action,
recognizing the newlines is "distributed" over the other rules to keep
the matched text as long as possible. Note that adding rules does not
slow down the scanner! The speed of the scanner is independent of the
number of rules or (modulo the considerations given at the beginning of
this section) how complicated the rules are with regard to operators such
as '*' and '|'.
A final example in speeding up a scanner: suppose you want to scan
through a file containing identifiers and keywords, one per line and with
no other extraneous characters, and recognize all the keywords. A
natural first approach is:
%%
asm |
auto |
break |
... etc ...
volatile |
while /* it's a keyword */
.|\n /* it's not a keyword */
To eliminate the back-tracking, introduce a catch-all rule:
%%
asm |
26 May 1990 27
FLEX(1) Minix Programmer's Manual FLEX(1)
auto |
break |
... etc ...
volatile |
while /* it's a keyword */
[a-z]+ |
.|\n /* it's not a keyword */
Now, if it's guaranteed that there's exactly one word per line, then we
can reduce the total number of matches by a half by merging in the
recognition of newlines with that of the other tokens:
%%
asm\n |
auto\n |
break\n |
... etc ...
volatile\n |
while\n /* it's a keyword */
[a-z]+\n |
.|\n /* it's not a keyword */
One has to be careful here, as we have now reintroduced backtracking into
the scanner. In particular, while we know that there will never be any
characters in the input stream other than letters or newlines, flex can't
figure this out, and it will plan for possibly needing backtracking when
it has scanned a token like "auto" and then the next character is
something other than a newline or a letter. Previously it would then
just match the "auto" rule and be done, but now it has no "auto" rule,
only a "auto\n" rule. To eliminate the possibility of backtracking, we
could either duplicate all rules but without final newlines, or, since we
never expect to encounter such an input and therefore don't how it
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -