📄 flexdoc.1

📁 C++版词法分析、语法分析器
💻 1
📖 第 1 页 / 共 5 页
字号:
and variable trailing context (see the BUGS section in flex(1))
entails a substantial performance penalty; use of
.I yymore(),
the
.B ^
operator,
and the
.B -I
flag entail minor performance penalties.
.TP
.B -s
causes the
.I default rule
(that unmatched scanner input is echoed to
.I stdout)
to be suppressed.  If the scanner encounters input that does not
match any of its rules, it aborts with an error.  This option is
useful for finding holes in a scanner's rule set.
.TP
.B -t
instructs
.I flex
to write the scanner it generates to standard output instead
of
.B lex.yy.c.
.TP
.B -v
specifies that
.I flex
should write to
.I stderr
a summary of statistics regarding the scanner it generates.
Most of the statistics are meaningless to the casual
.I flex
user, but the
first line identifies the version of
.I flex,
which is useful for figuring
out where you stand with respect to patches and new releases,
and the next two lines give the date when the scanner was created
and a summary of the flags which were in effect.
.TP
.B -F
specifies that the
.ul
fast
scanner table representation should be used.  This representation is
about as fast as the full table representation
.ul
(-f),
and for some sets of patterns will be considerably smaller (and for
others, larger).  In general, if the pattern set contains both "keywords"
and a catch-all, "identifier" rule, such as in the set:
.nf

    "case"    return TOK_CASE;
    "switch"  return TOK_SWITCH;
    ...
    "default" return TOK_DEFAULT;
    [a-z]+    return TOK_ID;

.fi
then you're better off using the full table representation.  If only
the "identifier" rule is present and you then use a hash table or some such
to detect the keywords, you're better off using
.ul
-F.
.IP
This option is equivalent to
.B -CF
(see below).
.TP
.B -I
instructs
.I flex
to generate an
.I interactive
scanner.  Normally, scanners generated by
.I flex
always look ahead one
character before deciding that a rule has been matched.  At the cost of
some scanning overhead,
.I flex
will generate a scanner which only looks ahead
when needed.  Such scanners are called
.I interactive
because if you want to write a scanner for an interactive system such as a
command shell, you will probably want the user's input to be terminated
with a newline, and without
.B -I
the user will have to type a character in addition to the newline in order
to have the newline recognized.  This leads to dreadful interactive
performance.
.IP
If all this seems to confusing, here's the general rule: if a human will
be typing in input to your scanner, use
.B -I,
otherwise don't; if you don't care about squeezing the utmost performance
from your scanner and you
don't want to make any assumptions about the input to your scanner,
use
.B -I.
.IP
Note,
.B -I
cannot be used in conjunction with
.I full
or
.I fast tables,
i.e., the
.B -f, -F, -Cf,
or
.B -CF
flags.
.TP
.B -L
instructs
.I flex
not to generate
.B #line
directives.  Without this option,
.I flex
peppers the generated scanner
with #line directives so error messages in the actions will be correctly
located with respect to the original
.I flex
input file, and not to
the fairly meaningless line numbers of
.B lex.yy.c.
(Unfortunately
.I flex
does not presently generate the necessary directives
to "retarget" the line numbers for those parts of
.B lex.yy.c
which it generated.  So if there is an error in the generated code,
a meaningless line number is reported.)
.TP
.B -T
makes
.I flex
run in
.I trace
mode.  It will generate a lot of messages to
.I stdout
concerning
the form of the input and the resultant non-deterministic and deterministic
finite automata.  This option is mostly for use in maintaining
.I flex.
.TP
.B -8
instructs
.I flex
to generate an 8-bit scanner, i.e., one which can recognize 8-bit
characters.  On some sites,
.I flex
is installed with this option as the default.  On others, the default
is 7-bit characters.  To see which is the case, check the verbose
.B (-v)
output for "equivalence classes created".  If the denominator of
the number shown is 128, then by default
.I flex
is generating 7-bit characters.  If it is 256, then the default is
8-bit characters and the
.B -8
flag is not required (but may be a good idea to keep the scanner
specification portable).  Feeding a 7-bit scanner 8-bit characters
will result in infinite loops, bus errors, or other such fireworks,
so when in doubt, use the flag.  Note that if equivalence classes
are used, 8-bit scanners take only slightly more table space than
7-bit scanners (128 bytes, to be exact); if equivalence classes are
not used, however, then the tables may grow up to twice their
7-bit size.
.TP 
.B -C[efmF]
controls the degree of table compression.
.IP
.B -Ce
directs
.I flex
to construct
.I equivalence classes,
i.e., sets of characters
which have identical lexical properties (for example, if the only
appearance of digits in the
.I flex
input is in the character class
"[0-9]" then the digits '0', '1', ..., '9' will all be put
in the same equivalence class).  Equivalence classes usually give
dramatic reductions in the final table/object file sizes (typically
a factor of 2-5) and are pretty cheap performance-wise (one array
look-up per character scanned).
.IP
.B -Cf
specifies that the
.I full
scanner tables should be generated -
.I flex
should not compress the
tables by taking advantages of similar transition functions for
different states.
.IP
.B -CF
specifies that the alternate fast scanner representation (described
above under the
.B -F
flag)
should be used.
.IP
.B -Cm
directs
.I flex
to construct
.I meta-equivalence classes,
which are sets of equivalence classes (or characters, if equivalence
classes are not being used) that are commonly used together.  Meta-equivalence
classes are often a big win when using compressed tables, but they
have a moderate performance impact (one or two "if" tests and one
array look-up per character scanned).
.IP
A lone
.B -C
specifies that the scanner tables should be compressed but neither
equivalence classes nor meta-equivalence classes should be used.
.IP
The options
.B -Cf
or
.B -CF
and
.B -Cm
do not make sense together - there is no opportunity for meta-equivalence
classes if the table is not being compressed.  Otherwise the options
may be freely mixed.
.IP
The default setting is
.B -Cem,
which specifies that
.I flex
should generate equivalence classes
and meta-equivalence classes.  This setting provides the highest
degree of table compression.  You can trade off
faster-executing scanners at the cost of larger tables with
the following generally being true:
.nf

    slowest & smallest
          -Cem
          -Cm
          -Ce
          -C
          -C{f,F}e
          -C{f,F}
    fastest & largest

.fi
Note that scanners with the smallest tables are usually generated and
compiled the quickest, so
during development you will usually want to use the default, maximal
compression.
.IP
.B -Cfe
is often a good compromise between speed and size for production
scanners.
.IP
.B -C
options are not cumulative; whenever the flag is encountered, the
previous -C settings are forgotten.
.TP
.B -Sskeleton_file
overrides the default skeleton file from which
.I flex
constructs its scanners.  You'll never need this option unless you are doing
.I flex
maintenance or development.
.SH PERFORMANCE CONSIDERATIONS
The main design goal of
.I flex
is that it generate high-performance scanners.  It has been optimized
for dealing well with large sets of rules.  Aside from the effects
of table compression on scanner speed outlined above,
there are a number of options/actions which degrade performance.  These
are, from most expensive to least:
.nf

    REJECT

    pattern sets that require backtracking
    arbitrary trailing context

    '^' beginning-of-line operator
    yymore()

.fi
with the first three all being quite expensive and the last two
being quite cheap.
.LP
.B REJECT
should be avoided at all costs when performance is important.
It is a particularly expensive option.
.LP
Getting rid of backtracking is messy and often may be an enormous
amount of work for a complicated scanner.  In principal, one begins
by using the
.B -b 
flag to generate a
.I lex.backtrack
file.  For example, on the input
.nf

    %%
    foo        return TOK_KEYWORD;
    foobar     return TOK_KEYWORD;

.fi
the file looks like:
.nf

    State #6 is non-accepting -
     associated rule line numbers:
           2       3
     out-transitions: [ o ]
     jam-transitions: EOF [ \\001-n  p-\\177 ]

    State #8 is non-accepting -
     associated rule line numbers:
           3
     out-transitions: [ a ]
     jam-transitions: EOF [ \\001-`  b-\\177 ]

    State #9 is non-accepting -
     associated rule line numbers:
           3
     out-transitions: [ r ]
     jam-transitions: EOF [ \\001-q  s-\\177 ]

    Compressed tables always backtrack.

.fi
The first few lines tell us that there's a scanner state in
which it can make a transition on an 'o' but not on any other
character, and that in that state the currently scanned text does not match
any rule.  The state occurs when trying to match the rules found
at lines 2 and 3 in the input file.
If the scanner is in that state and then reads
something other than an 'o', it will have to backtrack to find
a rule which is matched.  With
a bit of headscratching one can see that this must be the
state it's in when it has seen "fo".  When this has happened,
if anything other than another 'o' is seen, the scanner will
have to back up to simply match the 'f' (by the default rule).
.LP
The comment regarding State #8 indicates there's a problem
when "foob" has been scanned.  Indeed, on any character other
than a 'b', the scanner will have to back up to accept "foo".
Similarly, the comment for State #9 concerns when "fooba" has
been scanned.
.LP
The final comment reminds us that there's no point going to
all the trouble of removing backtracking from the rules unless
we're using
.B -f
or
.B -F,
since there's no performance gain doing so with compressed scanners.
.LP
The way to remove the backtracking is to add "error" rules:
.nf

    %%
    foo         return TOK_KEYWORD;
    foobar      return TOK_KEYWORD;

    fooba       |
    foob        |
    fo          {
                /* false alarm, not really a keyword */
                return TOK_ID;
                }

.fi
.LP
Eliminating backtracking among a list of keywords can also be
done using a "catch-all" rule:
.nf

    %%
    foo         return TOK_KEYWORD;
    foobar      return TOK_KEYWORD;

    [a-z]+      return TOK_ID;

.fi
This is usually the best solution when appropriate.
.LP
Backtracking messages tend to cascade.
With a complicated set of rules it's not uncommon to get hundreds
of messages.  If one can decipher them, though, it often
only takes a dozen or so rules to eliminate the backtracking (though
it's easy to make a mistake and have an error rule accidentally match
a valid token.  A possible future
.I flex
feature will be to automatically add rules to eliminate backtracking).
.LP
.I Variable
trailing context (where both the leading and trailing parts do not have
a fixed length) entails almost the same performance loss as
.I REJECT
(i.e., substantial).  So when possible a rule like:
.nf

    %%
    mouse|rat/(cat|dog)   run();

.fi
is better written:
.nf

    %%
    mouse/cat|dog         run();
    rat/cat|dog           run();

.fi
or as
.nf

    %%
    mouse|rat/cat         run();
    mouse|rat/dog         run();

.fi
Note that here the special '|' action does
.I not
provide any savings, and can even make things worse (see
.B BUGS
in flex(1)).
.LP
Another area where the user can increase a scanner's performance
(and one that's easier to implement) arises from the fact that
the longer the tokens matched, the faster the scanner will run.
This is because with long tokens the processing of most input
characters takes place in the (short) inner scanning loop, and
does not often have to go through the additional work of setting up
the scanning environment (e.g.,
.B yytext)
for the action.  Recall the scanner for C comments:
.nf

    %x comment
    %%
            int line_num = 1;

    "/*"         BEGIN(comment);

    <comment>[^*\\n]*
    <comment>"*"+[^*/\\n]*
    <comment>\\n             ++line_num;
    <comment>"*"+"/"        BEGIN(INITIAL);

.fi
This could be sped up by writing it as:
.nf

    %x comment
    %%
            int line_num = 1;

    "/*"         BEGIN(comment);

    <comment>[^*\\n]*
    <comment>[^*\\n]*\\n      ++line_num;
    <comment>"*"+[^*/\\n]*
    <comment>"*"+[^*/\\n]*\\n ++line_num;
    <comment>"*"+"/"        BEGIN(INITIAL);

.fi
Now instead of each newline requiring the processing of another
action, recognizing the newlines is "distributed" over the other rules
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -