perlreguts.pod
来自「视频监控网络部分的协议ddns,的模块的实现代码,请大家大胆指正.」· POD 代码 · 共 888 行 · 第 1/3 页
POD
888 行
=head1 NAMEperlreguts - Description of the Perl regular expression engine.=head1 DESCRIPTIONThis document is an attempt to shine some light on the guts of the regexengine and how it works. The regex engine represents a significant chunkof the perl codebase, but is relatively poorly understood. This documentis a meagre attempt at addressing this situation. It is derived from theauthor's experience, comments in the source code, other papers on theregex engine, feedback on the perl5-porters mail list, and no doubt otherplaces as well.B<NOTICE!> It should be clearly understood that the behavior andstructures discussed in this represents the state of the engine as theauthor understood it at the time of writing. It is B<NOT> an APIdefinition, it is purely an internals guide for those who want to hackthe regex engine, or understand how the regex engine works. Readers ofthis document are expected to understand perl's regex syntax and itsusage in detail. If you want to learn about the basics of Perl'sregular expressions, see L<perlre>. And if you want to replace theregex engine with your own see see L<perlreapi>.=head1 OVERVIEW=head2 A quick note on termsThere is some debate as to whether to say "regexp" or "regex". In thisdocument we will use the term "regex" unless there is a special reasonnot to, in which case we will explain why.When speaking about regexes we need to distinguish between their sourcecode form and their internal form. In this document we will use the term"pattern" when we speak of their textual, source code form, and the term"program" when we speak of their internal representation. Thesecorrespond to the terms I<S-regex> and I<B-regex> that Mark JasonDominus employs in his paper on "Rx" ([1] in L</REFERENCES>).=head2 What is a regular expression engine?A regular expression engine is a program that takes a set of constraintsspecified in a mini-language, and then applies those constraints to atarget string, and determines whether or not the string satisfies theconstraints. See L<perlre> for a full definition of the language.In less grandiose terms, the first part of the job is to turn a pattern intosomething the computer can efficiently use to find the matching point inthe string, and the second part is performing the search itself.To do this we need to produce a program by parsing the text. We thenneed to execute the program to find the point in the string thatmatches. And we need to do the whole thing efficiently.=head2 Structure of a Regexp Program=head3 High LevelAlthough it is a bit confusing and some people object to the terminology, itis worth taking a look at a comment that hasbeen in F<regexp.h> for years:I<This is essentially a linear encoding of a nondeterministicfinite-state machine (aka syntax charts or "railroad normal form" inparsing technology).>The term "railroad normal form" is a bit esoteric, with "syntaxdiagram/charts", or "railroad diagram/charts" being more common terms.Nevertheless it provides a useful mental image of a regex program: eachnode can be thought of as a unit of track, with a single entry and inmost cases a single exit point (there are pieces of track that fork, butstatistically not many), and the whole forms a layout with asingle entry and single exit point. The matching process can be thoughtof as a car that moves along the track, with the particular route throughthe system being determined by the character read at each possibleconnector point. A car can fall off the track at any point but it mayonly proceed as long as it matches the track.Thus the pattern C</foo(?:\w+|\d+|\s+)bar/> can be thought of as thefollowing chart: [start] | <foo> | +-----+-----+ | | | <\w+> <\d+> <\s+> | | | +-----+-----+ | <bar> | [end]The truth of the matter is that perl's regular expressions these days aremuch more complex than this kind of structure, but visualising it this waycan help when trying to get your bearings, and it matches thecurrent implementation pretty closely.To be more precise, we will say that a regex program is an encodingof a graph. Each node in the graph corresponds to part ofthe original regex pattern, such as a literal string or a branch,and has a pointer to the nodes representing the next componentto be matched. Since "node" and "opcode" already have other meanings in theperl source, we will call the nodes in a regex program "regops".The program is represented by an array of C<regnode> structures, one ormore of which represent a single regop of the program. StructC<regnode> is the smallest struct needed, and has a field structure which isshared with all the other larger structures.The "next" pointers of all regops except C<BRANCH> implement concatenation;a "next" pointer with a C<BRANCH> on both ends of it is connecting twoalternatives. [Here we have one of the subtle syntax dependencies: anindividual C<BRANCH> (as opposed to a collection of them) is neverconcatenated with anything because of operator precedence.]The operand of some types of regop is a literal string; for others,it is a regop leading into a sub-program. In particular, the operandof a C<BRANCH> node is the first regop of the branch.B<NOTE>: As the railroad metaphor suggests, this is B<not> a treestructure: the tail of the branch connects to the thing following theset of C<BRANCH>es. It is a like a single line of railway track thatsplits as it goes into a station or railway yard and rejoins as it comesout the other side.=head3 RegopsThe base structure of a regop is defined in F<regexp.h> as follows: struct regnode { U8 flags; /* Various purposes, sometimes overridden */ U8 type; /* Opcode value as specified by regnodes.h */ U16 next_off; /* Offset in size regnode */ };Other larger C<regnode>-like structures are defined in F<regcomp.h>. Theyare almost like subclasses in that they have the same fields asC<regnode>, with possibly additional fields following inthe structure, and in some cases the specific meaning (and name)of some of base fields are overridden. The following is a morecomplete description.=over 4=item C<regnode_1>=item C<regnode_2>C<regnode_1> structures have the same header, followed by a singlefour-byte argument; C<regnode_2> structures contain two two-bytearguments instead: regnode_1 U32 arg1; regnode_2 U16 arg1; U16 arg2;=item C<regnode_string>C<regnode_string> structures, used for literal strings, follow the headerwith a one-byte length and then the string data. Strings are padded onthe end with zero bytes so that the total length of the node is amultiple of four bytes: regnode_string char string[1]; U8 str_len; /* overrides flags */=item C<regnode_charclass>Character classes are represented by C<regnode_charclass> structures,which have a four-byte argument and then a 32-byte (256-bit) bitmapindicating which characters are included in the class. regnode_charclass U32 arg1; char bitmap[ANYOF_BITMAP_SIZE];=item C<regnode_charclass_class>There is also a larger form of a char class structure used to representPOSIX char classes called C<regnode_charclass_class> which has anadditional 4-byte (32-bit) bitmap indicating which POSIX char classeshave been included. regnode_charclass_class U32 arg1; char bitmap[ANYOF_BITMAP_SIZE]; char classflags[ANYOF_CLASSBITMAP_SIZE];=backF<regnodes.h> defines an array called C<regarglen[]> which gives the sizeof each opcode in units of C<size regnode> (4-byte). A macro is usedto calculate the size of an C<EXACT> node based on its C<str_len> field.The regops are defined in F<regnodes.h> which is generated fromF<regcomp.sym> by F<regcomp.pl>. Currently the maximum possible numberof distinct regops is restricted to 256, with about a quarter alreadyused.A set of macros makes accessing the fieldseasier and more consistent. These include C<OP()>, which is used to determinethe type of a C<regnode>-like structure; C<NEXT_OFF()>, which is the offset tothe next node (more on this later); C<ARG()>, C<ARG1()>, C<ARG2()>, C<ARG_SET()>,and equivalents for reading and setting the arguments; and C<STR_LEN()>,C<STRING()> and C<OPERAND()> for manipulating strings and regop bearingtypes.=head3 What regop is next?There are three distinct concepts of "next" in the regex engine, andit is important to keep them clear.=over 4=item *There is the "next regnode" from a given regnode, a value which israrely useful except that sometimes it matches up in terms of valuewith one of the others, and that sometimes the code assumes this toalways be so.=item *There is the "next regop" from a given regop/regnode. This is theregop physically located after the the current one, as determined bythe size of the current regop. This is often useful, such as whendumping the structure we use this order to traverse. Sometimes the codeassumes that the "next regnode" is the same as the "next regop", or inother words assumes that the sizeof a given regop type is always goingto be one regnode large.=item *There is the "regnext" from a given regop. This is the regop whichis reached by jumping forward by the value of C<NEXT_OFF()>,or in a few cases for longer jumps by the C<arg1> field of the C<regnode_1>structure. The subroutine C<regnext()> handles this transparently.This is the logical successor of the node, which in some cases, likethat of the C<BRANCH> regop, has special meaning.=back=head1 Process OverviewBroadly speaking, performing a match of a string against a patterninvolves the following steps:=over 5=item A. Compilation=over 5=item 1. Parsing for size=item 2. Parsing for construction=item 3. Peep-hole optimisation and analysis=back=item B. Execution=over 5=item 4. Start position and no-match optimisations=item 5. Program execution=back=backWhere these steps occur in the actual execution of a perl program isdetermined by whether the pattern involves interpolating any stringvariables. If interpolation occurs, then compilation happens at run time. If itdoes not, then compilation is performed at compile time. (The C</o> modifier changes this,as does C<qr//> to a certain extent.) The engine doesn't really care thatmuch.=head2 CompilationThis code resides primarily in F<regcomp.c>, along with the header filesF<regcomp.h>, F<regexp.h> and F<regnodes.h>.Compilation starts with C<pregcomp()>, which is mostly an initialisationwrapper which farms work out to two other routines for the heavy lifting: thefirst is C<reg()>, which is the start point for parsing; the second,C<study_chunk()>, is responsible for optimisation.Initialisation in C<pregcomp()> mostly involves the creation and data-fillingof a special structure, C<RExC_state_t> (defined in F<regcomp.c>).Almost all internally-used routines in F<regcomp.h> take a pointer to oneof these structures as their first argument, with the name C<pRExC_state>.This structure is used to store the compilation state and contains many
⌨️ 快捷键说明
复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?