⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 perlreguts.1

📁 视频监控网络部分的协议ddns,的模块的实现代码,请大家大胆指正.
💻 1
📖 第 1 页 / 共 4 页
字号:
have a somewhat incestuous relationship with overlap between their functions,and \f(CW\*(C`pregexec()\*(C'\fR may even call \f(CW\*(C`re_intuit_start()\*(C'\fR on its own. Neverthelessother parts of the the perl source code may call into either, or both..PPExecution of the interpreter itself used to be recursive, but thanks to theefforts of Dave Mitchell in the 5.9.x development track, that has changed: now aninternal stack is maintained on the heap and the routine is fullyiterative. This can make it tricky as the code is quite conservativeabout what state it stores, with the result that that two consecutive lines in thecode can actually be running in totally different contexts due to thesimulated recursion..PP\fIStart position and no-match optimisations\fR.IX Subsection "Start position and no-match optimisations".PP\&\f(CW\*(C`re_intuit_start()\*(C'\fR is responsible for handling start points and no-matchoptimisations as determined by the results of the analysis done by\&\f(CW\*(C`study_chunk()\*(C'\fR (and described in \*(L"Peep-hole Optimisation and Analysis\*(R")..PPThe basic structure of this routine is to try to find the start\- and/orend-points of where the pattern could match, and to ensure that the stringis long enough to match the pattern. It tries to use more efficientmethods over less efficient methods and may involve considerablecross-checking of constraints to find the place in the string that matches.For instance it may try to determine that a given fixed string must benot only present but a certain number of chars before the end of thestring, or whatever..PPIt calls several other routines, such as \f(CW\*(C`fbm_instr()\*(C'\fR which doesFast Boyer Moore matching and \f(CW\*(C`find_byclass()\*(C'\fR which is responsible forfinding the start using the first mandatory regop in the program..PPWhen the optimisation criteria have been satisfied, \f(CW\*(C`reg_try()\*(C'\fR is calledto perform the match..PP\fIProgram execution\fR.IX Subsection "Program execution".PP\&\f(CW\*(C`pregexec()\*(C'\fR is the main entry point for running a regex. It containssupport for initialising the regex interpreter's state, running\&\f(CW\*(C`re_intuit_start()\*(C'\fR if needed, and running the interpreter on the stringfrom various start positions as needed. When it is necessary to usethe regex interpreter \f(CW\*(C`pregexec()\*(C'\fR calls \f(CW\*(C`regtry()\*(C'\fR..PP\&\f(CW\*(C`regtry()\*(C'\fR is the entry point into the regex interpreter. It expectsas arguments a pointer to a \f(CW\*(C`regmatch_info\*(C'\fR structure and a pointer toa string.  It returns an integer 1 for success and a 0 for failure.It is basically a set-up wrapper around \f(CW\*(C`regmatch()\*(C'\fR..PP\&\f(CW\*(C`regmatch\*(C'\fR is the main \*(L"recursive loop\*(R" of the interpreter. It isbasically a giant switch statement that implements a state machine, wherethe possible states are the regops themselves, plus a number of additionalintermediate and failure states. A few of the states are implemented assubroutines but the bulk are inline code..SH "MISCELLANEOUS".IX Header "MISCELLANEOUS".Sh "Unicode and Localisation Support".IX Subsection "Unicode and Localisation Support"When dealing with strings containing characters that cannot be representedusing an eight-bit character set, perl uses an internal representationthat is a permissive version of Unicode's \s-1UTF\-8\s0 encoding[2]. This uses singlebytes to represent characters from the \s-1ASCII\s0 character set, and sequencesof two or more bytes for all other characters. (See perlunitutfor more information about the relationship between \s-1UTF\-8\s0 and perl'sencoding, utf8 \*(-- the difference isn't important for this discussion.).PPNo matter how you look at it, Unicode support is going to be a pain in aregex engine. Tricks that might be fine when you have 256 possiblecharacters often won't scale to handle the size of the \s-1UTF\-8\s0 characterset.  Things you can take for granted with \s-1ASCII\s0 may not be true withUnicode. For instance, in \s-1ASCII\s0, it is safe to assume that\&\f(CW\*(C`sizeof(char1) == sizeof(char2)\*(C'\fR, but in \s-1UTF\-8\s0 it isn't. Unicode case folding isvastly more complex than the simple rules of \s-1ASCII\s0, and even when notusing Unicode but only localised single byte encodings, things can gettricky (for example, \fB\s-1LATIN\s0 \s-1SMALL\s0 \s-1LETTER\s0 \s-1SHARP\s0 S\fR (U+00DF, \*8)should match '\s-1SS\s0' in localised case-insensitive matching)..PPMaking things worse is that \s-1UTF\-8\s0 support was a later addition to theregex engine (as it was to perl) and this necessarily  made things a lotmore complicated. Obviously it is easier to design a regex engine withUnicode support in mind from the beginning than it is to retrofit it toone that wasn't..PPNearly all regops that involve looking at the input string havetwo cases, one for \s-1UTF\-8\s0, and one not. In fact, it's often more complexthan that, as the pattern may be \s-1UTF\-8\s0 as well..PPCare must be taken when making changes to make sure that you handle\&\s-1UTF\-8\s0 properly, both at compile time and at execution time, includingwhen the string and pattern are mismatched..PPThe following comment in \fIregcomp.h\fR gives an example of exactly howtricky this can be:.PP.Vb 1\&    Two problematic code points in Unicode casefolding of EXACT nodes:\&\&    U+0390 \- GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS\&    U+03B0 \- GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS\&\&    which casefold to\&\&    Unicode                      UTF\-8\&\&    U+03B9 U+0308 U+0301         0xCE 0xB9 0xCC 0x88 0xCC 0x81\&    U+03C5 U+0308 U+0301         0xCF 0x85 0xCC 0x88 0xCC 0x81\&\&    This means that in case\-insensitive matching (or "loose matching",\&    as Unicode calls it), an EXACTF of length six (the UTF\-8 encoded\&    byte length of the above casefolded versions) can match a target\&    string of length two (the byte length of UTF\-8 encoded U+0390 or\&    U+03B0). This would rather mess up the minimum length computation.\&\&    What we\*(Aqll do is to look for the tail four bytes, and then peek\&    at the preceding two bytes to see whether we need to decrease\&    the minimum length by four (six minus two).\&\&    Thanks to the design of UTF\-8, there cannot be false matches:\&    A sequence of valid UTF\-8 bytes cannot be a subsequence of\&    another valid sequence of UTF\-8 bytes..Ve.Sh "Base Structures".IX Subsection "Base Structures"The \f(CW\*(C`regexp\*(C'\fR structure described in perlreapi is common to allregex engines. Two of its fields that are intended for the private useof the regex engine that compiled the pattern. These are the\&\f(CW\*(C`intflags\*(C'\fR and pprivate members. The \f(CW\*(C`pprivate\*(C'\fR is a void pointer toan arbitrary structure whose use and management is the responsibilityof the compiling engine. perl will never modify either of thesevalues. In the case of the stock engine the structure pointed to by\&\f(CW\*(C`pprivate\*(C'\fR is called \f(CW\*(C`regexp_internal\*(C'\fR..PPIts \f(CW\*(C`pprivate\*(C'\fR and \f(CW\*(C`intflags\*(C'\fR fields contain dataspecific to each engine..PPThere are two structures used to store a compiled regular expression.One, the \f(CW\*(C`regexp\*(C'\fR structure described in perlreapi is populated bythe engine currently being. used and some of its fields read by perl toimplement things such as the stringification of \f(CW\*(C`qr//\*(C'\fR..PPThe other structure is pointed to be the \f(CW\*(C`regexp\*(C'\fR struct's\&\f(CW\*(C`pprivate\*(C'\fR and is in addition to \f(CW\*(C`intflags\*(C'\fR in the same structconsidered to be the property of the regex engine which compiled theregular expression;.PPThe regexp structure contains all the data that perl needs to be aware ofto properly work with the regular expression. It includes data aboutoptimisations that perl can use to determine if the regex engine shouldreally be used, and various other control info that is needed to properlyexecute patterns in various contexts such as is the pattern anchored insome way, or what flags were used during the compile, or whether theprogram contains special constructs that perl needs to be aware of..PPIn addition it contains two fields that are intended for the private useof the regex engine that compiled the pattern. These are the \f(CW\*(C`intflags\*(C'\fRand pprivate members. The \f(CW\*(C`pprivate\*(C'\fR is a void pointer to an arbitrarystructure whose use and management is the responsibility of the compilingengine. perl will never modify either of these values..PPAs mentioned earlier, in the case of the default engines, the \f(CW\*(C`pprivate\*(C'\fRwill be a pointer to a regexp_internal structure which holds the compiledprogram and any additional data that is private to the regex engineimplementation..PP\fIPerl's \f(CI\*(C`pprivate\*(C'\fI structure\fR.IX Subsection "Perl's pprivate structure".PPThe following structure is used as the \f(CW\*(C`pprivate\*(C'\fR struct by perl'sregex engine. Since it is specific to perl it is only of curiosityvalue to other engine implementations..PP.Vb 10\&    typedef struct regexp_internal {\&            regexp_paren_ofs *swap; /* Swap copy of *startp / *endp */\&            U32 *offsets;           /* offset annotations 20001228 MJD \&                                       data about mapping the program to the \&                                       string*/\&            regnode *regstclass;    /* Optional startclass as identified or constructed\&                                       by the optimiser */\&            struct reg_data *data;  /* Additional miscellaneous data used by the program.\&                                       Used to make it easier to clone and free arbitrary\&                                       data that the regops need. Often the ARG field of\&                                       a regop is an index into this structure */\&            regnode program[1];     /* Unwarranted chumminess with compiler. */\&    } regexp_internal;.Ve.ie n .IP """swap""" 5.el .IP "\f(CWswap\fR" 5.IX Item "swap"\&\f(CW\*(C`swap\*(C'\fR is an extra set of startp/endp stored in a \f(CW\*(C`regexp_paren_ofs\*(C'\fRstruct. This is used when the last successful match was from the same patternas the current pattern, so that a partial match doesn't overwrite theprevious match's results. When this field is data filled the matchingengine will swap buffers before every match attempt. If the match fails,then it swaps them back. If it's successful it leaves them. This fieldis populated on demand and is by default null..ie n .IP """offsets""" 5.el .IP "\f(CWoffsets\fR" 5.IX Item "offsets"Offsets holds a mapping of offset in the \f(CW\*(C`program\*(C'\fRto offset in the \f(CW\*(C`precomp\*(C'\fR string. This is only used by ActiveState'svisual regex debugger..ie n .IP """regstclass""" 5.el .IP "\f(CWregstclass\fR" 5.IX Item "regstclass"Special regop that is used by \f(CW\*(C`re_intuit_start()\*(C'\fR to check if a patterncan match at a certain position. For instance if the regex engine knowsthat the pattern must start with a 'Z' then it can scan the string untilit finds one and then launch the regex engine from there. The routinethat handles this is called \f(CW\*(C`find_by_class()\*(C'\fR. Sometimes this fieldpoints at a regop embedded in the program, and sometimes it points atan independent synthetic regop that has been constructed by the optimiser..ie n .IP """data""" 5.el .IP "\f(CWdata\fR" 5.IX Item "data"This field points at a reg_data structure, which is defined as follows.Sp.Vb 5\&    struct reg_data {\&        U32 count;\&        U8 *what;\&        void* data[1];\&    };.Ve.SpThis structure is used for handling data structures that the regex engineneeds to handle specially during a clone or free operation on the compiledproduct. Each element in the data array has a corresponding element in thewhat array. During compilation regops that need special structures storedwill add an element to each array using the \fIadd_data()\fR routine and then storethe index in the regop..ie n .IP """program""" 5.el .IP "\f(CWprogram\fR" 5.IX Item "program"Compiled program. Inlined into the structure so the entire struct can betreated as a single blob..SH "SEE ALSO".IX Header "SEE ALSO"perlreapi.PPperlre.PPperlunitut.SH "AUTHOR".IX Header "AUTHOR"by Yves Orton, 2006..PPWith excerpts from Perl, and contributions and suggestions fromRonald J. Kimball, Dave Mitchell, Dominic Dunlop, Mark Jason Dominus,Stephen McCamant, and David Landgren..SH "LICENCE".IX Header "LICENCE"Same terms as Perl..SH "REFERENCES".IX Header "REFERENCES"[1] <http://perl.plover.com/Rx/paper/>.PP[2] <http://www.unicode.org>

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -