📄 tech.notes

📁 php-4.4.7学习linux时下载的源代码
💻 NOTES
📖 第 1 页 / 共 2 页
字号:
12 下一页
Technical Notes about PCRE--------------------------These are very rough technical notes that record potentially useful information about PCRE internals.Historical note 1-----------------Many years ago I implemented some regular expression functions to an algorithmsuggested by Martin Richards. These were not Unix-like in form, and were quiterestricted in what they could do by comparison with Perl. The interesting partabout the algorithm was that the amount of space required to hold the compiledform of an expression was known in advance. The code to apply an expression didnot operate by backtracking, as the original Henry Spencer code and currentPerl code does, but instead checked all possibilities simultaneously by keepinga list of current states and checking all of them as it advanced through thesubject string. In the terminology of Jeffrey Friedl's book, it was a "DFAalgorithm", though it was not a traditional Finite State Machine (FSM). Whenthe pattern was all used up, all remaining states were possible matches, andthe one matching the longest subset of the subject string was chosen. This didnot necessarily maximize the individual wild portions of the pattern, as isexpected in Unix and Perl-style regular expressions.Historical note 2-----------------By contrast, the code originally written by Henry Spencer (which wassubsequently heavily modified for Perl) compiles the expression twice: once ina dummy mode in order to find out how much store will be needed, and then forreal. (The Perl version probably doesn't do this any more; I'm talking aboutthe original library.) The execution function operates by backtracking andmaximizing (or, optionally, minimizing in Perl) the amount of the subject thatmatches individual wild portions of the pattern. This is an "NFA algorithm" inFriedl's terminology.OK, here's the real stuff-------------------------For the set of functions that form the "basic" PCRE library (which areunrelated to those mentioned above), I tried at first to invent an algorithmthat used an amount of store bounded by a multiple of the number of charactersin the pattern, to save on compiling time. However, because of the greatercomplexity in Perl regular expressions, I couldn't do this. In any case, afirst pass through the pattern is helpful for other reasons. Computing the memory requirement: how it was--------------------------------------------Up to and including release 6.7, PCRE worked by running a very degenerate firstpass to calculate a maximum store size, and then a second pass to do the realcompile - which might use a bit less than the predicted amount of memory. Theidea was that this would turn out faster than the Henry Spencer code becausethe first pass is degenerate and the second pass can just store stuff straightinto the vector, which it knows is big enough.Computing the memory requirement: how it is-------------------------------------------By the time I was working on a potential 6.8 release, the degenerate first passhad become very complicated and hard to maintain. Indeed one of the earlythings I did for 6.8 was to fix Yet Another Bug in the memory computation. ThenI had a flash of inspiration as to how I could run the real compile function ina "fake" mode that enables it to compute how much memory it would need, whileactually only ever using a few hundred bytes of working memory, and without toomany tests of the mode that might slow it down. So I re-factored the compilingfunctions to work this way. This got rid of about 600 lines of source. Itshould make future maintenance and development easier. As this was such a major change, I never released 6.8, instead upping the number to 7.0 (other quite major changes are also present in the 7.0 release).A side effect of this work is that the previous limit of 200 on the nestingdepth of parentheses was removed. However, there is a downside: pcre_compile()runs more slowly than before (30% or more, depending on the pattern) because itis doing a full analysis of the pattern. My hope is that this is not a bigissue.Traditional matching function-----------------------------The "traditional", and original, matching function is called pcre_exec(), and it implements an NFA algorithm, similar to the original Henry Spencer algorithm and the way that Perl works. Not surprising, since it is intended to be as compatible with Perl as possible. This is the function most users of PCRE will use most of the time.Supplementary matching function-------------------------------From PCRE 6.0, there is also a supplementary matching function called pcre_dfa_exec(). This implements a DFA matching algorithm that searches simultaneously for all possible matches that start at one point in the subject string. (Going back to my roots: see Historical Note 1 above.) This function intreprets the same compiled pattern data as pcre_exec(); however, not all the facilities are available, and those that are do not always work in quite the same way. See the user documentation for details.The algorithm that is used for pcre_dfa_exec() is not a traditional FSM, because it may have a number of states active at one time. More work would be needed at compile time to produce a traditional FSM where only one state is ever active at once. I believe some other regex matchers work this way.Format of compiled patterns---------------------------The compiled form of a pattern is a vector of bytes, containing items ofvariable length. The first byte in an item is an opcode, and the length of theitem is either implicit in the opcode or contained in the data bytes thatfollow it. In many cases below "two-byte" data values are specified. This is in fact justa default when the number is an offset within the compiled pattern. PCRE can becompiled to use 3-byte or 4-byte values for these offsets (impairing theperformance). This is necessary only when patterns whose compiled length isgreater than 64K are going to be processed. In this description, we assume the"normal" compilation options. "Two-byte" data values that are counts (e.g. for quantifiers) are always just two bytes.A list of all the opcodes follows:Opcodes with no following data------------------------------These items are all just one byte long  OP_END                 end of pattern  OP_ANY                 match any character  OP_ANYBYTE             match any single byte, even in UTF-8 mode  OP_SOD                 match start of data: \A  OP_SOM,                start of match (subject + offset): \G  OP_CIRC                ^ (start of data, or after \n in multiline)  OP_NOT_WORD_BOUNDARY   \W  OP_WORD_BOUNDARY       \w  OP_NOT_DIGIT           \D  OP_DIGIT               \d  OP_NOT_WHITESPACE      \S  OP_WHITESPACE          \s  OP_NOT_WORDCHAR        \W  OP_WORDCHAR            \w  OP_EODN                match end of data or \n at end: \Z  OP_EOD                 match end of data: \z  OP_DOLL                $ (end of data, or before \n in multiline)  OP_EXTUNI              match an extended Unicode character   OP_ANYNL               match any Unicode newline sequence   Repeating single characters---------------------------The common repeats (*, +, ?) when applied to a single character use thefollowing opcodes:  OP_STAR  OP_MINSTAR  OP_POSSTAR   OP_PLUS  OP_MINPLUS  OP_POSPLUS   OP_QUERY  OP_MINQUERY  OP_POSQUERY In ASCII mode, these are two-byte items; in UTF-8 mode, the length is variable.Those with "MIN" in their name are the minimizing versions. Those with "POS" in their names are possessive versions. Each is followed by the character that isto be repeated. Other repeats make use of  OP_UPTO  OP_MINUPTO  OP_POSUPTO   OP_EXACTwhich are followed by a two-byte count (most significant first) and therepeated character. OP_UPTO matches from 0 to the given number. A repeat with anon-zero minimum and a fixed maximum is coded as an OP_EXACT followed by anOP_UPTO (or OP_MINUPTO or OPT_POSUPTO).Repeating character types-------------------------Repeats of things like \d are done exactly as for single characters, exceptthat instead of a character, the opcode for the type is stored in the databyte. The opcodes are:  OP_TYPESTAR  OP_TYPEMINSTAR  OP_TYPEPOSSTAR   OP_TYPEPLUS  OP_TYPEMINPLUS  OP_TYPEPOSPLUS   OP_TYPEQUERY  OP_TYPEMINQUERY  OP_TYPEPOSQUERY   OP_TYPEUPTO  OP_TYPEMINUPTO  OP_TYPEPOSUPTO   OP_TYPEEXACTMatch by Unicode property
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -