📄 tech.notes
字号:
Technical Notes about PCRE--------------------------These are very rough technical notes that record potentially useful information about PCRE internals.Historical note 1-----------------Many years ago I implemented some regular expression functions to an algorithmsuggested by Martin Richards. These were not Unix-like in form, and were quiterestricted in what they could do by comparison with Perl. The interesting partabout the algorithm was that the amount of space required to hold the compiledform of an expression was known in advance. The code to apply an expression didnot operate by backtracking, as the original Henry Spencer code and currentPerl code does, but instead checked all possibilities simultaneously by keepinga list of current states and checking all of them as it advanced through thesubject string. In the terminology of Jeffrey Friedl's book, it was a "DFAalgorithm", though it was not a traditional Finite State Machine (FSM). Whenthe pattern was all used up, all remaining states were possible matches, andthe one matching the longest subset of the subject string was chosen. This didnot necessarily maximize the individual wild portions of the pattern, as isexpected in Unix and Perl-style regular expressions.Historical note 2-----------------By contrast, the code originally written by Henry Spencer (which wassubsequently heavily modified for Perl) compiles the expression twice: once ina dummy mode in order to find out how much store will be needed, and then forreal. (The Perl version probably doesn't do this any more; I'm talking aboutthe original library.) The execution function operates by backtracking andmaximizing (or, optionally, minimizing in Perl) the amount of the subject thatmatches individual wild portions of the pattern. This is an "NFA algorithm" inFriedl's terminology.OK, here's the real stuff-------------------------For the set of functions that form the "basic" PCRE library (which areunrelated to those mentioned above), I tried at first to invent an algorithmthat used an amount of store bounded by a multiple of the number of charactersin the pattern, to save on compiling time. However, because of the greatercomplexity in Perl regular expressions, I couldn't do this. In any case, afirst pass through the pattern is helpful for other reasons. Computing the memory requirement: how it was--------------------------------------------Up to and including release 6.7, PCRE worked by running a very degenerate firstpass to calculate a maximum store size, and then a second pass to do the realcompile - which might use a bit less than the predicted amount of memory. Theidea was that this would turn out faster than the Henry Spencer code becausethe first pass is degenerate and the second pass can just store stuff straightinto the vector, which it knows is big enough.Computing the memory requirement: how it is-------------------------------------------By the time I was working on a potential 6.8 release, the degenerate first passhad become very complicated and hard to maintain. Indeed one of the earlythings I did for 6.8 was to fix Yet Another Bug in the memory computation. ThenI had a flash of inspiration as to how I could run the real compile function ina "fake" mode that enables it to compute how much memory it would need, whileactually only ever using a few hundred bytes of working memory, and without toomany tests of the mode that might slow it down. So I re-factored the compilingfunctions to work this way. This got rid of about 600 lines of source. Itshould make future maintenance and development easier. As this was such a major change, I never released 6.8, instead upping the number to 7.0 (other quite major changes are also present in the 7.0 release).A side effect of this work is that the previous limit of 200 on the nestingdepth of parentheses was removed. However, there is a downside: pcre_compile()runs more slowly than before (30% or more, depending on the pattern) because itis doing a full analysis of the pattern. My hope is that this is not a bigissue.Traditional matching function-----------------------------The "traditional", and original, matching function is called pcre_exec(), and it implements an NFA algorithm, similar to the original Henry Spencer algorithm and the way that Perl works. Not surprising, since it is intended to be as compatible with Perl as possible. This is the function most users of PCRE will use most of the time.Supplementary matching function-------------------------------From PCRE 6.0, there is also a supplementary matching function called pcre_dfa_exec(). This implements a DFA matching algorithm that searches simultaneously for all possible matches that start at one point in the subject string. (Going back to my roots: see Historical Note 1 above.) This function intreprets the same compiled pattern data as pcre_exec(); however, not all the facilities are available, and those that are do not always work in quite the same way. See the user documentation for details.The algorithm that is used for pcre_dfa_exec() is not a traditional FSM, because it may have a number of states active at one time. More work would be needed at compile time to produce a traditional FSM where only one state is ever active at once. I believe some other regex matchers work this way.Format of compiled patterns---------------------------The compiled form of a pattern is a vector of bytes, containing items ofvariable length. The first byte in an item is an opcode, and the length of theitem is either implicit in the opcode or contained in the data bytes thatfollow it. In many cases below "two-byte" data values are specified. This is in fact justa default when the number is an offset within the compiled pattern. PCRE can becompiled to use 3-byte or 4-byte values for these offsets (impairing theperformance). This is necessary only when patterns whose compiled length isgreater than 64K are going to be processed. In this description, we assume the"normal" compilation options. "Two-byte" data values that are counts (e.g. for quantifiers) are always just two bytes.A list of all the opcodes follows:Opcodes with no following data------------------------------These items are all just one byte long OP_END end of pattern OP_ANY match any character OP_ANYBYTE match any single byte, even in UTF-8 mode OP_SOD match start of data: \A OP_SOM, start of match (subject + offset): \G OP_CIRC ^ (start of data, or after \n in multiline) OP_NOT_WORD_BOUNDARY \W OP_WORD_BOUNDARY \w OP_NOT_DIGIT \D OP_DIGIT \d OP_NOT_WHITESPACE \S OP_WHITESPACE \s OP_NOT_WORDCHAR \W OP_WORDCHAR \w OP_EODN match end of data or \n at end: \Z OP_EOD match end of data: \z OP_DOLL $ (end of data, or before \n in multiline) OP_EXTUNI match an extended Unicode character OP_ANYNL match any Unicode newline sequence Repeating single characters---------------------------The common repeats (*, +, ?) when applied to a single character use thefollowing opcodes: OP_STAR OP_MINSTAR OP_POSSTAR OP_PLUS OP_MINPLUS OP_POSPLUS OP_QUERY OP_MINQUERY OP_POSQUERY In ASCII mode, these are two-byte items; in UTF-8 mode, the length is variable.Those with "MIN" in their name are the minimizing versions. Those with "POS" in their names are possessive versions. Each is followed by the character that isto be repeated. Other repeats make use of OP_UPTO OP_MINUPTO OP_POSUPTO OP_EXACTwhich are followed by a two-byte count (most significant first) and therepeated character. OP_UPTO matches from 0 to the given number. A repeat with anon-zero minimum and a fixed maximum is coded as an OP_EXACT followed by anOP_UPTO (or OP_MINUPTO or OPT_POSUPTO).Repeating character types-------------------------Repeats of things like \d are done exactly as for single characters, exceptthat instead of a character, the opcode for the type is stored in the databyte. The opcodes are: OP_TYPESTAR OP_TYPEMINSTAR OP_TYPEPOSSTAR OP_TYPEPLUS OP_TYPEMINPLUS OP_TYPEPOSPLUS OP_TYPEQUERY OP_TYPEMINQUERY OP_TYPEPOSQUERY OP_TYPEUPTO OP_TYPEMINUPTO OP_TYPEPOSUPTO OP_TYPEEXACTMatch by Unicode property
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -