📄 pcre.txt
字号:
While running the pattern match, an unknown item was encoun- tered in the compiled pattern. This error could be caused by a bug in PCRE or by overwriting of the compiled pattern. PCRE_ERROR_NOMEMORY (-6) If a pattern contains back references, but the ovector that is passed to pcre_exec() is not big enough to remember the referenced substrings, PCRE gets a block of memory at the start of matching to use for this purpose. If the call via pcre_malloc() fails, this error is given. The memory is freed at the end of matching.EXTRACTING CAPTURED SUBSTRINGS Captured substrings can be accessed directly by using the offsets returned by pcre_exec() in ovector. For convenience, the functions pcre_copy_substring(), pcre_get_substring(), and pcre_get_substring_list() are provided for extracting captured substrings as new, separate, zero-terminated strings. A substring that contains a binary zero is correctly extracted and has a further zero added on the end, but the result does not, of course, function as a C string. The first three arguments are the same for all three func- tions: subject is the subject string which has just been successfully matched, ovector is a pointer to the vector of integer offsets that was passed to pcre_exec(), and stringcount is the number of substrings that were captured by the match, including the substring that matched the entire regular expression. This is the value returned by pcre_exec if it is greater than zero. If pcre_exec() returned zero, indicating that it ran out of space in ovec- tor, the value passed as stringcount should be the size of the vector divided by three. The functions pcre_copy_substring() and pcre_get_substring() extract a single substring, whose number is given as string- number. A value of zero extracts the substring that matched the entire pattern, while higher values extract the captured substrings. For pcre_copy_substring(), the string is placed in buffer, whose length is given by buffersize, while for pcre_get_substring() a new block of store is obtained via pcre_malloc, and its address is returned via stringptr. The yield of the function is the length of the string, not including the terminating zero, or one of PCRE_ERROR_NOMEMORY (-6) The buffer was too small for pcre_copy_substring(), or the attempt to get memory failed for pcre_get_substring(). PCRE_ERROR_NOSUBSTRING (-7) There is no substring whose number is stringnumber. The pcre_get_substring_list() function extracts all avail- able substrings and builds a list of pointers to them. All this is done in a single block of memory which is obtained via pcre_malloc. The address of the memory block is returned via listptr, which is also the start of the list of string pointers. The end of the list is marked by a NULL pointer. The yield of the function is zero if all went well, or PCRE_ERROR_NOMEMORY (-6) if the attempt to get the memory block failed. When any of these functions encounter a substring that is unset, which can happen when capturing subpattern number n+1 matches some part of the subject, but subpattern n has not been used at all, they return an empty string. This can be distinguished from a genuine zero-length substring by inspecting the appropriate offset in ovector, which is nega- tive for unset substrings.LIMITATIONS There are some size limitations in PCRE but it is hoped that they will never in practice be relevant. The maximum length of a compiled pattern is 65539 (sic) bytes. All values in repeating quantifiers must be less than 65536. The maximum number of capturing subpatterns is 99. The maximum number of all parenthesized subpatterns, including capturing sub- patterns, assertions, and other types of subpattern, is 200. The maximum length of a subject string is the largest posi- tive number that an integer variable can hold. However, PCRE uses recursion to handle subpatterns and indefinite repeti- tion. This means that the available stack space may limit the size of a subject string that can be processed by cer- tain patterns.DIFFERENCES FROM PERL The differences described here are with respect to Perl 5.005. 1. By default, a whitespace character is any character that the C library function isspace() recognizes, though it is possible to compile PCRE with alternative character type tables. Normally isspace() matches space, formfeed, newline, carriage return, horizontal tab, and vertical tab. Perl 5 no longer includes vertical tab in its set of whitespace char- acters. The \v escape that was in the Perl documentation for a long time was never in fact recognized. However, the char- acter itself was treated as whitespace at least up to 5.002. In 5.004 and 5.005 it does not match \s. 2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl permits them, but they do not mean what you might think. For example, (?!a){3} does not assert that the next three characters are not "a". It just asserts that the next character is not "a" three times. 3. Capturing subpatterns that occur inside negative looka- head assertions are counted, but their entries in the offsets vector are never set. Perl sets its numerical vari- ables from any such patterns that are matched before the assertion fails to match something (thereby succeeding), but only if the negative lookahead assertion contains just one branch. 4. Though binary zero characters are supported in the sub- ject string, they are not allowed in a pattern string because it is passed as a normal C string, terminated by zero. The escape sequence "\0" can be used in the pattern to represent a binary zero. 5. The following Perl escape sequences are not supported: \l, \u, \L, \U, \E, \Q. In fact these are implemented by Perl's general string-handling and are not part of its pat- tern matching engine. 6. The Perl \G assertion is not supported as it is not relevant to single pattern matches. 7. Fairly obviously, PCRE does not support the (?{code}) and (?p{code}) constructions. However, there is some experimen- tal support for recursive patterns using the non-Perl item (?R). 8. There are at the time of writing some oddities in Perl 5.005_02 concerned with the settings of captured strings when part of a pattern is repeated. For example, matching "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 unset. However, if the pattern is changed to /^(aa(b(b))?)+$/ then $2 (and $3) are set. In Perl 5.004 $2 is set in both cases, and that is also true of PCRE. If in the future Perl changes to a consistent state that is different, PCRE may change to follow. 9. Another as yet unresolved discrepancy is that in Perl 5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string "a", whereas in PCRE it does not. However, in both Perl and PCRE /^(a)?a/ matched against "a" leaves $1 unset. 10. PCRE provides some extensions to the Perl regular expression facilities: (a) Although lookbehind assertions must match fixed length strings, each alternative branch of a lookbehind assertion can match a different length of string. Perl 5.005 requires them all to have the same length. (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $ meta- character matches only at the very end of the string. (c) If PCRE_EXTRA is set, a backslash followed by a letter with no special meaning is faulted. (d) If PCRE_UNGREEDY is set, the greediness of the repeti- tion quantifiers is inverted, that is, by default they are not greedy, but if followed by a question mark they are. (e) PCRE_ANCHORED can be used to force a pattern to be tried only at the start of the subject. (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options for pcre_exec() have no Perl equivalents. (g) The (?R) construct allows for recursive pattern matching (Perl 5.6 can do this using the (?p{code}) construct, which PCRE cannot of course support.)REGULAR EXPRESSION DETAILS The syntax and semantics of the regular expressions sup- ported by PCRE are described below. Regular expressions are also described in the Perl documentation and in a number of other books, some of which have copious examples. Jeffrey Friedl's "Mastering Regular Expressions", published by O'Reilly (ISBN 1-56592-257), covers them in great detail. The description here is intended as reference documentation. A regular expression is a pattern that is matched against a subject string from left to right. Most characters stand for themselves in a pattern, and match the corresponding charac- ters in the subject. As a trivial example, the pattern The quick brown fox matches a portion of a subject string that is identical to itself. The power of regular expressions comes from the ability to include alternatives and repetitions in the pat- tern. These are encoded in the pattern by the use of meta- characters, which do not stand for themselves but instead are interpreted in some special way. There are two different sets of meta-characters: those that are recognized anywhere in the pattern except within square brackets, and those that are recognized in square brackets. Outside square brackets, the meta-characters are as follows: \ general escape character with several uses ^ assert start of subject (or line, in multiline mode) $ assert end of subject (or line, in multiline mode) . match any character except newline (by default) [ start character class definition | start of alternative branch ( start subpattern ) end subpattern ? extends the meaning of ( also 0 or 1 quantifier also quantifier minimizer * 0 or more quantifier + 1 or more quantifier { start min/max quantifier Part of a pattern that is in square brackets is called a "character class". In a character class the only meta- characters are: \ general escape character ^ negate the class, but only if the first character - indicates character range ] terminates the character class The following sections describe the use of each of the meta-characters.BACKSLASH The backslash character has several uses. Firstly, if it is followed by a non-alphameric character, it takes away any special meaning that character may have. This use of backslash as an escape character applies both inside and outside character classes. For example, if you want to match a "*" character, you write "\*" in the pattern. This applies whether or not the follow- ing character would otherwise be interpreted as a meta- character, so it is always safe to precede a non-alphameric with "\" to specify that it stands for itself. In particu- lar, if you want to match a backslash, you write "\\". If a pattern is compiled with the PCRE_EXTENDED option, whi- tespace in the pattern (other than in a character class) and characters between a "#" outside a character class and the next newline character are ignored. An escaping backslash can be used to include a whitespace or "#" character as part of the pattern. A second use of backslash provides a way of encoding non- printing characters in patterns in a visible manner. There is no restriction on the appearance of non-printing charac- ters, apart from the binary zero that terminates a pattern, but when a pattern is being prepared by text editing, it is usually easier to use one of the following escape sequences than the binary character it represents: \a alarm, that is, the BEL character (hex 07) \cx "control-x", where x is any character \e escape (hex 1B) \f formfeed (hex 0C) \n newline (hex 0A) \r carriage return (hex 0D) \t tab (hex 09) \xhh character with hex code hh \ddd character with octal code ddd, or backreference The precise effect of "\cx" is as follows: if "x" is a lower case letter, it is converted to upper case. Then bit 6 of the character (hex 40) is inverted. Thus "\cz" becomes hex 1A, but "\c{" becomes hex 3B, while "\c;" becomes hex 7B.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -