📄 pcre.txt
字号:
a letter.
If a non-zero starting offset is passed when the pattern is
anchored, one attempt to match at the given offset is tried.
This can only succeed if the pattern does not require the
match to be at the start of the subject.
In general, a pattern matches a certain portion of the sub-
ject, and in addition, further substrings from the subject
may be picked out by parts of the pattern. Following the
usage in Jeffrey Friedl's book, this is called "capturing"
in what follows, and the phrase "capturing subpattern" is
used for a fragment of a pattern that picks out a substring.
PCRE supports several other kinds of parenthesized subpat-
tern that do not cause substrings to be captured.
Captured substrings are returned to the caller via a vector
of integer offsets whose address is passed in ovector. The
number of elements in the vector is passed in ovecsize. The
first two-thirds of the vector is used to pass back captured
substrings, each substring using a pair of integers. The
remaining third of the vector is used as workspace by
pcre_exec() while matching capturing subpatterns, and is not
available for passing back information. The length passed in
ovecsize should always be a multiple of three. If it is not,
it is rounded down.
When a match has been successful, information about captured
substrings is returned in pairs of integers, starting at the
beginning of ovector, and continuing up to two-thirds of its
length at the most. The first element of a pair is set to
the offset of the first character in a substring, and the
second is set to the offset of the first character after the
end of a substring. The first pair, ovector[0] and ovec-
tor[1], identify the portion of the subject string matched
by the entire pattern. The next pair is used for the first
capturing subpattern, and so on. The value returned by
pcre_exec() is the number of pairs that have been set. If
there are no capturing subpatterns, the return value from a
successful match is 1, indicating that just the first pair
of offsets has been set.
Some convenience functions are provided for extracting the
captured substrings as separate strings. These are described
in the following section.
It is possible for an capturing subpattern number n+1 to
match some part of the subject when subpattern n has not
been used at all. For example, if the string "abc" is
matched against the pattern (a|(z))(bc) subpatterns 1 and 3
are matched, but 2 is not. When this happens, both offset
values corresponding to the unused subpattern are set to -1.
If a capturing subpattern is matched repeatedly, it is the
last portion of the string that it matched that gets
returned.
If the vector is too small to hold all the captured sub-
strings, it is used as far as possible (up to two-thirds of
its length), and the function returns a value of zero. In
particular, if the substring offsets are not of interest,
pcre_exec() may be called with ovector passed as NULL and
ovecsize as zero. However, if the pattern contains back
references and the ovector isn't big enough to remember the
related substrings, PCRE has to get additional memory for
use during matching. Thus it is usually advisable to supply
an ovector.
Note that pcre_info() can be used to find out how many cap-
turing subpatterns there are in a compiled pattern. The
smallest size for ovector that will allow for n captured
substrings in addition to the offsets of the substring
matched by the whole pattern is (n+1)*3.
If pcre_exec() fails, it returns a negative number. The fol-
lowing are defined in the header file:
PCRE_ERROR_NOMATCH (-1)
The subject string did not match the pattern.
PCRE_ERROR_NULL (-2)
Either code or subject was passed as NULL, or ovector was
NULL and ovecsize was not zero.
PCRE_ERROR_BADOPTION (-3)
An unrecognized bit was set in the options argument.
PCRE_ERROR_BADMAGIC (-4)
PCRE stores a 4-byte "magic number" at the start of the com-
piled code, to catch the case when it is passed a junk
pointer. This is the error it gives when the magic number
isn't present.
PCRE_ERROR_UNKNOWN_NODE (-5)
While running the pattern match, an unknown item was encoun-
tered in the compiled pattern. This error could be caused by
a bug in PCRE or by overwriting of the compiled pattern.
PCRE_ERROR_NOMEMORY (-6)
If a pattern contains back references, but the ovector that
is passed to pcre_exec() is not big enough to remember the
referenced substrings, PCRE gets a block of memory at the
start of matching to use for this purpose. If the call via
pcre_malloc() fails, this error is given. The memory is
freed at the end of matching.
EXTRACTING CAPTURED SUBSTRINGS
Captured substrings can be accessed directly by using the
offsets returned by pcre_exec() in ovector. For convenience,
the functions pcre_copy_substring(), pcre_get_substring(),
and pcre_get_substring_list() are provided for extracting
captured substrings as new, separate, zero-terminated
strings. A substring that contains a binary zero is
correctly extracted and has a further zero added on the end,
but the result does not, of course, function as a C string.
The first three arguments are the same for all three func-
tions: subject is the subject string which has just been
successfully matched, ovector is a pointer to the vector of
integer offsets that was passed to pcre_exec(), and
stringcount is the number of substrings that were captured
by the match, including the substring that matched the
entire regular expression. This is the value returned by
pcre_exec if it is greater than zero. If pcre_exec()
returned zero, indicating that it ran out of space in ovec-
tor, the value passed as stringcount should be the size of
the vector divided by three.
The functions pcre_copy_substring() and pcre_get_substring()
extract a single substring, whose number is given as string-
number. A value of zero extracts the substring that matched
the entire pattern, while higher values extract the captured
substrings. For pcre_copy_substring(), the string is placed
in buffer, whose length is given by buffersize, while for
pcre_get_substring() a new block of memory is obtained via
pcre_malloc, and its address is returned via stringptr. The
yield of the function is the length of the string, not
including the terminating zero, or one of
PCRE_ERROR_NOMEMORY (-6)
The buffer was too small for pcre_copy_substring(), or the
attempt to get memory failed for pcre_get_substring().
PCRE_ERROR_NOSUBSTRING (-7)
There is no substring whose number is stringnumber.
The pcre_get_substring_list() function extracts all avail-
able substrings and builds a list of pointers to them. All
this is done in a single block of memory which is obtained
via pcre_malloc. The address of the memory block is returned
via listptr, which is also the start of the list of string
pointers. The end of the list is marked by a NULL pointer.
The yield of the function is zero if all went well, or
PCRE_ERROR_NOMEMORY (-6)
if the attempt to get the memory block failed.
When any of these functions encounter a substring that is
unset, which can happen when capturing subpattern number n+1
matches some part of the subject, but subpattern n has not
been used at all, they return an empty string. This can be
distinguished from a genuine zero-length substring by
inspecting the appropriate offset in ovector, which is nega-
tive for unset substrings.
The two convenience functions pcre_free_substring() and
pcre_free_substring_list() can be used to free the memory
returned by a previous call of pcre_get_substring() or
pcre_get_substring_list(), respectively. They do nothing
more than call the function pointed to by pcre_free, which
of course could be called directly from a C program. How-
ever, PCRE is used in some situations where it is linked via
a special interface to another programming language which
cannot use pcre_free directly; it is for these cases that
the functions are provided.
LIMITATIONS
There are some size limitations in PCRE but it is hoped that
they will never in practice be relevant. The maximum length
of a compiled pattern is 65539 (sic) bytes. All values in
repeating quantifiers must be less than 65536. There max-
imum number of capturing subpatterns is 65535. There is no
limit to the number of non-capturing subpatterns, but the
maximum depth of nesting of all kinds of parenthesized sub-
pattern, including capturing subpatterns, assertions, and
other types of subpattern, is 200.
The maximum length of a subject string is the largest posi-
tive number that an integer variable can hold. However, PCRE
uses recursion to handle subpatterns and indefinite repeti-
tion. This means that the available stack space may limit
the size of a subject string that can be processed by cer-
tain patterns.
DIFFERENCES FROM PERL
The differences described here are with respect to Perl
5.005.
1. By default, a whitespace character is any character that
the C library function isspace() recognizes, though it is
possible to compile PCRE with alternative character type
tables. Normally isspace() matches space, formfeed, newline,
carriage return, horizontal tab, and vertical tab. Perl 5 no
longer includes vertical tab in its set of whitespace char-
acters. The \v escape that was in the Perl documentation for
a long time was never in fact recognized. However, the char-
acter itself was treated as whitespace at least up to 5.002.
In 5.004 and 5.005 it does not match \s.
2. PCRE does not allow repeat quantifiers on lookahead
assertions. Perl permits them, but they do not mean what you
might think. For example, (?!a){3} does not assert that the
next three characters are not "a". It just asserts that the
next character is not "a" three times.
3. Capturing subpatterns that occur inside negative looka-
head assertions are counted, but their entries in the
offsets vector are never set. Perl sets its numerical vari-
ables from any such patterns that are matched before the
assertion fails to match something (thereby succeeding), but
only if the negative lookahead assertion contains just one
branch.
4. Though binary zero characters are supported in the sub-
ject string, they are not allowed in a pattern string
because it is passed as a normal C string, terminated by
zero. The escape sequence "\0" can be used in the pattern to
represent a binary zero.
5. The following Perl escape sequences are not supported:
\l, \u, \L, \U, \E, \Q. In fact these are implemented by
Perl's general string-handling and are not part of its pat-
tern matching engine.
6. The Perl \G assertion is not supported as it is not
relevant to single pattern matches.
7. Fairly obviously, PCRE does not support the (?{code}) and
(?p{code}) constructions. However, there is some experimen-
tal support for recursive patterns using the non-Perl item
(?R).
8. There are at the time of writing some oddities in Perl
5.005_02 concerned with the settings of captured strings
when part of a pattern is repeated. For example, matching
"aba" against the pattern /^(a(b)?)+$/ sets $2 to the value
"b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2
unset. However, if the pattern is changed to
/^(aa(b(b))?)+$/ then $2 (and $3) are set.
In Perl 5.004 $2 is set in both cases, and that is also true
of PCRE. If in the future Perl changes to a consistent state
that is different, PCRE may change to follow.
9. Another as yet unresolved discrepancy is that in Perl
5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string
"a", whereas in PCRE it does not. However, in both Perl and
PCRE /^(a)?a/ matched against "a" leaves $1 unset.
10. PCRE provides some extensions to the Perl regular
expression facilities:
(a) Although lookbehind assertions must match fixed length
strings, each alternative branch of a lookbehind assertion
can match a different length of string. Perl 5.005 requires
them all to have the same length.
(b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not
set, the $ meta- character matches only at the very end of
the string.
(c) If PCRE_EXTRA is set, a backslash followed by a letter
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -