⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 pcre.3

📁 Apache 2.0.63 is the current stable version of the 2.0 series, and is recommended over any previous
💻 3
📖 第 1 页 / 共 5 页
字号:

If a capturing subpattern is matched repeatedly, it is the last portion of the
string that it matched that gets returned.

If the vector is too small to hold all the captured substrings, it is used as
far as possible (up to two-thirds of its length), and the function returns a
value of zero. In particular, if the substring offsets are not of interest,
\fBpcre_exec()\fR may be called with \fIovector\fR passed as NULL and
\fIovecsize\fR as zero. However, if the pattern contains back references and
the \fIovector\fR isn't big enough to remember the related substrings, PCRE has
to get additional memory for use during matching. Thus it is usually advisable
to supply an \fIovector\fR.

Note that \fBpcre_info()\fR can be used to find out how many capturing
subpatterns there are in a compiled pattern. The smallest size for
\fIovector\fR that will allow for \fIn\fR captured substrings in addition to
the offsets of the substring matched by the whole pattern is (\fIn\fR+1)*3.

If \fBpcre_exec()\fR fails, it returns a negative number. The following are
defined in the header file:

  PCRE_ERROR_NOMATCH        (-1)

The subject string did not match the pattern.

  PCRE_ERROR_NULL           (-2)

Either \fIcode\fR or \fIsubject\fR was passed as NULL, or \fIovector\fR was
NULL and \fIovecsize\fR was not zero.

  PCRE_ERROR_BADOPTION      (-3)

An unrecognized bit was set in the \fIoptions\fR argument.

  PCRE_ERROR_BADMAGIC       (-4)

PCRE stores a 4-byte "magic number" at the start of the compiled code, to catch
the case when it is passed a junk pointer. This is the error it gives when the
magic number isn't present.

  PCRE_ERROR_UNKNOWN_NODE   (-5)

While running the pattern match, an unknown item was encountered in the
compiled pattern. This error could be caused by a bug in PCRE or by overwriting
of the compiled pattern.

  PCRE_ERROR_NOMEMORY       (-6)

If a pattern contains back references, but the \fIovector\fR that is passed to
\fBpcre_exec()\fR is not big enough to remember the referenced substrings, PCRE
gets a block of memory at the start of matching to use for this purpose. If the
call via \fBpcre_malloc()\fR fails, this error is given. The memory is freed at
the end of matching.


.SH EXTRACTING CAPTURED SUBSTRINGS
Captured substrings can be accessed directly by using the offsets returned by
\fBpcre_exec()\fR in \fIovector\fR. For convenience, the functions
\fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and
\fBpcre_get_substring_list()\fR are provided for extracting captured substrings
as new, separate, zero-terminated strings. A substring that contains a binary
zero is correctly extracted and has a further zero added on the end, but the
result does not, of course, function as a C string.

The first three arguments are the same for all three functions: \fIsubject\fR
is the subject string which has just been successfully matched, \fIovector\fR
is a pointer to the vector of integer offsets that was passed to
\fBpcre_exec()\fR, and \fIstringcount\fR is the number of substrings that
were captured by the match, including the substring that matched the entire
regular expression. This is the value returned by \fBpcre_exec\fR if it
is greater than zero. If \fBpcre_exec()\fR returned zero, indicating that it
ran out of space in \fIovector\fR, the value passed as \fIstringcount\fR should
be the size of the vector divided by three.

The functions \fBpcre_copy_substring()\fR and \fBpcre_get_substring()\fR
extract a single substring, whose number is given as \fIstringnumber\fR. A
value of zero extracts the substring that matched the entire pattern, while
higher values extract the captured substrings. For \fBpcre_copy_substring()\fR,
the string is placed in \fIbuffer\fR, whose length is given by
\fIbuffersize\fR, while for \fBpcre_get_substring()\fR a new block of memory is
obtained via \fBpcre_malloc\fR, and its address is returned via
\fIstringptr\fR. The yield of the function is the length of the string, not
including the terminating zero, or one of

  PCRE_ERROR_NOMEMORY       (-6)

The buffer was too small for \fBpcre_copy_substring()\fR, or the attempt to get
memory failed for \fBpcre_get_substring()\fR.

  PCRE_ERROR_NOSUBSTRING    (-7)

There is no substring whose number is \fIstringnumber\fR.

The \fBpcre_get_substring_list()\fR function extracts all available substrings
and builds a list of pointers to them. All this is done in a single block of
memory which is obtained via \fBpcre_malloc\fR. The address of the memory block
is returned via \fIlistptr\fR, which is also the start of the list of string
pointers. The end of the list is marked by a NULL pointer. The yield of the
function is zero if all went well, or

  PCRE_ERROR_NOMEMORY       (-6)

if the attempt to get the memory block failed.

When any of these functions encounter a substring that is unset, which can
happen when capturing subpattern number \fIn+1\fR matches some part of the
subject, but subpattern \fIn\fR has not been used at all, they return an empty
string. This can be distinguished from a genuine zero-length substring by
inspecting the appropriate offset in \fIovector\fR, which is negative for unset
substrings.

The two convenience functions \fBpcre_free_substring()\fR and
\fBpcre_free_substring_list()\fR can be used to free the memory returned by
a previous call of \fBpcre_get_substring()\fR or
\fBpcre_get_substring_list()\fR, respectively. They do nothing more than call
the function pointed to by \fBpcre_free\fR, which of course could be called
directly from a C program. However, PCRE is used in some situations where it is
linked via a special interface to another programming language which cannot use
\fBpcre_free\fR directly; it is for these cases that the functions are
provided.


.SH LIMITATIONS
There are some size limitations in PCRE but it is hoped that they will never in
practice be relevant.
The maximum length of a compiled pattern is 65539 (sic) bytes.
All values in repeating quantifiers must be less than 65536.
There maximum number of capturing subpatterns is 65535.
There is no limit to the number of non-capturing subpatterns, but the maximum
depth of nesting of all kinds of parenthesized subpattern, including capturing
subpatterns, assertions, and other types of subpattern, is 200.

The maximum length of a subject string is the largest positive number that an
integer variable can hold. However, PCRE uses recursion to handle subpatterns
and indefinite repetition. This means that the available stack space may limit
the size of a subject string that can be processed by certain patterns.


.SH DIFFERENCES FROM PERL
The differences described here are with respect to Perl 5.005.

1. By default, a whitespace character is any character that the C library
function \fBisspace()\fR recognizes, though it is possible to compile PCRE with
alternative character type tables. Normally \fBisspace()\fR matches space,
formfeed, newline, carriage return, horizontal tab, and vertical tab. Perl 5
no longer includes vertical tab in its set of whitespace characters. The \\v
escape that was in the Perl documentation for a long time was never in fact
recognized. However, the character itself was treated as whitespace at least
up to 5.002. In 5.004 and 5.005 it does not match \\s.

2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl permits
them, but they do not mean what you might think. For example, (?!a){3} does
not assert that the next three characters are not "a". It just asserts that the
next character is not "a" three times.

3. Capturing subpatterns that occur inside negative lookahead assertions are
counted, but their entries in the offsets vector are never set. Perl sets its
numerical variables from any such patterns that are matched before the
assertion fails to match something (thereby succeeding), but only if the
negative lookahead assertion contains just one branch.

4. Though binary zero characters are supported in the subject string, they are
not allowed in a pattern string because it is passed as a normal C string,
terminated by zero. The escape sequence "\\0" can be used in the pattern to
represent a binary zero.

5. The following Perl escape sequences are not supported: \\l, \\u, \\L, \\U,
\\E, \\Q. In fact these are implemented by Perl's general string-handling and
are not part of its pattern matching engine.

6. The Perl \\G assertion is not supported as it is not relevant to single
pattern matches.

7. Fairly obviously, PCRE does not support the (?{code}) and (?p{code})
constructions. However, there is some experimental support for recursive
patterns using the non-Perl item (?R).

8. There are at the time of writing some oddities in Perl 5.005_02 concerned
with the settings of captured strings when part of a pattern is repeated. For
example, matching "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value
"b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 unset. However, if
the pattern is changed to /^(aa(b(b))?)+$/ then $2 (and $3) are set.

In Perl 5.004 $2 is set in both cases, and that is also true of PCRE. If in the
future Perl changes to a consistent state that is different, PCRE may change to
follow.

9. Another as yet unresolved discrepancy is that in Perl 5.005_02 the pattern
/^(a)?(?(1)a|b)+$/ matches the string "a", whereas in PCRE it does not.
However, in both Perl and PCRE /^(a)?a/ matched against "a" leaves $1 unset.

10. PCRE provides some extensions to the Perl regular expression facilities:

(a) Although lookbehind assertions must match fixed length strings, each
alternative branch of a lookbehind assertion can match a different length of
string. Perl 5.005 requires them all to have the same length.

(b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $ meta-
character matches only at the very end of the string.

(c) If PCRE_EXTRA is set, a backslash followed by a letter with no special
meaning is faulted.

(d) If PCRE_UNGREEDY is set, the greediness of the repetition quantifiers is
inverted, that is, by default they are not greedy, but if followed by a
question mark they are.

(e) PCRE_ANCHORED can be used to force a pattern to be tried only at the start
of the subject.

(f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options for
\fBpcre_exec()\fR have no Perl equivalents.

(g) The (?R) construct allows for recursive pattern matching (Perl 5.6 can do
this using the (?p{code}) construct, which PCRE cannot of course support.)


.SH REGULAR EXPRESSION DETAILS
The syntax and semantics of the regular expressions supported by PCRE are
described below. Regular expressions are also described in the Perl
documentation and in a number of other books, some of which have copious
examples. Jeffrey Friedl's "Mastering Regular Expressions", published by
O'Reilly (ISBN 1-56592-257), covers them in great detail.

The description here is intended as reference documentation. The basic
operation of PCRE is on strings of bytes. However, there is the beginnings of
some support for UTF-8 character strings. To use this support you must
configure PCRE to include it, and then call \fBpcre_compile()\fR with the
PCRE_UTF8 option. How this affects the pattern matching is described in the
final section of this document.

A regular expression is a pattern that is matched against a subject string from
left to right. Most characters stand for themselves in a pattern, and match the
corresponding characters in the subject. As a trivial example, the pattern

  The quick brown fox

matches a portion of a subject string that is identical to itself. The power of
regular expressions comes from the ability to include alternatives and
repetitions in the pattern. These are encoded in the pattern by the use of
\fImeta-characters\fR, which do not stand for themselves but instead are
interpreted in some special way.

There are two different sets of meta-characters: those that are recognized
anywhere in the pattern except within square brackets, and those that are
recognized in square brackets. Outside square brackets, the meta-characters are
as follows:

  \\      general escape character with several uses
  ^      assert start of subject (or line, in multiline mode)
  $      assert end of subject (or line, in multiline mode)
  .      match any character except newline (by default)
  [      start character class definition
  |      start of alternative branch
  (      start subpattern
  )      end subpattern
  ?      extends the meaning of (
         also 0 or 1 quantifier
         also quantifier minimizer
  *      0 or more quantifier
  +      1 or more quantifier
  {      start min/max quantifier

Part of a pattern that is in square brackets is called a "character class". In
a character class the only meta-characters are:

  \\      general escape character
  ^      negate the class, but only if the first character
  -      indicates character range
  ]      terminates the character class

The following sections describe the use of each of the meta-characters.


.SH BACKSLASH
The backslash character has several uses. Firstly, if it is followed by a

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -