pcre.3

来自「一套很值得分析的短信SMS开发源代码。是我今年早些时候从taobao上买来的。但」· 3 代码 · 共 1,379 行 · 第 1/5 页

3
1,379
字号
The third argument for \fBpcre_study()\fR is a pointer to an error message. If
studying succeeds (even if no data is returned), the variable it points to is
set to NULL. Otherwise it points to a textual error message.

This is a typical call to \fBpcre_study\fR():

  pcre_extra *pe;
  pe = pcre_study(
    re,             /* result of pcre_compile() */
    0,              /* no options exist */
    &error);        /* set to NULL or points to a message */

At present, studying a pattern is useful only for non-anchored patterns that do
not have a single fixed starting character. A bitmap of possible starting
characters is created.


.SH LOCALE SUPPORT
PCRE handles caseless matching, and determines whether characters are letters,
digits, or whatever, by reference to a set of tables. The library contains a
default set of tables which is created in the default C locale when PCRE is
compiled. This is used when the final argument of \fBpcre_compile()\fR is NULL,
and is sufficient for many applications.

An alternative set of tables can, however, be supplied. Such tables are built
by calling the \fBpcre_maketables()\fR function, which has no arguments, in the
relevant locale. The result can then be passed to \fBpcre_compile()\fR as often
as necessary. For example, to build and use tables that are appropriate for the
French locale (where accented characters with codes greater than 128 are
treated as letters), the following code could be used:

  setlocale(LC_CTYPE, "fr");
  tables = pcre_maketables();
  re = pcre_compile(..., tables);

The tables are built in memory that is obtained via \fBpcre_malloc\fR. The
pointer that is passed to \fBpcre_compile\fR is saved with the compiled
pattern, and the same tables are used via this pointer by \fBpcre_study()\fR
and \fBpcre_exec()\fR. Thus for any single pattern, compilation, studying and
matching all happen in the same locale, but different patterns can be compiled
in different locales. It is the caller's responsibility to ensure that the
memory containing the tables remains available for as long as it is needed.


.SH INFORMATION ABOUT A PATTERN
The \fBpcre_fullinfo()\fR function returns information about a compiled
pattern. It replaces the obsolete \fBpcre_info()\fR function, which is
nevertheless retained for backwards compability (and is documented below).

The first argument for \fBpcre_fullinfo()\fR is a pointer to the compiled
pattern. The second argument is the result of \fBpcre_study()\fR, or NULL if
the pattern was not studied. The third argument specifies which piece of
information is required, while the fourth argument is a pointer to a variable
to receive the data. The yield of the function is zero for success, or one of
the following negative numbers:

  PCRE_ERROR_NULL       the argument \fIcode\fR was NULL
                        the argument \fIwhere\fR was NULL
  PCRE_ERROR_BADMAGIC   the "magic number" was not found
  PCRE_ERROR_BADOPTION  the value of \fIwhat\fR was invalid

Here is a typical call of \fBpcre_fullinfo()\fR, to obtain the length of the
compiled pattern:

  int rc;
  unsigned long int length;
  rc = pcre_fullinfo(
    re,               /* result of pcre_compile() */
    pe,               /* result of pcre_study(), or NULL */
    PCRE_INFO_SIZE,   /* what is required */
    &length);         /* where to put the data */

The possible values for the third argument are defined in \fBpcre.h\fR, and are
as follows:

  PCRE_INFO_OPTIONS

Return a copy of the options with which the pattern was compiled. The fourth
argument should point to an \fBunsigned long int\fR variable. These option bits
are those specified in the call to \fBpcre_compile()\fR, modified by any
top-level option settings within the pattern itself, and with the PCRE_ANCHORED
bit forcibly set if the form of the pattern implies that it can match only at
the start of a subject string.

  PCRE_INFO_SIZE

Return the size of the compiled pattern, that is, the value that was passed as
the argument to \fBpcre_malloc()\fR when PCRE was getting memory in which to
place the compiled data. The fourth argument should point to a \fBsize_t\fR
variable.

  PCRE_INFO_CAPTURECOUNT

Return the number of capturing subpatterns in the pattern. The fourth argument
should point to an \fbint\fR variable.

  PCRE_INFO_BACKREFMAX

Return the number of the highest back reference in the pattern. The fourth
argument should point to an \fBint\fR variable. Zero is returned if there are
no back references.

  PCRE_INFO_FIRSTCHAR

Return information about the first character of any matched string, for a
non-anchored pattern. If there is a fixed first character, e.g. from a pattern
such as (cat|cow|coyote), it is returned in the integer pointed to by
\fIwhere\fR. Otherwise, if either

(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch
starts with "^", or

(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set
(if it were set, the pattern would be anchored),

-1 is returned, indicating that the pattern matches only at the start of a
subject string or after any "\\n" within the string. Otherwise -2 is returned.
For anchored patterns, -2 is returned.

  PCRE_INFO_FIRSTTABLE

If the pattern was studied, and this resulted in the construction of a 256-bit
table indicating a fixed set of characters for the first character in any
matching string, a pointer to the table is returned. Otherwise NULL is
returned. The fourth argument should point to an \fBunsigned char *\fR
variable.

  PCRE_INFO_LASTLITERAL

For a non-anchored pattern, return the value of the rightmost literal character
which must exist in any matched string, other than at its start. The fourth
argument should point to an \fBint\fR variable. If there is no such character,
or if the pattern is anchored, -1 is returned. For example, for the pattern
/a\\d+z\\d+/ the returned value is 'z'.

The \fBpcre_info()\fR function is now obsolete because its interface is too
restrictive to return all the available data about a compiled pattern. New
programs should use \fBpcre_fullinfo()\fR instead. The yield of
\fBpcre_info()\fR is the number of capturing subpatterns, or one of the
following negative numbers:

  PCRE_ERROR_NULL       the argument \fIcode\fR was NULL
  PCRE_ERROR_BADMAGIC   the "magic number" was not found

If the \fIoptptr\fR argument is not NULL, a copy of the options with which the
pattern was compiled is placed in the integer it points to (see
PCRE_INFO_OPTIONS above).

If the pattern is not anchored and the \fIfirstcharptr\fR argument is not NULL,
it is used to pass back information about the first character of any matched
string (see PCRE_INFO_FIRSTCHAR above).


.SH MATCHING A PATTERN
The function \fBpcre_exec()\fR is called to match a subject string against a
pre-compiled pattern, which is passed in the \fIcode\fR argument. If the
pattern has been studied, the result of the study should be passed in the
\fIextra\fR argument. Otherwise this must be NULL.

Here is an example of a simple call to \fBpcre_exec()\fR:

  int rc;
  int ovector[30];
  rc = pcre_exec(
    re,             /* result of pcre_compile() */
    NULL,           /* we didn't study the pattern */
    "some string",  /* the subject string */
    11,             /* the length of the subject string */
    0,              /* start at offset 0 in the subject */
    0,              /* default options */
    ovector,        /* vector for substring information */
    30);            /* number of elements in the vector */

The PCRE_ANCHORED option can be passed in the \fIoptions\fR argument, whose
unused bits must be zero. However, if a pattern was compiled with
PCRE_ANCHORED, or turned out to be anchored by virtue of its contents, it
cannot be made unachored at matching time.

There are also three further options that can be set only at matching time:

  PCRE_NOTBOL

The first character of the string is not the beginning of a line, so the
circumflex metacharacter should not match before it. Setting this without
PCRE_MULTILINE (at compile time) causes circumflex never to match.

  PCRE_NOTEOL

The end of the string is not the end of a line, so the dollar metacharacter
should not match it nor (except in multiline mode) a newline immediately before
it. Setting this without PCRE_MULTILINE (at compile time) causes dollar never
to match.

  PCRE_NOTEMPTY

An empty string is not considered to be a valid match if this option is set. If
there are alternatives in the pattern, they are tried. If all the alternatives
match the empty string, the entire match fails. For example, if the pattern

  a?b?

is applied to a string not beginning with "a" or "b", it matches the empty
string at the start of the subject. With PCRE_NOTEMPTY set, this match is not
valid, so PCRE searches further into the string for occurrences of "a" or "b".

Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a special case
of a pattern match of the empty string within its \fBsplit()\fR function, and
when using the /g modifier. It is possible to emulate Perl's behaviour after
matching a null string by first trying the match again at the same offset with
PCRE_NOTEMPTY set, and then if that fails by advancing the starting offset (see
below) and trying an ordinary match again.

The subject string is passed as a pointer in \fIsubject\fR, a length in
\fIlength\fR, and a starting offset in \fIstartoffset\fR. Unlike the pattern
string, the subject may contain binary zero characters. When the starting
offset is zero, the search for a match starts at the beginning of the subject,
and this is by far the most common case.

A non-zero starting offset is useful when searching for another match in the
same subject by calling \fBpcre_exec()\fR again after a previous success.
Setting \fIstartoffset\fR differs from just passing over a shortened string and
setting PCRE_NOTBOL in the case of a pattern that begins with any kind of
lookbehind. For example, consider the pattern

  \\Biss\\B

which finds occurrences of "iss" in the middle of words. (\\B matches only if
the current position in the subject is not a word boundary.) When applied to
the string "Mississipi" the first call to \fBpcre_exec()\fR finds the first
occurrence. If \fBpcre_exec()\fR is called again with just the remainder of the
subject, namely "issipi", it does not match, because \\B is always false at the
start of the subject, which is deemed to be a word boundary. However, if
\fBpcre_exec()\fR is passed the entire string again, but with \fIstartoffset\fR
set to 4, it finds the second occurrence of "iss" because it is able to look
behind the starting point to discover that it is preceded by a letter.

If a non-zero starting offset is passed when the pattern is anchored, one
attempt to match at the given offset is tried. This can only succeed if the
pattern does not require the match to be at the start of the subject.

In general, a pattern matches a certain portion of the subject, and in
addition, further substrings from the subject may be picked out by parts of the
pattern. Following the usage in Jeffrey Friedl's book, this is called
"capturing" in what follows, and the phrase "capturing subpattern" is used for
a fragment of a pattern that picks out a substring. PCRE supports several other
kinds of parenthesized subpattern that do not cause substrings to be captured.

Captured substrings are returned to the caller via a vector of integer offsets
whose address is passed in \fIovector\fR. The number of elements in the vector
is passed in \fIovecsize\fR. The first two-thirds of the vector is used to pass
back captured substrings, each substring using a pair of integers. The
remaining third of the vector is used as workspace by \fBpcre_exec()\fR while
matching capturing subpatterns, and is not available for passing back
information. The length passed in \fIovecsize\fR should always be a multiple of
three. If it is not, it is rounded down.

When a match has been successful, information about captured substrings is
returned in pairs of integers, starting at the beginning of \fIovector\fR, and
continuing up to two-thirds of its length at the most. The first element of a
pair is set to the offset of the first character in a substring, and the second
is set to the offset of the first character after the end of a substring. The
first pair, \fIovector[0]\fR and \fIovector[1]\fR, identify the portion of the
subject string matched by the entire pattern. The next pair is used for the
first capturing subpattern, and so on. The value returned by \fBpcre_exec()\fR
is the number of pairs that have been set. If there are no capturing
subpatterns, the return value from a successful match is 1, indicating that
just the first pair of offsets has been set.

Some convenience functions are provided for extracting the captured substrings
as separate strings. These are described in the following section.

It is possible for an capturing subpattern number \fIn+1\fR to match some
part of the subject when subpattern \fIn\fR has not been used at all. For
example, if the string "abc" is matched against the pattern (a|(z))(bc)
subpatterns 1 and 3 are matched, but 2 is not. When this happens, both offset
values corresponding to the unused subpattern are set to -1.

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?