📄 pcre.3
字号:
[12[:^digit:]]
matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the POSIX
syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
supported, and an error is given if they are encountered.
.SH VERTICAL BAR
Vertical bar characters are used to separate alternative patterns. For example,
the pattern
gilbert|sullivan
matches either "gilbert" or "sullivan". Any number of alternatives may appear,
and an empty alternative is permitted (matching the empty string).
The matching process tries each alternative in turn, from left to right,
and the first one that succeeds is used. If the alternatives are within a
subpattern (defined below), "succeeds" means matching the rest of the main
pattern as well as the alternative in the subpattern.
.SH INTERNAL OPTION SETTING
The settings of PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and PCRE_EXTENDED
can be changed from within the pattern by a sequence of Perl option letters
enclosed between "(?" and ")". The option letters are
i for PCRE_CASELESS
m for PCRE_MULTILINE
s for PCRE_DOTALL
x for PCRE_EXTENDED
For example, (?im) sets caseless, multiline matching. It is also possible to
unset these options by preceding the letter with a hyphen, and a combined
setting and unsetting such as (?im-sx), which sets PCRE_CASELESS and
PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, is also
permitted. If a letter appears both before and after the hyphen, the option is
unset.
The scope of these option changes depends on where in the pattern the setting
occurs. For settings that are outside any subpattern (defined below), the
effect is the same as if the options were set or unset at the start of
matching. The following patterns all behave in exactly the same way:
(?i)abc
a(?i)bc
ab(?i)c
abc(?i)
which in turn is the same as compiling the pattern abc with PCRE_CASELESS set.
In other words, such "top level" settings apply to the whole pattern (unless
there are other changes inside subpatterns). If there is more than one setting
of the same option at top level, the rightmost setting is used.
If an option change occurs inside a subpattern, the effect is different. This
is a change of behaviour in Perl 5.005. An option change inside a subpattern
affects only that part of the subpattern that follows it, so
(a(?i)b)c
matches abc and aBc and no other strings (assuming PCRE_CASELESS is not used).
By this means, options can be made to have different settings in different
parts of the pattern. Any changes made in one alternative do carry on
into subsequent branches within the same subpattern. For example,
(a(?i)b|c)
matches "ab", "aB", "c", and "C", even though when matching "C" the first
branch is abandoned before the option setting. This is because the effects of
option settings happen at compile time. There would be some very weird
behaviour otherwise.
The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed in the
same way as the Perl-compatible options by using the characters U and X
respectively. The (?X) flag setting is special in that it must always occur
earlier in the pattern than any of the additional features it turns on, even
when it is at top level. It is best put at the start.
.SH SUBPATTERNS
Subpatterns are delimited by parentheses (round brackets), which can be nested.
Marking part of a pattern as a subpattern does two things:
1. It localizes a set of alternatives. For example, the pattern
cat(aract|erpillar|)
matches one of the words "cat", "cataract", or "caterpillar". Without the
parentheses, it would match "cataract", "erpillar" or the empty string.
2. It sets up the subpattern as a capturing subpattern (as defined above).
When the whole pattern matches, that portion of the subject string that matched
the subpattern is passed back to the caller via the \fIovector\fR argument of
\fBpcre_exec()\fR. Opening parentheses are counted from left to right (starting
from 1) to obtain the numbers of the capturing subpatterns.
For example, if the string "the red king" is matched against the pattern
the ((red|white) (king|queen))
the captured substrings are "red king", "red", and "king", and are numbered 1,
2, and 3, respectively.
The fact that plain parentheses fulfil two functions is not always helpful.
There are often times when a grouping subpattern is required without a
capturing requirement. If an opening parenthesis is followed by "?:", the
subpattern does not do any capturing, and is not counted when computing the
number of any subsequent capturing subpatterns. For example, if the string "the
white queen" is matched against the pattern
the ((?:red|white) (king|queen))
the captured substrings are "white queen" and "queen", and are numbered 1 and
2. The maximum number of captured substrings is 99, and the maximum number of
all subpatterns, both capturing and non-capturing, is 200.
As a convenient shorthand, if any option settings are required at the start of
a non-capturing subpattern, the option letters may appear between the "?" and
the ":". Thus the two patterns
(?i:saturday|sunday)
(?:(?i)saturday|sunday)
match exactly the same set of strings. Because alternative branches are tried
from left to right, and options are not reset until the end of the subpattern
is reached, an option setting in one branch does affect subsequent branches, so
the above patterns match "SUNDAY" as well as "Saturday".
.SH REPETITION
Repetition is specified by quantifiers, which can follow any of the following
items:
a single character, possibly escaped
the . metacharacter
a character class
a back reference (see next section)
a parenthesized subpattern (unless it is an assertion - see below)
The general repetition quantifier specifies a minimum and maximum number of
permitted matches, by giving the two numbers in curly brackets (braces),
separated by a comma. The numbers must be less than 65536, and the first must
be less than or equal to the second. For example:
z{2,4}
matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special
character. If the second number is omitted, but the comma is present, there is
no upper limit; if the second number and the comma are both omitted, the
quantifier specifies an exact number of required matches. Thus
[aeiou]{3,}
matches at least 3 successive vowels, but may match many more, while
\\d{8}
matches exactly 8 digits. An opening curly bracket that appears in a position
where a quantifier is not allowed, or one that does not match the syntax of a
quantifier, is taken as a literal character. For example, {,6} is not a
quantifier, but a literal string of four characters.
The quantifier {0} is permitted, causing the expression to behave as if the
previous item and the quantifier were not present.
For convenience (and historical compatibility) the three most common
quantifiers have single-character abbreviations:
* is equivalent to {0,}
+ is equivalent to {1,}
? is equivalent to {0,1}
It is possible to construct infinite loops by following a subpattern that can
match no characters with a quantifier that has no upper limit, for example:
(a?)*
Earlier versions of Perl and PCRE used to give an error at compile time for
such patterns. However, because there are cases where this can be useful, such
patterns are now accepted, but if any repetition of the subpattern does in fact
match no characters, the loop is forcibly broken.
By default, the quantifiers are "greedy", that is, they match as much as
possible (up to the maximum number of permitted times), without causing the
rest of the pattern to fail. The classic example of where this gives problems
is in trying to match comments in C programs. These appear between the
sequences /* and */ and within the sequence, individual * and / characters may
appear. An attempt to match C comments by applying the pattern
/\\*.*\\*/
to the string
/* first command */ not comment /* second comment */
fails, because it matches the entire string owing to the greediness of the .*
item.
However, if a quantifier is followed by a question mark, it ceases to be
greedy, and instead matches the minimum number of times possible, so the
pattern
/\\*.*?\\*/
does the right thing with the C comments. The meaning of the various
quantifiers is not otherwise changed, just the preferred number of matches.
Do not confuse this use of question mark with its use as a quantifier in its
own right. Because it has two uses, it can sometimes appear doubled, as in
\\d??\\d
which matches one digit by preference, but can match two if that is the only
way the rest of the pattern matches.
If the PCRE_UNGREEDY option is set (an option which is not available in Perl),
the quantifiers are not greedy by default, but individual ones can be made
greedy by following them with a question mark. In other words, it inverts the
default behaviour.
When a parenthesized subpattern is quantified with a minimum repeat count that
is greater than 1 or with a limited maximum, more store is required for the
compiled pattern, in proportion to the size of the minimum or maximum.
If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent
to Perl's /s) is set, thus allowing the . to match newlines, the pattern is
implicitly anchored, because whatever follows will be tried against every
character position in the subject string, so there is no point in retrying the
overall match at any position after the first. PCRE treats such a pattern as
though it were preceded by \\A. In cases where it is known that the subject
string contains no newlines, it is worth setting PCRE_DOTALL when the pattern
begins with .* in order to obtain this optimization, or alternatively using ^
to indicate anchoring explicitly.
When a capturing subpattern is repeated, the value captured is the substring
that matched the final iteration. For example, after
(tweedle[dume]{3}\\s*)+
has matched "tweedledum tweedledee" the value of the captured substring is
"tweedledee". However, if there are nested capturing subpatterns, the
corresponding captured values may have been set in previous iterations. For
example, after
/(a|(b))+/
matches "aba" the value of the second captured substring is "b".
.SH BACK REFERENCES
Outside a character class, a backslash followed by a digit greater than 0 (and
possibly further digits) is a back reference to a capturing subpattern earlier
(i.e. to its left) in the pattern, provided there have been that many previous
capturing left parentheses.
However, if the decimal number following the backslash is less than 10, it is
always taken as a back reference, and causes an error only if there are not
that many capturing left parentheses in the entire pattern. In other words, the
parentheses that are referenced need not be to the left of the reference for
numbers less than 10. See the section entitled "Backslash" above for further
details of the handling of digits following a backslash.
A back reference matches whatever actually matched the capturing subpattern in
the current subject string, rather than anything matching the subpattern
itself. So the pattern
(sens|respons)e and \\1ibility
matches "sense and sensibility" and "response and responsibility", but not
"sense and responsibility". If caseful matching is in force at the time of the
back reference, the case of letters is relevant. For example,
((?i)rah)\\s+\\1
matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original
cap
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -