📄 pcre.txt
字号:
Note that the sequences \A, \Z, and \z can be used to match
the start and end of the subject in both modes, and if all
branches of a pattern start with \A it is always anchored,
whether PCRE_MULTILINE is set or not.
FULL STOP (PERIOD, DOT)
Outside a character class, a dot in the pattern matches any
one character in the subject, including a non-printing char-
acter, but not (by default) newline. If the PCRE_DOTALL
option is set, dots match newlines as well. The handling of
dot is entirely independent of the handling of circumflex
and dollar, the only relationship being that they both
involve newline characters. Dot has no special meaning in a
character class.
SQUARE BRACKETS
An opening square bracket introduces a character class, ter-
minated by a closing square bracket. A closing square
bracket on its own is not special. If a closing square
bracket is required as a member of the class, it should be
the first data character in the class (after an initial cir-
cumflex, if present) or escaped with a backslash.
A character class matches a single character in the subject;
the character must be in the set of characters defined by
the class, unless the first character in the class is a cir-
cumflex, in which case the subject character must not be in
the set defined by the class. If a circumflex is actually
required as a member of the class, ensure it is not the
first character, or escape it with a backslash.
For example, the character class [aeiou] matches any lower
case vowel, while [^aeiou] matches any character that is not
a lower case vowel. Note that a circumflex is just a con-
venient notation for specifying the characters which are in
the class by enumerating those that are not. It is not an
assertion: it still consumes a character from the subject
string, and fails if the current pointer is at the end of
the string.
When caseless matching is set, any letters in a class
represent both their upper case and lower case versions, so
for example, a caseless [aeiou] matches "A" as well as "a",
and a caseless [^aeiou] does not match "A", whereas a case-
ful version would.
The newline character is never treated in any special way in
character classes, whatever the setting of the PCRE_DOTALL
or PCRE_MULTILINE options is. A class such as [^a] will
always match a newline.
The minus (hyphen) character can be used to specify a range
of characters in a character class. For example, [d-m]
matches any letter between d and m, inclusive. If a minus
character is required in a class, it must be escaped with a
backslash or appear in a position where it cannot be inter-
preted as indicating a range, typically as the first or last
character in the class.
It is not possible to have the literal character "]" as the
end character of a range. A pattern such as [W-]46] is
interpreted as a class of two characters ("W" and "-") fol-
lowed by a literal string "46]", so it would match "W46]" or
"-46]". However, if the "]" is escaped with a backslash it
is interpreted as the end of range, so [W-\]46] is inter-
preted as a single class containing a range followed by two
separate characters. The octal or hexadecimal representation
of "]" can also be used to end a range.
Ranges operate in ASCII collating sequence. They can also be
used for characters specified numerically, for example
[\000-\037]. If a range that includes letters is used when
caseless matching is set, it matches the letters in either
case. For example, [W-c] is equivalent to [][\^_`wxyzabc],
matched caselessly, and if character tables for the "fr"
locale are in use, [\xc8-\xcb] matches accented E characters
in both cases.
The character types \d, \D, \s, \S, \w, and \W may also
appear in a character class, and add the characters that
they match to the class. For example, [\dABCDEF] matches any
hexadecimal digit. A circumflex can conveniently be used
with the upper case character types to specify a more res-
tricted set of characters than the matching lower case type.
For example, the class [^\W_] matches any letter or digit,
but not underscore.
All non-alphameric characters other than \, -, ^ (at the
start) and the terminating ] are non-special in character
classes, but it does no harm if they are escaped.
POSIX CHARACTER CLASSES
Perl 5.6 (not yet released at the time of writing) is going
to support the POSIX notation for character classes, which
uses names enclosed by [: and :] within the enclosing
square brackets. PCRE supports this notation. For example,
[01[:alpha:]%]
matches "0", "1", any alphabetic character, or "%". The sup-
ported class names are
alnum letters and digits
alpha letters
ascii character codes 0 - 127
cntrl control characters
digit decimal digits (same as \d)
graph printing characters, excluding space
lower lower case letters
print printing characters, including space
punct printing characters, excluding letters and digits
space white space (same as \s)
upper upper case letters
word "word" characters (same as \w)
xdigit hexadecimal digits
The names "ascii" and "word" are Perl extensions. Another
Perl extension is negation, which is indicated by a ^ char-
acter after the colon. For example,
[12[:^digit:]]
matches "1", "2", or any non-digit. PCRE (and Perl) also
recognize the POSIX syntax [.ch.] and [=ch=] where "ch" is a
"collating element", but these are not supported, and an
error is given if they are encountered.
VERTICAL BAR
Vertical bar characters are used to separate alternative
patterns. For example, the pattern
gilbert|sullivan
matches either "gilbert" or "sullivan". Any number of alter-
natives may appear, and an empty alternative is permitted
(matching the empty string). The matching process tries
each alternative in turn, from left to right, and the first
one that succeeds is used. If the alternatives are within a
subpattern (defined below), "succeeds" means matching the
rest of the main pattern as well as the alternative in the
subpattern.
INTERNAL OPTION SETTING
The settings of PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL,
and PCRE_EXTENDED can be changed from within the pattern by
a sequence of Perl option letters enclosed between "(?" and
")". The option letters are
i for PCRE_CASELESS
m for PCRE_MULTILINE
s for PCRE_DOTALL
x for PCRE_EXTENDED
For example, (?im) sets caseless, multiline matching. It is
also possible to unset these options by preceding the letter
with a hyphen, and a combined setting and unsetting such as
(?im-sx), which sets PCRE_CASELESS and PCRE_MULTILINE while
unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted.
If a letter appears both before and after the hyphen, the
option is unset.
The scope of these option changes depends on where in the
pattern the setting occurs. For settings that are outside
any subpattern (defined below), the effect is the same as if
the options were set or unset at the start of matching. The
following patterns all behave in exactly the same way:
(?i)abc
a(?i)bc
ab(?i)c
abc(?i)
which in turn is the same as compiling the pattern abc with
PCRE_CASELESS set. In other words, such "top level" set-
tings apply to the whole pattern (unless there are other
changes inside subpatterns). If there is more than one set-
ting of the same option at top level, the rightmost setting
is used.
If an option change occurs inside a subpattern, the effect
is different. This is a change of behaviour in Perl 5.005.
An option change inside a subpattern affects only that part
of the subpattern that follows it, so
(a(?i)b)c
matches abc and aBc and no other strings (assuming
PCRE_CASELESS is not used). By this means, options can be
made to have different settings in different parts of the
pattern. Any changes made in one alternative do carry on
into subsequent branches within the same subpattern. For
example,
(a(?i)b|c)
matches "ab", "aB", "c", and "C", even though when matching
"C" the first branch is abandoned before the option setting.
This is because the effects of option settings happen at
compile time. There would be some very weird behaviour oth-
erwise.
The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can
be changed in the same way as the Perl-compatible options by
using the characters U and X respectively. The (?X) flag
setting is special in that it must always occur earlier in
the pattern than any of the additional features it turns on,
even when it is at top level. It is best put at the start.
SUBPATTERNS
Subpatterns are delimited by parentheses (round brackets),
which can be nested. Marking part of a pattern as a subpat-
tern does two things:
1. It localizes a set of alternatives. For example, the pat-
tern
cat(aract|erpillar|)
matches one of the words "cat", "cataract", or "caterpil-
lar". Without the parentheses, it would match "cataract",
"erpillar" or the empty string.
2. It sets up the subpattern as a capturing subpattern (as
defined above). When the whole pattern matches, that por-
tion of the subject string that matched the subpattern is
passed back to the caller via the ovector argument of
pcre_exec(). Opening parentheses are counted from left to
right (starting from 1) to obtain the numbers of the captur-
ing subpatterns.
For example, if the string "the red king" is matched against
the pattern
the ((red|white) (king|queen))
the captured substrings are "red king", "red", and "king",
and are numbered 1, 2, and 3, respectively.
The fact that plain parentheses fulfil two functions is not
always helpful. There are often times when a grouping sub-
pattern is required without a capturing requirement. If an
opening parenthesis is followed by "?:", the subpattern does
not do any capturing, and is not counted when computing the
number of any subsequent capturing subpatterns. For example,
if the string "the white queen" is matched against the pat-
tern
the ((?:red|white) (king|queen))
the captured substrings are "white queen" and "queen", and
are numbered 1 and 2. The maximum number of captured sub-
strings is 99, and the maximum number of all subpatterns,
both capturing and non-capturing, is 200.
As a convenient shorthand, if any option settings are
required at the start of a non-capturing subpattern, the
option letters may appear between the "?" and the ":". Thus
the two patterns
(?i:saturday|sunday)
(?:(?i)saturday|sunday)
match exactly the same set of strings. Because alternative
branches are tried from left to right, and options are not
reset until the end of the subpattern is reached, an option
setting in one branch does affect subsequent branches, so
the above patterns match "SUNDAY" as well as "Saturday".
REPETITION
Repetition is specified by quantifiers, w
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -