📄 pcre.txt

📁 Apache 2.0.63 is the current stable version of the 2.0 series, and is recommended over any previous
💻 TXT
📖 第 1 页 / 共 5 页
字号:
     with no special meaning is faulted.

     (d) If PCRE_UNGREEDY is set, the greediness of  the  repeti-
     tion  quantifiers  is inverted, that is, by default they are
     not greedy, but if followed by a question mark they are.

     (e) PCRE_ANCHORED can be used to force a pattern to be tried
     only at the start of the subject.

     (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY  options
     for pcre_exec() have no Perl equivalents.

     (g) The (?R) construct allows for recursive pattern matching
     (Perl  5.6 can do this using the (?p{code}) construct, which
     PCRE cannot of course support.)



REGULAR EXPRESSION DETAILS
     The syntax and semantics of  the  regular  expressions  sup-
     ported  by PCRE are described below. Regular expressions are
     also described in the Perl documentation and in a number  of
     other  books,  some  of which have copious examples. Jeffrey
     Friedl's  "Mastering  Regular  Expressions",  published   by
     O'Reilly (ISBN 1-56592-257), covers them in great detail.

     The description here is intended as reference documentation.
     The basic operation of PCRE is on strings of bytes. However,
     there is the beginnings of some support for UTF-8  character
     strings.  To  use  this  support  you must configure PCRE to
     include it, and then call pcre_compile() with the  PCRE_UTF8
     option.  How  this affects the pattern matching is described
     in the final section of this document.

     A regular expression is a pattern that is matched against  a
     subject string from left to right. Most characters stand for
     themselves in a pattern, and match the corresponding charac-
     ters in the subject. As a trivial example, the pattern

       The quick brown fox

     matches a portion of a subject string that is  identical  to
     itself.  The  power  of  regular  expressions comes from the
     ability to include alternatives and repetitions in the  pat-
     tern.  These  are encoded in the pattern by the use of meta-
     characters, which do not stand for  themselves  but  instead
     are interpreted in some special way.

     There are two different sets of meta-characters: those  that
     are  recognized anywhere in the pattern except within square
     brackets, and those that are recognized in square  brackets.
     Outside square brackets, the meta-characters are as follows:

       \      general escape character with several uses
       ^      assert start of  subject  (or  line,  in  multiline
     mode)
       $      assert end of subject (or line, in multiline mode)
       .      match any character except newline (by default)
       [      start character class definition
       |      start of alternative branch
       (      start subpattern
       )      end subpattern
       ?      extends the meaning of (
              also 0 or 1 quantifier
              also quantifier minimizer
       *      0 or more quantifier
       +      1 or more quantifier
       {      start min/max quantifier

     Part of a pattern that is in square  brackets  is  called  a
     "character  class".  In  a  character  class  the only meta-
     characters are:

       \      general escape character
       ^      negate the class, but only if the first character
       -      indicates character range
       ]      terminates the character class

     The following sections describe  the  use  of  each  of  the
     meta-characters.



BACKSLASH
     The backslash character has several uses. Firstly, if it  is
     followed  by  a  non-alphameric character, it takes away any
     special  meaning  that  character  may  have.  This  use  of

     backslash  as  an  escape  character applies both inside and
     outside character classes.

     For example, if you want to match a "*" character, you write
     "\*" in the pattern. This applies whether or not the follow-
     ing character would otherwise  be  interpreted  as  a  meta-
     character,  so it is always safe to precede a non-alphameric
     with "\" to specify that it stands for itself.  In  particu-
     lar, if you want to match a backslash, you write "\\".

     If a pattern is compiled with the PCRE_EXTENDED option, whi-
     tespace in the pattern (other than in a character class) and
     characters between a "#" outside a character class  and  the
     next  newline  character  are ignored. An escaping backslash
     can be used to include a whitespace or "#" character as part
     of the pattern.

     A second use of backslash provides a way  of  encoding  non-
     printing  characters  in patterns in a visible manner. There
     is no restriction on the appearance of non-printing  charac-
     ters,  apart from the binary zero that terminates a pattern,
     but when a pattern is being prepared by text editing, it  is
     usually  easier to use one of the following escape sequences
     than the binary character it represents:

       \a     alarm, that is, the BEL character (hex 07)
       \cx    "control-x", where x is any character
       \e     escape (hex 1B)
       \f     formfeed (hex 0C)
       \n     newline (hex 0A)
       \r     carriage return (hex 0D)
       \t     tab (hex 09)
       \xhh   character with hex code hh
       \ddd   character with octal code ddd, or backreference

     The precise effect of "\cx" is as follows: if "x" is a lower
     case  letter,  it  is converted to upper case. Then bit 6 of
     the character (hex 40) is inverted.  Thus "\cz" becomes  hex
     1A, but "\c{" becomes hex 3B, while "\c;" becomes hex 7B.

     After "\x", up to two hexadecimal digits are  read  (letters
     can be in upper or lower case).

     After "\0" up to two further octal digits are read. In  both
     cases,  if  there are fewer than two digits, just those that
     are present are used. Thus the sequence "\0\x\07"  specifies
     two binary zeros followed by a BEL character.  Make sure you
     supply two digits after the initial zero  if  the  character
     that follows is itself an octal digit.

     The handling of a backslash followed by a digit other than 0
     is  complicated.   Outside  a character class, PCRE reads it
     and any following digits as a decimal number. If the  number
     is  less  than  10, or if there have been at least that many
     previous capturing left parentheses in the  expression,  the
     entire  sequence is taken as a back reference. A description
     of how this works is given later, following  the  discussion
     of parenthesized subpatterns.

     Inside a character  class,  or  if  the  decimal  number  is
     greater  than  9 and there have not been that many capturing
     subpatterns, PCRE re-reads up to three octal digits  follow-
     ing  the  backslash,  and  generates  a single byte from the
     least significant 8 bits of the value. Any subsequent digits
     stand for themselves.  For example:

       \040   is another way of writing a space
       \40    is the same, provided there are fewer than 40
                 previous capturing subpatterns
       \7     is always a back reference
       \11    might be a back reference, or another way of
                 writing a tab
       \011   is always a tab
       \0113  is a tab followed by the character "3"
       \113   is the character with octal code 113 (since there
                 can be no more than 99 back references)
       \377   is a byte consisting entirely of 1 bits
       \81    is either a back reference, or a binary zero
                 followed by the two characters "8" and "1"

     Note that octal values of 100 or greater must not be  intro-
     duced  by  a  leading zero, because no more than three octal
     digits are ever read.

     All the sequences that define a single  byte  value  can  be
     used both inside and outside character classes. In addition,
     inside a character class, the sequence "\b"  is  interpreted
     as  the  backspace  character  (hex 08). Outside a character
     class it has a different meaning (see below).

     The third use of backslash is for specifying generic charac-
     ter types:

       \d     any decimal digit
       \D     any character that is not a decimal digit
       \s     any whitespace character
       \S     any character that is not a whitespace character
       \w     any "word" character
       \W     any "non-word" character

     Each pair of escape sequences partitions the complete set of
     characters  into  two  disjoint  sets.  Any  given character
     matches one, and only one, of each pair.

     A "word" character is any letter or digit or the  underscore
     character,  that  is,  any  character which can be part of a
     Perl "word". The definition of letters and  digits  is  con-
     trolled  by PCRE's character tables, and may vary if locale-
     specific matching is  taking  place  (see  "Locale  support"
     above). For example, in the "fr" (French) locale, some char-
     acter codes greater than 128 are used for accented  letters,
     and these are matched by \w.

     These character type sequences can appear  both  inside  and
     outside  character classes. They each match one character of
     the appropriate type. If the current matching  point  is  at
     the end of the subject string, all of them fail, since there
     is no character to match.

     The fourth use of backslash is  for  certain  simple  asser-
     tions. An assertion specifies a condition that has to be met
     at a particular point in  a  match,  without  consuming  any
     characters  from  the subject string. The use of subpatterns
     for more complicated  assertions  is  described  below.  The
     backslashed assertions are

       \b     word boundary
       \B     not a word boundary
       \A     start of subject (independent of multiline mode)
       \Z     end of subject or newline at  end  (independent  of
     multiline mode)
       \z     end of subject (independent of multiline mode)

     These assertions may not appear in  character  classes  (but
     note that "\b" has a different meaning, namely the backspace
     character, inside a character class).

     A word boundary is a position in the  subject  string  where
     the current character and the previous character do not both
     match \w or \W (i.e. one matches \w and  the  other  matches
     \W),  or the start or end of the string if the first or last
     character matches \w, respectively.

     The \A, \Z, and \z assertions differ  from  the  traditional
     circumflex  and  dollar  (described below) in that they only
     ever match at the very start and end of the subject  string,
     whatever  options  are  set.  They  are  not affected by the
     PCRE_NOTBOL or PCRE_NOTEOL options. If the startoffset argu-
     ment  of  pcre_exec()  is  non-zero, \A can never match. The
     difference between \Z and \z is that  \Z  matches  before  a
     newline  that is the last character of the string as well as
     at the end of the string, whereas \z  matches  only  at  the
     end.



CIRCUMFLEX AND DOLLAR
     Outside a character class, in the default matching mode, the
     circumflex  character  is an assertion which is true only if
     the current matching point is at the start  of  the  subject
     string.  If  the startoffset argument of pcre_exec() is non-
     zero, circumflex can never match. Inside a character  class,
     circumflex has an entirely different meaning (see below).

     Circumflex need not be the first character of the pattern if
     a  number of alternatives are involved, but it should be the
     first thing in each alternative in which it appears  if  the
     pattern is ever to match that branch. If all possible alter-
     natives start with a circumflex, that is, if the pattern  is
     constrained to match only at the start of the subject, it is
     said to be an "anchored" pattern. (There are also other con-
     structs that can cause a pattern to be anchored.)

     A dollar character is an assertion which is true only if the
     current  matching point is at the end of the subject string,
     or immediately before a newline character that is  the  last
     character in the string (by default). Dollar need not be the
     last character of the pattern if a  number  of  alternatives
     are  involved,  but it should be the last item in any branch
     in which it appears.  Dollar has no  special  meaning  in  a
     character class.

     The meaning of dollar can be changed so that it matches only
     at   the   very   end   of   the   string,  by  setting  the
     PCRE_DOLLAR_ENDONLY option at compile or matching time. This
     does not affect the \Z assertion.

     The meanings of the circumflex  and  dollar  characters  are
     changed  if  the  PCRE_MULTILINE option is set. When this is
     the case,  they  match  immediately  after  and  immediately
     before an internal "\n" character, respectively, in addition
     to matching at the start and end of the subject string.  For
     example,  the  pattern  /^abc$/  matches  the subject string
     "def\nabc" in multiline  mode,  but  not  otherwise.  Conse-
     quently,  patterns  that  are  anchored  in single line mode
     because all branches start with "^" are not anchored in mul-
     tiline mode, and a match for circumflex is possible when the
     startoffset  argument  of  pcre_exec()  is   non-zero.   The
     PCRE_DOLLAR_ENDONLY  option  is ignored if PCRE_MULTILINE is
     set.
💿 文件大小 8042 K
👤 上传用户 invill
📂 所属分类 Java编程
🏷️ 相关标签

#recommended #the #previous #current
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -