⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 pcre.txt

📁 Apache 2.0.63 is the current stable version of the 2.0 series, and is recommended over any previous
💻 TXT
📖 第 1 页 / 共 5 页
字号:
     a letter.

     If a non-zero starting offset is passed when the pattern  is
     anchored, one attempt to match at the given offset is tried.
     This can only succeed if the pattern does  not  require  the
     match to be at the start of the subject.

     In general, a pattern matches a certain portion of the  sub-
     ject,  and  in addition, further substrings from the subject
     may be picked out by parts of  the  pattern.  Following  the
     usage  in  Jeffrey Friedl's book, this is called "capturing"
     in what follows, and the phrase  "capturing  subpattern"  is
     used for a fragment of a pattern that picks out a substring.
     PCRE supports several other kinds of  parenthesized  subpat-
     tern that do not cause substrings to be captured.

     Captured substrings are returned to the caller via a  vector
     of  integer  offsets whose address is passed in ovector. The
     number of elements in the vector is passed in ovecsize.  The
     first two-thirds of the vector is used to pass back captured
     substrings, each substring using a  pair  of  integers.  The
     remaining  third  of  the  vector  is  used  as workspace by
     pcre_exec() while matching capturing subpatterns, and is not
     available for passing back information. The length passed in
     ovecsize should always be a multiple of three. If it is not,
     it is rounded down.

     When a match has been successful, information about captured
     substrings is returned in pairs of integers, starting at the
     beginning of ovector, and continuing up to two-thirds of its
     length  at  the  most. The first element of a pair is set to
     the offset of the first character in a  substring,  and  the
     second is set to the offset of the first character after the
     end of a substring. The first  pair,  ovector[0]  and  ovec-
     tor[1],  identify  the portion of the subject string matched
     by the entire pattern. The next pair is used for  the  first
     capturing  subpattern,  and  so  on.  The  value returned by
     pcre_exec() is the number of pairs that have  been  set.  If
     there  are no capturing subpatterns, the return value from a
     successful match is 1, indicating that just the  first  pair
     of offsets has been set.

     Some convenience functions are provided for  extracting  the
     captured substrings as separate strings. These are described
     in the following section.

     It is possible for an capturing  subpattern  number  n+1  to
     match  some  part  of  the subject when subpattern n has not
     been used at all.  For  example,  if  the  string  "abc"  is
     matched  against the pattern (a|(z))(bc) subpatterns 1 and 3
     are matched, but 2 is not. When this  happens,  both  offset
     values corresponding to the unused subpattern are set to -1.

     If a capturing subpattern is matched repeatedly, it  is  the
     last  portion  of  the  string  that  it  matched  that gets
     returned.

     If the vector is too small to hold  all  the  captured  sub-
     strings,  it is used as far as possible (up to two-thirds of
     its length), and the function returns a value  of  zero.  In
     particular,  if  the  substring offsets are not of interest,
     pcre_exec() may be called with ovector passed  as  NULL  and
     ovecsize  as  zero.  However,  if  the pattern contains back
     references and the ovector isn't big enough to remember  the
     related  substrings,  PCRE  has to get additional memory for
     use during matching. Thus it is usually advisable to  supply
     an ovector.

     Note that pcre_info() can be used to find out how many  cap-
     turing  subpatterns  there  are  in  a compiled pattern. The
     smallest size for ovector that will  allow  for  n  captured
     substrings  in  addition  to  the  offsets  of the substring
     matched by the whole pattern is (n+1)*3.

     If pcre_exec() fails, it returns a negative number. The fol-
     lowing are defined in the header file:

       PCRE_ERROR_NOMATCH        (-1)

     The subject string did not match the pattern.

       PCRE_ERROR_NULL           (-2)

     Either code or subject was passed as NULL,  or  ovector  was
     NULL and ovecsize was not zero.

       PCRE_ERROR_BADOPTION      (-3)

     An unrecognized bit was set in the options argument.

       PCRE_ERROR_BADMAGIC       (-4)

     PCRE stores a 4-byte "magic number" at the start of the com-
     piled  code,  to  catch  the  case  when it is passed a junk
     pointer. This is the error it gives when  the  magic  number
     isn't present.

       PCRE_ERROR_UNKNOWN_NODE   (-5)

     While running the pattern match, an unknown item was encoun-
     tered in the compiled pattern. This error could be caused by
     a bug in PCRE or by overwriting of the compiled pattern.

       PCRE_ERROR_NOMEMORY       (-6)

     If a pattern contains back references, but the ovector  that
     is  passed  to pcre_exec() is not big enough to remember the
     referenced substrings, PCRE gets a block of  memory  at  the
     start  of  matching to use for this purpose. If the call via
     pcre_malloc() fails, this error  is  given.  The  memory  is
     freed at the end of matching.




EXTRACTING CAPTURED SUBSTRINGS
     Captured substrings can be accessed directly  by  using  the
     offsets returned by pcre_exec() in ovector. For convenience,
     the functions  pcre_copy_substring(),  pcre_get_substring(),
     and  pcre_get_substring_list()  are  provided for extracting
     captured  substrings  as  new,   separate,   zero-terminated
     strings.   A  substring  that  contains  a  binary  zero  is
     correctly extracted and has a further zero added on the end,
     but the result does not, of course, function as a C string.

     The first three arguments are the same for all  three  func-
     tions:  subject  is  the  subject string which has just been
     successfully matched, ovector is a pointer to the vector  of
     integer   offsets   that  was  passed  to  pcre_exec(),  and
     stringcount is the number of substrings that  were  captured
     by  the  match,  including  the  substring  that matched the
     entire regular expression. This is  the  value  returned  by
     pcre_exec  if  it  is  greater  than  zero.  If  pcre_exec()
     returned zero, indicating that it ran out of space in  ovec-
     tor,  the  value passed as stringcount should be the size of
     the vector divided by three.

     The functions pcre_copy_substring() and pcre_get_substring()
     extract a single substring, whose number is given as string-
     number. A value of zero extracts the substring that  matched
     the entire pattern, while higher values extract the captured
     substrings. For pcre_copy_substring(), the string is  placed
     in  buffer,  whose  length is given by buffersize, while for
     pcre_get_substring() a new block of memory is  obtained  via
     pcre_malloc,  and its address is returned via stringptr. The
     yield of the function is  the  length  of  the  string,  not
     including the terminating zero, or one of

       PCRE_ERROR_NOMEMORY       (-6)

     The buffer was too small for pcre_copy_substring(),  or  the
     attempt to get memory failed for pcre_get_substring().

       PCRE_ERROR_NOSUBSTRING    (-7)

     There is no substring whose number is stringnumber.

     The pcre_get_substring_list() function extracts  all  avail-
     able  substrings  and builds a list of pointers to them. All
     this is done in a single block of memory which  is  obtained
     via pcre_malloc. The address of the memory block is returned
     via listptr, which is also the start of the list  of  string
     pointers.  The  end of the list is marked by a NULL pointer.
     The yield of the function is zero if all went well, or

       PCRE_ERROR_NOMEMORY       (-6)

     if the attempt to get the memory block failed.

     When any of these functions encounter a  substring  that  is
     unset, which can happen when capturing subpattern number n+1
     matches some part of the subject, but subpattern n  has  not
     been  used  at all, they return an empty string. This can be
     distinguished  from  a  genuine  zero-length  substring   by
     inspecting the appropriate offset in ovector, which is nega-
     tive for unset substrings.

     The  two  convenience  functions  pcre_free_substring()  and
     pcre_free_substring_list()  can  be  used to free the memory
     returned by  a  previous  call  of  pcre_get_substring()  or
     pcre_get_substring_list(),  respectively.  They  do  nothing
     more than call the function pointed to by  pcre_free,  which
     of  course  could  be called directly from a C program. How-
     ever, PCRE is used in some situations where it is linked via
     a  special  interface  to another programming language which
     cannot use pcre_free directly; it is for  these  cases  that
     the functions are provided.



LIMITATIONS
     There are some size limitations in PCRE but it is hoped that
     they will never in practice be relevant.  The maximum length
     of a compiled pattern is 65539 (sic) bytes.  All  values  in
     repeating  quantifiers  must be less than 65536.  There max-
     imum number of capturing subpatterns is 65535.  There is  no
     limit  to  the  number of non-capturing subpatterns, but the
     maximum depth of nesting of all kinds of parenthesized  sub-
     pattern,  including  capturing  subpatterns, assertions, and
     other types of subpattern, is 200.

     The maximum length of a subject string is the largest  posi-
     tive number that an integer variable can hold. However, PCRE
     uses recursion to handle subpatterns and indefinite  repeti-
     tion.  This  means  that the available stack space may limit
     the size of a subject string that can be processed  by  cer-
     tain patterns.



DIFFERENCES FROM PERL
     The differences described here  are  with  respect  to  Perl
     5.005.

     1. By default, a whitespace character is any character  that
     the  C  library  function isspace() recognizes, though it is
     possible to compile PCRE  with  alternative  character  type
     tables. Normally isspace() matches space, formfeed, newline,
     carriage return, horizontal tab, and vertical tab. Perl 5 no
     longer  includes vertical tab in its set of whitespace char-
     acters. The \v escape that was in the Perl documentation for
     a long time was never in fact recognized. However, the char-
     acter itself was treated as whitespace at least up to 5.002.
     In 5.004 and 5.005 it does not match \s.

     2. PCRE does  not  allow  repeat  quantifiers  on  lookahead
     assertions. Perl permits them, but they do not mean what you
     might think. For example, (?!a){3} does not assert that  the
     next  three characters are not "a". It just asserts that the
     next character is not "a" three times.

     3. Capturing subpatterns that occur inside  negative  looka-
     head  assertions  are  counted,  but  their  entries  in the
     offsets vector are never set. Perl sets its numerical  vari-
     ables  from  any  such  patterns that are matched before the
     assertion fails to match something (thereby succeeding), but
     only  if  the negative lookahead assertion contains just one
     branch.

     4. Though binary zero characters are supported in  the  sub-
     ject  string,  they  are  not  allowed  in  a pattern string
     because it is passed as a normal  C  string,  terminated  by
     zero. The escape sequence "\0" can be used in the pattern to
     represent a binary zero.

     5. The following Perl escape sequences  are  not  supported:
     \l,  \u,  \L,  \U,  \E, \Q. In fact these are implemented by
     Perl's general string-handling and are not part of its  pat-
     tern matching engine.

     6. The Perl \G assertion is  not  supported  as  it  is  not
     relevant to single pattern matches.

     7. Fairly obviously, PCRE does not support the (?{code}) and
     (?p{code})  constructions. However, there is some experimen-
     tal support for recursive patterns using the  non-Perl  item
     (?R).

     8. There are at the time of writing some  oddities  in  Perl
     5.005_02  concerned  with  the  settings of captured strings
     when part of a pattern is repeated.  For  example,  matching
     "aba"  against the pattern /^(a(b)?)+$/ sets $2 to the value
     "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves  $2
     unset.    However,    if   the   pattern   is   changed   to
     /^(aa(b(b))?)+$/ then $2 (and $3) are set.

     In Perl 5.004 $2 is set in both cases, and that is also true
     of PCRE. If in the future Perl changes to a consistent state
     that is different, PCRE may change to follow.

     9. Another as yet unresolved discrepancy  is  that  in  Perl
     5.005_02  the  pattern /^(a)?(?(1)a|b)+$/ matches the string
     "a", whereas in PCRE it does not.  However, in both Perl and
     PCRE /^(a)?a/ matched against "a" leaves $1 unset.

     10. PCRE  provides  some  extensions  to  the  Perl  regular
     expression facilities:

     (a) Although lookbehind assertions must match  fixed  length
     strings,  each  alternative branch of a lookbehind assertion
     can match a different length of string. Perl 5.005  requires
     them all to have the same length.

     (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is  not
     set,  the  $ meta- character matches only at the very end of
     the string.

     (c) If PCRE_EXTRA is set, a backslash followed by  a  letter

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -