📄 pcre.txt

📁 ncbi源码
💻 TXT
📖 第 1 页 / 共 5 页
字号:
     also described in the Perl documentation and in a number  of     other  books,  some  of which have copious examples. Jeffrey     Friedl's  "Mastering  Regular  Expressions",  published   by     O'Reilly (ISBN 1-56592-257), covers them in great detail.     The description here is intended as reference documentation.     The basic operation of PCRE is on strings of bytes. However,     there is the beginnings of some support for UTF-8  character     strings.  To  use  this  support  you must configure PCRE to     include it, and then call pcre_compile() with the  PCRE_UTF8     option.  How  this affects the pattern matching is described     in the final section of this document.     A regular expression is a pattern that is matched against  a     subject string from left to right. Most characters stand for     themselves in a pattern, and match the corresponding charac-     ters in the subject. As a trivial example, the pattern       The quick brown fox     matches a portion of a subject string that is  identical  to     itself.  The  power  of  regular  expressions comes from the     ability to include alternatives and repetitions in the  pat-     tern.  These  are encoded in the pattern by the use of meta-     characters, which do not stand for  themselves  but  instead     are interpreted in some special way.     There are two different sets of meta-characters: those  that     are  recognized anywhere in the pattern except within square     brackets, and those that are recognized in square  brackets.     Outside square brackets, the meta-characters are as follows:       \      general escape character with several uses       ^      assert start of  subject  (or  line,  in  multiline     mode)       $      assert end of subject (or line, in multiline mode)       .      match any character except newline (by default)       [      start character class definition       |      start of alternative branch       (      start subpattern       )      end subpattern       ?      extends the meaning of (              also 0 or 1 quantifier              also quantifier minimizer       *      0 or more quantifier       +      1 or more quantifier       {      start min/max quantifier     Part of a pattern that is in square  brackets  is  called  a     "character  class".  In  a  character  class  the only meta-     characters are:       \      general escape character       ^      negate the class, but only if the first character       -      indicates character range       ]      terminates the character class     The following sections describe  the  use  of  each  of  the     meta-characters.BACKSLASH     The backslash character has several uses. Firstly, if it  is     followed  by  a  non-alphameric character, it takes away any     special  meaning  that  character  may  have.  This  use  of     backslash  as  an  escape  character applies both inside and     outside character classes.     For example, if you want to match a "*" character, you write     "\*" in the pattern. This applies whether or not the follow-     ing character would otherwise  be  interpreted  as  a  meta-     character,  so it is always safe to precede a non-alphameric     with "\" to specify that it stands for itself.  In  particu-     lar, if you want to match a backslash, you write "\\".     If a pattern is compiled with the PCRE_EXTENDED option, whi-     tespace in the pattern (other than in a character class) and     characters between a "#" outside a character class  and  the     next  newline  character  are ignored. An escaping backslash     can be used to include a whitespace or "#" character as part     of the pattern.     A second use of backslash provides a way  of  encoding  non-     printing  characters  in patterns in a visible manner. There     is no restriction on the appearance of non-printing  charac-     ters,  apart from the binary zero that terminates a pattern,     but when a pattern is being prepared by text editing, it  is     usually  easier to use one of the following escape sequences     than the binary character it represents:       \a     alarm, that is, the BEL character (hex 07)       \cx    "control-x", where x is any character       \e     escape (hex 1B)       \f     formfeed (hex 0C)       \n     newline (hex 0A)       \r     carriage return (hex 0D)       \t     tab (hex 09)       \xhh   character with hex code hh       \ddd   character with octal code ddd, or backreference     The precise effect of "\cx" is as follows: if "x" is a lower     case  letter,  it  is converted to upper case. Then bit 6 of     the character (hex 40) is inverted.  Thus "\cz" becomes  hex     1A, but "\c{" becomes hex 3B, while "\c;" becomes hex 7B.     After "\x", up to two hexadecimal digits are  read  (letters     can be in upper or lower case).     After "\0" up to two further octal digits are read. In  both     cases,  if  there are fewer than two digits, just those that     are present are used. Thus the sequence "\0\x\07"  specifies     two binary zeros followed by a BEL character.  Make sure you     supply two digits after the initial zero  if  the  character     that follows is itself an octal digit.     The handling of a backslash followed by a digit other than 0     is  complicated.   Outside  a character class, PCRE reads it     and any following digits as a decimal number. If the  number     is  less  than  10, or if there have been at least that many     previous capturing left parentheses in the  expression,  the     entire  sequence is taken as a back reference. A description     of how this works is given later, following  the  discussion     of parenthesized subpatterns.     Inside a character  class,  or  if  the  decimal  number  is     greater  than  9 and there have not been that many capturing     subpatterns, PCRE re-reads up to three octal digits  follow-     ing  the  backslash,  and  generates  a single byte from the     least significant 8 bits of the value. Any subsequent digits     stand for themselves.  For example:       \040   is another way of writing a space       \40    is the same, provided there are fewer than 40                 previous capturing subpatterns       \7     is always a back reference       \11    might be a back reference, or another way of                 writing a tab       \011   is always a tab       \0113  is a tab followed by the character "3"       \113   is the character with octal code 113 (since there                 can be no more than 99 back references)       \377   is a byte consisting entirely of 1 bits       \81    is either a back reference, or a binary zero                 followed by the two characters "8" and "1"     Note that octal values of 100 or greater must not be  intro-     duced  by  a  leading zero, because no more than three octal     digits are ever read.     All the sequences that define a single  byte  value  can  be     used both inside and outside character classes. In addition,     inside a character class, the sequence "\b"  is  interpreted     as  the  backspace  character  (hex 08). Outside a character     class it has a different meaning (see below).     The third use of backslash is for specifying generic charac-     ter types:       \d     any decimal digit       \D     any character that is not a decimal digit       \s     any whitespace character       \S     any character that is not a whitespace character       \w     any "word" character       \W     any "non-word" character     Each pair of escape sequences partitions the complete set of     characters  into  two  disjoint  sets.  Any  given character     matches one, and only one, of each pair.     A "word" character is any letter or digit or the  underscore     character,  that  is,  any  character which can be part of a     Perl "word". The definition of letters and  digits  is  con-     trolled  by PCRE's character tables, and may vary if locale-     specific matching is  taking  place  (see  "Locale  support"     above). For example, in the "fr" (French) locale, some char-     acter codes greater than 128 are used for accented  letters,     and these are matched by \w.     These character type sequences can appear  both  inside  and     outside  character classes. They each match one character of     the appropriate type. If the current matching  point  is  at     the end of the subject string, all of them fail, since there     is no character to match.     The fourth use of backslash is  for  certain  simple  asser-     tions. An assertion specifies a condition that has to be met     at a particular point in  a  match,  without  consuming  any     characters  from  the subject string. The use of subpatterns     for more complicated  assertions  is  described  below.  The     backslashed assertions are       \b     word boundary       \B     not a word boundary       \A     start of subject (independent of multiline mode)       \Z     end of subject or newline at  end  (independent  of     multiline mode)       \z     end of subject (independent of multiline mode)     These assertions may not appear in  character  classes  (but     note that "\b" has a different meaning, namely the backspace     character, inside a character class).     A word boundary is a position in the  subject  string  where     the current character and the previous character do not both     match \w or \W (i.e. one matches \w and  the  other  matches     \W),  or the start or end of the string if the first or last     character matches \w, respectively.     The \A, \Z, and \z assertions differ  from  the  traditional     circumflex  and  dollar  (described below) in that they only     ever match at the very start and end of the subject  string,     whatever  options  are  set.  They  are  not affected by the     PCRE_NOTBOL or PCRE_NOTEOL options. If the startoffset argu-     ment  of  pcre_exec()  is  non-zero, \A can never match. The     difference between \Z and \z is that  \Z  matches  before  a     newline  that is the last character of the string as well as     at the end of the string, whereas \z  matches  only  at  the     end.CIRCUMFLEX AND DOLLAR     Outside a character class, in the default matching mode, the     circumflex  character  is an assertion which is true only if     the current matching point is at the start  of  the  subject     string.  If  the startoffset argument of pcre_exec() is non-     zero, circumflex can never match. Inside a character  class,     circumflex has an entirely different meaning (see below).     Circumflex need not be the first character of the pattern if     a  number of alternatives are involved, but it should be the     first thing in each alternative in which it appears  if  the     pattern is ever to match that branch. If all possible alter-     natives start with a circumflex, that is, if the pattern  is     constrained to match only at the start of the subject, it is     said to be an "anchored" pattern. (There are also other con-     structs that can cause a pattern to be anchored.)     A dollar character is an assertion which is true only if the     current  matching point is at the end of the subject string,     or immediately before a newline character that is  the  last     character in the string (by default). Dollar need not be the     last character of the pattern if a  number  of  alternatives     are  involved,  but it should be the last item in any branch     in which it appears.  Dollar has no  special  meaning  in  a     character class.     The meaning of dollar can be changed so that it matches only     at   the   very   end   of   the   string,  by  setting  the     PCRE_DOLLAR_ENDONLY option at compile or matching time. This     does not affect the \Z assertion.     The meanings of the circumflex  and  dollar  characters  are     changed  if  the  PCRE_MULTILINE option is set. When this is     the case,  they  match  immediately  after  and  immediately     before an internal "\n" character, respectively, in addition     to matching at the start and end of the subject string.  For     example,  the  pattern  /^abc$/  matches  the subject string     "def\nabc" in multiline  mode,  but  not  otherwise.  Conse-     quently,  patterns  that  are  anchored  in single line mode     because all branches start with "^" are not anchored in mul-     tiline mode, and a match for circumflex is possible when the     startoffset  argument  of  pcre_exec()  is   non-zero.   The     PCRE_DOLLAR_ENDONLY  option  is ignored if PCRE_MULTILINE is     set.     Note that the sequences \A, \Z, and \z can be used to  match     the  start  and end of the subject in both modes, and if all     branches of a pattern start with \A it is  always  anchored,     whether PCRE_MULTILINE is set or not.FULL STOP (PERIOD, DOT)     Outside a character class, a dot in the pattern matches  any     one character in the subject, including a non-printing char-     acter, but not (by default)  newline.   If  the  PCRE_DOTALL     option  is set, dots match newlines as well. The handling of     dot is entirely independent of the  handling  of  circumflex     and  dollar,  the  only  relationship  being  that they both     involve newline characters. Dot has no special meaning in  a     character class.SQUARE BRACKETS     An opening square bracket introduces a character class, ter-     minated  by  a  closing  square  bracket.  A  closing square     bracket on its own is  not  special.  If  a  closing  square     bracket  is  required as a member of the class, it should be     the first data character in the class (after an initial cir-     cumflex, if present) or escaped with a backslash.
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -