⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 pcre.3

📁 ncbi源码
💻 3
📖 第 1 页 / 共 5 页
字号:
can be changed from within the pattern by a sequence of Perl option lettersenclosed between "(?" and ")". The option letters are  i  for PCRE_CASELESS  m  for PCRE_MULTILINE  s  for PCRE_DOTALL  x  for PCRE_EXTENDEDFor example, (?im) sets caseless, multiline matching. It is also possible tounset these options by preceding the letter with a hyphen, and a combinedsetting and unsetting such as (?im-sx), which sets PCRE_CASELESS andPCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, is alsopermitted. If a letter appears both before and after the hyphen, the option isunset.The scope of these option changes depends on where in the pattern the settingoccurs. For settings that are outside any subpattern (defined below), theeffect is the same as if the options were set or unset at the start ofmatching. The following patterns all behave in exactly the same way:  (?i)abc  a(?i)bc  ab(?i)c  abc(?i)which in turn is the same as compiling the pattern abc with PCRE_CASELESS set.In other words, such "top level" settings apply to the whole pattern (unlessthere are other changes inside subpatterns). If there is more than one settingof the same option at top level, the rightmost setting is used.If an option change occurs inside a subpattern, the effect is different. Thisis a change of behaviour in Perl 5.005. An option change inside a subpatternaffects only that part of the subpattern that follows it, so  (a(?i)b)cmatches abc and aBc and no other strings (assuming PCRE_CASELESS is not used).By this means, options can be made to have different settings in differentparts of the pattern. Any changes made in one alternative do carry oninto subsequent branches within the same subpattern. For example,  (a(?i)b|c)matches "ab", "aB", "c", and "C", even though when matching "C" the firstbranch is abandoned before the option setting. This is because the effects ofoption settings happen at compile time. There would be some very weirdbehaviour otherwise.The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed in thesame way as the Perl-compatible options by using the characters U and Xrespectively. The (?X) flag setting is special in that it must always occurearlier in the pattern than any of the additional features it turns on, evenwhen it is at top level. It is best put at the start..SH SUBPATTERNSSubpatterns are delimited by parentheses (round brackets), which can be nested.Marking part of a pattern as a subpattern does two things:1. It localizes a set of alternatives. For example, the pattern  cat(aract|erpillar|)matches one of the words "cat", "cataract", or "caterpillar". Without theparentheses, it would match "cataract", "erpillar" or the empty string.2. It sets up the subpattern as a capturing subpattern (as defined above).When the whole pattern matches, that portion of the subject string that matchedthe subpattern is passed back to the caller via the \fIovector\fR argument of\fBpcre_exec()\fR. Opening parentheses are counted from left to right (startingfrom 1) to obtain the numbers of the capturing subpatterns.For example, if the string "the red king" is matched against the pattern  the ((red|white) (king|queen))the captured substrings are "red king", "red", and "king", and are numbered 1,2, and 3, respectively.The fact that plain parentheses fulfil two functions is not always helpful.There are often times when a grouping subpattern is required without acapturing requirement. If an opening parenthesis is followed by "?:", thesubpattern does not do any capturing, and is not counted when computing thenumber of any subsequent capturing subpatterns. For example, if the string "thewhite queen" is matched against the pattern  the ((?:red|white) (king|queen))the captured substrings are "white queen" and "queen", and are numbered 1 and2. The maximum number of captured substrings is 99, and the maximum number ofall subpatterns, both capturing and non-capturing, is 200.As a convenient shorthand, if any option settings are required at the start ofa non-capturing subpattern, the option letters may appear between the "?" andthe ":". Thus the two patterns  (?i:saturday|sunday)  (?:(?i)saturday|sunday)match exactly the same set of strings. Because alternative branches are triedfrom left to right, and options are not reset until the end of the subpatternis reached, an option setting in one branch does affect subsequent branches, sothe above patterns match "SUNDAY" as well as "Saturday"..SH REPETITIONRepetition is specified by quantifiers, which can follow any of the followingitems:  a single character, possibly escaped  the . metacharacter  a character class  a back reference (see next section)  a parenthesized subpattern (unless it is an assertion - see below)The general repetition quantifier specifies a minimum and maximum number ofpermitted matches, by giving the two numbers in curly brackets (braces),separated by a comma. The numbers must be less than 65536, and the first mustbe less than or equal to the second. For example:  z{2,4}matches "zz", "zzz", or "zzzz". A closing brace on its own is not a specialcharacter. If the second number is omitted, but the comma is present, there isno upper limit; if the second number and the comma are both omitted, thequantifier specifies an exact number of required matches. Thus  [aeiou]{3,}matches at least 3 successive vowels, but may match many more, while  \\d{8}matches exactly 8 digits. An opening curly bracket that appears in a positionwhere a quantifier is not allowed, or one that does not match the syntax of aquantifier, is taken as a literal character. For example, {,6} is not aquantifier, but a literal string of four characters.The quantifier {0} is permitted, causing the expression to behave as if theprevious item and the quantifier were not present.For convenience (and historical compatibility) the three most commonquantifiers have single-character abbreviations:  *    is equivalent to {0,}  +    is equivalent to {1,}  ?    is equivalent to {0,1}It is possible to construct infinite loops by following a subpattern that canmatch no characters with a quantifier that has no upper limit, for example:  (a?)*Earlier versions of Perl and PCRE used to give an error at compile time forsuch patterns. However, because there are cases where this can be useful, suchpatterns are now accepted, but if any repetition of the subpattern does in factmatch no characters, the loop is forcibly broken.By default, the quantifiers are "greedy", that is, they match as much aspossible (up to the maximum number of permitted times), without causing therest of the pattern to fail. The classic example of where this gives problemsis in trying to match comments in C programs. These appear between thesequences /* and */ and within the sequence, individual * and / characters mayappear. An attempt to match C comments by applying the pattern  /\\*.*\\*/to the string  /* first command */  not comment  /* second comment */fails, because it matches the entire string owing to the greediness of the .*item.However, if a quantifier is followed by a question mark, it ceases to begreedy, and instead matches the minimum number of times possible, so thepattern  /\\*.*?\\*/does the right thing with the C comments. The meaning of the variousquantifiers is not otherwise changed, just the preferred number of matches.Do not confuse this use of question mark with its use as a quantifier in itsown right. Because it has two uses, it can sometimes appear doubled, as in  \\d??\\dwhich matches one digit by preference, but can match two if that is the onlyway the rest of the pattern matches.If the PCRE_UNGREEDY option is set (an option which is not available in Perl),the quantifiers are not greedy by default, but individual ones can be madegreedy by following them with a question mark. In other words, it inverts thedefault behaviour.When a parenthesized subpattern is quantified with a minimum repeat count thatis greater than 1 or with a limited maximum, more store is required for thecompiled pattern, in proportion to the size of the minimum or maximum.If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalentto Perl's /s) is set, thus allowing the . to match newlines, the pattern isimplicitly anchored, because whatever follows will be tried against everycharacter position in the subject string, so there is no point in retrying theoverall match at any position after the first. PCRE treats such a pattern asthough it were preceded by \\A. In cases where it is known that the subjectstring contains no newlines, it is worth setting PCRE_DOTALL when the patternbegins with .* in order to obtain this optimization, or alternatively using ^to indicate anchoring explicitly.When a capturing subpattern is repeated, the value captured is the substringthat matched the final iteration. For example, after  (tweedle[dume]{3}\\s*)+has matched "tweedledum tweedledee" the value of the captured substring is"tweedledee". However, if there are nested capturing subpatterns, thecorresponding captured values may have been set in previous iterations. Forexample, after  /(a|(b))+/matches "aba" the value of the second captured substring is "b"..SH BACK REFERENCESOutside a character class, a backslash followed by a digit greater than 0 (andpossibly further digits) is a back reference to a capturing subpattern earlier(i.e. to its left) in the pattern, provided there have been that many previouscapturing left parentheses.However, if the decimal number following the backslash is less than 10, it isalways taken as a back reference, and causes an error only if there are notthat many capturing left parentheses in the entire pattern. In other words, theparentheses that are referenced need not be to the left of the reference fornumbers less than 10. See the section entitled "Backslash" above for furtherdetails of the handling of digits following a backslash.A back reference matches whatever actually matched the capturing subpattern inthe current subject string, rather than anything matching the subpatternitself. So the pattern  (sens|respons)e and \\1ibilitymatches "sense and sensibility" and "response and responsibility", but not"sense and responsibility". If caseful matching is in force at the time of theback reference, the case of letters is relevant. For example,  ((?i)rah)\\s+\\1matches "rah rah" and "RAH RAH", but not "RAH rah", even though the originalcapturing subpattern is matched caselessly.There may be more than one back reference to the same subpattern. If asubpattern has not actually been used in a particular match, any backreferences to it always fail. For example, the pattern  (a|(bc))\\2always fails if it starts to match "a" rather than "bc". Because there may beup to 99 back references, all digits following the backslash are takenas part of a potential back reference number. If the pattern continues with adigit character, some delimiter must be used to terminate the back reference.If the PCRE_EXTENDED option is set, this can be whitespace. Otherwise an emptycomment can be used.A back reference that occurs inside the parentheses to which it refers failswhen the subpattern is first used, so, for example, (a\\1) never matches.However, such references can be useful inside repeated subpatterns. Forexample, the pattern  (a|b\\1)+matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration ofthe subpattern, the back reference matches the character string correspondingto the previous iteration. In order for this to work, the pattern must be suchthat the first iteration does not need to match the back reference. This can bedone using alternation, as in the example above, or by a quantifier with aminimum of zero..SH ASSERTIONSAn assertion is a test on the chara

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -