📄 regexp.reference.html
字号:
The quantifier {0} is permitted, causing the expression to behave as if the previous item and the quantifier were not present. </p> <p class="para"> For convenience (and historical compatibility) the three most common quantifiers have single-character abbreviations: <table border="5"> <caption><b>Single-character quantifiers</b></caption> <colgroup> <tbody valign="middle" class="tbody"> <tr valign="middle"> <td colspan="1" rowspan="1" align="left"><i>*</i></td> <td colspan="1" rowspan="1" align="left">equivalent to <i>{0,}</i></td> </tr> <tr valign="middle"> <td colspan="1" rowspan="1" align="left"><i>+</i></td> <td colspan="1" rowspan="1" align="left">equivalent to <i>{1,}</i></td> </tr> <tr valign="middle"> <td colspan="1" rowspan="1" align="left"><i>?</i></td> <td colspan="1" rowspan="1" align="left">equivalent to <i>{0,1}</i></td> </tr> </tbody> </colgroup> </table> </p> <p class="para"> It is possible to construct infinite loops by following a subpattern that can match no characters with a quantifier that has no upper limit, for example: <i>(a?)*</i> </p> <p class="para"> Earlier versions of Perl and PCRE used to give an error at compile time for such patterns. However, because there are cases where this can be useful, such patterns are now accepted, but if any repetition of the subpattern does in fact match no characters, the loop is forcibly broken. </p> <p class="para"> By default, the quantifiers are "greedy", that is, they match as much as possible (up to the maximum number of permitted times), without causing the rest of the pattern to fail. The classic example of where this gives problems is in trying to match comments in C programs. These appear between the sequences /* and */ and within the sequence, individual * and / characters may appear. An attempt to match C comments by applying the pattern <i>/\*.*\*/</i> to the string <i>/* first comment */ not comment /* second comment */</i> fails, because it matches the entire string due to the greediness of the .* item. </p> <p class="para"> However, if a quantifier is followed by a question mark, then it ceases to be greedy, and instead matches the minimum number of times possible, so the pattern <i>/\*.*?\*/</i> does the right thing with the C comments. The meaning of the various quantifiers is not otherwise changed, just the preferred number of matches. Do not confuse this use of question mark with its use as a quantifier in its own right. Because it has two uses, it can sometimes appear doubled, as in <i>\d??\d</i> which matches one digit by preference, but can match two if that is the only way the rest of the pattern matches. </p> <p class="para"> If the <a href="reference.pcre.pattern.modifiers.html" class="link">PCRE_UNGREEDY</a> option is set (an option which is not available in Perl) then the quantifiers are not greedy by default, but individual ones can be made greedy by following them with a question mark. In other words, it inverts the default behaviour. </p> <p class="para"> Quantifiers followed by <i>+</i> are "possessive". They eat as many characters as possible and don't return to match the rest of the pattern. Thus <i>.*abc</i> matches "aabc" but <i>.*+abc</i> doesn't because <i>.*+</i> eats the whole string. Possessive quantifiers can be used to speed up processing since PHP 4.3.3. </p> <p class="para"> When a parenthesized subpattern is quantified with a minimum repeat count that is greater than 1 or with a limited maximum, more store is required for the compiled pattern, in proportion to the size of the minimum or maximum. </p> <p class="para"> If a pattern starts with .* or .{0,} and the <a href="reference.pcre.pattern.modifiers.html" class="link">PCRE_DOTALL</a> option (equivalent to Perl's /s) is set, thus allowing the . to match newlines, then the pattern is implicitly anchored, because whatever follows will be tried against every character position in the subject string, so there is no point in retrying the overall match at any position after the first. PCRE treats such a pattern as though it were preceded by \A. In cases where it is known that the subject string contains no newlines, it is worth setting <a href="reference.pcre.pattern.modifiers.html" class="link">PCRE_DOTALL</a> when the pattern begins with .* in order to obtain this optimization, or alternatively using ^ to indicate anchoring explicitly. </p> <p class="para"> When a capturing subpattern is repeated, the value captured is the substring that matched the final iteration. For example, after <i>(tweedle[dume]{3}\s*)+</i> has matched "tweedledum tweedledee" the value of the captured substring is "tweedledee". However, if there are nested capturing subpatterns, the corresponding captured values may have been set in previous iterations. For example, after <i>/(a|(b))+/</i> matches "aba" the value of the second captured substring is "b". </p> </div> <div id="regexp.reference.back-references" class="section"> <h2 class="title">Back references</h2> <p class="para"> Outside a character class, a backslash followed by a digit greater than 0 (and possibly further digits) is a back reference to a capturing subpattern earlier (i.e. to its left) in the pattern, provided there have been that many previous capturing left parentheses. </p> <p class="para"> However, if the decimal number following the backslash is less than 10, it is always taken as a back reference, and causes an error only if there are not that many capturing left parentheses in the entire pattern. In other words, the parentheses that are referenced need not be to the left of the reference for numbers less than 10. See the section entitled "Backslash" above for further details of the handling of digits following a backslash. </p> <p class="para"> A back reference matches whatever actually matched the capturing subpattern in the current subject string, rather than anything matching the subpattern itself. So the pattern <i>(sens|respons)e and \1ibility</i> matches "sense and sensibility" and "response and responsibility", but not "sense and responsibility". If caseful matching is in force at the time of the back reference, then the case of letters is relevant. For example, <i>((?i)rah)\s+\1</i> matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original capturing subpattern is matched caselessly. </p> <p class="para"> There may be more than one back reference to the same subpattern. If a subpattern has not actually been used in a particular match, then any back references to it always fail. For example, the pattern <i>(a|(bc))\2</i> always fails if it starts to match "a" rather than "bc". Because there may be up to 99 back references, all digits following the backslash are taken as part of a potential back reference number. If the pattern continues with a digit character, then some delimiter must be used to terminate the back reference. If the <a href="reference.pcre.pattern.modifiers.html" class="link">PCRE_EXTENDED</a> option is set, this can be whitespace. Otherwise an empty comment can be used. </p> <p class="para"> A back reference that occurs inside the parentheses to which it refers fails when the subpattern is first used, so, for example, (a\1) never matches. However, such references can be useful inside repeated subpatterns. For example, the pattern <i>(a|b\1)+</i> matches any number of "a"s and also "aba", "ababaa" etc. At each iteration of the subpattern, the back reference matches the character string corresponding to the previous iteration. In order for this to work, the pattern must be such that the first iteration does not need to match the back reference. This can be done using alternation, as in the example above, or by a quantifier with a minimum of zero. </p> <p class="para"> Back references to the named subpatterns can be achieved by <i>(?P=name)</i> or, since PHP 5.2.4, also by <i>\k<name></i>, <i>\k'name'</i>, <i>\k{name}</i> or <i>\g{name}</i>. </p> </div> <div id="regexp.reference.assertions" class="section"> <h2 class="title">Assertions</h2> <p class="para"> An assertion is a test on the characters following or preceding the current matching point that does not actually consume any characters. The simple assertions coded as \b, \B, \A, \Z, \z, ^ and $ are described above. More complicated assertions are coded as subpatterns. There are two kinds: those that look ahead of the current position in the subject string, and those that look behind it. </p> <p class="para"> An assertion subpattern is matched in the normal way, except that it does not cause the current matching position to be changed. Lookahead assertions start with (?= for positive assertions and (?! for negative assertions. For example, <i>\w+(?=;)</i> matches a word followed by a semicolon, but does not include the semicolon in the match, and <i>foo(?!bar)</i> matches any occurrence of "foo" that is not followed by "bar". Note that the apparently similar pattern <i>(?!foo)bar</i> does not find an occurrence of "bar" that is preceded by something other than "foo"; it finds any occurrence of "bar" whatsoever, because the assertion (?!foo) is always <b><tt>TRUE</tt></b> when the next three characters are "bar". A lookbehind assertion is needed to achieve this effect. </p> <p class="para"> Lookbehind assertions start with (?<= for positive assertions and (?<! for negative assertions. For example, <i>(?<!foo)bar</i> does find an occurrence of "bar" that is not preceded by "foo". The contents of a lookbehind assertion are restricted such that all the strings it matches must have a fixed length. However, if there are several alternatives, they do not all have to have the same fixed length. Thus <i>(?<=bullock|donkey)</i> is permitted, but <i>(?<!dogs?|cats?)</i> causes an error at compile time. Branches that match different length strings are permitted only at the top level of a lookbehind assertion. This is an extension compared with Perl 5.005, which requires all branches to match the same length of string. An assertion such as <i>(?<=ab(c|de))</i> is not permitted, because its single top-level branch can match two different lengths, but it is acceptable if rewritten to use two top-level branches: <i>(?<=abc|abde)</i> The implementation of lookbehind assertions is, for each alternative, to temporarily move the current position back by the fixed width and then try to match. If there are insufficient characters before the current position, the match is deemed to fail. Lookbehinds in conjunction with once-only subpatterns can be particularly useful for matching at the ends of strings; an example is given at the end of the section on once-only subpatterns. </p> <p class="para"> Several assertions (of any sort) may occur in succession. For example, <i>(?<=\d{3})(?<!999)foo</i> matches "foo" preceded by three digits that are not "999". Notice that each of the assertions is applied independently at the same point in the subject string. First there is a check that the previous three characters are all digits, then there is a check that the same three characters are not "999". This pattern does not match "foo" preceded by six characters, the first of which are digits and the last three of which are not "999". For example, it doesn't match "123abcfoo". A pattern to do that is <i>(?<=\d{3}...)(?<!999)foo</i> </p> <p class="para"> This time the first assertion looks at the preceding six characters, checking that the first three are digits, and then the second assertion checks that the preceding three characters are not "999". </p> <p class="para"> Assertions can be nested in any combination. For example, <i>(?<=(?<!foo)bar)baz</i> matches an occurrence of "baz" that is preceded by "bar" which in turn is not preceded by "foo", while <i>(?<=\d{3}...(?<!999))foo</i> is another pattern which matches "foo" preceded by three digits and any three characters that are not "999". </p> <p class="para"> Assertion subpatterns are not capturing subpatterns, and may not be repeated, because it makes no sense to assert the same thing several times. If any kind of assertion contains capturing subpatterns within it, these are counted for the purposes of numbering the capturing subpatterns in the whole pattern. However, substring capturing is carried out only for positive assertions, because it doe
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -