📄 glib-regex-syntax.html
字号:
</p><p>By default, the quantifiers are "greedy", that is, they match as muchas possible (up to the maximum number of permitted times), withoutcausing the rest of the pattern to fail. The classic example of wherethis gives problems is in trying to match comments in C programs. Theseappear between /* and */ and within the comment, individual * and /characters may appear. An attempt to match C comments by applying thepattern</p><pre class="programlisting">/\*.*\*/</pre><p>to the string</p><pre class="programlisting">/* first comment */ not comment /* second comment */</pre><p>fails, because it matches the entire string owing to the greediness ofthe .* item.</p><p>However, if a quantifier is followed by a question mark, it ceases tobe greedy, and instead matches the minimum number of times possible, sothe pattern</p><pre class="programlisting">/\*.*?\*/</pre><p>does the right thing with the C comments. The meaning of the variousquantifiers is not otherwise changed, just the preferred number ofmatches. Do not confuse this use of question mark with its use as aquantifier in its own right. Because it has two uses, it can sometimesappear doubled, as in</p><pre class="programlisting">\d??\d</pre><p>which matches one digit by preference, but can match two if that is theonly way the rest of the pattern matches.</p><p>If the <code class="varname">G_REGEX_UNGREEDY</code> flag is set, the quantifiers are not greedyby default, but individual ones can be made greedy by following them witha question mark. In other words, it inverts the default behaviour.</p><p>When a parenthesized subpattern is quantified with a minimum repeatcount that is greater than 1 or with a limited maximum, more memory isrequired for the compiled pattern, in proportion to the size of theminimum or maximum.</p><p>If a pattern starts with .* or .{0,} and the <code class="varname">G_REGEX_DOTALL</code> flagis set, thus allowing the dot to match newlines, thepattern is implicitly anchored, because whatever follows will be triedagainst every character position in the string, so there is nopoint in retrying the overall match at any position after the first.GRegex normally treats such a pattern as though it were preceded by \A.</p><p>In cases where it is known that the string contains no newlines, itis worth setting <code class="varname">G_REGEX_DOTALL</code> in order to obtain this optimization,or alternatively using ^ to indicate anchoring explicitly.</p><p>However, there is one situation where the optimization cannot be used.When .* is inside capturing parentheses that are the subject of abackreference elsewhere in the pattern, a match at the start may failwhere a later one succeeds. Consider, for example:</p><pre class="programlisting">(.*)abc\1</pre><p>If the string is "xyz123abc123" the match point is the fourth character.For this reason, such a pattern is not implicitly anchored.</p><p>When a capturing subpattern is repeated, the value captured is thesubstring that matched the final iteration. For example, after</p><pre class="programlisting">(tweedle[dume]{3}\s*)+</pre><p>has matched "tweedledum tweedledee" the value of the captured substringis "tweedledee". However, if there are nested capturing subpatterns,the corresponding captured values may have been set in previous iterations.For example, after</p><pre class="programlisting">/(a|(b))+/</pre><p>matches "aba" the value of the second captured substring is "b".</p></div><div class="refsect1" lang="en"><a name="id2816195"></a><h2>Atomic grouping and possessive quantifiers</h2><p>With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")repetition, failure of what follows normally causes the repeateditem to be re-evaluated to see if a different numberof repeats allows the rest of the pattern to match. Sometimes itis useful to prevent this, either to change the nature of thematch, or to cause it fail earlier than it otherwise might, when theauthor of the pattern knows there is no point in carrying on.</p><p>Consider, for example, the pattern \d+foo when applied to the string</p><pre class="programlisting">123456bar</pre><p>After matching all 6 digits and then failing to match "foo", the normalaction of the matcher is to try again with only 5 digits matching the\d+ item, and then with 4, and so on, before ultimately failing."Atomic grouping" (a term taken from Jeffrey Friedl’s book) providesthe means for specifying that once a subpattern has matched, it is notto be re-evaluated in this way.</p><p>If we use atomic grouping for the previous example, the matchergive up immediately on failing to match "foo" the first time. The notationis a kind of special parenthesis, starting with (?> as in thisexample:</p><pre class="programlisting">(?>\d+)foo</pre><p>This kind of parenthesis "locks up" the part of the pattern it containsonce it has matched, and a failure further into the pattern isprevented from backtracking into it. Backtracking past it to previousitems, however, works as normal.</p><p>An alternative description is that a subpattern of this type matchesthe string of characters that an identical standalone pattern wouldmatch, if anchored at the current point in the string.</p><p>Atomic grouping subpatterns are not capturing subpatterns. Simple casessuch as the above example can be thought of as a maximizing repeat thatmust swallow everything it can. So, while both \d+ and \d+? are preparedto adjust the number of digits they match in order to make therest of the pattern match, (?>\d+) can only match an entire sequence ofdigits.</p><p>Atomic groups in general can of course contain arbitrarily complicatedsubpatterns, and can be nested. However, when the subpattern for anatomic group is just a single repeated item, as in the example above, asimpler notation, called a "possessive quantifier" can be used. Thisconsists of an additional + character following a quantifier. Usingthis notation, the previous example can be rewritten as</p><pre class="programlisting">\d++foo</pre><p>Possessive quantifiers are always greedy; the setting of the<code class="varname">G_REGEX_UNGREEDY</code> option is ignored. They are a convenient notation for thesimpler forms of atomic group. However, there is no difference in themeaning of a possessive quantifier and the equivalentatomic group, though there may be a performance difference;possessive quantifiers should be slightly faster.</p><p>The possessive quantifier syntax is an extension to the Perl syntax.It was invented by Jeffrey Friedl in the first edition of his book andthen implemented by Mike McCloskey in Sun's Java package.It ultimately found its way into Perl at release 5.10.</p><p>GRegex has an optimization that automatically "possessifies" certain simplepattern constructs. For example, the sequence A+B is treated as A++B becausethere is no point in backtracking into a sequence of A's when B must follow.</p><p>When a pattern contains an unlimited repeat inside a subpattern thatcan itself be repeated an unlimited number of times, the use of anatomic group is the only way to avoid some failing matches taking avery long time indeed. The pattern</p><pre class="programlisting">(\D+|<\d+>)*[!?]</pre><p>matches an unlimited number of substrings that either consist of non-digits, or digits enclosed in <>, followed by either ! or ?. When itmatches, it runs quickly. However, if it is applied to</p><pre class="programlisting">aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</pre><p>it takes a long time before reporting failure. This is because thestring can be divided between the internal \D+ repeat and the external* repeat in a large number of ways, and all have to be tried. (Theexample uses [!?] rather than a single character at the end, becauseGRegex has an optimization that allows for fast failurewhen a single character is used. It remember the last single characterthat is required for a match, and fail early if it is not presentin the string.) If the pattern is changed so that it uses an atomicgroup, like this:</p><pre class="programlisting">((?>\D+)|<\d+>)*[!?]</pre><p>sequences of non-digits cannot be broken, and failure happens quickly.</p></div><div class="refsect1" lang="en"><a name="id2816369"></a><h2>Back references</h2><p>Outside a character class, a backslash followed by a digit greater than0 (and possibly further digits) is a back reference to a capturing subpatternearlier (that is, to its left) in the pattern, provided there have been thatmany previous capturing left parentheses.</p><p>However, if the decimal number following the backslash is less than 10,it is always taken as a back reference, and causes an error only ifthere are not that many capturing left parentheses in the entire pattern.In other words, the parentheses that are referenced need not beto the left of the reference for numbers less than 10. A "forward backreference" of this type can make sense when a repetition is involved andthe subpattern to the right has participated in an earlier iteration.</p><p>It is not possible to have a numerical "forward back reference" to subpatternwhose number is 10 or more using this syntax because a sequence such as \e50 isinterpreted as a character defined in octal. See the subsection entitled"Non-printing characters" above for further details of the handling of digitsfollowing a backslash. There is no such problem when named parentheses are used.A back reference to any subpattern is possible using named parentheses (see below).</p><p>Another way of avoiding the ambiguity inherent in the use of digits following abackslash is to use the \g escape sequence (introduced in Perl 5.10.)This escape must be followed by a positive or a negative number,optionally enclosed in braces.</p><p>A positive number specifies an absolute reference without the ambiguity that ispresent in the older syntax. It is also useful when literal digits follow thereference. A negative number is a relative reference. Consider "(abc(def)ghi)\g{-1}",the sequence \g{-1} is a reference to the most recently started capturingsubpattern before \g, that is, is it equivalent to \2. Similarly, \g{-2}would be equivalent to \1. The use of relative references can be helpful inlong patterns, and also in patterns that are created by joining togetherfragments that contain references within themselves.</p><p>A back reference matches whatever actually matched the capturing subpatternin the current string, rather than anything matchingthe subpattern itself (see "Subpatterns as subroutines" below for a wayof doing that). So the pattern</p><pre class="programlisting">(sens|respons)e and \1ibility</pre><p>matches "sense and sensibility" and "response and responsibility", butnot "sense and responsibility". If caseful matching is in force at thetime of the back reference, the case of letters is relevant. For example,</p><pre class="programlisting">((?i)rah)\s+\1</pre><p>matches "rah rah" and "RAH RAH", but not "RAH rah", even though theoriginal capturing subpattern is matched caselessly.</p><p>Back references to named subpatterns use the Perl syntax \k<name> or \k'name'or the Python syntax (?P=name). We could rewrite the above example in either ofthe following ways:</p><pre class="programlisting">(?<p1>(?i)rah)\s+\k<p1>(?P<p1>(?i)rah)\s+(?P=p1)</pre><p>A subpattern that is referenced by name may appear in the pattern before orafter the reference.</p><p>There may be more than one back reference to the same subpattern. If asubpattern has not actually been used in a particular match, any backreferences to it always fail. For example, the pattern</p><pre class="programlisting">(a|(bc))\2</pre><p>always fails if it starts to match "a" rather than "bc". Because theremay be many capturing parentheses in a pattern, all digits followingthe backslash are taken as part of a potential back reference number.If the pattern continues with a digit character, some delimiter must beused to terminate the back reference. If the <code class="varname">G_REGEX_EXTENDED</code> flag isset, this can be whitespace. Otherwise an empty comment (see "Comments" below) can be used.</p><p>A back reference that occurs inside the parentheses to which it refersfails when the subpattern is first used, so, for example, (a\1) nevermatches. However, such references can be useful inside repeated subpatterns.For example, the pattern</p><pre class="programlisting">(a|b\1)+</pre><p>matches any number of "a"s and also "aba", "ababbaa" etc. At each iterationof the subpattern, the back reference matches the characterstring corresponding to the previous iteration. In order for this towork, the pattern must be such that the first iteration does not needto match the back reference. This can be done using alternation, as inthe example above, or by a quantifier with a minimum of zero.</p></div><div class="refsect1" lang="en"><a name="id2813022"></a><h2>Assertions</h2><p>An assertion is a test on the characters following or preceding thecurrent matching point that does not actually consume any characters.The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ aredescribed above.</p><p>More complicated assertions are coded as subpatterns. There are twokinds: those that look ahead of the current position in thestring, and those that look behind it. An assertion subpattern ismatched in the normal way, except that it does not cause the currentmatching position to be changed.</p><p>Assertion subpatterns are not capturing subpatterns, and may not berepeated, because it makes no sense to assert the same thing severaltimes. If any kind of assertion contains capturing subpatterns withinit, these are counted for the purposes of numbering the capturingsubpatterns in the whole pattern. However, substring capturing is carriedout only for positive assertions, because it does not make sense fornegative assertions.</p><div class="refsect2" lang="en"><a name="id2813046"></a><h3>Lookahead assertions</h3><p>Lookahead assertions start with (?= for positive assertions and (?! fornegative assertions. For example,</p><pre class="programlisting">\w+(?=;)</pre><p>matches a word followed by a semicolon, but does not include the semicolonin the match, and</p><pre class="programlisting">foo(?!bar)</pre><p>matches any occurrence of "foo" that is not followed by "bar". Notethat the apparently similar pattern</p><pre class="programlisting">(?!foo)bar</pre><p>doe
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -