📄 regex-syntax.sgml
字号:
example, the pattern</para><programlisting> gilbert|sullivan</programlisting><para>matches either "gilbert" or "sullivan". Any number of alternatives mayappear, and an empty alternative is permitted (matching the emptystring). The matching process tries each alternative in turn, fromleft to right, and the first one that succeeds is used. If the alternatives are within a subpattern (defined below), "succeeds" means matching the rest of the main pattern as well as the alternative in the subpattern.</para></refsect1><refsect1><title>Internal option setting</title><para>The settings of the <varname>G_REGEX_CASELESS</varname>, <varname>G_REGEX_MULTILINE</varname>, <varname>G_REGEX_MULTILINE</varname>,and <varname>G_REGEX_EXTENDED</varname> options can be changed from within the pattern by asequence of Perl-style option letters enclosed between "(?" and ")". Theoption letters are</para><table frame="all" colsep="1" rowsep="1"><title>Option settings</title><tgroup cols="2"><colspec colnum="1" align="center"/><thead> <row> <entry>Option</entry> <entry>Flag</entry> </row></thead><tbody> <row> <entry>i</entry> <entry><varname>G_REGEX_CASELESS</varname></entry> </row> <row> <entry>m</entry> <entry><varname>G_REGEX_MULTILINE</varname></entry> </row> <row> <entry>s</entry> <entry><varname>G_REGEX_DOTALL</varname></entry> </row> <row> <entry>x</entry> <entry><varname>G_REGEX_EXTENDED</varname></entry> </row></tbody></tgroup></table><para>For example, (?im) sets caseless, multiline matching. It is alsopossible to unset these options by preceding the letter with a hyphen, and acombined setting and unsetting such as (?im-sx), which sets <varname>G_REGEX_CASELESS</varname>and <varname>G_REGEX_MULTILINE</varname> while unsetting <varname>G_REGEX_DOTALL</varname> and <varname>G_REGEX_EXTENDED</varname>,is also permitted. If a letter appears both before and after thehyphen, the option is unset.</para><para>When an option change occurs at top level (that is, not inside subpatternparentheses), the change applies to the remainder of the patternthat follows.</para><para>An option change within a subpattern (see below for a description of subpatterns)affects only that part of the current pattern that follows it, so</para><programlisting>(a(?i)b)c</programlisting><para>matches abc and aBc and no other strings (assuming <varname>G_REGEX_CASELESS</varname> is notused). By this means, options can be made to have different settingsin different parts of the pattern. Any changes made in one alternativedo carry on into subsequent branches within the same subpattern. Forexample,</para><programlisting>(a(?i)b|c)</programlisting><para>matches "ab", "aB", "c", and "C", even though when matching "C" thefirst branch is abandoned before the option setting. This is becausethe effects of option settings happen at compile time. There would besome very weird behaviour otherwise.</para><para>The options <varname>G_REGEX_UNGREEDY</varname> and<varname>G_REGEX_EXTRA</varname> and <varname>G_REGEX_DUPNAMES</varname>can be changed in the same way as the Perl-compatible options by usingthe characters U, X and J respectively.</para></refsect1><refsect1><title>Subpatterns</title><para>Subpatterns are delimited by parentheses (round brackets), which can benested. Turning part of a pattern into a subpattern does two things:</para><itemizedlist><listitem><para>It localizes a set of alternatives. For example, the patterncat(aract|erpillar|) matches one of the words "cat", "cataract", or"caterpillar". Without the parentheses, it would match "cataract","erpillar" or an empty string.</para></listitem><listitem><para>It sets up the subpattern as a capturing subpattern. This meansthat, when the whole pattern matches, that portion of thestring that matched the subpattern can be obtained using <function>g_regex_fetch()</function>.Opening parentheses are counted from left to right (starting from 1, assubpattern 0 is the whole matched string) to obtain numbers for thecapturing subpatterns.</para></listitem></itemizedlist><para>For example, if the string "the red king" is matched against the pattern</para><programlisting>the ((red|white) (king|queen))</programlisting><para>the captured substrings are "red king", "red", and "king", and are numbered 1, 2, and 3, respectively.</para><para>The fact that plain parentheses fulfil two functions is not alwayshelpful. There are often times when a grouping subpattern is requiredwithout a capturing requirement. If an opening parenthesis is followedby a question mark and a colon, the subpattern does not do any capturing,and is not counted when computing the number of any subsequentcapturing subpatterns. For example, if the string "the white queen" ismatched against the pattern</para><programlisting>the ((?:red|white) (king|queen))</programlisting><para>the captured substrings are "white queen" and "queen", and are numbered1 and 2. The maximum number of capturing subpatterns is 65535.</para><para>As a convenient shorthand, if any option settings are required at thestart of a non-capturing subpattern, the option letters may appearbetween the "?" and the ":". Thus the two patterns</para><programlisting>(?i:saturday|sunday)(?:(?i)saturday|sunday)</programlisting><para>match exactly the same set of strings. Because alternative branches aretried from left to right, and options are not reset until the end ofthe subpattern is reached, an option setting in one branch does affectsubsequent branches, so the above patterns match "SUNDAY" as well as"Saturday".</para></refsect1><refsect1><title>Named subpatterns</title><para>Identifying capturing parentheses by number is simple, but it can bevery hard to keep track of the numbers in complicated regular expressions.Furthermore, if an expression is modified, the numbers maychange. To help with this difficulty, GRegex supports the naming ofsubpatterns. A subpattern can be named in one of three ways: (?<name>...) or(?'name'...) as in Perl, or (?P<name>...) as in Python.References to capturing parentheses from otherparts of the pattern, such as backreferences, recursion, and conditions,can be made by name as well as by number.</para><para>Names consist of up to 32 alphanumeric characters and underscores. Namedcapturing parentheses are still allocated numbers as well as names, exactly asif the names were not present.By default, a name must be unique within a pattern, but it is possible to relaxthis constraint by setting the <varname>G_REGEX_DUPNAMES</varname> option atcompile time. This can be useful for patterns where only one instance of thenamed parentheses can match. Suppose you want to match the name of a weekday,either as a 3-letter abbreviation or as the full name, and in both cases youwant to extract the abbreviation. This pattern (ignoring the line breaks) doesthe job:</para><programlisting>(?<DN>Mon|Fri|Sun)(?:day)?|(?<DN>Tue)(?:sday)?|(?<DN>Wed)(?:nesday)?|(?<DN>Thu)(?:rsday)?|(?<DN>Sat)(?:urday)?</programlisting><para>There are five capturing substrings, but only one is ever set after a match.The function for extracting the data by name returns the substringfor the first (and in this example, the only) subpattern of that name thatmatched. This saves searching to find which numbered subpattern it was. If youmake a reference to a non-unique named subpattern from elsewhere in thepattern, the one that corresponds to the lowest number is used.</para></refsect1><refsect1><title>Repetition</title><para>Repetition is specified by quantifiers, which can follow any of thefollowing items:</para><itemizedlist><listitem><para>a literal data character</para></listitem><listitem><para>the dot metacharacter</para></listitem><listitem><para>the \C escape sequence</para></listitem><listitem><para>the \X escape sequence (in UTF-8 mode)</para></listitem><listitem><para>the \R escape sequence</para></listitem><listitem><para>an escape such as \d that matches a single character</para></listitem><listitem><para>a character class</para></listitem><listitem><para>a back reference (see next section)</para></listitem><listitem><para>a parenthesized subpattern (unless it is an assertion)</para></listitem></itemizedlist><para>The general repetition quantifier specifies a minimum and maximum numberof permitted matches, by giving the two numbers in curly brackets(braces), separated by a comma. The numbers must be less than 65536,and the first must be less than or equal to the second. For example:</para><programlisting>z{2,4}</programlisting><para>matches "zz", "zzz", or "zzzz". A closing brace on its own is not aspecial character. If the second number is omitted, but the comma ispresent, there is no upper limit; if the second number and the commaare both omitted, the quantifier specifies an exact number of requiredmatches. Thus</para><programlisting>[aeiou]{3,}</programlisting><para>matches at least 3 successive vowels, but may match many more, while</para><programlisting>\d{8}</programlisting><para>matches exactly 8 digits. An opening curly bracket that appears in aposition where a quantifier is not allowed, or one that does not matchthe syntax of a quantifier, is taken as a literal character. For example,{,6} is not a quantifier, but a literal string of four characters.</para><para>In UTF-8 mode, quantifiers apply to UTF-8 characters rather than toindividual bytes. Thus, for example, \x{100}{2} matches two UTF-8characters, each of which is represented by a two-byte sequence. Similarly,\X{3} matches three Unicode extended sequences, each of which may beseveral bytes long (and they may be of different lengths).</para><para>The quantifier {0} is permitted, causing the expression to behave as ifthe previous item and the quantifier were not present.</para><para>For convenience, the three most common quantifiers have single-characterabbreviations:</para><table frame="all" colsep="1" rowsep="1"><title>Abbreviations for quantifiers</title><tgroup cols="2"><colspec colnum="1" align="center"/><thead> <row> <entry>Abbreviation</entry> <entry>Meaning</entry> </row></thead><tbody> <row> <entry>*</entry> <entry>is equivalent to {0,}</entry> </row> <row> <entry>+</entry> <entry>is equivalent to {1,}</entry> </row> <row> <entry>?</entry> <entry>is equivalent to {0,1}</entry> </row></tbody></tgroup></table><para>It is possible to construct infinite loops by following a subpatternthat can match no characters with a quantifier that has no upper limit,for example:</para><programlisting>(a?)*</programlisting><para>Because there are cases where this can be useful, such patterns areaccepted, but if any repetition of the subpattern does in fact matchno characters, the loop is forcibly broken.</para><para>By default, the quantifiers are "greedy", that is, they match as muchas possible (up to the maximum number of permitted times), withoutcausing the rest of the pattern to fail. The classic example of wherethis gives problems is in trying to match comments in C programs. Theseappear between /* and */ and within the comment, individual * and /characters may appear. An attempt to match C comments by applying thepattern</para><programlisting>/\*.*\*/</programlisting><para>to the string</para><programlisting>/* first comment */ not comment /* second comment */</programlisting><para>fails, because it matches the entire string owing to the greediness ofthe .* item.</para><para>However, if a quantifier is followed by a question mark, it ceases tobe greedy, and instead matches the minimum number of times possible, sothe pattern</para><programlisting>/\*.*?\*/</programlisting><para>does the right thing with the C comments. The meaning of the variousquantifiers is not otherwise changed, just the preferred number ofmatches. Do not confuse this use of question mark with its use as aquantifier in its own right. Because it has two uses, it can sometimesappear doubled, as in</para><programlisting>\d??\d</programlisting><para>which matches one digit by preference, but can match two if that is theonly way the rest of the pattern matches.</para><para>If the <varname>G_REGEX_UNGREEDY</varname> flag is set, the quantifiers are not greedyby default, but individual ones can be made greedy by following them witha question mark. In other words, it inverts the default behaviour.</para><para>When a parenthesized subpattern is quantified with a minimum repeatcount that is greater than 1 or with a limited maximum, more memory isrequired for the compiled pattern, in proportion to the size of theminimum or maximum.</para><para>If a pattern starts with .* or .{0,} and the <varname>G_REGEX_DOTALL</varname> flagis set, thus allowing the dot to match newlines, thepattern is implicitly anchored, because whatever follows will be triedagainst every character position in the string, so there is nopoint in retrying the overall match at any position after the first.GRegex normally treats such a pattern as though it were preceded by \A.</para><para>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -