📄 glib-regex-syntax.html
字号:
<tr><td align="center">Z</td><td>Separator</td></tr><tr><td align="center">Zl</td><td>Line separator</td></tr><tr><td align="center">Zp</td><td>Paragraph separator</td></tr><tr><td align="center">Zs</td><td>Space separator</td></tr></tbody></table></div></div><br class="table-break"><p>The special property L& is also supported: it matches a character that hasthe Lu, Ll, or Lt property, in other words, a letter that is not classified asa modifier or "other".</p><p>The long synonyms for these properties that Perl supports (such as \ep{Letter})are not supported by GRegex, nor is it permitted to prefix any of theseproperties with "Is".</p><p>No character that is in the Unicode table has the Cn (unassigned) property.Instead, this property is assumed for any code point that is not in theUnicode table.</p><p>Specifying caseless matching does not affect these escape sequences.For example, \p{Lu} always matches only upper case letters.</p><p>The \X escape matches any number of Unicode characters that form anextended Unicode sequence. \X is equivalent to</p><pre class="programlisting">(?>\PM\pM*)</pre><p>That is, it matches a character without the "mark" property, followedby zero or more characters with the "mark" property, and treats thesequence as an atomic group (see below). Characters with the "mark"property are typically accents that affect the preceding character.</p><p>Matching characters by Unicode property is not fast, because GRegex hasto search a structure that contains data for over fifteen thousandcharacters. That is why the traditional escape sequences such as \d and\w do not use Unicode properties.</p></div><hr><div class="refsect2" lang="en"><a name="id2814777"></a><h3>Simple assertions</h3><p>The final use of backslash is for certain simple assertions. Anassertion specifies a condition that has to be met at a particular point ina match, without consuming any characters from the string. Theuse of subpatterns for more complicated assertions is described below.The backslashed assertions are:</p><div class="table"><a name="id2814790"></a><p class="title"><b>Table 8. Simple assertions</b></p><div class="table-contents"><table summary="Simple assertions" border="1"><colgroup><col align="center"><col></colgroup><thead><tr><th align="center">Escape</th><th>Meaning</th></tr></thead><tbody><tr><td align="center">\b</td><td>matches at a word boundary</td></tr><tr><td align="center">\B</td><td>matches when not at a word boundary</td></tr><tr><td align="center">\A</td><td>matches at the start of the string</td></tr><tr><td align="center">\Z</td><td>matches at the end of the string or before a newline at the end of the string</td></tr><tr><td align="center">\z</td><td>matches only at the end of the string</td></tr><tr><td align="center">\G</td><td>matches at first matching position in the string</td></tr></tbody></table></div></div><br class="table-break"><p>These assertions may not appear in character classes (but note that \bhas a different meaning, namely the backspace character, inside acharacter class).</p><p>A word boundary is a position in the string where the currentcharacter and the previous character do not both match \w or \W (i.e.one matches \w and the other matches \W), or the start or end of thestring if the first or last character matches \w, respectively.</p><p>The \A, \Z, and \z assertions differ from the traditional circumflexand dollar (described in the next section) in that they only ever matchat the very start and end of the string, whatever options areset. Thus, they are independent of multiline mode. These three assertionsare not affected by the <code class="varname">G_REGEX_MATCH_NOTBOL</code> or <code class="varname">G_REGEX_MATCH_NOTEOL</code> options,which affect only the behaviour of the circumflex and dollar metacharacters.However, if the start_position argument of a matching function is non-zero,indicating that matching is to start at a point other than the beginning ofthe string, \A can never match. The difference between \Z and \z isthat \Z matches before a newline at the end of the string as well at thevery end, whereas \z matches only at the end.</p><p>The \G assertion is true only when the current matching position is atthe start point of the match, as specified by the start_position argumentto the matching functions. It differs from \A when the value of startoffset isnon-zero.</p><p>Note, however, that the interpretation of \G, as the start of thecurrent match, is subtly different from Perl’s, which defines it as theend of the previous match. In Perl, these can be different when thepreviously matched string was empty.</p><p>If all the alternatives of a pattern begin with \G, the expression isanchored to the starting match position, and the "anchored" flag is setin the compiled regular expression.</p></div></div><div class="refsect1" lang="en"><a name="id2814955"></a><h2>Circumflex and dollar</h2><p>Outside a character class, in the default matching mode, the circumflexcharacter is an assertion that is true only if the current matchingpoint is at the start of the string. If the start_position argument tothe matching functions is non-zero, circumflex can never match if the<code class="varname">G_REGEX_MULTILINE</code> option is unset. Inside a character class, circumflexhas an entirely different meaning (see below).</p><p>Circumflex need not be the first character of the pattern if a numberof alternatives are involved, but it should be the first thing in eachalternative in which it appears if the pattern is ever to match thatbranch. If all possible alternatives start with a circumflex, that is,if the pattern is constrained to match only at the start of the string,it is said to be an "anchored" pattern. (There are also otherconstructs that can cause a pattern to be anchored.)</p><p>A dollar character is an assertion that is true only if the currentmatching point is at the end of the string, or immediatelybefore a newline at the end of the string (by default). Dollar need notbe the last character of the pattern if a number of alternatives areinvolved, but it should be the last item in any branch in which itappears. Dollar has no special meaning in a character class.</p><p>The meaning of dollar can be changed so that it matches only at thevery end of the string, by setting the <code class="varname">G_REGEX_DOLLAR_ENDONLY</code> option atcompile time. This does not affect the \Z assertion.</p><p>The meanings of the circumflex and dollar characters are changed if the<code class="varname">G_REGEX_MULTILINE</code> option is set. When this is the case,a circumflex matches immediately after internal newlines as well as at thestart of the string. It does not match after a newline that ends the string.A dollar matches before any newlines in the string, as well as at the veryend, when <code class="varname">G_REGEX_MULTILINE</code> is set. When newline isspecified as the two-character sequence CRLF, isolated CR and LF charactersdo not indicate newlines.</p><p>For example, the pattern /^abc$/ matches the string "def\nabc" (where\n represents a newline) in multiline mode, but not otherwise. Consequently,patterns that are anchored in single line mode because all branches start with^ are not anchored in multiline mode, and a match for circumflex is possiblewhen the <code class="varname">start_position</code> argument of a matching functionis non-zero. The <code class="varname">G_REGEX_DOLLAR_ENDONLY</code> option is ignoredif <code class="varname">G_REGEX_MULTILINE</code> is set.</p><p>Note that the sequences \A, \Z, and \z can be used to match the start andend of the string in both modes, and if all branches of a pattern start with\A it is always anchored, whether or not <code class="varname">G_REGEX_MULTILINE</code>is set.</p></div><div class="refsect1" lang="en"><a name="id2815052"></a><h2>Full stop (period, dot)</h2><p>Outside a character class, a dot in the pattern matches any one characterin the string, including a non-printing character, but not (bydefault) newline. In UTF-8 a character might be more than one byte long.</p><p>When a line ending is defined as a single character, dot never matches thatcharacter; when the two-character sequence CRLF is used, dot does not match CRif it is immediately followed by LF, but otherwise it matches all characters(including isolated CRs and LFs). When any Unicode line endings are beingrecognized, dot does not match CR or LF or any of the other line endingcharacters.</p><p>If the <code class="varname">G_REGEX_DOTALL</code> flag is set, dots match newlinesas well. The handling of dot is entirely independent of the handling of circumflexand dollar, the only relationship being that they both involve newlinecharacters. Dot has no special meaning in a character class.</p><p>The behaviour of dot with regard to newlines can be changed. If the<code class="varname">G_REGEX_DOTALL</code> option is set, a dot matches any onecharacter, without exception. If newline is defined as the two-charactersequence CRLF, it takes two dots to match it.</p><p>The handling of dot is entirely independent of the handling of circumflex anddollar, the only relationship being that they both involve newlines. Dot has nospecial meaning in a character class.</p></div><div class="refsect1" lang="en"><a name="id2815103"></a><h2>Matching a single byte</h2><p>Outside a character class, the escape sequence \C matches any one byte,both in and out of UTF-8 mode. Unlike a dot, it always matches any lineending characters.The feature is provided in Perl in order to match individual bytes inUTF-8 mode. Because it breaks up UTF-8 characters into individualbytes, what remains in the string may be a malformed UTF-8 string. Forthis reason, the \C escape sequence is best avoided.</p><p>GRegex does not allow \C to appear in lookbehind assertions (describedbelow), because in UTF-8 mode this would make it impossible to calculatethe length of the lookbehind.</p></div><div class="refsect1" lang="en"><a name="id2815126"></a><h2>Square brackets and character classes</h2><p>An opening square bracket introduces a character class, terminated by aclosing square bracket. A closing square bracket on its own is not special. If a closing square bracket is required as a member of the class,it should be the first data character in the class (after an initialcircumflex, if present) or escaped with a backslash.</p><p>A character class matches a single character in the string. A matched charactermust be in the set of characters defined by the class, unless the firstcharacter in the class definition is a circumflex, in which case thestring character must not be in the set defined by the class. If acircumflex is actually required as a member of the class, ensure it isnot the first character, or escape it with a backslash.</p><p>For example, the character class [aeiou] matches any lower case vowel,while [^aeiou] matches any character that is not a lower case vowel.Note that a circumflex is just a convenient notation for specifying thecharacters that are in the class by enumerating those that are not. Aclass that starts with a circumflex is not an assertion: it still consumesa character from the string, and therefore it fails if the current pointeris at the end of the string.</p><p>In UTF-8 mode, characters with values greater than 255 can be includedin a class as a literal string of bytes, or by using the \x{ escapingmechanism.</p><p>When caseless matching is set, any letters in a class represent boththeir upper case and lower case versions, so for example, a caseless[aeiou] matches "A" as well as "a", and a caseless [^aeiou] does notmatch "A", whereas a caseful version would.</p><p>Characters that might indicate line breaks are never treatedin any special way when matching character classes, whatever line-endingsequence is in use, and whatever setting of the <code class="varname">G_REGEX_DOTALL</code>and <code class="varname">G_REGEX_MULTILINE</code> options is used. A class such as [^a]always matches one of these characters.</p><p>The minus (hyphen) character can be used to specify a range of characters ina character class. For example, [d-m] matches any letterbetween d and m, inclusive. If a minus character is required in aclass, it must be escaped with a backslash or appear in a positionwhere it cannot be interpreted as indicating a range, typically as thefirst or last character in the class.</p><p>It is not possible to have the literal character "]" as the end characterof a range. A pattern such as [W-]46] is interpreted as a class oftwo characters ("W" and "-") followed by a literal string "46]", so itwould match "W46]" or "-46]". However, if the "]" is escaped with abackslash it is interpreted as the end of range, so [W-\]46] is interpretedas a class containing a range followed by two other characters.The octal or hexadecimal representation of "]" can also be used to enda range.</p><p>Ranges operate in the collating sequence of character values. They canalso be used for characters specified numerically, for example[\000-\037]. In UTF-8 mode, ranges can include characters whose valuesare greater than 255, for example [\x{100}-\x{2ff}].</p><p>The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appearin a character class, and add the characters that they match to theclass. For example, [\dABCDEF] matches any hexadecimal digit. Acircumflex can conveniently be used with the upper case character types tospecify a more restricted set of characters than the matching lowercase type. For example, the class [^\W_] matches any letter or digit,but not underscore.</p><p>The only metacharacters that are recognized in character classes arebackslash, hyphen (only where it can be interpreted as specifying arange), circumflex (only at the start), opening square bracket (onlywhen it can be interpreted as introducing a POSIX class name - see thenext section), and the terminating closing square bracket. However,escaping other non-alphanumeric characters does no harm.</p></div><div class="refsect1" lang="en"><a name="id2815241"></a><h2>Posix character classes</h2><p>GRegex supports the POSIX notation for character classes. This uses namesenclosed by [: and :] within the enclosing square brackets. For example,</p><pre class="programlisting">[01[:alpha:]%]</pre><p>matches "0", "1", any alphabetic character, or "%". The supported classnames are</p><div class="table"><a name="id2815262"></a><p class="title"><b>Table 9. Posix classes</b></p><div class="table-contents"><table summary="Posix classes" border="1"><colgroup><col align="center"><col></colgroup><thead><tr><th align="center">Name</th><th>Meaning</th></tr></thead><tbody><tr><td align="center">alnum</td><td>letters and digits</td></tr><tr><td align="center">alpha</td><td>letters</td></tr><tr><td align="center">ascii</td>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -