📄 glib-regex-syntax.html

📁 最新gtk中文资料集
💻 HTML
📖 第 1 页 / 共 5 页
字号:
<td>character codes 0 - 127</td></tr><tr><td align="center">blank</td><td>space or tab only</td></tr><tr><td align="center">cntrl</td><td>control characters</td></tr><tr><td align="center">digit</td><td>decimal digits (same as \d)</td></tr><tr><td align="center">graph</td><td>printing characters, excluding space</td></tr><tr><td align="center">lower</td><td>lower case letters</td></tr><tr><td align="center">print</td><td>printing characters, including space</td></tr><tr><td align="center">punct</td><td>printing characters, excluding letters and digits</td></tr><tr><td align="center">space</td><td>white space (not quite the same as \s)</td></tr><tr><td align="center">upper</td><td>upper case letters</td></tr><tr><td align="center">word</td><td>"word" characters (same as \w)</td></tr><tr><td align="center">xdigit</td><td>hexadecimal digits</td></tr></tbody></table></div></div><br class="table-break"><p>The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),and space (32). Notice that this list includes the VT character (code11). This makes "space" different to \s, which does not include VT (forPerl compatibility).</p><p>The name "word" is a Perl extension, and "blank" is a GNU extension.Another Perl extension is negation, which is indicated by a ^ characterafter the colon. For example,</p><pre class="programlisting">[12[:^digit:]]</pre><p>matches "1", "2", or any non-digit. GRegex also recognize thePOSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", butthese are not supported, and an error is given if they are encountered.</p><p>In UTF-8 mode, characters with values greater than 128 do not match anyof the POSIX character classes.</p></div><div class="refsect1" lang="en"><a name="id2815472"></a><h2>Vertical bar</h2><p>Vertical bar characters are used to separate alternative patterns. Forexample, the pattern</p><pre class="programlisting"> gilbert|sullivan</pre><p>matches either "gilbert" or "sullivan". Any number of alternatives mayappear, and an empty alternative is permitted (matching the emptystring). The matching process tries each alternative in turn, fromleft to right, and the first one that succeeds is used. If the alternatives are within a subpattern (defined below), "succeeds" means matching the rest of the main pattern as well as the alternative in the subpattern.</p></div><div class="refsect1" lang="en"><a name="id2815499"></a><h2>Internal option setting</h2><p>The settings of the <code class="varname">G_REGEX_CASELESS</code>, <code class="varname">G_REGEX_MULTILINE</code>, <code class="varname">G_REGEX_MULTILINE</code>,and <code class="varname">G_REGEX_EXTENDED</code> options can be changed from within the pattern by asequence of Perl-style option letters enclosed between "(?" and ")". Theoption letters are</p><div class="table"><a name="id2815523"></a><p class="title"><b>Table&#160;10.&#160;Option settings</b></p><div class="table-contents"><table summary="Option settings" border="1"><colgroup><col align="center"><col></colgroup><thead><tr><th align="center">Option</th><th>Flag</th></tr></thead><tbody><tr><td align="center">i</td><td><code class="varname">G_REGEX_CASELESS</code></td></tr><tr><td align="center">m</td><td><code class="varname">G_REGEX_MULTILINE</code></td></tr><tr><td align="center">s</td><td><code class="varname">G_REGEX_DOTALL</code></td></tr><tr><td align="center">x</td><td><code class="varname">G_REGEX_EXTENDED</code></td></tr></tbody></table></div></div><br class="table-break"><p>For example, (?im) sets caseless, multiline matching. It is alsopossible to unset these options by preceding the letter with a hyphen, and acombined setting and unsetting such as (?im-sx), which sets <code class="varname">G_REGEX_CASELESS</code>and <code class="varname">G_REGEX_MULTILINE</code> while unsetting <code class="varname">G_REGEX_DOTALL</code> and <code class="varname">G_REGEX_EXTENDED</code>,is also permitted. If a letter appears both before and after thehyphen, the option is unset.</p><p>When an option change occurs at top level (that is, not inside subpatternparentheses), the change applies to the remainder of the patternthat follows.</p><p>An option change within a subpattern (see below for a description of subpatterns)affects only that part of the current pattern that follows it, so</p><pre class="programlisting">(a(?i)b)c</pre><p>matches abc and aBc and no other strings (assuming <code class="varname">G_REGEX_CASELESS</code> is notused). By this means, options can be made to have different settingsin different parts of the pattern. Any changes made in one alternativedo carry on into subsequent branches within the same subpattern. Forexample,</p><pre class="programlisting">(a(?i)b|c)</pre><p>matches "ab", "aB", "c", and "C", even though when matching "C" thefirst branch is abandoned before the option setting. This is becausethe effects of option settings happen at compile time. There would besome very weird behaviour otherwise.</p><p>The options <code class="varname">G_REGEX_UNGREEDY</code> and<code class="varname">G_REGEX_EXTRA</code> and <code class="varname">G_REGEX_DUPNAMES</code>can be changed in the same way as the Perl-compatible options by usingthe characters U, X and J respectively.</p></div><div class="refsect1" lang="en"><a name="id2815683"></a><h2>Subpatterns</h2><p>Subpatterns are delimited by parentheses (round brackets), which can benested. Turning part of a pattern into a subpattern does two things:</p><div class="itemizedlist"><ul type="disc"><li><p>It localizes a set of alternatives. For example, the patterncat(aract|erpillar|) matches one of the words "cat", "cataract", or"caterpillar". Without the parentheses, it would match "cataract","erpillar" or an empty string.</p></li><li><p>It sets up the subpattern as a capturing subpattern. This meansthat, when the whole pattern matches, that portion of thestring that matched the subpattern can be obtained using <code class="function">g_regex_fetch()</code>.Opening parentheses are counted from left to right (starting from 1, assubpattern 0 is the whole matched string) to obtain numbers for thecapturing subpatterns.</p></li></ul></div><p>For example, if the string "the red king" is matched against the pattern</p><pre class="programlisting">the ((red|white) (king|queen))</pre><p>the captured substrings are "red king", "red", and "king", and are numbered 1, 2, and 3, respectively.</p><p>The fact that plain parentheses fulfil two functions is not alwayshelpful. There are often times when a grouping subpattern is requiredwithout a capturing requirement. If an opening parenthesis is followedby a question mark and a colon, the subpattern does not do any capturing,and is not counted when computing the number of any subsequentcapturing subpatterns. For example, if the string "the white queen" ismatched against the pattern</p><pre class="programlisting">the ((?:red|white) (king|queen))</pre><p>the captured substrings are "white queen" and "queen", and are numbered1 and 2. The maximum number of capturing subpatterns is 65535.</p><p>As a convenient shorthand, if any option settings are required at thestart of a non-capturing subpattern, the option letters may appearbetween the "?" and the ":". Thus the two patterns</p><pre class="programlisting">(?i:saturday|sunday)(?:(?i)saturday|sunday)</pre><p>match exactly the same set of strings. Because alternative branches aretried from left to right, and options are not reset until the end ofthe subpattern is reached, an option setting in one branch does affectsubsequent branches, so the above patterns match "SUNDAY" as well as"Saturday".</p></div><div class="refsect1" lang="en"><a name="id2815781"></a><h2>Named subpatterns</h2><p>Identifying capturing parentheses by number is simple, but it can bevery hard to keep track of the numbers in complicated regular expressions.Furthermore, if an expression is modified, the numbers maychange. To help with this difficulty, GRegex supports the naming ofsubpatterns.  A subpattern can be named in one of three ways: (?&lt;name&gt;...) or(?'name'...) as in Perl, or (?P&lt;name&gt;...) as in Python.References to capturing parentheses from otherparts of the pattern, such as backreferences, recursion, and conditions,can be made by name as well as by number.</p><p>Names consist of up to 32 alphanumeric characters and underscores. Namedcapturing parentheses are still allocated numbers as well as names, exactly asif the names were not present.By default, a name must be unique within a pattern, but it is possible to relaxthis constraint by setting the <code class="varname">G_REGEX_DUPNAMES</code> option atcompile time. This can be useful for patterns where only one instance of thenamed parentheses can match. Suppose you want to match the name of a weekday,either as a 3-letter abbreviation or as the full name, and in both cases youwant to extract the abbreviation. This pattern (ignoring the line breaks) doesthe job:</p><pre class="programlisting">(?&lt;DN&gt;Mon|Fri|Sun)(?:day)?|(?&lt;DN&gt;Tue)(?:sday)?|(?&lt;DN&gt;Wed)(?:nesday)?|(?&lt;DN&gt;Thu)(?:rsday)?|(?&lt;DN&gt;Sat)(?:urday)?</pre><p>There are five capturing substrings, but only one is ever set after a match.The function for extracting the data by name returns the substringfor the first (and in this example, the only) subpattern of that name thatmatched. This saves searching to find which numbered subpattern it was. If youmake a reference to a non-unique named subpattern from elsewhere in thepattern, the one that corresponds to the lowest number is used.</p></div><div class="refsect1" lang="en"><a name="id2815838"></a><h2>Repetition</h2><p>Repetition is specified by quantifiers, which can follow any of thefollowing items:</p><div class="itemizedlist"><ul type="disc"><li><p>a literal data character</p></li><li><p>the dot metacharacter</p></li><li><p>the \C escape sequence</p></li><li><p>the \X escape sequence (in UTF-8 mode)</p></li><li><p>the \R escape sequence</p></li><li><p>an escape such as \d that matches a single character</p></li><li><p>a character class</p></li><li><p>a back reference (see next section)</p></li><li><p>a parenthesized subpattern (unless it is an assertion)</p></li></ul></div><p>The general repetition quantifier specifies a minimum and maximum numberof permitted matches, by giving the two numbers in curly brackets(braces), separated by a comma. The numbers must be less than 65536,and the first must be less than or equal to the second. For example:</p><pre class="programlisting">z{2,4}</pre><p>matches "zz", "zzz", or "zzzz". A closing brace on its own is not aspecial character. If the second number is omitted, but the comma ispresent, there is no upper limit; if the second number and the commaare both omitted, the quantifier specifies an exact number of requiredmatches. Thus</p><pre class="programlisting">[aeiou]{3,}</pre><p>matches at least 3 successive vowels, but may match many more, while</p><pre class="programlisting">\d{8}</pre><p>matches exactly 8 digits. An opening curly bracket that appears in aposition where a quantifier is not allowed, or one that does not matchthe syntax of a quantifier, is taken as a literal character. For example,{,6} is not a quantifier, but a literal string of four characters.</p><p>In UTF-8 mode, quantifiers apply to UTF-8 characters rather than toindividual bytes. Thus, for example, \x{100}{2} matches two UTF-8characters, each of which is represented by a two-byte sequence. Similarly,\X{3} matches three Unicode extended sequences, each of which may beseveral bytes long (and they may be of different lengths).</p><p>The quantifier {0} is permitted, causing the expression to behave as ifthe previous item and the quantifier were not present.</p><p>For convenience, the three most common quantifiers have single-characterabbreviations:</p><div class="table"><a name="id2815962"></a><p class="title"><b>Table&#160;11.&#160;Abbreviations for quantifiers</b></p><div class="table-contents"><table summary="Abbreviations for quantifiers" border="1"><colgroup><col align="center"><col></colgroup><thead><tr><th align="center">Abbreviation</th><th>Meaning</th></tr></thead><tbody><tr><td align="center">*</td><td>is equivalent to {0,}</td></tr><tr><td align="center">+</td><td>is equivalent to {1,}</td></tr><tr><td align="center">?</td><td>is equivalent to {0,1}</td></tr></tbody></table></div></div><br class="table-break"><p>It is possible to construct infinite loops by following a subpatternthat can match no characters with a quantifier that has no upper limit,for example:</p><pre class="programlisting">(a?)*</pre><p>Because there are cases where this can be useful, such patterns areaccepted, but if any repetition of the subpattern does in fact matchno characters, the loop is forcibly broken.
💿 文件大小 7242 K
👤 上传用户 balefu123
📂 所属分类 Linux/Unix编程
🏷️ 相关标签

#gtk
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -