📄 regex-syntax.sgml

📁 this is a glib for c language
💻 SGML
📖 第 1 页 / 共 5 页
字号:
12 3 4 5 下一页
<refentry id="glib-regex-syntax" revision="11 Jul 2006"><refmeta><refentrytitle>Regular expression syntax</refentrytitle></refmeta><!--Based on the man page for pcrepattern.Remember to sync this document with the file docs/pcrepattern.3 in thepcre package when upgrading to a newer version of pcre.In sync with PCRE 7.0--><refnamediv><refname>Regular expression syntax</refname><refpurpose>Syntax and semantics of the regular expressions supported by GRegex</refpurpose></refnamediv><refsect1><title>GRegex regular expression details</title><para>A regular expression is a pattern that is matched against astring from left to right. Most characters stand for themselves in apattern, and match the corresponding characters in the string. As atrivial example, the pattern</para><programlisting>The quick brown fox</programlisting><para>matches a portion of a string that is identical to itself. Whencaseless matching is specified (the <varname>G_REGEX_CASELESS</varname> flag), letters arematched independently of case.</para><para>The power of regular expressions comes from the ability to includealternatives and repetitions in the pattern. These are encoded in thepattern by the use of metacharacters, which do not stand for themselvesbut instead are interpreted in some special way.</para><para>There are two different sets of metacharacters: those that are recognizedanywhere in the pattern except within square brackets, and thosethat are recognized in square brackets. Outside square brackets, themetacharacters are as follows:</para><table frame="all" colsep="1" rowsep="1"><title>Metacharacters outside square brackets</title><tgroup cols="2"><colspec colnum="1" align="center"/><thead>  <row>    <entry>Character</entry>    <entry>Meaning</entry>  </row></thead><tbody>  <row>    <entry>\</entry>    <entry>general escape character with several uses</entry>  </row>  <row>    <entry>^</entry>    <entry>assert start of string (or line, in multiline mode)</entry>  </row>  <row>    <entry>$</entry>    <entry>assert end of string (or line, in multiline mode)</entry>  </row>  <row>    <entry>.</entry>    <entry>match any character except newline (by default)</entry>  </row>  <row>    <entry>[</entry>    <entry>start character class definition</entry>  </row>  <row>    <entry>|</entry>    <entry>start of alternative branch</entry>  </row>  <row>    <entry>(</entry>    <entry>start subpattern</entry>  </row>  <row>    <entry>)</entry>    <entry>end subpattern</entry>  </row>  <row>    <entry>?</entry>    <entry>extends the meaning of (, or 0/1 quantifier, or quantifier minimizer</entry>  </row>  <row>    <entry>*</entry>    <entry>0 or more quantifier</entry>  </row>  <row>    <entry>+</entry>    <entry>1 or more quantifier, also "possessive quantifier"</entry>  </row>  <row>    <entry>{</entry>    <entry>start min/max quantifier</entry>  </row></tbody></tgroup></table><para>Part of a pattern that is in square brackets is called a "characterclass". In a character class the only metacharacters are:</para><table frame="all" colsep="1" rowsep="1"><title>Metacharacters inside square brackets</title><tgroup cols="2"><colspec colnum="1" align="center"/><thead>  <row>    <entry>Character</entry>    <entry>Meaning</entry>  </row></thead><tbody>  <row>    <entry>\</entry>    <entry>general escape character</entry>  </row>  <row>    <entry>^</entry>    <entry>negate the class, but only if the first character</entry>  </row>  <row>    <entry>-</entry>    <entry>indicates character range</entry>  </row>  <row>    <entry>[</entry>    <entry>POSIX character class (only if followed by POSIX syntax)</entry>  </row>  <row>    <entry>]</entry>    <entry>terminates the character class</entry>  </row></tbody></tgroup></table></refsect1><refsect1><title>Backslash</title><para>The backslash character has several uses. Firstly, if it is followed bya non-alphanumeric character, it takes away any special meaning thatcharacter may have. This use of backslash as an escape characterapplies both inside and outside character classes.</para><para>For example, if you want to match a * character, you write \* in thepattern. This escaping action applies whether or not the followingcharacter would otherwise be interpreted as a metacharacter, so it isalways safe to precede a non-alphanumeric with backslash to specifythat it stands for itself. In particular, if you want to match abackslash, you write \\.</para><para>If a pattern is compiled with the <varname>G_REGEX_EXTENDED</varname>option, whitespace in the pattern (other than in a character class) andcharacters between a # outside a character class and the next newlineare ignored.An escaping backslash can be used to include a whitespace or # characteras part of the pattern.</para><para>If you want to remove the special meaning from a sequence of characters,you can do so by putting them between \Q and \E.The \Q...\E sequence is recognized both inside and outside characterclasses.</para><refsect2><title>Non-printing characters</title><para>A second use of backslash provides a way of encoding non-printingcharacters in patterns in a visible manner. There is no restriction on theappearance of non-printing characters, apart from the binary zero thatterminates a pattern, but when a pattern is being prepared by textediting, it is usually easier to use one of the following escapesequences than the binary character it represents:</para><table frame="all" colsep="1" rowsep="1"><title>Non-printing characters</title><tgroup cols="2"><colspec colnum="1" align="center"/><thead>  <row>    <entry>Escape</entry>    <entry>Meaning</entry>  </row></thead><tbody>  <row>    <entry>\a</entry>    <entry>alarm, that is, the BEL character (hex 07)</entry>  </row>  <row>    <entry>\cx</entry>    <entry>"control-x", where x is any character</entry>  </row>  <row>    <entry>\e</entry>    <entry>escape (hex 1B)</entry>  </row>  <row>    <entry>\f</entry>    <entry>formfeed (hex 0C)</entry>  </row>  <row>    <entry>\n</entry>    <entry>newline (hex 0A)</entry>  </row>  <row>    <entry>\r</entry>    <entry>carriage return (hex 0D)</entry>  </row>  <row>    <entry>\t</entry>    <entry>tab (hex 09)</entry>  </row>  <row>    <entry>\ddd</entry>    <entry>character with octal code ddd, or backreference</entry>  </row>  <row>    <entry>\xhh</entry>    <entry>character with hex code hh</entry>  </row>  <row>    <entry>\x{hhh..}</entry>    <entry>character with hex code hhh..</entry>  </row></tbody></tgroup></table><para>The precise effect of \cx is as follows: if x is a lower case letter,it is converted to upper case. Then bit 6 of the character (hex 40) isinverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;becomes hex 7B.</para><para>After \x, from zero to two hexadecimal digits are read (letters can bein upper or lower case). Any number of hexadecimal digits may appearbetween \x{ and }, but the value of the character codemust be less than 2**31 (that is, the maximum hexadecimal value is7FFFFFFF). If characters other than hexadecimal digits appear between\x{ and }, or if there is no terminating }, this form of escape is notrecognized. Instead, the initial \x will be interpreted as a basic hexadecimalescape, with no following digits, giving a character whosevalue is zero.</para><para>Characters whose value is less than 256 can be defined by either of thetwo syntaxes for \x. There is no differencein the way they are handled. For example, \xdc is exactly the same as\x{dc}.</para><para>After \0 up to two further octal digits are read. If there are fewerthan two digits, just those that are present are used.Thus the sequence \0\x\07 specifies two binary zeros followed by a BELcharacter (code value 7). Make sure you supply two digits after theinitial zero if the pattern character that follows is itself an octaldigit.</para><para>The handling of a backslash followed by a digit other than 0 is complicated.Outside a character class, GRegex reads it and any following digits as adecimal number. If the number is less than 10, or if therehave been at least that many previous capturing left parentheses in theexpression, the entire sequence is taken as a back reference. Adescription of how this works is given later, following the discussionof parenthesized subpatterns.</para><para>Inside a character class, or if the decimal number is greater than 9and there have not been that many capturing subpatterns, GRegex re-readsup to three octal digits following the backslash, and uses them to generatea data character. Any subsequent digits stand for themselves. For example:</para><table frame="all" colsep="1" rowsep="1"><title>Non-printing characters</title><tgroup cols="2"><colspec colnum="1" align="center"/><thead>  <row>    <entry>Escape</entry>    <entry>Meaning</entry>  </row></thead><tbody>  <row>    <entry>\040</entry>    <entry>is another way of writing a space</entry>  </row>  <row>    <entry>\40</entry>    <entry>is the same, provided there are fewer than 40 previous capturing subpatterns</entry>  </row>  <row>    <entry>\7</entry>    <entry>is always a back reference</entry>  </row>  <row>    <entry>\11</entry>    <entry>might be a back reference, or another way of writing a tab</entry>  </row>  <row>    <entry>\011</entry>    <entry>is always a tab</entry>  </row>  <row>    <entry>\0113</entry>    <entry>is a tab followed by the character "3"</entry>  </row>  <row>    <entry>\113</entry>    <entry>might be a back reference, otherwise the character with octal code 113</entry>  </row>  <row>    <entry>\377</entry>    <entry>might be a back reference, otherwise the byte consisting entirely of 1 bits</entry>  </row>  <row>    <entry>\81</entry>    <entry>is either a back reference, or a binary zero followed by the two characters "8" and "1"</entry>  </row></tbody></tgroup></table><para>Note that octal values of 100 or greater must not be introduced by aleading zero, because no more than three octal digits are ever read.</para><para>All the sequences that define a single character can be used both insideand outside character classes. In addition, inside a character class, thesequence \b is interpreted as the backspace character (hex 08), and thesequences \R and \X are interpreted as the characters "R" and "X", respectively.Outside a character class, these sequences have different meanings (see below).</para></refsect2><refsect2><title>Absolute and relative back references</title><para>The sequence \g followed by a positive or negative number, optionally enclosedin braces, is an absolute or relative back reference. Back references arediscussed later, following the discussion of parenthesized subpatterns.</para></refsect2><refsect2><title>Generic character types</title><para>Another use of backslash is for specifying generic character types.The following are always recognized:</para><table frame="all" colsep="1" rowsep="1"><title>Generic characters</title><tgroup cols="2"><colspec colnum="1" align="center"/><thead>  <row>    <entry>Escape</entry>    <entry>Meaning</entry>  </row></thead><tbody>  <row>    <entry>\d</entry>    <entry>any decimal digit</entry>  </row>  <row>    <entry>\D</entry>    <entry>any character that is not a decimal digit</entry>  </row>  <row>    <entry>\s</entry>    <entry>any whitespace character</entry>  </row>  <row>    <entry>\S</entry>    <entry>any character that is not a whitespace character</entry>  </row>  <row>    <entry>\w</entry>
12 3 4 5 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -