📄 regex-syntax.sgml

📁 this is a glib for c language
💻 SGML
📖 第 1 页 / 共 5 页
字号:
    <entry>any "word" character</entry>  </row>  <row>    <entry>\W</entry>    <entry>any "non-word" character</entry>  </row></tbody></tgroup></table><para>Each pair of escape sequences partitions the complete set of charactersinto two disjoint sets. Any given character matches one, and only one,of each pair.</para><para>These character type sequences can appear both inside and outside characterclasses. They each match one character of the appropriate type.If the current matching point is at the end of the passed string, allof them fail, since there is no character to match.</para><para>For compatibility with Perl, \s does not match the VT character (code11). This makes it different from the the POSIX "space" class. The \scharacters are HT (9), LF (10), FF (12), CR (13), and space (32).</para><para>A "word" character is an underscore or any character less than 256 thatis a letter or digit.</para><para>Characters with values greater than 128 never match \d,\s, or \w, and always match \D, \S, and \W.</para></refsect2><refsect2><title>Newline sequences</title><para>Outside a character class, the escape sequence \R matches any Unicodenewline sequence.This particular group matches either the two-character sequence CR followed byLF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,U+000B), FF (formfeed, U+000C), CR (carriage return, U+000D), NEL (nextline, U+0085), LS (line separator, U+2028), or PS (paragraph separator, U+2029).The two-character sequence is treated as a single unit thatcannot be split. Inside a character class, \R matches the letter "R".</para></refsect2><refsect2><title>Unicode character properties</title><para>To support generic character types there are three additional escapesequences, they are:</para><table frame="all" colsep="1" rowsep="1"><title>Generic character types</title><tgroup cols="2"><colspec colnum="1" align="center"/><thead>  <row>    <entry>Escape</entry>    <entry>Meaning</entry>  </row></thead><tbody>  <row>    <entry>\p{xx}</entry>    <entry>a character with the xx property</entry>  </row>  <row>    <entry>\P{xx}</entry>    <entry>a character without the xx property</entry>  </row>  <row>    <entry>\X</entry>    <entry>an extended Unicode sequence</entry>  </row></tbody></tgroup></table><para>The property names represented by xx above are limited to the Unicodescript names, the general category properties, and "Any", which matchesany character (including newline). Other properties such as "InMusicalSymbols"are not currently supported. Note that \P{Any} does not match any characters,so always causes a match failure.</para><para>Sets of Unicode characters are defined as belonging to certain scripts. Acharacter from one of these sets can be matched using a script name. Forexample, \p{Greek} or \P{Han}.</para><para>Those that are not part of an identified script are lumped together as"Common". The current list of scripts is:</para><itemizedlist><listitem><para>Arabic</para></listitem><listitem><para>Armenian</para></listitem><listitem><para>Balinese</para></listitem><listitem><para>Bengali</para></listitem><listitem><para>Bopomofo</para></listitem><listitem><para>Braille</para></listitem><listitem><para>Buginese</para></listitem><listitem><para>Buhid</para></listitem><listitem><para>Canadian_Aboriginal</para></listitem><listitem><para>Cherokee</para></listitem><listitem><para>Common</para></listitem><listitem><para>Coptic</para></listitem><listitem><para>Cuneiform</para></listitem><listitem><para>Cypriot</para></listitem><listitem><para>Cyrillic</para></listitem><listitem><para>Deseret</para></listitem><listitem><para>Devanagari</para></listitem><listitem><para>Ethiopic</para></listitem><listitem><para>Georgian</para></listitem><listitem><para>Glagolitic</para></listitem><listitem><para>Gothic</para></listitem><listitem><para>Greek</para></listitem><listitem><para>Gujarati</para></listitem><listitem><para>Gurmukhi</para></listitem><listitem><para>Han</para></listitem><listitem><para>Hangul</para></listitem><listitem><para>Hanunoo</para></listitem><listitem><para>Hebrew</para></listitem><listitem><para>Hiragana</para></listitem><listitem><para>Inherited</para></listitem><listitem><para>Kannada</para></listitem><listitem><para>Katakana</para></listitem><listitem><para>Kharoshthi</para></listitem><listitem><para>Khmer</para></listitem><listitem><para>Lao</para></listitem><listitem><para>Latin</para></listitem><listitem><para>Limbu</para></listitem><listitem><para>Linear_B</para></listitem><listitem><para>Malayalam</para></listitem><listitem><para>Mongolian</para></listitem><listitem><para>Myanmar</para></listitem><listitem><para>New_Tai_Lue</para></listitem><listitem><para>Nko</para></listitem><listitem><para>Ogham</para></listitem><listitem><para>Old_Italic</para></listitem><listitem><para>Old_Persian</para></listitem><listitem><para>Oriya</para></listitem><listitem><para>Osmanya</para></listitem><listitem><para>Phags_Pa</para></listitem><listitem><para>Phoenician</para></listitem><listitem><para>Runic</para></listitem><listitem><para>Shavian</para></listitem><listitem><para>Sinhala</para></listitem><listitem><para>Syloti_Nagri</para></listitem><listitem><para>Syriac</para></listitem><listitem><para>Tagalog</para></listitem><listitem><para>Tagbanwa</para></listitem><listitem><para>Tai_Le</para></listitem><listitem><para>Tamil</para></listitem><listitem><para>Telugu</para></listitem><listitem><para>Thaana</para></listitem><listitem><para>Thai</para></listitem><listitem><para>Tibetan</para></listitem><listitem><para>Tifinagh</para></listitem><listitem><para>Ugaritic</para></listitem><listitem><para>Yi</para></listitem></itemizedlist><para>Each character has exactly one general category property, specified by atwo-letter abbreviation. For compatibility with Perl, negation can be specifiedby including a circumflex between the opening brace and the property name. Forexample, \p{^Lu} is the same as \P{Lu}.</para><para>If only one letter is specified with \p or \P, it includes all the generalcategory properties that start with that letter. In this case, in the absenceof negation, the curly brackets in the escape sequence are optional; these twoexamples have the same effect:</para><programlisting>\p{L}\pL</programlisting><para>The following general category property codes are supported:</para><table frame="all" colsep="1" rowsep="1"><title>Property codes</title><tgroup cols="2"><colspec colnum="1" align="center"/><thead>  <row>    <entry>Code</entry>    <entry>Meaning</entry>  </row></thead><tbody>  <row>    <entry>C</entry>    <entry>Other</entry>  </row>  <row>    <entry>Cc</entry>    <entry>Control</entry>  </row>  <row>    <entry>Cf</entry>    <entry>Format</entry>  </row>  <row>    <entry>Cn</entry>    <entry>Unassigned</entry>  </row>  <row>    <entry>Co</entry>    <entry>Private use</entry>  </row>  <row>    <entry>Cs</entry>    <entry>Surrogate</entry>  </row>  <row>    <entry>L</entry>    <entry>Letter</entry>  </row>  <row>    <entry>Ll</entry>    <entry>Lower case letter</entry>  </row>  <row>    <entry>Lm</entry>    <entry>Modifier letter</entry>  </row>  <row>    <entry>Lo</entry>    <entry>Other letter</entry>  </row>  <row>    <entry>Lt</entry>    <entry>Title case letter</entry>  </row>  <row>    <entry>Lu</entry>    <entry>Upper case letter</entry>  </row>  <row>    <entry>M</entry>    <entry>Mark</entry>  </row>  <row>    <entry>Mc</entry>    <entry>Spacing mark</entry>  </row>  <row>    <entry>Me</entry>    <entry>Enclosing mark</entry>  </row>  <row>    <entry>Mn</entry>    <entry>Non-spacing mark</entry>  </row>  <row>    <entry>N</entry>    <entry>Number</entry>  </row>  <row>    <entry>Nd</entry>    <entry>Decimal number</entry>  </row>  <row>    <entry>Nl</entry>    <entry>Letter number</entry>  </row>  <row>    <entry>No</entry>    <entry>Other number</entry>  </row>  <row>    <entry>P</entry>    <entry>Punctuation</entry>  </row>  <row>    <entry>Pc</entry>    <entry>Connector punctuation</entry>  </row>  <row>    <entry>Pd</entry>    <entry>Dash punctuation</entry>  </row>  <row>    <entry>Pe</entry>    <entry>Close punctuation</entry>  </row>  <row>    <entry>Pf</entry>    <entry>Final punctuation</entry>  </row>  <row>    <entry>Pi</entry>    <entry>Initial punctuation</entry>  </row>  <row>    <entry>Po</entry>    <entry>Other punctuation</entry>  </row>  <row>    <entry>Ps</entry>    <entry>Open punctuation</entry>  </row>  <row>    <entry>S</entry>    <entry>Symbol</entry>  </row>  <row>    <entry>Sc</entry>    <entry>Currency symbol</entry>  </row>  <row>    <entry>Sk</entry>    <entry>Modifier symbol</entry>  </row>  <row>    <entry>Sm</entry>    <entry>Mathematical symbol</entry>  </row>  <row>    <entry>So</entry>    <entry>Other symbol</entry>  </row>  <row>    <entry>Z</entry>    <entry>Separator</entry>  </row>  <row>    <entry>Zl</entry>    <entry>Line separator</entry>  </row>  <row>    <entry>Zp</entry>    <entry>Paragraph separator</entry>  </row>  <row>    <entry>Zs</entry>    <entry>Space separator</entry>  </row></tbody></tgroup></table><para>The special property L&amp; is also supported: it matches a character that hasthe Lu, Ll, or Lt property, in other words, a letter that is not classified asa modifier or "other".</para><para>The long synonyms for these properties that Perl supports (such as \ep{Letter})are not supported by GRegex, nor is it permitted to prefix any of theseproperties with "Is".</para><para>No character that is in the Unicode table has the Cn (unassigned) property.Instead, this property is assumed for any code point that is not in theUnicode table.</para><para>Specifying caseless matching does not affect these escape sequences.For example, \p{Lu} always matches only upper case letters.</para><para>The \X escape matches any number of Unicode characters that form anextended Unicode sequence. \X is equivalent to</para><programlisting>(?&gt;\PM\pM*)</programlisting><para>That is, it matches a character without the "mark" property, followedby zero or more characters with the "mark" property, and treats thesequence as an atomic group (see below). Characters with the "mark"property are typically accents that affect the preceding character.</para><para>Matching characters by Unicode property is not fast, because GRegex hasto search a structure that contains data for over fifteen thousandcharacters. That is why the traditional escape sequences such as \d and\w do not use Unicode properties.</para></refsect2><refsect2><title>Simple assertions</title><para>The final use of backslash is for certain simple assertions. Anassertion specifies a condition that has to be met at a particular point ina match, without consuming any characters from the string. Theuse of subpatterns for more complicated assertions is described below.The backslashed assertions are:</para><table frame="all" colsep="1" rowsep="1"><title>Simple assertions</title><tgroup cols="2"><colspec colnum="1" align="center"/><thead>
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -