📄 xbd_chap09.html
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><meta name="generator" content="HTML Tidy, see www.w3.org"><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><link type="text/css" rel="stylesheet" href="style.css"><!-- Generated by The Open Group's rhtm tool v1.2.1 --><!-- Copyright (c) 2001-2003 The Open Group, All Rights Reserved --><title>Regular Expressions</title></head><body bgcolor="white"><basefont size="3"> <!--header start--><center><font size="2">The Open Group Base Specifications Issue 6<br>IEEE Std 1003.1, 2003 Edition<br>Copyright © 2001-2003 The IEEE and The Open Group, All Rights reserved.</font></center><!--header end--><hr size="2" noshade><h2><a name="tag_09"></a>Regular Expressions</h2><p>Regular Expressions (REs) provide a mechanism to select specific strings from a set of character strings.</p><p>Regular expressions are a context-independent syntax that can represent a wide variety of character sets and character setorderings, where these character sets are interpreted according to the current locale. While many regular expressions can beinterpreted differently depending on the current locale, many features, such as character class expressions, provide for contextualinvariance across locales.</p><p>The Basic Regular Expression (BRE) notation and construction rules in <a href="#tag_09_03">Basic Regular Expressions</a> shallapply to most utilities supporting regular expressions. Some utilities, instead, support the Extended Regular Expressions (ERE)described in <a href="#tag_09_04">Extended Regular Expressions</a> ; any exceptions for both cases are noted in the descriptions ofthe specific utilities using regular expressions. Both BREs and EREs are supported by the Regular Expression Matching interface inthe System Interfaces volume of IEEE Std 1003.1-2001 under <a href="../functions/regcomp.html"><i>regcomp</i>()</a>, <ahref="../functions/regexec.html"><i>regexec</i>()</a>, and related functions.</p><h3><a name="tag_09_01"></a>Regular Expression Definitions</h3><p>For the purposes of this section, the following definitions shall apply:</p><h4><a name="tag_09_01_01"></a>entire regular expression</h4><p>The concatenated set of one or more BREs or EREs that make up the pattern specified for string selection.</p><h4><a name="tag_09_01_02"></a>matched</h4><p>A sequence of zero or more characters shall be said to be matched by a BRE or ERE when the characters in the sequence correspondto a sequence of characters defined by the pattern.</p><p>Matching shall be based on the bit pattern used for encoding the character, not on the graphic representation of the character.This means that if a character set contains two or more encodings for a graphic symbol, or if the strings searched contain textencoded in more than one codeset, no attempt is made to search for any other representation of the encoded symbol. If that isrequired, the user can specify equivalence classes containing all variations of the desired graphic symbol.</p><p>The search for a matching sequence starts at the beginning of a string and stops when the first sequence matching the expressionis found, where "first" is defined to mean "begins earliest in the string". If the pattern permits a variable number ofmatching characters and thus there is more than one such sequence starting at that point, the longest such sequence is matched. Forexample, the BRE <tt>"bb*"</tt> matches the second to fourth characters of the string <tt>"abbbc"</tt> , and the ERE<tt>"(wee|week)(knights|night)"</tt> matches all ten characters of the string <tt>"weeknights"</tt> .</p><p>Consistent with the whole match being the longest of the leftmost matches, each subpattern, from left to right, shall match thelongest possible string. For this purpose, a null string shall be considered to be longer than no match at all. For example,matching the BRE <tt>"\(.*\).*"</tt> against <tt>"abcdef"</tt> , the subexpression <tt>"(\1)"</tt> is <tt>"abcdef"</tt> , andmatching the BRE <tt>"\(a*\)*"</tt> against <tt>"bc"</tt> , the subexpression <tt>"(\1)"</tt> is the null string.</p><p>When a multi-character collating element in a bracket expression (see <a href="#tag_09_03_05">RE Bracket Expression</a> ) isinvolved, the longest sequence shall be measured in characters consumed from the string to be matched; that is, the collatingelement counts not as one element, but as the number of characters it matches.</p><h4><a name="tag_09_01_03"></a>BRE (ERE) matching a single character</h4><p>A BRE or ERE that shall match either a single character or a single collating element.</p><p>Only a BRE or ERE of this type that includes a bracket expression (see <a href="#tag_09_03_05">RE Bracket Expression</a> ) canmatch a collating element.</p><h4><a name="tag_09_01_04"></a>BRE (ERE) matching multiple characters</h4><p>A BRE or ERE that shall match a concatenation of single characters or collating elements.</p><p>Such a BRE or ERE is made up from a BRE (ERE) matching a single character and BRE (ERE) special characters.</p><h4><a name="tag_09_01_05"></a>invalid</h4><p>This section uses the term "invalid" for certain constructs or conditions. Invalid REs shall cause the utility or functionusing the RE to generate an error condition. When invalid is not used, violations of the specified syntax or semantics for REsproduce undefined results: this may entail an error, enabling an extended syntax for that RE, or using the construct in error asliteral characters to be matched. For example, the BRE construct <tt>"\{1,2,3\}"</tt> does not comply with the grammar. Aconforming application cannot rely on it producing an error nor matching the literal characters <tt>"\{1,2,3\}"</tt> .</p><h3><a name="tag_09_02"></a>Regular Expression General Requirements</h3><p>The requirements in this section shall apply to both basic and extended regular expressions.</p><p>The use of regular expressions is generally associated with text processing. REs (BREs and EREs) operate on text strings; thatis, zero or more characters followed by an end-of-string delimiter (typically NUL). Some utilities employing regular expressionslimit the processing to lines; that is, zero or more characters followed by a <newline>. In the regular expression processingdescribed in IEEE Std 1003.1-2001, the <newline> is regarded as an ordinary character and both a period and anon-matching list can match one. The Shell and Utilities volume of IEEE Std 1003.1-2001 specifies within the individualdescriptions of those standard utilities employing regular expressions whether they permit matching of <newline>s; if notstated otherwise, the use of literal <newline>s or any escape sequence equivalent produces undefined results. Those utilities(like <a href="../utilities/grep.html"><i>grep</i></a>) that do not allow <newline>s to match are responsible for eliminatingany <newline> from strings before matching against the RE. The <a href="../functions/regcomp.html"><i>regcomp</i>()</a>function in the System Interfaces volume of IEEE Std 1003.1-2001, however, can provide support for such processingwithout violating the rules of this section.</p><p>The interfaces specified in IEEE Std 1003.1-2001 do not permit the inclusion of a NUL character in an RE or in thestring to be matched. If during the operation of a standard utility a NUL is included in the text designated to be matched, thatNUL may designate the end of the text string for the purposes of matching.</p><p>When a standard utility or function that uses regular expressions specifies that pattern matching shall be performed withoutregard to the case (uppercase or lowercase) of either data or patterns, then when each character in the string is matched againstthe pattern, not only the character, but also its case counterpart (if any), shall be matched. This definition of case-insensitiveprocessing is intended to allow matching of multi-character collating elements as well as characters, as each character in thestring is matched using both its cases. For example, in a locale where <tt>"Ch"</tt> is a multi-character collating element andwhere a matching list expression matches such elements, the RE <tt>"[[.Ch.]]"</tt> when matched against the string <tt>"char"</tt>is in reality matched against <tt>"ch"</tt> , <tt>"Ch"</tt> , <tt>"cH"</tt> , and <tt>"CH"</tt> .</p><p>The implementation shall support any regular expression that does not exceed 256 bytes in length.</p><h3><a name="tag_09_03"></a>Basic Regular Expressions</h3><h4><a name="tag_09_03_01"></a>BREs Matching a Single Character or Collating Element</h4><p>A BRE ordinary character, a special character preceded by a backslash, or a period shall match a single character. A bracketexpression shall match a single character or a single collating element.</p><h4><a name="tag_09_03_02"></a>BRE Ordinary Characters</h4><p>An ordinary character is a BRE that matches itself: any character in the supported character set, except for the BRE specialcharacters listed in <a href="#tag_09_03_03">BRE Special Characters</a> .</p><p>The interpretation of an ordinary character preceded by a backslash ( <tt>'\'</tt> ) is undefined, except for:</p><ul><li><p>The characters <tt>')'</tt> , <tt>'('</tt> , <tt>'{'</tt> , and <tt>'}'</tt></p></li><li><p>The digits 1 to 9 inclusive (see <a href="#tag_09_03_06">BREs Matching Multiple Characters</a> )</p></li><li><p>A character inside a bracket expression</p></li></ul><h4><a name="tag_09_03_03"></a>BRE Special Characters</h4><p>A BRE special character has special properties in certain contexts. Outside those contexts, or when preceded by a backslash,such a character is a BRE that matches the special character itself. The BRE special characters and the contexts in which they havetheir special meaning are as follows:</p><dl compact><dt><tt>.[\</tt></dt><dd>The period, left-bracket, and backslash shall be special except when used in a bracket expression (see <a href="#tag_09_03_05">RE Bracket Expression</a> ). An expression containing a <tt>'['</tt> that is not preceded by a backslash and is notpart of a bracket expression produces undefined results.</dd><dt><tt>*</tt></dt><dd>The asterisk shall be special except when used: <ul><li><p>In a bracket expression</p></li><li><p>As the first character of an entire BRE (after an initial <tt>'^'</tt> , if any)</p></li><li><p>As the first character of a subexpression (after an initial <tt>'^'</tt> , if any); see <a href="#tag_09_03_06">BREs MatchingMultiple Characters</a></p></li></ul></dd><dt><tt>^</tt></dt><dd>The circumflex shall be special when used as: <ul><li><p>An anchor (see <a href="#tag_09_03_08">BRE Expression Anchoring</a> )</p></li><li><p>The first character of a bracket expression (see <a href="#tag_09_03_05">RE Bracket Expression</a> )</p></li></ul></dd><dt><tt>$</tt></dt><dd>The dollar sign shall be special when used as an anchor.</dd></dl><h4><a name="tag_09_03_04"></a>Periods in BREs</h4><p>A period ( <tt>'.'</tt> ), when used outside a bracket expression, is a BRE that shall match any character in the supportedcharacter set except NUL.</p><h4><a name="tag_09_03_05"></a>RE Bracket Expression</h4><p>A bracket expression (an expression enclosed in square brackets, <tt>"[]"</tt> ) is an RE that shall match a single collatingelement contained in the non-empty set of collating elements represented by the bracket expression.</p><p>The following rules and definitions apply to bracket expressions:</p><ol><li><p>A bracket expression is either a matching list expression or a non-matching list expression. It consists of one or moreexpressions: collating elements, collating symbols, equivalence classes, character classes, or range expressions. The right-bracket( <tt>']'</tt> ) shall lose its special meaning and represent itself in a bracket expression if it occurs first in the list (afteran initial circumflex ( <tt>'^'</tt> ), if any). Otherwise, it shall terminate the bracket expression, unless it appears in acollating symbol (such as <tt>"[.].]"</tt> ) or is the ending right-bracket for a collating symbol, equivalence class, or characterclass. The special characters <tt>'.'</tt> , <tt>'*'</tt> , <tt>'['</tt> , and <tt>'\'</tt> (period, asterisk, left-bracket, andbackslash, respectively) shall lose their special meaning within a bracket expression.</p><p>The character sequences <tt>"[."</tt> , <tt>"[="</tt> , and <tt>"[:"</tt> (left-bracket followed by a period, equals-sign, orcolon) shall be special inside a bracket expression and are used to delimit collating symbols, equivalence class expressions, andcharacter class expressions. These symbols shall be followed by a valid expression and the matching terminating sequence<tt>".]"</tt> , <tt>"=]"</tt> , or <tt>":]"</tt> , as described in the following items.</p></li><li><p>A matching list expression specifies a list that shall match any single-character collating element in any of the expressionsrepresented in the list. The first character in the list shall not be the circumflex; for example, <tt>"[abc]"</tt> is an RE thatmatches any of the characters <tt>'a'</tt> , <tt>'b'</tt> , or <tt>'c'</tt> . It is unspecified whether a matching list expressionmatches a multi-character collating element that is matched by one of the expressions.</p></li><li><p>A non-matching list expression begins with a circumflex ( <tt>'^'</tt> ), and specifies a list that shall match anysingle-character collating element except for the expressions represented in the list after the leading circumflex. For example,<tt>"[^abc]"</tt> is an RE that matches any character except the characters <tt>'a'</tt> , <tt>'b'</tt> , or <tt>'c'</tt> . It isunspecified whether a non-matching list expression matches a multi-character collating element that is not matched by any of theexpressions. The circumflex shall have this special meaning only when it occurs first in the list, immediately following theleft-bracket.</p></li><li><p>A collating symbol is a collating element enclosed within bracket-period ( <tt>"[."</tt> and <tt>".]"</tt> ) delimiters.Collating elements are defined as described in <a href="xbd_chap07.html#tag_07_03_02_04"><i>Collation Order</i></a> . Conformingapplications shall represent multi-character collating elements as collating symbols when it is necessary to distinguish them froma list of the individual characters that make up the multi-character collating element. For example, if the string <tt>"ch"</tt> isa collating element defined using the line:</p><blockquote><pre><tt>collating-element <ch-digraph> from "<c><h>"</tt></pre></blockquote><p>in the locale definition, the expression <tt>"[[.ch.]]"</tt> shall be treated as an RE containing the collating symbol<tt>'ch'</tt> , while <tt>"[ch]"</tt> shall be treated as an RE matching <tt>'c'</tt> or <tt>'h'</tt> . Collating symbols arerecognized only inside bracket expressions. If the string is not a collating element in the current locale, the expression isinvalid.</p></li><li><p>An equivalence class expression shall represent the set of collating elements belonging to an equivalence class, as described in<a href="xbd_chap07.html#tag_07_03_02_04"><i>Collation Order</i></a> . Only primary equivalence classes shall be recognized. Theclass shall be expressed by enclosing any one of the collating elements in the equivalence class within bracket-equal (<tt>"[="</tt> and <tt>"=]"</tt> ) delimiters. For example, if <tt>'a'</tt> , <tt>'à'</tt> , and <tt>'â'</tt> belong tothe same equivalence class, then <tt>"[[=a=]b]"</tt> , <tt>"[[=à=]b]"</tt> , and <tt>"[[=â=]b]"</tt> are eachequivalent to <tt>"[aàâb]"</tt> . If the collating element does not belong to an equivalence class, the equivalenceclass expression shall be treated as a collating symbol.</p></li><li><p>A character class expression shall represent the union of two sets:</p><ol type="a">
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -