📄 xbd_chap09.html
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><meta name="generator" content="HTML Tidy, see www.w3.org"><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><link type="text/css" rel="stylesheet" href="style.css"><!-- Generated by The Open Group's rhtm tool v1.2.1 --><!-- Copyright (c) 2001-2003 The Open Group, All Rights Reserved --><title>Rationale</title></head><body><basefont size="3"> <center><font size="2">The Open Group Base Specifications Issue 6<br>IEEE Std 1003.1, 2003 Edition<br>Copyright © 2001-2003 The IEEE and The Open Group</font></center><hr size="2" noshade><h3><a name="tag_01_09"></a>Regular Expressions</h3><p>Rather than repeating the description of REs for each utility supporting REs, the standard developers preferred a common,comprehensive description of regular expressions in one place. The most common behavior is described here, and exceptions orextensions to this are documented for the respective utilities, as appropriate.</p><p>The BRE corresponds to the <a href="../utilities/ed.html"><i>ed</i></a> or historical <a href="../utilities/grep.html"><i>grep</i></a> type, and the ERE corresponds to the historical <i>egrep</i> type (now <a href="../utilities/grep.html"><i>grep</i></a> <b>-E</b>).</p><p>The text is based on the <a href="../utilities/ed.html"><i>ed</i></a> description and substantially modified, primarily to aiddevelopers and others in the understanding of the capabilities and limitations of REs. Much of this was influenced byinternationalization requirements.</p><p>It should be noted that the definitions in this section do not cover the <a href="../utilities/tr.html"><i>tr</i></a> utility;the <a href="../utilities/tr.html"><i>tr</i></a> syntax does not employ REs.</p><p>The specification of REs is particularly important to internationalization because pattern matching operations are very basicoperations in business and other operations. The syntax and rules of REs are intended to be as intuitive as possible to make themeasy to understand and use. The historical rules and behavior do not provide that capability to non-English language users, and donot provide the necessary support for commonly used characters and language constructs. It was necessary to provide extensions tothe historical RE syntax and rules to accommodate other languages.</p><p>As they are limited to bracket expressions, the rationale for these modifications is in the Base Definitions volume ofIEEE Std 1003.1-2001, <a href="../basedefs/xbd_chap09.html#tag_09_03_05">Section 9.3.5, RE Bracket Expression</a>.</p><h4><a name="tag_01_09_01"></a>Regular Expression Definitions</h4><p>It is possible to determine what strings correspond to subexpressions by recursively applying the leftmost longest rule to eachsubexpression, but only with the proviso that the overall match is leftmost longest. For example, matching<tt>"\(ac*\)c*d[ac]*\1"</tt> against <i>acdacaaa</i> matches <i>acdacaaa</i> (with \1=<i>a</i>); simply matching the longest matchfor <tt>"\(ac*\)"</tt> would yield \1=<i>ac</i>, but the overall match would be smaller (<i>acdac</i>). Conceptually, theimplementation must examine every possible match and among those that yield the leftmost longest total matches, pick the one thatdoes the longest match for the leftmost subexpression, and so on. Note that this means that matching by subexpressions iscontext-dependent: a subexpression within a larger RE may match a different string from the one it would match as an independentRE, and two instances of the same subexpression within the same larger RE may match different lengths even in similar sequences ofcharacters. For example, in the ERE <tt>"(a.*b)(a.*b)"</tt> , the two identical subexpressions would match four and six characters,respectively, of <i>accbaccccb</i>.</p><p>The definition of single character has been expanded to include also collating elements consisting of two or more characters;this expansion is applicable only when a bracket expression is included in the BRE or ERE. An example of such a collating elementmay be the Dutch <i>ij</i>, which collates as a <tt>'y'</tt> . In some encodings, a ligature "i with j" exists as a character andwould represent a single-character collating element. In another encoding, no such ligature exists, and the two-character sequence<i>ij</i> is defined as a multi-character collating element. Outside brackets, the <i>ij</i> is treated as a two-character RE andmatches the same characters in a string. Historically, a bracket expression only matched a single character. TheISO POSIX-2:1993 standard required bracket expressions like <tt>"[^[:lower:]]"</tt> to match multi-character collatingelements such as <tt>"ij"</tt> . However, this requirement led to behavior that many users did not expect and that could notfeasibly be mimicked in user code, and it was rarely if ever implemented correctly. The current standard leaves it unspecifiedwhether a bracket expression matches a multi-character collating element, allowing both historical and ISO POSIX-2:1993standard implementations to conform.</p><p>Also, in the current standard, it is unspecified whether character class expressions like <tt>"[:lower:]"</tt> can includemulti-character collating elements like <tt>"ij"</tt> ; hence <tt>"[[:lower:]]"</tt> can match <tt>"ij"</tt> , and<tt>"[^[:lower:]]"</tt> can fail to match <tt>"ij"</tt> . Common practice is for a character class expression to match a collatingelement if it matches the collating element's first character.</p><h4><a name="tag_01_09_02"></a>Regular Expression General Requirements</h4><p>The definition of which sequence is matched when several are possible is based on the leftmost-longest rule historically used bydeterministic recognizers. This rule is easier to define and describe, and arguably more useful, than the first-match rulehistorically used by non-deterministic recognizers. It is thought that dependencies on the choice of rule are rare; carefullycontrived examples are needed to demonstrate the difference.</p><p>A formal expression of the leftmost-longest rule is:</p><blockquote>The search is performed as if all possible suffixes of the string were tested for a prefix matching the pattern; thelongest suffix containing a matching prefix is chosen, and the longest possible matching prefix of the chosen suffix is identifiedas the matching sequence.</blockquote><p>Historically, most RE implementations only match lines, not strings. However, that is more an effect of the usage than of aninherent feature of REs themselves. Consequently, IEEE Std 1003.1-2001 does not regard <newline>s as special; theyare ordinary characters, and both a period and a non-matching list can match them. Those utilities (like <a href="../utilities/grep.html"><i>grep</i></a>) that do not allow <newline>s to match are responsible for eliminating any<newline> from strings before matching against the RE. The <a href="../functions/regcomp.html"><i>regcomp</i>()</a> function,however, can provide support for such processing without violating the rules of this section.</p><p>Some implementations of <i>egrep</i> have had very limited flexibility in handling complex EREs. IEEE Std 1003.1-2001does not attempt to define the complexity of a BRE or ERE, but does place a lower limit on it-any RE must be handled, as long as itcan be expressed in 256 bytes or less. (Of course, this does not place an upper limit on the implementation.) There are historicalprograms using a non-deterministic-recognizer implementation that should have no difficulty with this limit. It is possible that agood approach would be to attempt to use the faster, but more limited, deterministic recognizer for simple expressions and to fallback on the non-deterministic recognizer for those expressions requiring it. Non-deterministic implementations must be careful toobserve the rules on which match is chosen; the longest match, not the first match, starting at a given character is used.</p><p>The term "invalid" highlights a difference between this section and some others: IEEE Std 1003.1-2001 frequentlyavoids mandating of errors for syntax violations because they can be used by implementors to trigger extensions. However, theauthors of the internationalization features of REs wanted to mandate errors for certain conditions to identify usage problems ornon-portable constructs. These are identified within this rationale as appropriate. The remaining syntax violations have been leftimplicitly or explicitly undefined. For example, the BRE construct <tt>"\{1,2,3\}"</tt> does not comply with the grammar. Aconforming application cannot rely on it producing an error nor matching the literal characters <tt>"\{1,2,3\}"</tt> . The term"undefined" was used in favor of "unspecified" because many of the situations are considered errors on some implementations,and the standard developers considered that consistency throughout the section was preferable to mixing undefined andunspecified.</p><h4><a name="tag_01_09_03"></a>Basic Regular Expressions</h4><p>There is no additional rationale provided for this section.</p><h5><a name="tag_01_09_03_01"></a>BREs Matching a Single Character or Collating Element</h5><p>There is no additional rationale provided for this section.</p><h5><a name="tag_01_09_03_02"></a>BRE Ordinary Characters</h5><p>There is no additional rationale provided for this section.</p><h5><a name="tag_01_09_03_03"></a>BRE Special Characters</h5><p>There is no additional rationale provided for this section.</p><h5><a name="tag_01_09_03_04"></a>Periods in BREs</h5><p>There is no additional rationale provided for this section.</p><h5><a name="tag_01_09_03_05"></a>RE Bracket Expression</h5><p>Range expressions are, historically, an integral part of REs. However, the requirements of "natural language behavior" andportability do conflict. In the POSIX locale, ranges must be treated according to the collating sequence and include suchcharacters that fall within the range based on that collating sequence, regardless of character values. In other locales, rangeshave unspecified behavior.</p><p>Some historical implementations allow range expressions where the ending range point of one range is also the starting point ofthe next (for instance, <tt>"[a-m-o]"</tt> ). This behavior should not be permitted, but to avoid breaking historicalimplementations, it is now <i>undefined</i> whether it is a valid expression and how it should be interpreted.</p><p>Current practice in <a href="../utilities/awk.html"><i>awk</i></a> and <a href="../utilities/lex.html"><i>lex</i></a> is toaccept escape sequences in bracket expressions as per the Base Definitions volume of IEEE Std 1003.1-2001, Table 5-1,Escape Sequences and Associated Actions, while the normal ERE behavior is to regard such a sequence as consisting of twocharacters. Allowing the <a href="../utilities/awk.html"><i>awk</i></a>/ <a href="../utilities/lex.html"><i>lex</i></a> behavior inEREs would change the normal behavior in an unacceptable way; it is expected that <a href="../utilities/awk.html"><i>awk</i></a>and <a href="../utilities/lex.html"><i>lex</i></a> will decode escape sequences in EREs before passing them to <a href="../functions/regcomp.html"><i>regcomp</i>()</a> or comparable routines. Each utility describes the escape sequences it accepts asan exception to the rules in this section; the list is not the same, for historical reasons.</p><p>As noted previously, the new syntax and rules have been added to accommodate other languages than English. The remainder of thissection describes the rationale for these modifications.</p><p>In the POSIX locale, a regular expression that starts with a range expression matches a set of strings that are contiguouslysorted, but this is not necessarily true in other locales. For example, a French locale might have the following behavior:</p><pre><b>$</b> <tt>lsalpha Alpha estimé ESTIMé été eurêka</tt><b>$</b> <tt>ls [a-e]*alpha Alpha estimé eurêka</tt></pre><p>Such disagreements between matching and contiguous sorting are unavoidable because POSIX sorting cannot be implemented in termsof a deterministic finite-state automaton (DFA), but range expressions by design are implementable in terms of DFAs.</p><p>Historical implementations used native character order to interpret range expressions. The ISO POSIX-2:1993 standardinstead required collating element order (CEO): the order that collating elements were specified between the <b>order_start</b> and<b>order_end</b> keywords in the <i>LC_COLLATE</i> category of the current locale. CEO had some advantages in portability over thenative character order, but it also had some disadvantages:</p><ul><li><p>CEO could not feasibly be mimicked in user code, leading to inconsistencies between POSIX matchers and matchers in popular userprograms like Emacs, <i>ksh</i>, and Perl.</p></li><li><p>CEO caused range expressions to match accented and capitalized letters contrary to many users' expectations. For example,<tt>"[a-e]"</tt> typically matched both <tt>'E'</tt> and <tt>'á'</tt> but neither <tt>'A'</tt> nor <tt>'é'</tt> .</p></li><li><p>CEO was not consistent across implementations. In practice, CEO was often less portable than native character order. Forexample, it was common for the CEOs of two implementation-supplied locales to disagree, even if both locales were named<tt>"da_DK"</tt> .</p></li></ul><p>Because of these problems, some implementations of regular expressions continued to use native character order. Others used thecollation sequence, which is more consistent with sorting than either CEO or native order, but which departs further from thetraditional POSIX semantics because it generally requires <tt>"[a-e]"</tt> to match either <tt>'A'</tt> or <tt>'E'</tt> but notboth. As a result of this kind of implementation variation, programmers who wanted to write portable regular expressions could notrely on the ISO POSIX-2:1993 standard guarantees in practice.</p><p>While revising the standard, lengthy consideration was given to proposals to attack this problem by adding an API for queryingthe CEO to allow user-mode matchers, but none of these proposals had implementation experience and none achieved consensus. Leavingthe standard alone was also considered, but rejected due to the problems described above.</p>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -