⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 xbd_chap09.html

📁 IEEE 1003.1-2003, Single Unix Specification v3
💻 HTML
📖 第 1 页 / 共 2 页
字号:
<p>The current standard leaves unspecified the behavior of a range expression outside the POSIX locale. This makes it clearer thatconforming applications should avoid range expressions outside the POSIX locale, and it allows implementations and compatibleuser-mode matchers to interpret range expressions using native order, CEO, collation sequence, or other, more advanced techniques.The concerns which led to this change were raised in IEEE PASC interpretation 1003.2 #43 and others, and related to ambiguities inthe specification of how multi-character collating elements should be handled in range expressions. These ambiguities had led tomultiple interpretations of the specification, in conflicting ways, which led to varying implementations. As noted above, effortswere made to resolve the differences, but no solution has been found that would be specific enough to allow for portable softwarewhile not invalidating existing implementations.</p><p>The standard developers recognize that collating elements are important, such elements being common in several Europeanlanguages; for example, <tt>'ch'</tt> or <tt>'ll'</tt> in traditional Spanish; <tt>'aa'</tt> in several Scandinavian languages.Existing internationalized implementations have processed, and continue to process, these elements in range expressions. Effortsare expected to continue in the future to find a way to define the behavior of these elements precisely and portably.</p><p>The ISO&nbsp;POSIX-2:1993 standard required <tt>"[b-a]"</tt> to be an invalid expression in the POSIX locale, but thisrequirement has been relaxed in this version of the standard so that <tt>"[b-a]"</tt> can instead be treated as a valid expressionthat does not match any string.</p><h5><a name="tag_01_09_03_06"></a>BREs Matching Multiple Characters</h5><p>The limit of nine back-references to subexpressions in the RE is based on the use of a single-digit identifier; increasing thisto multiple digits would break historical applications. This does not imply that only nine subexpressions are allowed in REs. Thefollowing is a valid BRE with ten subexpressions:</p><pre><tt>\(\(\(ab\)*c\)*d\)\(ef\)*\(gh\)\{2\}\(ij\)*\(kl\)*\(mn\)*\(op\)*\(qr\)*</tt></pre><p>The standard developers regarded the common historical behavior, which supported <tt>"\n*"</tt> , but not<tt>"\n\{min,max\}"</tt> , <tt>"\(...\)*"</tt> , or <tt>"\(...\)\{min,max\}"</tt> , as a non-intentional result of a specificimplementation, and they supported both duplication and interval expressions following subexpressions and back-references.</p><p>The changes to the processing of the back-reference expression remove an unspecified or ambiguous behavior in the Shell andUtilities volume of IEEE&nbsp;Std&nbsp;1003.1-2001, aligning it with the requirements specified for the <a href="../functions/regcomp.html"><i>regcomp</i>()</a> expression, and is the result of PASC Interpretation 1003.2-92 #43 submitted forthe ISO&nbsp;POSIX-2:1993 standard.</p><h5><a name="tag_01_09_03_07"></a>BRE Precedence</h5><p>There is no additional rationale provided for this section.</p><h5><a name="tag_01_09_03_08"></a>BRE Expression Anchoring</h5><p>Often, the dollar sign is viewed as matching the ending &lt;newline&gt; in text files. This is not strictly true; the&lt;newline&gt; is typically eliminated from the strings to be matched, and the dollar sign matches the terminating nullcharacter.</p><p>The ability of <tt>'^'</tt> , <tt>'$'</tt> , and <tt>'*'</tt> to be non-special in certain circumstances may be confusing tosome programmers, but this situation was changed only in a minor way from historical practice to avoid breaking many historicalscripts. Some consideration was given to making the use of the anchoring characters undefined if not escaped and not at thebeginning or end of strings. This would cause a number of historical BREs, such as <tt>"2^10"</tt> , <tt>"$HOME"</tt> , and<tt>"$1.35"</tt> , that relied on the characters being treated literally, to become invalid.</p><p>However, one relatively uncommon case was changed to allow an extension used on some implementations. Historically, the BREs<tt>"^foo"</tt> and <tt>"\(^foo\)"</tt> did not match the same string, despite the general rule that subexpressions and entire BREsmatch the same strings. To increase consensus, IEEE&nbsp;Std&nbsp;1003.1-2001 has allowed an extension on some implementations totreat these two cases in the same way by declaring that anchoring <i>may</i> occur at the beginning or end of a subexpression.Therefore, portable BREs that require a literal circumflex at the beginning or a dollar sign at the end of a subexpression mustescape them. Note that a BRE such as <tt>"a\(^bc\)"</tt> will either match <tt>"a^bc"</tt> or nothing on different systems underthe rules.</p><p>ERE anchoring has been different from BRE anchoring in all historical systems. An unescaped anchor character has never matchedits literal counterpart outside a bracket expression. Some implementations treated <tt>"foo$bar"</tt> as a valid expression thatnever matched anything; others treated it as invalid. IEEE&nbsp;Std&nbsp;1003.1-2001 mandates the former, valid unmatchedbehavior.</p><p>Some implementations have extended the BRE syntax to add alternation. For example, the subexpression <tt>"\(foo$\|bar\)"</tt>would match either <tt>"foo"</tt> at the end of the string or <tt>"bar"</tt> anywhere. The extension is triggered by the use of theundefined <tt>"\|"</tt> sequence. Because the BRE is undefined for portable scripts, the extending system is free to make otherassumptions, such that the <tt>'$'</tt> represents the end-of-line anchor in the middle of a subexpression. If it were not for theextension, the <tt>'$'</tt> would match a literal dollar sign under the rules.</p><h4><a name="tag_01_09_04"></a>Extended Regular Expressions</h4><p>As with BREs, the standard developers decided to make the interpretation of escaped ordinary characters undefined.</p><p>The right parenthesis is not listed as an ERE special character because it is only special in the context of a preceding leftparenthesis. If found without a preceding left parenthesis, the right parenthesis has no special meaning.</p><p>The interval expression, <tt>"{m,n}"</tt> , has been added to EREs. Historically, the interval expression has only beensupported in some ERE implementations. The standard developers estimated that the addition of interval expressions to EREs wouldnot decrease consensus and would also make BREs more of a subset of EREs than in many historical implementations.</p><p>It was suggested that, in addition to interval expressions, back-references ( <tt>'\n'</tt> ) should also be added to EREs. Thiswas rejected by the standard developers as likely to decrease consensus.</p><p>In historical implementations, multiple duplication symbols are usually interpreted from left to right and treated as additive.As an example, <tt>"a+*b"</tt> matches zero or more instances of <tt>'a'</tt> followed by a <tt>'b'</tt> . InIEEE&nbsp;Std&nbsp;1003.1-2001, multiple duplication symbols are undefined; that is, they cannot be relied upon for conformingapplications. One reason for this is to provide some scope for future enhancements.</p><p>The precedence of operations differs between EREs and those in <a href="../utilities/lex.html"><i>lex</i></a>; in <a href="../utilities/lex.html"><i>lex</i></a>, for historical reasons, interval expressions have a lower precedence thanconcatenation.</p><h5><a name="tag_01_09_04_01"></a>EREs Matching a Single Character or Collating Element</h5><p>There is no additional rationale provided for this section.</p><h5><a name="tag_01_09_04_02"></a>ERE Ordinary Characters</h5><p>There is no additional rationale provided for this section.</p><h5><a name="tag_01_09_04_03"></a>ERE Special Characters</h5><p>There is no additional rationale provided for this section.</p><h5><a name="tag_01_09_04_04"></a>Periods in EREs</h5><p>There is no additional rationale provided for this section.</p><h5><a name="tag_01_09_04_05"></a>ERE Bracket Expression</h5><p>There is no additional rationale provided for this section.</p><h5><a name="tag_01_09_04_06"></a>EREs Matching Multiple Characters</h5><p>There is no additional rationale provided for this section.</p><h5><a name="tag_01_09_04_07"></a>ERE Alternation</h5><p>There is no additional rationale provided for this section.</p><h5><a name="tag_01_09_04_08"></a>ERE Precedence</h5><p>There is no additional rationale provided for this section.</p><h5><a name="tag_01_09_04_09"></a>ERE Expression Anchoring</h5><p>There is no additional rationale provided for this section.</p><h4><a name="tag_01_09_05"></a>Regular Expression Grammar</h4><p>The grammars are intended to represent the range of acceptable syntaxes available to conforming applications. There areinstances in the text where undefined constructs are described; as explained previously, these allow implementation extensions.There is no intended requirement that an implementation extension must somehow fit into the grammars shown here.</p><p>The BRE grammar does not permit L_ANCHOR or R_ANCHOR inside <tt>"\("</tt> and <tt>"\)"</tt> (which implies that <tt>'^'</tt> and<tt>'$'</tt> are ordinary characters). This reflects the semantic limits on the application, as noted in the Base Definitionsvolume of IEEE&nbsp;Std&nbsp;1003.1-2001, <a href="../basedefs/xbd_chap09.html#tag_09_03_08">Section 9.3.8, BRE ExpressionAnchoring</a>. Implementations are permitted to extend the language to interpret <tt>'^'</tt> and <tt>'$'</tt> as anchors in theselocations, and as such, conforming applications cannot use unescaped <tt>'^'</tt> and <tt>'$'</tt> in positions inside<tt>"\("</tt> and <tt>"\)"</tt> that might be interpreted as anchors.</p><p>The ERE grammar does not permit several constructs that the Base Definitions volume of IEEE&nbsp;Std&nbsp;1003.1-2001, <a href="../basedefs/xbd_chap09.html#tag_09_04_02">Section 9.4.2, ERE Ordinary Characters</a> and the Base Definitions volume ofIEEE&nbsp;Std&nbsp;1003.1-2001, <a href="../basedefs/xbd_chap09.html#tag_09_04_03">Section 9.4.3, ERE Special Characters</a>specify as having undefined results:</p><ul><li><p>ORD_CHAR preceded by <tt>'\'</tt></p></li><li><p><i>ERE_dupl_symbol</i>(s) appearing first in an ERE, or immediately following <tt>'|'</tt> , <tt>'^'</tt> , or <tt>'('</tt></p></li><li><p><tt>'{'</tt> not part of a valid <i>ERE_dupl_symbol</i></p></li><li><p><tt>'|'</tt> appearing first or last in an ERE, or immediately following <tt>'|'</tt> or <tt>'('</tt> , or immediately preceding<tt>')'</tt></p></li></ul><p>Implementations are permitted to extend the language to allow these. Conforming applications cannot use such constructs.</p><h5><a name="tag_01_09_05_01"></a>BRE/ERE Grammar Lexical Conventions</h5><p>There is no additional rationale provided for this section.</p><h5><a name="tag_01_09_05_02"></a>RE and Bracket Expression Grammar</h5><p>The removal of the <i>Back_open_paren</i> <i>Back_close_paren</i> option from the <i>nondupl_RE</i> specification is the resultof PASC Interpretation 1003.2-92 #43 submitted for the ISO&nbsp;POSIX-2:1993 standard. Although the grammar required support fornull subexpressions, this section does not describe the meaning of, and historical practice did not support, this construct.</p><h5><a name="tag_01_09_05_03"></a>ERE Grammar</h5><p>There is no additional rationale provided for this section.</p><hr size="2" noshade><center><font size="2"><!--footer start-->UNIX &reg; is a registered Trademark of The Open Group.<br>POSIX &reg; is a registered Trademark of The IEEE.<br>[ <a href="../mindex.html">Main Index</a> | <a href="../basedefs/contents.html">XBD</a> | <a href="../utilities/contents.html">XCU</a> | <a href="../functions/contents.html">XSH</a> | <a href="../xrat/contents.html">XRAT</a>]</font></center><!--footer end--><hr size="2" noshade></body></html>

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -