xbd_chap07.html

来自「IEEE 1003.1-2003, Single Unix Specificat」· HTML 代码 · 共 911 行 · 第 1/3 页
HTML
911 行
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><meta name="generator" content="HTML Tidy, see www.w3.org"><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><link type="text/css" rel="stylesheet" href="style.css"><!-- Generated by The Open Group's rhtm tool v1.2.1 --><!-- Copyright (c) 2001-2003 The Open Group, All Rights Reserved --><title>Rationale</title></head><body><basefont size="3"> <center><font size="2">The Open Group Base Specifications Issue 6<br>IEEE Std 1003.1, 2003 Edition<br>Copyright &copy; 2001-2003 The IEEE and The Open Group</font></center><hr size="2" noshade><h3><a name="tag_01_07"></a>Locale</h3><h4><a name="tag_01_07_01"></a>General</h4><p>The description of locales is based on work performed in the UniForum Technical Committee, Subcommittee on Internationalization.Wherever appropriate, keywords are taken from the ISO&nbsp;C standard or the X/Open Portability Guide.</p><p>The value used to specify a locale with environment variables is the name specified as the <i>name</i> operand to the <a href="../utilities/localedef.html"><i>localedef</i></a> utility when the locale was created. This provides a verifiable method to createand invoke a locale.</p><p>The &quot;object&quot; definitions need not be portable, as long as &quot;source&quot; definitions are. Strictly speaking, source definitionsare portable only between implementations using the same character set(s). Such source definitions, if they use symbolic namesonly, easily can be ported between systems using different codesets, as long as the characters in the portable character set (seethe Base Definitions volume of IEEE&nbsp;Std&nbsp;1003.1-2001, <a href="../basedefs/xbd_chap06.html#tag_06_01">Section 6.1,Portable Character Set</a>) have common values between the codesets; this is frequently the case in historical implementations. Ofsource, this requires that the symbolic names used for characters outside the portable character set be identical between charactersets. The definition of symbolic names for characters is outside the scope of IEEE&nbsp;Std&nbsp;1003.1-2001, but is certainlywithin the scope of other standards organizations.</p><p>Applications can select the desired locale by invoking the <a href="../functions/setlocale.html"><i>setlocale</i>()</a> function(or equivalent) with the appropriate value. If the function is invoked with an empty string, the value of the correspondingenvironment variable is used. If the environment variable is not set or is set to the empty string, the implementation sets theappropriate environment as defined in the Base Definitions volume of IEEE&nbsp;Std&nbsp;1003.1-2001, <a href="../basedefs/xbd_chap08.html">Chapter 8, Environment Variables</a>.</p><h4><a name="tag_01_07_02"></a>POSIX Locale</h4><p>The POSIX locale is equal to the C locale. To avoid being classified as a C-language function, the name has been changed to thePOSIX locale; the environment variable value can be either <tt>"POSIX"</tt> or, for historical reasons, <tt>"C"</tt> .</p><p>The POSIX definitions mirror the historical UNIX system behavior.</p><p>The use of symbolic names for characters in the tables does not imply that the POSIX locale must be described using symboliccharacter names, but merely that it may be advantageous to do so.</p><h4><a name="tag_01_07_03"></a>Locale Definition</h4><p>The decision to separate the file format from the <a href="../utilities/localedef.html"><i>localedef</i></a> utility descriptionwas only partially editorial. Implementations may provide other interfaces than <a href="../utilities/localedef.html"><i>localedef</i></a>. Requirements on &quot;the utility&quot;, mostly concerning error messages, aredescribed in this way because they are meant to affect the other interfaces implementations may provide as well as <a href="../utilities/localedef.html"><i>localedef</i></a>.</p><p>The text about POSIX2_LOCALEDEF does not mean that internationalization is optional; only that the functionality of the <a href="../utilities/localedef.html"><i>localedef</i></a> utility is. REs, for instance, must still be able to recognize, for example,character class expressions such as <tt>"[[:alpha:]]"</tt> . A possible analogy is with an applications development environment;while all conforming implementations must be capable of executing applications, not all need to have the development environmentinstalled. The assumption is that the capability to modify the behavior of utilities (and applications) via locale settings must besupported. If the <a href="../utilities/localedef.html"><i>localedef</i></a> utility is not present, then the only choice is toselect an existing (presumably implementation-documented) locale. An implementation could, for example, choose to support only thePOSIX locale, which would in effect limit the amount of changes from historical implementations quite drastically. The <a href="../utilities/localedef.html"><i>localedef</i></a> utility is still required, but would always terminate with an exit codeindicating that no locale could be created. Supported locales must be documented using the syntax defined in this chapter. (Thisensures that users can accurately determine what capabilities are provided. If the implementation decides to provide additionalcapabilities to the ones in this chapter, that is already provided for.)</p><p>If the option is present (that is, locales can be created), then the <a href="../utilities/localedef.html"><i>localedef</i></a>utility must be capable of creating locales based on the syntax and rules defined in this chapter. This does not mean that theimplementation cannot also provide alternate means for creating locales.</p><p>The octal, decimal, and hexadecimal notations are the same employed by the charmap facility (see the Base Definitions volume ofIEEE&nbsp;Std&nbsp;1003.1-2001, <a href="../basedefs/xbd_chap06.html#tag_06_04">Section 6.4, Character Set Description File</a>).To avoid confusion between an octal constant and a back-reference, the octal, hexadecimal, and decimal constants must contain atleast two digits. As single-digit constants are relatively rare, this should not impose any significant hardship. Provision is madefor more digits to account for systems in which the byte size is larger than 8 bits. For example, a Unicode (see theISO/IEC&nbsp;10646-1:2000 standard) system that has defined 16-bit bytes may require six octal, four hexadecimal, and five decimaldigits. As with the charmap file, multi-byte characters are described in the locale definition file using &quot;big-endian&quot; notationfor reasons of portability. There is no requirement that the internal representation in the computer memory be in this sameorder.</p><p>One of the guidelines used for the development of this volume of IEEE&nbsp;Std&nbsp;1003.1-2001 is that characters outside theinvariant part of the ISO/IEC&nbsp;646:1991 standard should not be used in portable specifications. The backslash character is notin the invariant part; the number sign is, but with multiple representations: as a number sign, and as a pound sign. As far asgeneral usage of these symbols, they are covered by the &quot;grandfather clause&quot;, but for newly defined interfaces, the WG15 POSIXworking group has requested that POSIX provide alternate representations. Consequently, while the default escape character remainsthe backslash and the default comment character is the number sign, implementations are required to recognize alternativerepresentations, identified in the applicable source file via the <b>&lt;escape_char&gt;</b> and <b>&lt;comment_char&gt;</b>keywords.</p><h5><a name="tag_01_07_03_01"></a>LC_CTYPE</h5><p>The <i>LC_CTYPE</i> category is primarily used to define the encoding-independent aspects of a character set, such as characterclassification. In addition, certain encoding-dependent characteristics are also defined for an application via the <i>LC_CTYPE</i>category. IEEE&nbsp;Std&nbsp;1003.1-2001 does not mandate that the encoding used in the locale is the same as the one used by theapplication because an implementation may decide that it is advantageous to define locales in a system-wide encoding rather thanhaving multiple, logically identical locales in different encodings, and to convert from the application encoding to thesystem-wide encoding on usage. Other implementations could require encoding-dependent locales.</p><p>In either case, the <i>LC_CTYPE</i> attributes that are directly dependent on the encoding, such as <b>&lt;mb_cur_max&gt;</b>and the display width of characters, are not user-specifiable in a locale source and are consequently not defined as keywords.</p><p>Implementations may define additional keywords or extend the <i>LC_CTYPE</i> mechanism to allow application-definedkeywords.</p><p>The text &quot;The ellipsis specification shall only be valid within a single encoded character set&quot; is present because it ispossible to have a locale supported by multiple character encodings, as explained in the rationale for the Base Definitions volumeof IEEE&nbsp;Std&nbsp;1003.1-2001, <a href="../basedefs/xbd_chap06.html#tag_06_01">Section 6.1, Portable Character Set</a>. Anexample given there is of a possible Japanese-based locale supported by a mixture of the character sets JIS&nbsp;X&nbsp;0201 Roman,JIS&nbsp;X&nbsp;0208, and JIS&nbsp;X&nbsp;0201 Katakana. Attempting to express a range of characters across these sets is notlogical and the implementation is free to reject such attempts.</p><p>As the <i>LC_CTYPE</i> character classes are based on the ISO&nbsp;C standard character class definition, the category does notsupport multi-character elements. For instance, the German character &lt;sharp-s&gt; is traditionally classified as a lowercaseletter. There is no corresponding uppercase letter; in proper capitalization of German text, the &lt;sharp-s&gt; will be replacedby <tt>"SS"</tt> ; that is, by two characters. This kind of conversion is outside the scope of the <b>toupper</b> and<b>tolower</b> keywords.</p><p>Where IEEE&nbsp;Std&nbsp;1003.1-2001 specifies that only certain characters can be specified, as for the keywords <b>digit</b>and <b>xdigit</b>, the specified characters must be from the portable character set, as shown. As an example, only the Arabicdigits 0 through 9 are acceptable as digits.</p><p>The character classes <b>digit</b>, <b>xdigit</b>, <b>lower</b>, <b>upper</b>, and <b>space</b> have a set of automaticallyincluded characters. These only need to be specified if the character values (that is, encoding) differs from the implementationdefault values. It is not possible to define a locale without these automatically included characters unless some implementationextension is used to prevent their inclusion. Such a definition would not be a proper superset of the C locale, and thus, it mightnot be possible for the standard utilities to be implemented as programs conforming to the ISO&nbsp;C standard.</p><p>The definition of character class <b>digit</b> requires that only ten characters-the ones defining digits-can be specified;alternate digits (for example, Hindi or Kanji) cannot be specified here. However, the encoding may vary if an implementationsupports more than one encoding.</p><p>The definition of character class <b>xdigit</b> requires that the characters included in character class <b>digit</b> areincluded here also and allows for different symbols for the hexadecimal digits 10 through 15.</p><p>The inclusion of the <b>charclass</b> keyword satisfies the following requirement from the ISO&nbsp;POSIX-2:1993 standard, AnnexH.1:</p><dl compact><dt>(3)</dt><dd><i>The</i> LC_CTYPE (2.5.2.1) locale definition should be enhanced to allow user-specified additional character classes,similar in concept to the ISO&nbsp;C standard Multibyte Support Extension (MSE) <a href="../functions/iswctype.html"><i>iswctype</i>()</a> function.</dd></dl><p>This keyword was previously included in The Open Group specifications and is now mandated in the Shell and Utilities volume ofIEEE&nbsp;Std&nbsp;1003.1-2001.</p><p>The symbolic constant {CHARCLASS_NAME_MAX} was also adopted from The Open Group specifications. Applications portability isenhanced by the use of symbolic constants.</p><h5><a name="tag_01_07_03_02"></a>LC_COLLATE</h5><p>The rules governing collation depend to some extent on the use. At least five different levels of increasingly complex collationrules can be distinguished:</p><ol><li><p><i>Byte/machine code order</i>: This is the historical collation order in the UNIX system and many proprietary operatingsystems. Collation is here performed character by character, without any regard to context. The primary virtue is that it usuallyis quite fast and also completely deterministic; it works well when the native machine collation sequence matches the userexpectations.</p></li><li><p><i>Character order</i>: On this level, collation is also performed character by character, without regard to context. The orderbetween characters is, however, not determined by the code values, but on the expectations by the user of the &quot;correct&quot; orderbetween characters. In addition, such a (simple) collation order can specify that certain characters collate equally (for example,uppercase and lowercase letters).</p></li><li><p><i>String ordering</i>: On this level, entire strings are compared based on relatively straightforward rules. Several &quot;passes''may be required to determine the order between two strings. Characters may be ignored in some passes, but not in others; thestrings may be compared in different directions; and simple string substitutions may be performed before strings are compared. Thislevel is best described as &quot;dictionary&quot; ordering; it is based on the spelling, not the pronunciation, or meaning, of thewords.</p></li><li><p><i>Text search ordering</i>: This is a further refinement of the previous level, best described as &quot;telephone book ordering'';some common homonyms (words spelled differently but with the same pronunciation) are collated together; numbers are collated as ifthey were spelled out, and so on.</p></li><li><p><i>Semantic-level ordering</i>: Words and strings are collated based on their meaning; entire words (such as &quot;the&quot;) areeliminated; the ordering is not deterministic. This usually requires special software and is highly dependent on the intendeduse.</p></li></ol><p>While the historical collation order formally is at level 1, for the English language it corresponds roughly to elements atlevel 2. The user expects to see the output from the <a href="../utilities/ls.html"><i>ls</i></a> utility sorted very much as itwould be in a dictionary. While telephone book ordering would be an optimal goal for standard collation, this was ruled out as theorder would be language-dependent. Furthermore, a requirement was that the order must be determined solely from the text string andthe collation rules; no external information (for example, &quot;pronunciation dictionaries&quot;) could be required.</p><p>As a result, the goal for the collation support is at level 3. This also matches the requirements for the Canadian collationorder, as well as other, known collation requirements for alphabetic scripts. It specifically rules out collation based onpronunciation rules or based on semantic analysis of the text.</p><p>The syntax for the <i>LC_COLLATE</i> category source meets the requirements for level 3 and has been verified to produce thecorrect result with examples based on French, Canadian, and Danish collation order. Because it supports multi-character collatingelements, it is also capable of supporting collation in codesets where a character is expressed using non-spacing charactersfollowed by the base character (such as the ISO/IEC&nbsp;6937:1994 standard).</p><p>The directives that can be specified in an operand to the <b>order_start</b> keyword are based on the requirements specified inseveral proposed standards and in customary use. The following is a rephrasing of rules defined for &quot;lexical ordering in Englishand French&quot; by the Canadian Standards Association (the text in square brackets is rephrased):</p><ul><li><p>Once special characters [punctuation] have been removed from original strings, the ordering is determined by scanning forwards(left to right) [disregarding case and diacriticals].</p></li><li><p>In case of equivalence, special characters are once again removed from original strings and the ordering is determined byscanning backwards (starting from the rightmost character of the string and back), character by character [disregarding case butconsidering diacriticals].</p></li><li><p>In case of repeated equivalence, special characters are removed again from original strings and the ordering is determined byscanning forwards, character by character [considering both case and diacriticals].</p></li><li><p>If there is still an ordering equivalence after the first three rules have been applied, then only special characters and theposition they occupy in the string are considered to determine ordering. The string that has a special character in the lowestposition comes first. If two strings have a special character in the same position, the character [with the lowest collation value]comes first. In case of equality, the other special characters are considered until there is a difference or until all specialcharacters have been exhausted.</p></li></ul><p>It is estimated that this part of IEEE&nbsp;Std&nbsp;1003.1-2001 covers the requirements for all European languages, and noparticular problems are anticipated with Slavic or Middle East character sets.</p><p>The Far East (particularly Japanese/Chinese) collations are often based on contextual information and pronunciation rules (thesame ideogram can have different meanings and different pronunciations). Such collation, in general, falls outside the desired goalof IEEE&nbsp;Std&nbsp;1003.1-2001. There are, however, several other collation rules (stroke/radical or &quot;most commonpronunciation&quot;) that can be supported with the mechanism described here.</p><p>The character order is defined by the order in which characters and elements are specified between the <b>order_start</b> and<b>order_end</b> keywords. Weights assigned to the characters and elements define the collation sequence; in the absence ofweights, the character order is also the collation sequence.</p><p>The <b>position</b> keyword provides the capability to consider, in a compare, the relative position of characters not subjectto <b>IGNORE</b>. As an example, consider the two strings <tt>"o-ring"</tt> and <tt>"or-ing"</tt> . Assuming the hyphen is subjectto <b>IGNORE</b> on the first pass, the two strings compare equal, and the position of the hyphen is immaterial. On second pass,all characters except the hyphen are subject to <b>IGNORE</b>, and in the normal case the two strings would again compare equal. Bytaking position into account, the first collates before the second.</p><h5><a name="tag_01_07_03_03"></a>LC_MONETARY</h5><p>The currency symbol does not appear in <i>LC_MONETARY</i> because it is not defined in the C locale of the ISO&nbsp;Cstandard.</p><p>The ISO&nbsp;C standard limits the size of decimal points and thousands delimiters to single-byte values. In locales based onmulti-byte coded character sets, this cannot be enforced; IEEE&nbsp;Std&nbsp;1003.1-2001 does not prohibit such characters, butmakes the behavior unspecified (in the text &quot;In contexts where other standards ...&quot;).</p><p>The grouping specification is based on, but not identical to, the ISO&nbsp;C standard. The -1 indicates that no further groupingis performed; the equivalent of {CHAR_MAX} in the ISO&nbsp;C standard.</p><p>The text &quot;the value is not available in the locale&quot; is taken from the ISO&nbsp;C standard and is used instead of the&quot;unspecified&quot; text in early proposals. There is no implication that omitting these keywords or assigning them values of<tt>""</tt> or -1 produces unspecified results; such omissions or assignments eliminate the effects described for the keyword orproduce zero-length strings, as appropriate.</p><p>The locale definition is an extension of the ISO&nbsp;C standard <a href="../functions/localeconv.html"><i>localeconv</i>()</a>specification. In particular, rules on how <b>currency_symbol</b> is treated are extended to also cover <b>int_curr_symbol</b>, and<b>p_set_by_space</b> and <b>n_sep_by_space</b> have been augmented with the value 2, which places a &lt;space&gt; between the signand the symbol (if they are adjacent; otherwise, it should be treated as a 0). The following table shows the result of variouscombinations:</p><center><table border="1" cellpadding="3" align="center"><tr valign="top"><th align="left"><p class="tent">&nbsp;</p></th><th align="left"><p class="tent">&nbsp;</p></th><th colspan="3" align="center"><p class="tent"><b>p_sep_by_space</b></p></th></tr><tr valign="top"><th align="left">
xbd_chap07.html - 源码说明

本页面展示了「IEEE 1003.1-2003, Single Unix Specification v3」中的 xbd_chap07.html 源码文件，采用 HTML 编程语言编写，共 911 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与Specification相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?