⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 rfc2781.txt

📁 著名的RFC文档,其中有一些文档是已经翻译成中文的的.
💻 TXT
📖 第 1 页 / 共 2 页
字号:
Network Working Group                                        P. HoffmanRequest for Comments: 2781                     Internet Mail ConsortiumCategory: Informational                                      F. Yergeau                                                      Alis Technologies                                                          February 2000                    UTF-16, an encoding of ISO 10646Status of this Memo   This memo provides information for the Internet community.  It does   not specify an Internet standard of any kind.  Distribution of this   memo is unlimited.Copyright Notice   Copyright (C) The Internet Society (2000).  All Rights Reserved.1. Introduction   This document describes the UTF-16 encoding of Unicode/ISO-10646,   addresses the issues of serializing UTF-16 as an octet stream for   transmission over the Internet, discusses MIME charset naming as   described in [CHARSET-REG], and contains the registration for three   MIME charset parameter values: UTF-16BE (big-endian), UTF-16LE   (little-endian), and UTF-16.1.1 Background and motivation   The Unicode Standard [UNICODE] and ISO/IEC 10646 [ISO-10646] jointly   define a coded character set (CCS), hereafter referred to as Unicode,   which encompasses most of the world's writing systems [WORKSHOP].   UTF-16, the object of this specification, is one of the standard ways   of encoding Unicode character data; it has the characteristics of   encoding all currently defined characters (in plane 0, the BMP) in   exactly two octets and of being able to encode all other characters   likely to be defined (the next 16 planes) in exactly four octets.   The Unicode Standard further defines additional character properties   and other application details of great interest to implementors. Up   to the present time, changes in Unicode and amendments to ISO/IEC   10646 have tracked each other, so that the character repertoires and   code point assignments have remained in sync. The relevant   standardization committees have committed to maintain this very   useful synchronism, as well as not to assign characters outside of   the 17 planes accessible to UTF-16.Hoffman & Yergeau            Informational                      [Page 1]RFC 2781            UTF-16, an encoding of ISO 10646       February 2000   The IETF policy on character sets and languages [CHARPOLICY] says   that IETF protocols MUST be able to use the UTF-8 character encoding   scheme [UTF-8]. Some products and network standards already specify   UTF-16, making it an important encoding for the Internet. This   document is not an update to the [CHARPOLICY] document, only a   description of the UTF-16 encoding.1.2 Terminology   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this   document are to be interpreted as described in RFC 2119 [MUSTSHOULD].   Throughout this document, character values are shown in hexadecimal   notation. For example, "0x013C" is the character whose value is the   character assigned the integer value 316 (decimal) in the CCS.2. UTF-16 definition   UTF-16 is described in the Unicode Standard, version 3.0 [UNICODE].   The definitive reference is Annex Q of ISO/IEC 10646-1 [ISO-10646].   The rest of this section summarizes the definition is simple terms.   In ISO 10646, each character is assigned a number, which Unicode   calls the Unicode scalar value. This number is the same as the UCS-4   value of the character, and this document will refer to it as the   "character value" for brevity. In the UTF-16 encoding, characters are   represented using either one or two unsigned 16-bit integers,   depending on the character value. Serialization of these integers for   transmission as a byte stream is discussed in Section 3.   The rules for how characters are encoded in UTF-16 are:   -  Characters with values less than 0x10000 are represented as a      single 16-bit integer with a value equal to that of the character      number.   -  Characters with values between 0x10000 and 0x10FFFF are      represented by a 16-bit integer with a value between 0xD800 and      0xDBFF (within the so-called high-half zone or high surrogate      area) followed by a 16-bit integer with a value between 0xDC00 and      0xDFFF (within the so-called low-half zone or low surrogate area).   -  Characters with values greater than 0x10FFFF cannot be encoded in      UTF-16.   Note: Values between 0xD800 and 0xDFFF are specifically reserved for   use with UTF-16, and don't have any characters assigned to them.Hoffman & Yergeau            Informational                      [Page 2]RFC 2781            UTF-16, an encoding of ISO 10646       February 20002.1 Encoding UTF-16   Encoding of a single character from an ISO 10646 character value to   UTF-16 proceeds as follows. Let U be the character number, no greater   than 0x10FFFF.   1) If U < 0x10000, encode U as a 16-bit unsigned integer and      terminate.   2) Let U' = U - 0x10000. Because U is less than or equal to 0x10FFFF,      U' must be less than or equal to 0xFFFFF. That is, U' can be      represented in 20 bits.   3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and      0xDC00, respectively. These integers each have 10 bits free to      encode the character value, for a total of 20 bits.   4) Assign the 10 high-order bits of the 20-bit U' to the 10 low-order      bits of W1 and the 10 low-order bits of U' to the 10 low-order      bits of W2. Terminate.   Graphically, steps 2 through 4 look like:   U' = yyyyyyyyyyxxxxxxxxxx   W1 = 110110yyyyyyyyyy   W2 = 110111xxxxxxxxxx2.2 Decoding UTF-16   Decoding of a single character from UTF-16 to an ISO 10646 character   value proceeds as follows. Let W1 be the next 16-bit integer in the   sequence of integers representing the text. Let W2 be the (eventual)   next integer following W1.   1) If W1 < 0xD800 or W1 > 0xDFFF, the character value U is the value      of W1. Terminate.   2) Determine if W1 is between 0xD800 and 0xDBFF. If not, the sequence      is in error and no valid character can be obtained using W1.      Terminate.   3) If there is no W2 (that is, the sequence ends with W1), or if W2      is not between 0xDC00 and 0xDFFF, the sequence is in error.      Terminate.   4) Construct a 20-bit unsigned integer U', taking the 10 low-order      bits of W1 as its 10 high-order bits and the 10 low-order bits of      W2 as its 10 low-order bits.Hoffman & Yergeau            Informational                      [Page 3]RFC 2781            UTF-16, an encoding of ISO 10646       February 2000   5) Add 0x10000 to U' to obtain the character value U. Terminate.   Note that steps 2 and 3 indicate errors. Error recovery is not   specified by this document. When terminating with an error in steps 2   and 3, it may be wise to set U to the value of W1 to help the caller   diagnose the error and not lose information. Also note that a string   decoding algorithm, as opposed to the single-character decoding   described above, need not terminate upon detection of an error, if   proper error reporting and/or recovery is provided.3. Labelling UTF-16 text   Appendix A of this specification contains registrations for three   MIME charsets: "UTF-16BE", "UTF-16LE", and "UTF-16". MIME charsets   represent the combination of a CCS (a coded character set) and a CES   (a character encoding scheme). Here the CCS is Unicode/ISO 10646 and   the CES is the same in all three cases, except for the serialization   order of the octets in each character, and the external determination   of which serialization is used.   This section describes which of the three labels to apply to a stream   of text. Section 4 describes how to interpret the labels on a stream   of text.3.1 Definition of big-endian and little-endian   Historically, computer hardware has processed two-octet entities such   as 16-bit integers in one of two ways. So-called "big-endian"   hardware handles two-octet entities with the higher-order octet   first, that is at the lower address in memory; when written out to   disk or to a network interface (serializing), the high-order octet   thus appears first in the data stream. On the other hand, "Little-   endian" hardware handles two-octet entities with the lower-order   octet first. Hardware of both kinds is common today.   For example, the unsigned 16-bit integer that represents the decimal   number 258 is 0x0102. The big-endian serialization of that number is   the octet 0x01 followed by the octet 0x02. The little-endian   serialization of that number is the octet 0x02 followed by the octet   0x01. The following C code fragment demonstrates a way to write 16-   bit quantities to a file in big-endian order, irrespective of the   hardware's native byte order.  void write_be(unsigned short u, FILE f)  /* assume short is 16 bits */  {    putc(u >> 8,   f);                     /* output high-order byte */    putc(u & 0xFF, f);                     /* then low-order */  }Hoffman & Yergeau            Informational                      [Page 4]RFC 2781            UTF-16, an encoding of ISO 10646       February 2000   The term "network byte order" has been used in many RFCs to indicate   big-endian serialization, although that term has yet to be formally   defined in a standards-track document. Although ISO 10646 prefers   big-endian serialization (section 6.3 of [ISO-10646]), little-endian   order is also sometimes used on the Internet.3.2 Byte order mark (BOM)   The Unicode Standard and ISO 10646 define the character "ZERO WIDTH   NON-BREAKING SPACE" (0xFEFF), which is also known informally as "BYTE   ORDER MARK" (abbreviated "BOM"). The latter name hints at a second   possible usage of the character, in addition to its normal use as a   genuine "ZERO WIDTH NON-BREAKING SPACE" within text. This usage,   suggested by Unicode section 2.4 and ISO 10646 Annex F (informative),   is to prepend a 0xFEFF character to a stream of Unicode characters as   a "signature"; a receiver of such a serialized stream may then use   the initial character both as a hint that the stream consists of   Unicode characters and as a way to recognize the serialization order.   In serialized UTF-16 prepended with such a signature, the order is   big-endian if the first two octets are 0xFE followed by 0xFF; if they   are 0xFF followed by 0xFE, the order is little-endian. Note that   0xFFFE is not a Unicode character, precisely to preserve the   usefulness of 0xFEFF as a byte-order mark.   It is important to understand that the character 0xFEFF appearing at   any position other than the beginning of a stream MUST be interpreted   with the semantics for the zero-width non-breaking space, and MUST   NOT be interpreted as a byte-order mark. The contrapositive of that   statement is not always true: the character 0xFEFF in the first   position of a stream MAY be interpreted as a zero-width non-breaking   space, and is not always a byte-order mark. For example, if a process   splits a UTF-16 string into many parts, a part might begin with   0xFEFF because there was a zero-width non-breaking space at the   beginning of that substring.   The Unicode standard further suggests than an initial 0xFEFF   character may be stripped before processing the text, the rationale   being that such a character in initial position may be an artifact of   the encoding (an encoding signature), not a genuine intended "ZERO   WIDTH NON-BREAKING SPACE". Note that such stripping might affect an   external process at a different layer (such as a digital signature or   a count of the characters) that is relying on the presence of all   characters in the stream.   In particular, in UTF-16 plain text it is likely, but not certain,   that an initial 0xFEFF is a signature. When concatenating two   strings, it is important to strip out those signatures, because   otherwise the resulting string may contain an unintended "ZERO WIDTHHoffman & Yergeau            Informational                      [Page 5]RFC 2781            UTF-16, an encoding of ISO 10646       February 2000   NON-BREAKING SPACE" at the connection point. Also, some   specifications mandate an initial 0xFEFF character in objects   labelled as UTF-16 and specify that this signature is not part of the   object.3.3 Choosing a label for UTF-16 text   Any labelling application that uses UTF-16 character encoding, and   explicitly labels the text, and knows the serialization order of the   characters in text, SHOULD label the text as either "UTF-16BE" or   "UTF-16LE", whichever is appropriate based on the endianness of the   text. This allows applications processing the text, but unable to   look inside the text, to know the serialization definitively.   Text in the "UTF-16BE" charset MUST be serialized with the octets   which make up a single 16-bit UTF-16 value in big-endian order.   Systems labelling UTF-16BE text MUST NOT prepend a BOM to the text.   Text in the "UTF-16LE" charset MUST be serialized with the octets   which make up a single 16-bit UTF-16 value in little-endian order.   Systems labelling UTF-16LE text MUST NOT prepend a BOM to the text.   Any labelling application that uses UTF-16 character encoding, and   puts an explicit charset label on the text, and does not know the   serialization order of the characters in text, MUST label the text as   "UTF-16", and SHOULD make sure the text starts with 0xFEFF.   An exception to the "SHOULD" rule of using "UTF-16BE" or "UTF-16LE"   would occur with document formats that mandate a BOM in UTF-16 text,   thereby requiring the use of the "UTF-16" tag only.4. Interpreting text labels   When a program sees text labelled as "UTF-16BE", "UTF-16LE", or   "UTF-16", it can make some assumptions, based on the labelling rules   given in the previous section. These assumptions allow the program to   then process the text.4.1 Interpreting text labelled as UTF-16BE   Text labelled "UTF-16BE" can always be interpreted as being big-   endian.  The detection of an initial BOM does not affect de-   serialization of text labelled as UTF-16BE. Finding 0xFF followed by   0xFE is an error since there is no Unicode character 0xFFFE.Hoffman & Yergeau            Informational                      [Page 6]RFC 2781            UTF-16, an encoding of ISO 10646       February 20004.2 Interpreting text labelled as UTF-16LE   Text labelled "UTF-16LE" can always be interpreted as being little-   endian. The detection of an initial BOM does not affect de-   serialization of text labelled as UTF-16LE. Finding 0xFE followed by   0xFF is an error since there is no Unicode character 0xFFFE, which   would be the interpretation of those octets under little-endian   order.4.3 Interpreting text labelled as UTF-16   Text labelled with the "UTF-16" charset might be serialized in either   big-endian or little-endian order. If the first two octets of the   text is 0xFE followed by 0xFF, then the text can be interpreted as   being big-endian. If the first two octets of the text is 0xFF   followed by 0xFE, then the text can be interpreted as being little-   endian. If the first two octets of the text is not 0xFE followed by   0xFF, and is not 0xFF followed by 0xFE, then the text SHOULD be   interpreted as being big-endian.   All applications that process text with the "UTF-16" charset label   MUST be able to read at least the first two octets of the text and be   able to process those octets in order to determine the serialization   order of the text. Applications that process text with the "UTF-16"   charset label MUST NOT assume the serialization without first   checking the first two octets to see if they are a big-endian BOM, a   little-endian BOM, or not a BOM. All applications that process text   with the "UTF-16" charset label MUST be able to interpret both big-   endian and little-endian text.5. Examples   For the sake of example, let's suppose that there is a hieroglyphic   character representing the Egyptian god Ra with character value   0x12345 (this character does not exist at present in Unicode).   The examples here all evaluate to the phrase:   *=Ra   where the "*" represents the Ra hieroglyph (0x12345).   Text labelled with UTF-16BE, without a BOM:   D8 08 DF 45 00 3D 00 52 00 61   Text labelled with UTF-16LE, without a BOM:   08 D8 45 DF 3D 00 52 00 61 00Hoffman & Yergeau            Informational                      [Page 7]

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -