rfc2781.txt

来自「RFC 的详细文档!」· 文本 代码 · 共 788 行 · 第 1/2 页

TXT
788
字号

RFC 2781            UTF-16, an encoding of ISO 10646       February 2000


   Big-endian text labelled with UTF-16, with a BOM:
   FE FF D8 08 DF 45 00 3D 00 52 00 61

   Little-endian text labelled with UTF-16, with a BOM:
   FF FE 08 D8 45 DF 3D 00 52 00 61 00

6. Versions of the standards

   ISO/IEC 10646 is updated from time to time by published amendments;
   similarly, different versions of the Unicode standard exist: 1.0,
   1.1, 2.0, 2.1, and 3.0 as of this writing. Each new version replaces
   the previous one, but implementations, and more significantly data,
   are not updated instantly.

   In general, the changes amount to adding new characters, which does
   not pose particular problems with old data. Amendment 5 to ISO/IEC
   10646, however, has moved and expanded the Korean Hangul block,
   thereby making any previous data containing Hangul characters invalid
   under the new version. Unicode 2.0 has the same difference from
   Unicode 1.1. The official justification for allowing such an
   incompatible change was that no significant implementations and data
   containing Hangul existed, a statement that is likely to be true but
   remains unprovable. The incident has been dubbed the "Korean mess",
   and the relevant committees have pledged to never, ever again make
   such an incompatible change.

   New versions, and in particular any incompatible changes, have
   consequences regarding MIME character encoding labels, to be
   discussed in Appendix A.

7. IANA Considerations

   IANA is to register the character sets found in Appendixes A.1, A.2,
   and A.3 according to RFC 2278, using registration templates found in
   those appendixes.

8. Security Considerations

   UTF-16 is based on the ISO 10646 character set, which is frequently
   being added to, as described in Section 6 and Appendix A of this
   document. Processors must be able to handle characters that are not
   defined at the time that the processor was created in such a way as
   to not allow an attacker to harm a recipient by including unknown
   characters.

   Processors that handle any type of text, including text encoded as
   UTF-16, must be vigilant in checking for control characters that
   might reprogram a display terminal or keyboard. Similarly, processors



Hoffman & Yergeau            Informational                      [Page 8]

RFC 2781            UTF-16, an encoding of ISO 10646       February 2000


   that interpret text entities (such as looking for embedded
   programming code), must be careful not to execute the code without
   first alerting the recipient.

   Text in UTF-16 may contain special characters, such as the OBJECT
   REPLACEMENT CHARACTER (0xFFFC), that might cause external processing,
   depending on the interpretation of the processing program and the
   availability of an external data stream that would be executed. This
   external processing may have side-effects that allow the sender of a
   message to attack the receiving system.

   Implementors of UTF-16 need to consider the security aspects of how
   they handle illegal UTF-16 sequences (that is, sequences involving
   surrogate pairs that have illegal values or unpaired surrogates). It
   is conceivable that in some circumstances an attacker would be able
   to exploit an incautious UTF-16 parser by sending it an octet
   sequence that is not permitted by the UTF-16 syntax, causing it to
   behave in some anomalous fashion.

9. References

   [CHARPOLICY]  Alvestrand, H., "IETF Policy on Character Sets and
                 Languages", BCP 18, RFC 2277, January 1998.

   [CHARSET-REG] Freed, N. and J. Postel, "IANA Charset Registration
                 Procedures", BCP 19, RFC 2278, January 1998.

   [HTTP-1.1]    Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
                 Masinter, L., Leach, P. and T. Berners-Lee, "Hypertext
                 Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999.

   [ISO-10646]   ISO/IEC 10646-1:1993. International Standard --
                 Information technology -- Universal Multiple-Octet
                 Coded Character Set (UCS) -- Part 1: Architecture and
                 Basic Multilingual Plane. 22 amendments and two
                 technical corrigenda have been published up to now.
                 UTF-16 is described in Annex Q, published as Amendment
                 1. Many other amendments are currently at various
                 stages of standardization. A second edition is in
                 preparation, probably to be published in 2000; in this
                 new edition, UTF-16 will probably be described in Annex
                 C.

   [MUSTSHOULD]  Bradner, S., "Key words for use in RFCs to Indicate
                 Requirement Levels", BCP 14, RFC 2119, March 1997.

   [UNICODE]     The Unicode Consortium, "The Unicode Standard --
                 Version 3.0", ISBN 0-201-61633-5. Described at



Hoffman & Yergeau            Informational                      [Page 9]

RFC 2781            UTF-16, an encoding of ISO 10646       February 2000


   <http://www.unicode.org/unicode/standard/versions/Unicode3.0.html>.

   [UTF-8]       Yergeau, F., "UTF-8, a transformation format of ISO
                 10646", RFC 2279, January 1998.

   [WORKSHOP]    Weider, C., Preston, C., Simonsen, K., Alvestrand, H.,
                 Atkinson, R., Crispin., M. and P. Svanberg, "Report of
                 the IAB Character Set Workshop", RFC 2130, April 1997.

10. Acknowledgments

   Deborah Goldsmith wrote a great deal of the initial wording for this
   specification. Martin Duerst proposed numerous significant changes.
   Other significant contributors include:

   Mati Allouche
   Walt Daniels
   Mark Davis
   Ned Freed
   Asmus Freytag
   Lloyd Honomichl
   Dan Kegel
   Murata Makoto
   Larry Masinter
   Markus Scherer
   Keld Simonsen
   Ken Whistler

   Some of the text in this specification was copied from [UTF-8], and
   that document was worked on by many people. Please see the
   acknowledgments section in that document for more people who may have
   contributed indirectly to this document.



















Hoffman & Yergeau            Informational                     [Page 10]

RFC 2781            UTF-16, an encoding of ISO 10646       February 2000


A. Charset registrations

   This memo is meant to serve as the basis for registration of three
   MIME charsets [CHARSET-REG]. The proposed charsets are "UTF-16BE",
   "UTF-16LE", and "UTF-16". These strings label objects containing text
   consisting of characters from the repertoire of ISO/IEC 10646
   including all amendments at least up to amendment 5 (Korean block),
   encoded to a sequence of octets using the encoding and serialization
   schemes outlined above.

   Note that "UTF-16BE", "UTF-16LE", and "UTF-16" are NOT suitable for
   use in media types under the "text" top-level type, because they do
   not encode line endings in the way required for MIME "text" media
   types. An exception to this is HTTP, which uses a MIME-like
   mechanism, but is exempt from the restrictions on the text top-level
   type (see section 19.4.2 of HTTP 1.1 [HTTP-1.1]).

   It is noteworthy that the labels described here do not contain a
   version identification, referring generically to ISO/IEC 10646. This
   is intentional, the rationale being as follows:

   A MIME charset is designed to give just the information needed to
   interpret a sequence of bytes received on the wire into a sequence of
   characters, nothing more (see RFC 2045, section 2.2, in [MIME]). As
   long as a character set standard does not change incompatibly,
   version numbers serve no purpose, because one gains nothing by
   learning from the tag that newly assigned characters may be received
   that one doesn't know about. The tag itself doesn't teach anything
   about the new characters, which are going to be received anyway.

   Hence, as long as the standards evolve compatibly, the apparent
   advantage of having labels that identify the versions is only that,
   apparent. But there is a disadvantage to such version-dependent
   labels: when an older application receives data accompanied by a
   newer, unknown label, it may fail to recognize the label and be
   completely unable to deal with the data, whereas a generic, known
   label would have triggered mostly correct processing of the data,
   which may well not contain any new characters.

   The "Korean mess" (ISO/IEC 10646 amendment 5) is an incompatible
   change, in principle contradicting the appropriateness of a version
   independent MIME charset as described above. But the compatibility
   problem can only appear with data containing Korean Hangul characters
   encoded according to Unicode 1.1 (or equivalently ISO/IEC 10646
   before amendment 5), and there is arguably no such data to worry
   about, this being the very reason the incompatible change was deemed
   acceptable.




Hoffman & Yergeau            Informational                     [Page 11]

RFC 2781            UTF-16, an encoding of ISO 10646       February 2000


   In practice, then, a version-independent label is warranted, provided
   the label is understood to refer to all versions after Amendment 5,
   and provided no incompatible change actually occurs. Should
   incompatible changes occur in a later version of ISO/IEC 10646, the
   MIME charsets defined here will stay aligned with the previous
   version until and unless the IETF specifically decides otherwise.

A.1 Registration for UTF-16BE

   To: ietf-charsets@iana.org
   Subject: Registration of new charset

   Charset name(s): UTF-16BE

   Published specification(s): This specification

   Suitable for use in MIME content types under the
   "text" top-level type: No

   Person & email address to contact for further information:
   Paul Hoffman <phoffman@imc.org>
   Francois Yergeau <fyergeau@alis.com>

A.2 Registration for UTF-16LE

   To: ietf-charsets@iana.org
   Subject: Registration of new charset

   Charset name(s): UTF-16LE

   Published specification(s): This specification

   Suitable for use in MIME content types under the
   "text" top-level type: No

   Person & email address to contact for further information:
   Paul Hoffman <phoffman@imc.org>
   Francois Yergeau <fyergeau@alis.com>

A.3 Registration for UTF-16

   To: ietf-charsets@iana.org
   Subject: Registration of new charset

   Charset name(s): UTF-16

   Published specification(s): This specification




Hoffman & Yergeau            Informational                     [Page 12]

RFC 2781            UTF-16, an encoding of ISO 10646       February 2000


   Suitable for use in MIME content types under the
   "text" top-level type: No

   Person & email address to contact for further information:
   Paul Hoffman <phoffman@imc.org>
   Francois Yergeau <fyergeau@alis.com>

Authors' Addresses

   Paul Hoffman
   Internet Mail Consortium
   127 Segre Place
   Santa Cruz, CA  95060 USA

   EMail: phoffman@imc.org


   Francois Yergeau
   Alis Technologies
   100, boul. Alexis-Nihon, Suite 600
   Montreal  QC  H4M 2P2 Canada

   EMail: fyergeau@alis.com




























Hoffman & Yergeau            Informational                     [Page 13]

RFC 2781            UTF-16, an encoding of ISO 10646       February 2000


Full Copyright Statement

   Copyright (C) The Internet Society (2000).  All Rights Reserved.

   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implementation may be prepared, copied, published
   and distributed, in whole or in part, without restriction of any
   kind, provided that the above copyright notice and this paragraph are
   included on all such copies and derivative works.  However, this
   document itself may not be modified in any way, such as by removing
   the copyright notice or references to the Internet Society or other
   Internet organizations, except as needed for the purpose of
   developing Internet standards in which case the procedures for
   copyrights defined in the Internet Standards process must be
   followed, or as required to translate it into languages other than
   English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assigns.

   This document and the information contained herein is provided on an
   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Acknowledgement

   Funding for the RFC Editor function is currently provided by the
   Internet Society.



















Hoffman & Yergeau            Informational                     [Page 14]


⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?