rfc1922.txt

来自「RFC 的详细文档!」· 文本 代码 · 共 1,516 行 · 第 1/4 页

TXT
1,516
字号






Network Working Group                                            HF. Zhu
Request for Comments: 1922                                    Tsinghua U
Category: Informational                                           DY. Hu
                                                              Tsinghua U
                                                                ZG. Wang
                                                                    CITS
                                                                 TC. Kao
                                                                     III
                                                              WCH. Chang
                                                                     III
                                                              M. Crispin
                                                            U Washington
                                                              March 1996


            Chinese Character Encoding for Internet Messages

Status of this Memo

   This memo provides information for the Internet community.  It does
   not specify an Internet standard.  Distribution of this memo is
   unlimited.

Abstract

   This memo describes methods of transporting Chinese characters in
   Internet services which transport text, such as electronic mail
   [RFC-822], network news [RFC-1036], telnet [RFC-854] and the World
   Wide Web [RFC-1866].

Introduction

   As the use of Internet covers more and more Chinese people in the
   world, the need has increased for the ability to send documents
   containing Chinese characters on the Internet.  The methods described
   in this document provide means of transporting existing Chinese
   character sets as well as leaving space for future extension.

   This document describes two encodings, ISO-2022-CN and
   ISO-2022-CN-EXT.  These are designed with interoperability in mind
   and are encouraged in this document for current Chinese interchange;
   they are 7-bit, support both simplified and traditional characters
   using both GB and CNS/Big5, and do not impose any unusual quoting
   requirements on ASCII characters.

   As important related issues, this document gives detailed
   descriptions of the two encodings CN-GB and CN-Big5, and a brief
   description of ISO/IEC 10646 [ISO-10646].  CN-GB and CN-Big5 are



Zhu, et al                   Informational                      [Page 1]

RFC 1922               Chinese Character Encoding             March 1996


   currently used as the internal codes for Chinese documents.
   ISO-10646 is the universal multi-octet character set defined by ISO;
   we feel that in the future it may become the preferred technology for
   Chinese documents and electronic mail when it is widely available.

Specification

1.    7-bit Chinese encodings: ISO-2022-CN and ISO-2022-CN-EXT

1.1.  Description

   ISO-2022-CN is based on ISO 2022 [ISO-2022], similar to earlier work
   on ISO-2022-JP [RFC-1468] and ISO-2022-KR [RFC-1557] for the Japanese
   and Korean languages respectively.  It is 7-bit, and supports both
   simplified Chinese characters using GB 2312-80 [GB-2312] and
   traditional Chinese characters using the first two planes of CNS
   11643 [CNS-11643], as well as ASCII [ASCII] characters.

   ISO-2022-CN-EXT is a superset of ISO-2022-CN that additionally
   supports other GB character sets and planes of CNS 11643.

   Since ISO-2022-CN and ISO-2022-CN-EXT are 7-bit encodings, they do
   not require the 8-bit SMTP extensions.  ISO-2022-CN supports all the
   Chinese characters that appear in Big5 [BIG5].

1.2.  ISO-2022-CN

   The starting code of ISO-2022-CN is ASCII.  ASCII and Chinese
   characters are distinguished by designations (ESC sequences) and
   shift functions.

   Designations define the Chinese character sets used in the text.
   There are three kinds of designations: SOdesignation, SS2designation
   and SS3designation.

   The SOdesignation is in the form ESC $ ) <F>, where <F> is the "final
   character" assigned to the character set by ISO (refer to the ISO
   registry [ISOREG] for more details).  The SS2designation is in the
   form ESC $ * <F>, and the SS3designation is in the form ESC $ + <F>.
   A designation overrides any previous designation for subsequent bytes
   in the text.

   There are four kinds of shifts: SI, SO, SS2 and SS3.  Shift functions
   specify how to interpret the subsequent bytes.

   The shift SI (one byte with hexadecimal value 0F) declares that
   subsequent bytes are interpreted in ASCII.




Zhu, et al                   Informational                      [Page 2]

RFC 1922               Chinese Character Encoding             March 1996


   The shift SO (one byte with hexadecimal value 0E) declares that
   subsequent bytes are interpreted in the character set defined by
   SOdesignation.

   The shift SS2 (two bytes with hexadecimal values 1B 4E) declares that
   the subsequent TWO bytes are interpreted in the character set defined
   by SS2designation, after which the previous interpretation (from SI
   or SO) is restored.

   The shift SS3 (two bytes with hexadecimal values 1B 4F) declares that
   the subsequent TWO bytes are interpreted in the character set defined
   by SS3designation, after which the previous interpretation (from SI
   or SO) is restored.

   The escape sequences, shift functions and character sets used in an
   ISO-2022-CN text are as follows:

    Character sets                                       Shift in with
   --------------------------------------------------------------------
     ASCII                                                     SI
     GB 2312, CNS 11643-plane-1                                SO
              CNS 11643-plane-2                                SS2

      ESC $ ) A         Indicates the bytes following SO are Chinese
                        characters as defined in GB 2312-80, until
                        another SOdesignation appears

      ESC $ ) G         Indicates the bytes following SO are as defined
                        in CNS 11643-plane-1, until another
                        SOdesignation appears

      ESC $ * H         Indicates the two bytes immediately following
                        SS2 is a Chinese character as defined in CNS
                        11643-plane-2, until another SS2designation
                        appears

   If there are any GB or CNS characters on a line, a designation for
   the corresponding character set must be used so that each line has
   its own character set information and the text can be displayed
   correctly when scroll back in a window.  Also, there must be a shift
   to ASCII (SI) before the end of the line (i.e., before the CRLF).  In
   other words, each line starts in ASCII, and ends in ASCII.

      Example: the hex sequence

         1b 24 29 41 0e 3d 3b 3b 3b 1b 24 29 47 47 28 5f 50 0f

      represents the Chinese word for "Interchange" (jiao huan) twice;



Zhu, et al                   Informational                      [Page 3]

RFC 1922               Chinese Character Encoding             March 1996


      the first time in simplified form using GB-2312 (the 3d 3b 3b 3b
      sequence above), and the second time in traditional form using
      CNS-11643 (the 47 28 5f 50 sequence above).  The sequence 1b 24 29
      41 is the SOdesignation for GB-2312, the 0e is SO to switch to
      Chinese from ASCII, the 1b 24 29 47 is the SOdesignation for
      CNS-11643 plane 1, and finally the 0f is the SI to return to ASCII
      at the end of the line.

   The name given to this character encoding is "ISO-2022-CN". This name
   is intended to be used as the "charset" parameter in MIME [MIME-1,
   MIME-2] messages.

      Content-Type: text/plain; charset=iso-2022-cn

   The ISO-2022-CN encoding is already in 7-bit form, so it is not
   necessary to use a Content-Transfer-Encoding header.

   Other restrictions are given in the "Formal Syntax of ISO-2022-CN"
   (Section 7.1 of this document).

1.3.  ISO-2022-CN-EXT

   ISO-2022-CN-EXT supports all characters in existing GB, Big5 and CNS
   11643 character sets.

   The escape sequences, shift functions and character sets used in an
   ISO-2022-CN-EXT text are as follows:

    Character sets                                       Shift in with
   --------------------------------------------------------------------
     ASCII                                                    SI
     GB 2312, GB 12345, CNS 11643-plane-1, ISO-IR-165         SO
     GB 7589, GB 13131, CNS 11643-plane-2                     SS2
     GB 7590, GB 13132 or other new GBs,CNS 11643-plane-3 or  SS3
      higher planes of CNS 11643

      Note: Currently, there are some GB sets that have not been
      registered in ISO. Here <X7589>, <X7590>, <X12345>, <X13131> and
      <X13132> represent the final character that will be assigned by
      ISO for those sets.  These GB sets shall only be used once these
      final characters are assigned.










Zhu, et al                   Informational                      [Page 4]

RFC 1922               Chinese Character Encoding             March 1996


      ESC $ ) A         Indicates the bytes following SO are Chinese
                        characters as defined in GB 2312-80, until
                        another SOdesignation appears

      ESC $ * <X7589>   Indicates the two bytes immediately following
                        SS2 is a Chinese character as defined in GB
                        7589-87 [GB-7589], until another SS2designation
                        appears

      ESC $ + <X7590>   Indicates the two bytes immediately following
                        SS3 is a Chinese character as defined in GB
                        7590-87 [GB-7590], until another SS3designation
                        appears

      ESC $ ) <X12345>  Indicates the bytes following SO are as defined
                        in GB 12345-90 [GB-12345], until another
                        SOdesignation appears

      ESC $ * <X13131>  Indicates the two bytes immediately following
                        SS2 is a Chinese character as defined in GB
                        13131-91 [GB-13131], until another
                        SS2designation appears

      ESC $ + <X13132>  Indicates the two bytes immediately following
                        SS3 is a Chinese character as defined in GB
                        13132-91 [GB-13131], until another
                        SS3designation appears

      ESC $ ) E         Indicates the bytes following SO are as defined
                        in ISO-IR-165 (for details, see section 2.1),
                        until another SOdesignation appears

      ESC $ ) G         Indicates the bytes following SO are as defined
                        in CNS 11643-plane-1, until another
                        SOdesignation appears

      ESC $ * H         Indicates the two bytes immediately following
                        SS2 is a Chinese character as defined in CNS
                        11643-plane-2, until another SS2designation
                        appears

      ESC $ + I         Indicates the immediate two bytes following SS3
                        is a Chinese character as defined in CNS
                        11643-plane-3, until another SS3designation
                        appears






Zhu, et al                   Informational                      [Page 5]

RFC 1922               Chinese Character Encoding             March 1996


      ESC $ + J         Indicates the immediate two bytes following SS3
                        is a Chinese character as defined in CNS
                        11643-plane-4, until another SS3designation
                        appears

      ESC $ + K         Indicates the immediate two bytes following SS3
                        is a Chinese character as defined in CNS
                        11643-plane-5, until another SS3designation
                        appears

      ESC $ + L         Indicates the immediate two bytes following SS3
                        is a Chinese character as defined in CNS
                        11643-plane-6, until another SS3designation
                        appears

      ESC $ + M         Indicates the immediate two bytes following SS3
                        is a Chinese character as defined in CNS
                        11643-plane-7, until another SS3designation
                        appears

   As in ISO-2022-CN, each line starts in ASCII, and ends in ASCII, and
   has its own designation information before any Chinese characters
   appear.

   The name given to this character encoding is "ISO-2022-CN-EXT". This
   name is intended to be used as the "charset" parameter in MIME
   messages.

      Content-Type: text/plain; charset=ISO-2022-CN-EXT

   The ISO-2022-CN-EXT encoding is also in 7-bit form, so it is not
   necessary to use a Content-Transfer-Encoding header.

   Other restrictions are given in the "Formal Syntax of
   ISO-2022-CN-EXT" (Section 7.2 of this document).

1.4.  How to Support Big5 or other internal codesets with ISO-2022-CN
      and ISO-2022-CN-EXT

   Since there are many different Chinese internal coding systems
   [CJKINF], such as EUC GB, Big5, CCCII (an encoding for library
   systems mainly used in Taiwan), GBK (the new standard specification
   for Chinese internal code, also is the codepage for Microsoft
   simplified Chinese Windows 95) etc., ISO-2022-CN and ISO-2022-CN-EXT,
   which are 7-bit and will not lose information during communication
   among different codesets,  facilitate interchange between the various
   Chinese coding systems in the Internet.




Zhu, et al                   Informational                      [Page 6]

RFC 1922               Chinese Character Encoding             March 1996


   For instance, ISO-2022-CN and ISO-2022-CN-EXT can be used to support
   the popular Big5 codeset, because the first two planes of CNS-11643
   contain the same Chinese characters as Big5's "common part" except
   two duplicate characters.  By the "common part" we mean the part that
   is not specific to any Big5 vendor, consisting of 5401 more
   frequently used characters in Big5 range 0xA440-0xC67E, 7652 less
   frequently used characters in Big5 range 0xC940-0xF9D5, and 441 other
   symbols in Big5 range 0xA140-0xA3E0, as defined in Institute for
   Information Industry's (III) technical report C-26 (see also [Big5]).
   The appendix of this document presents a conversion table for
   converting Big5 into CNS-11643, including specific extensions of some
   popular vendors.  For other extensions, vendors and implementors of
   Big5 products are ENCOURAGED to create detailed conversion tables, in
   order to increase interoperability between different coding systems.

   Public domain software (binary or C source code) for conversion
   between Big5 and CNS-11643 is available on many Internet sites.  At
   the time of this writing, the following FTP sites and software are
   advertised:

   1) Beijing:
      ftp://ftp.net.tsinghua.edu.cn/pub/Chinese/convert/big5cns.zip
      (IP address: 166.111.1.6)

   2) Xi'an:
      ftp://ftp.xanet.edu.cn
      /pub/chinese-soft/unix/convert/BeTTY-1.534.tar.gz
      (IP address: 202.112.11.131)

   3) Taiwan:
      ftp://ftp.seed.net.tw/Pub/Chinese/DOS/code-convert/chcode.zip
      (IP address: 140.92.1.65)

   4) US:
      ftp://ftp.ifcss.org/pub/software/unix/convert/BeTTY-1.534.tar.gz
      (IP address: 128.123.1.55)

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?