📄 rfc1922.txt
字号:
Network Working Group HF. ZhuRequest for Comments: 1922 Tsinghua UCategory: Informational DY. Hu Tsinghua U ZG. Wang CITS TC. Kao III WCH. Chang III M. Crispin U Washington March 1996 Chinese Character Encoding for Internet MessagesStatus of this Memo This memo provides information for the Internet community. It does not specify an Internet standard. Distribution of this memo is unlimited.Abstract This memo describes methods of transporting Chinese characters in Internet services which transport text, such as electronic mail [RFC-822], network news [RFC-1036], telnet [RFC-854] and the World Wide Web [RFC-1866].Introduction As the use of Internet covers more and more Chinese people in the world, the need has increased for the ability to send documents containing Chinese characters on the Internet. The methods described in this document provide means of transporting existing Chinese character sets as well as leaving space for future extension. This document describes two encodings, ISO-2022-CN and ISO-2022-CN-EXT. These are designed with interoperability in mind and are encouraged in this document for current Chinese interchange; they are 7-bit, support both simplified and traditional characters using both GB and CNS/Big5, and do not impose any unusual quoting requirements on ASCII characters. As important related issues, this document gives detailed descriptions of the two encodings CN-GB and CN-Big5, and a brief description of ISO/IEC 10646 [ISO-10646]. CN-GB and CN-Big5 areZhu, et al Informational [Page 1]RFC 1922 Chinese Character Encoding March 1996 currently used as the internal codes for Chinese documents. ISO-10646 is the universal multi-octet character set defined by ISO; we feel that in the future it may become the preferred technology for Chinese documents and electronic mail when it is widely available.Specification1. 7-bit Chinese encodings: ISO-2022-CN and ISO-2022-CN-EXT1.1. Description ISO-2022-CN is based on ISO 2022 [ISO-2022], similar to earlier work on ISO-2022-JP [RFC-1468] and ISO-2022-KR [RFC-1557] for the Japanese and Korean languages respectively. It is 7-bit, and supports both simplified Chinese characters using GB 2312-80 [GB-2312] and traditional Chinese characters using the first two planes of CNS 11643 [CNS-11643], as well as ASCII [ASCII] characters. ISO-2022-CN-EXT is a superset of ISO-2022-CN that additionally supports other GB character sets and planes of CNS 11643. Since ISO-2022-CN and ISO-2022-CN-EXT are 7-bit encodings, they do not require the 8-bit SMTP extensions. ISO-2022-CN supports all the Chinese characters that appear in Big5 [BIG5].1.2. ISO-2022-CN The starting code of ISO-2022-CN is ASCII. ASCII and Chinese characters are distinguished by designations (ESC sequences) and shift functions. Designations define the Chinese character sets used in the text. There are three kinds of designations: SOdesignation, SS2designation and SS3designation. The SOdesignation is in the form ESC $ ) <F>, where <F> is the "final character" assigned to the character set by ISO (refer to the ISO registry [ISOREG] for more details). The SS2designation is in the form ESC $ * <F>, and the SS3designation is in the form ESC $ + <F>. A designation overrides any previous designation for subsequent bytes in the text. There are four kinds of shifts: SI, SO, SS2 and SS3. Shift functions specify how to interpret the subsequent bytes. The shift SI (one byte with hexadecimal value 0F) declares that subsequent bytes are interpreted in ASCII.Zhu, et al Informational [Page 2]RFC 1922 Chinese Character Encoding March 1996 The shift SO (one byte with hexadecimal value 0E) declares that subsequent bytes are interpreted in the character set defined by SOdesignation. The shift SS2 (two bytes with hexadecimal values 1B 4E) declares that the subsequent TWO bytes are interpreted in the character set defined by SS2designation, after which the previous interpretation (from SI or SO) is restored. The shift SS3 (two bytes with hexadecimal values 1B 4F) declares that the subsequent TWO bytes are interpreted in the character set defined by SS3designation, after which the previous interpretation (from SI or SO) is restored. The escape sequences, shift functions and character sets used in an ISO-2022-CN text are as follows: Character sets Shift in with -------------------------------------------------------------------- ASCII SI GB 2312, CNS 11643-plane-1 SO CNS 11643-plane-2 SS2 ESC $ ) A Indicates the bytes following SO are Chinese characters as defined in GB 2312-80, until another SOdesignation appears ESC $ ) G Indicates the bytes following SO are as defined in CNS 11643-plane-1, until another SOdesignation appears ESC $ * H Indicates the two bytes immediately following SS2 is a Chinese character as defined in CNS 11643-plane-2, until another SS2designation appears If there are any GB or CNS characters on a line, a designation for the corresponding character set must be used so that each line has its own character set information and the text can be displayed correctly when scroll back in a window. Also, there must be a shift to ASCII (SI) before the end of the line (i.e., before the CRLF). In other words, each line starts in ASCII, and ends in ASCII. Example: the hex sequence 1b 24 29 41 0e 3d 3b 3b 3b 1b 24 29 47 47 28 5f 50 0f represents the Chinese word for "Interchange" (jiao huan) twice;Zhu, et al Informational [Page 3]RFC 1922 Chinese Character Encoding March 1996 the first time in simplified form using GB-2312 (the 3d 3b 3b 3b sequence above), and the second time in traditional form using CNS-11643 (the 47 28 5f 50 sequence above). The sequence 1b 24 29 41 is the SOdesignation for GB-2312, the 0e is SO to switch to Chinese from ASCII, the 1b 24 29 47 is the SOdesignation for CNS-11643 plane 1, and finally the 0f is the SI to return to ASCII at the end of the line. The name given to this character encoding is "ISO-2022-CN". This name is intended to be used as the "charset" parameter in MIME [MIME-1, MIME-2] messages. Content-Type: text/plain; charset=iso-2022-cn The ISO-2022-CN encoding is already in 7-bit form, so it is not necessary to use a Content-Transfer-Encoding header. Other restrictions are given in the "Formal Syntax of ISO-2022-CN" (Section 7.1 of this document).1.3. ISO-2022-CN-EXT ISO-2022-CN-EXT supports all characters in existing GB, Big5 and CNS 11643 character sets. The escape sequences, shift functions and character sets used in an ISO-2022-CN-EXT text are as follows: Character sets Shift in with -------------------------------------------------------------------- ASCII SI GB 2312, GB 12345, CNS 11643-plane-1, ISO-IR-165 SO GB 7589, GB 13131, CNS 11643-plane-2 SS2 GB 7590, GB 13132 or other new GBs,CNS 11643-plane-3 or SS3 higher planes of CNS 11643 Note: Currently, there are some GB sets that have not been registered in ISO. Here <X7589>, <X7590>, <X12345>, <X13131> and <X13132> represent the final character that will be assigned by ISO for those sets. These GB sets shall only be used once these final characters are assigned.Zhu, et al Informational [Page 4]RFC 1922 Chinese Character Encoding March 1996 ESC $ ) A Indicates the bytes following SO are Chinese characters as defined in GB 2312-80, until another SOdesignation appears ESC $ * <X7589> Indicates the two bytes immediately following SS2 is a Chinese character as defined in GB 7589-87 [GB-7589], until another SS2designation appears ESC $ + <X7590> Indicates the two bytes immediately following SS3 is a Chinese character as defined in GB 7590-87 [GB-7590], until another SS3designation appears ESC $ ) <X12345> Indicates the bytes following SO are as defined in GB 12345-90 [GB-12345], until another SOdesignation appears ESC $ * <X13131> Indicates the two bytes immediately following SS2 is a Chinese character as defined in GB 13131-91 [GB-13131], until another SS2designation appears ESC $ + <X13132> Indicates the two bytes immediately following SS3 is a Chinese character as defined in GB 13132-91 [GB-13131], until another SS3designation appears ESC $ ) E Indicates the bytes following SO are as defined in ISO-IR-165 (for details, see section 2.1), until another SOdesignation appears ESC $ ) G Indicates the bytes following SO are as defined in CNS 11643-plane-1, until another SOdesignation appears ESC $ * H Indicates the two bytes immediately following SS2 is a Chinese character as defined in CNS 11643-plane-2, until another SS2designation appears ESC $ + I Indicates the immediate two bytes following SS3 is a Chinese character as defined in CNS 11643-plane-3, until another SS3designation appearsZhu, et al Informational [Page 5]RFC 1922 Chinese Character Encoding March 1996 ESC $ + J Indicates the immediate two bytes following SS3 is a Chinese character as defined in CNS 11643-plane-4, until another SS3designation appears ESC $ + K Indicates the immediate two bytes following SS3 is a Chinese character as defined in CNS 11643-plane-5, until another SS3designation appears ESC $ + L Indicates the immediate two bytes following SS3 is a Chinese character as defined in CNS 11643-plane-6, until another SS3designation appears ESC $ + M Indicates the immediate two bytes following SS3 is a Chinese character as defined in CNS 11643-plane-7, until another SS3designation appears As in ISO-2022-CN, each line starts in ASCII, and ends in ASCII, and has its own designation information before any Chinese characters appear. The name given to this character encoding is "ISO-2022-CN-EXT". This name is intended to be used as the "charset" parameter in MIME messages. Content-Type: text/plain; charset=ISO-2022-CN-EXT The ISO-2022-CN-EXT encoding is also in 7-bit form, so it is not necessary to use a Content-Transfer-Encoding header. Other restrictions are given in the "Formal Syntax of ISO-2022-CN-EXT" (Section 7.2 of this document).1.4. How to Support Big5 or other internal codesets with ISO-2022-CN and ISO-2022-CN-EXT Since there are many different Chinese internal coding systems [CJKINF], such as EUC GB, Big5, CCCII (an encoding for library systems mainly used in Taiwan), GBK (the new standard specification for Chinese internal code, also is the codepage for Microsoft simplified Chinese Windows 95) etc., ISO-2022-CN and ISO-2022-CN-EXT, which are 7-bit and will not lose information during communication among different codesets, facilitate interchange between the various Chinese coding systems in the Internet.Zhu, et al Informational [Page 6]RFC 1922 Chinese Character Encoding March 1996 For instance, ISO-2022-CN and ISO-2022-CN-EXT can be used to support the popular Big5 codeset, because the first two planes of CNS-11643 contain the same Chinese characters as Big5's "common part" except two duplicate characters. By the "common part" we mean the part that is not specific to any Big5 vendor, consisting of 5401 more frequently used characters in Big5 range 0xA440-0xC67E, 7652 less frequently used characters in Big5 range 0xC940-0xF9D5, and 441 other symbols in Big5 range 0xA140-0xA3E0, as defined in Institute for Information Industry's (III) technical report C-26 (see also [Big5]). The appendix of this document presents a conversion table for converting Big5 into CNS-11643, including specific extensions of some popular vendors. For other extensions, vendors and implementors of Big5 products are ENCOURAGED to create detailed conversion tables, in order to increase interoperability between different coding systems. Public domain software (binary or C source code) for conversion between Big5 and CNS-11643 is available on many Internet sites. At the time of this writing, the following FTP sites and software are advertised: 1) Beijing: ftp://ftp.net.tsinghua.edu.cn/pub/Chinese/convert/big5cns.zip (IP address: 166.111.1.6) 2) Xi'an: ftp://ftp.xanet.edu.cn /pub/chinese-soft/unix/convert/BeTTY-1.534.tar.gz (IP address: 202.112.11.131) 3) Taiwan: ftp://ftp.seed.net.tw/Pub/Chinese/DOS/code-convert/chcode.zip (IP address: 140.92.1.65) 4) US: ftp://ftp.ifcss.org/pub/software/unix/convert/BeTTY-1.534.tar.gz (IP address: 128.123.1.55)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -