rfc1922.txt
来自「RFC 的详细文档!」· 文本 代码 · 共 1,516 行 · 第 1/4 页
TXT
1,516 行
Network Working Group HF. Zhu
Request for Comments: 1922 Tsinghua U
Category: Informational DY. Hu
Tsinghua U
ZG. Wang
CITS
TC. Kao
III
WCH. Chang
III
M. Crispin
U Washington
March 1996
Chinese Character Encoding for Internet Messages
Status of this Memo
This memo provides information for the Internet community. It does
not specify an Internet standard. Distribution of this memo is
unlimited.
Abstract
This memo describes methods of transporting Chinese characters in
Internet services which transport text, such as electronic mail
[RFC-822], network news [RFC-1036], telnet [RFC-854] and the World
Wide Web [RFC-1866].
Introduction
As the use of Internet covers more and more Chinese people in the
world, the need has increased for the ability to send documents
containing Chinese characters on the Internet. The methods described
in this document provide means of transporting existing Chinese
character sets as well as leaving space for future extension.
This document describes two encodings, ISO-2022-CN and
ISO-2022-CN-EXT. These are designed with interoperability in mind
and are encouraged in this document for current Chinese interchange;
they are 7-bit, support both simplified and traditional characters
using both GB and CNS/Big5, and do not impose any unusual quoting
requirements on ASCII characters.
As important related issues, this document gives detailed
descriptions of the two encodings CN-GB and CN-Big5, and a brief
description of ISO/IEC 10646 [ISO-10646]. CN-GB and CN-Big5 are
Zhu, et al Informational [Page 1]
RFC 1922 Chinese Character Encoding March 1996
currently used as the internal codes for Chinese documents.
ISO-10646 is the universal multi-octet character set defined by ISO;
we feel that in the future it may become the preferred technology for
Chinese documents and electronic mail when it is widely available.
Specification
1. 7-bit Chinese encodings: ISO-2022-CN and ISO-2022-CN-EXT
1.1. Description
ISO-2022-CN is based on ISO 2022 [ISO-2022], similar to earlier work
on ISO-2022-JP [RFC-1468] and ISO-2022-KR [RFC-1557] for the Japanese
and Korean languages respectively. It is 7-bit, and supports both
simplified Chinese characters using GB 2312-80 [GB-2312] and
traditional Chinese characters using the first two planes of CNS
11643 [CNS-11643], as well as ASCII [ASCII] characters.
ISO-2022-CN-EXT is a superset of ISO-2022-CN that additionally
supports other GB character sets and planes of CNS 11643.
Since ISO-2022-CN and ISO-2022-CN-EXT are 7-bit encodings, they do
not require the 8-bit SMTP extensions. ISO-2022-CN supports all the
Chinese characters that appear in Big5 [BIG5].
1.2. ISO-2022-CN
The starting code of ISO-2022-CN is ASCII. ASCII and Chinese
characters are distinguished by designations (ESC sequences) and
shift functions.
Designations define the Chinese character sets used in the text.
There are three kinds of designations: SOdesignation, SS2designation
and SS3designation.
The SOdesignation is in the form ESC $ ) <F>, where <F> is the "final
character" assigned to the character set by ISO (refer to the ISO
registry [ISOREG] for more details). The SS2designation is in the
form ESC $ * <F>, and the SS3designation is in the form ESC $ + <F>.
A designation overrides any previous designation for subsequent bytes
in the text.
There are four kinds of shifts: SI, SO, SS2 and SS3. Shift functions
specify how to interpret the subsequent bytes.
The shift SI (one byte with hexadecimal value 0F) declares that
subsequent bytes are interpreted in ASCII.
Zhu, et al Informational [Page 2]
RFC 1922 Chinese Character Encoding March 1996
The shift SO (one byte with hexadecimal value 0E) declares that
subsequent bytes are interpreted in the character set defined by
SOdesignation.
The shift SS2 (two bytes with hexadecimal values 1B 4E) declares that
the subsequent TWO bytes are interpreted in the character set defined
by SS2designation, after which the previous interpretation (from SI
or SO) is restored.
The shift SS3 (two bytes with hexadecimal values 1B 4F) declares that
the subsequent TWO bytes are interpreted in the character set defined
by SS3designation, after which the previous interpretation (from SI
or SO) is restored.
The escape sequences, shift functions and character sets used in an
ISO-2022-CN text are as follows:
Character sets Shift in with
--------------------------------------------------------------------
ASCII SI
GB 2312, CNS 11643-plane-1 SO
CNS 11643-plane-2 SS2
ESC $ ) A Indicates the bytes following SO are Chinese
characters as defined in GB 2312-80, until
another SOdesignation appears
ESC $ ) G Indicates the bytes following SO are as defined
in CNS 11643-plane-1, until another
SOdesignation appears
ESC $ * H Indicates the two bytes immediately following
SS2 is a Chinese character as defined in CNS
11643-plane-2, until another SS2designation
appears
If there are any GB or CNS characters on a line, a designation for
the corresponding character set must be used so that each line has
its own character set information and the text can be displayed
correctly when scroll back in a window. Also, there must be a shift
to ASCII (SI) before the end of the line (i.e., before the CRLF). In
other words, each line starts in ASCII, and ends in ASCII.
Example: the hex sequence
1b 24 29 41 0e 3d 3b 3b 3b 1b 24 29 47 47 28 5f 50 0f
represents the Chinese word for "Interchange" (jiao huan) twice;
Zhu, et al Informational [Page 3]
RFC 1922 Chinese Character Encoding March 1996
the first time in simplified form using GB-2312 (the 3d 3b 3b 3b
sequence above), and the second time in traditional form using
CNS-11643 (the 47 28 5f 50 sequence above). The sequence 1b 24 29
41 is the SOdesignation for GB-2312, the 0e is SO to switch to
Chinese from ASCII, the 1b 24 29 47 is the SOdesignation for
CNS-11643 plane 1, and finally the 0f is the SI to return to ASCII
at the end of the line.
The name given to this character encoding is "ISO-2022-CN". This name
is intended to be used as the "charset" parameter in MIME [MIME-1,
MIME-2] messages.
Content-Type: text/plain; charset=iso-2022-cn
The ISO-2022-CN encoding is already in 7-bit form, so it is not
necessary to use a Content-Transfer-Encoding header.
Other restrictions are given in the "Formal Syntax of ISO-2022-CN"
(Section 7.1 of this document).
1.3. ISO-2022-CN-EXT
ISO-2022-CN-EXT supports all characters in existing GB, Big5 and CNS
11643 character sets.
The escape sequences, shift functions and character sets used in an
ISO-2022-CN-EXT text are as follows:
Character sets Shift in with
--------------------------------------------------------------------
ASCII SI
GB 2312, GB 12345, CNS 11643-plane-1, ISO-IR-165 SO
GB 7589, GB 13131, CNS 11643-plane-2 SS2
GB 7590, GB 13132 or other new GBs,CNS 11643-plane-3 or SS3
higher planes of CNS 11643
Note: Currently, there are some GB sets that have not been
registered in ISO. Here <X7589>, <X7590>, <X12345>, <X13131> and
<X13132> represent the final character that will be assigned by
ISO for those sets. These GB sets shall only be used once these
final characters are assigned.
Zhu, et al Informational [Page 4]
RFC 1922 Chinese Character Encoding March 1996
ESC $ ) A Indicates the bytes following SO are Chinese
characters as defined in GB 2312-80, until
another SOdesignation appears
ESC $ * <X7589> Indicates the two bytes immediately following
SS2 is a Chinese character as defined in GB
7589-87 [GB-7589], until another SS2designation
appears
ESC $ + <X7590> Indicates the two bytes immediately following
SS3 is a Chinese character as defined in GB
7590-87 [GB-7590], until another SS3designation
appears
ESC $ ) <X12345> Indicates the bytes following SO are as defined
in GB 12345-90 [GB-12345], until another
SOdesignation appears
ESC $ * <X13131> Indicates the two bytes immediately following
SS2 is a Chinese character as defined in GB
13131-91 [GB-13131], until another
SS2designation appears
ESC $ + <X13132> Indicates the two bytes immediately following
SS3 is a Chinese character as defined in GB
13132-91 [GB-13131], until another
SS3designation appears
ESC $ ) E Indicates the bytes following SO are as defined
in ISO-IR-165 (for details, see section 2.1),
until another SOdesignation appears
ESC $ ) G Indicates the bytes following SO are as defined
in CNS 11643-plane-1, until another
SOdesignation appears
ESC $ * H Indicates the two bytes immediately following
SS2 is a Chinese character as defined in CNS
11643-plane-2, until another SS2designation
appears
ESC $ + I Indicates the immediate two bytes following SS3
is a Chinese character as defined in CNS
11643-plane-3, until another SS3designation
appears
Zhu, et al Informational [Page 5]
RFC 1922 Chinese Character Encoding March 1996
ESC $ + J Indicates the immediate two bytes following SS3
is a Chinese character as defined in CNS
11643-plane-4, until another SS3designation
appears
ESC $ + K Indicates the immediate two bytes following SS3
is a Chinese character as defined in CNS
11643-plane-5, until another SS3designation
appears
ESC $ + L Indicates the immediate two bytes following SS3
is a Chinese character as defined in CNS
11643-plane-6, until another SS3designation
appears
ESC $ + M Indicates the immediate two bytes following SS3
is a Chinese character as defined in CNS
11643-plane-7, until another SS3designation
appears
As in ISO-2022-CN, each line starts in ASCII, and ends in ASCII, and
has its own designation information before any Chinese characters
appear.
The name given to this character encoding is "ISO-2022-CN-EXT". This
name is intended to be used as the "charset" parameter in MIME
messages.
Content-Type: text/plain; charset=ISO-2022-CN-EXT
The ISO-2022-CN-EXT encoding is also in 7-bit form, so it is not
necessary to use a Content-Transfer-Encoding header.
Other restrictions are given in the "Formal Syntax of
ISO-2022-CN-EXT" (Section 7.2 of this document).
1.4. How to Support Big5 or other internal codesets with ISO-2022-CN
and ISO-2022-CN-EXT
Since there are many different Chinese internal coding systems
[CJKINF], such as EUC GB, Big5, CCCII (an encoding for library
systems mainly used in Taiwan), GBK (the new standard specification
for Chinese internal code, also is the codepage for Microsoft
simplified Chinese Windows 95) etc., ISO-2022-CN and ISO-2022-CN-EXT,
which are 7-bit and will not lose information during communication
among different codesets, facilitate interchange between the various
Chinese coding systems in the Internet.
Zhu, et al Informational [Page 6]
RFC 1922 Chinese Character Encoding March 1996
For instance, ISO-2022-CN and ISO-2022-CN-EXT can be used to support
the popular Big5 codeset, because the first two planes of CNS-11643
contain the same Chinese characters as Big5's "common part" except
two duplicate characters. By the "common part" we mean the part that
is not specific to any Big5 vendor, consisting of 5401 more
frequently used characters in Big5 range 0xA440-0xC67E, 7652 less
frequently used characters in Big5 range 0xC940-0xF9D5, and 441 other
symbols in Big5 range 0xA140-0xA3E0, as defined in Institute for
Information Industry's (III) technical report C-26 (see also [Big5]).
The appendix of this document presents a conversion table for
converting Big5 into CNS-11643, including specific extensions of some
popular vendors. For other extensions, vendors and implementors of
Big5 products are ENCOURAGED to create detailed conversion tables, in
order to increase interoperability between different coding systems.
Public domain software (binary or C source code) for conversion
between Big5 and CNS-11643 is available on many Internet sites. At
the time of this writing, the following FTP sites and software are
advertised:
1) Beijing:
ftp://ftp.net.tsinghua.edu.cn/pub/Chinese/convert/big5cns.zip
(IP address: 166.111.1.6)
2) Xi'an:
ftp://ftp.xanet.edu.cn
/pub/chinese-soft/unix/convert/BeTTY-1.534.tar.gz
(IP address: 202.112.11.131)
3) Taiwan:
ftp://ftp.seed.net.tw/Pub/Chinese/DOS/code-convert/chcode.zip
(IP address: 140.92.1.65)
4) US:
ftp://ftp.ifcss.org/pub/software/unix/convert/BeTTY-1.534.tar.gz
(IP address: 128.123.1.55)
⌨️ 快捷键说明
复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?