📄 rfc1842.txt
字号:
Network Working Group Y. WeiRequest for Comments: 1842 AsiaInfo Services Inc.Category: Informational Y. Zhang Harvard Univ. J. Li Rice Univ. J. Ding AsiaInfo Services Inc. Y. Jiang Univ. of Maryland August 1995 ASCII Printable Characters-Based Chinese Character Encoding for Internet MessagesStatus of this Memo This memo provides information for the Internet community. This memo does not specify an Internet standard of any kind. Distribution of this memo is unlimited.Abstract This document describes the encoding used in electronic mail [RFC822] and network news [RFC1036] messages over the Internet. The 7-bit representation of GB 2312 Chinese text was specified by Fung Fung Lee of Stanford University [Lee89] and implemented in various software packages under different platforms (see appendix for a partial list of the available software packages that support this encoding method). It is further tested and used in the usenet newsgroups alt.chinese.text and chinese.* as well as various other network forums with considerable success. Future extensions of this encoding method can accommodate additional GB character sets and other east asian language character sets [Wei94]. The name given to this encoding is "HZ-GB-2312", which is intended to be used in the "charset" parameter field of MIME headers (see [MIME1] and [MIME2]).Wei, et al Informational [Page 1]RFC 1842 ASCII/Chinese Character Encoding August 1995Table of Contents 1. Introduction................................................ 2 2. Description................................................. 3 3. Formal Syntax............................................... 4 4. MIME Considerations......................................... 5 5. Background Information...................................... 5 6. References.................................................. 6 7. Acknowledgements............................................ 6 8. Security Considerations..................................... 7 9. Authors' Addresses.......................................... 7 10. Appendix: List of Software Implementing HZ Representation... 91. Introduction Chinese (and other east Asia languages) characters are encoded with multiple bytes to guarantee sufficient coding space for the large number of glyphs these languages contain. With the prolification of internetwork traffic around the world, it becomes necessary to define ways to facilitate the transfer of text in multiple-byte character- set languages (hereafter as Chinese text) over internet. There are two layers of concerns need to be addressed by any mechanism whose purpose is to transfer Chinese text over internet. The first is on application layer, in which concerned applications should be able to recognize the encoding of the text and/or discern different character sets which might be mixed in the text and handle it accordingly. The second layer is the actual transport of Chinese text between point A to point B over the Internet. Because the prevailing mail transport protocol used over internet, the Simple Mail Transport Protocol (aka. SMTP) was designed originally for ASCII character set only, many internet mail agents are not 8 bit clean and therefore introduce challenges for any attempt to actually implement a mechanism for the transport of Chinese text over internet. Here we describe a mechanism for transmission of Chinese text over IP network. This described mechanism has being implemented by various software package dealing with multi-language support and has been tested on USENET newsgroups and other types of internet forums over the last two years. The test results shows that the HZ representation can pass through almost all existing mail delivery agents without being corrupted. The HZ representation currently handles GB2312-80 Chinese character set only. Further expansion to other Chinese encoding systems and to other East Asia Language is under consideration.Wei, et al Informational [Page 2]RFC 1842 ASCII/Chinese Character Encoding August 19952. Description For an arbitrary mixed text with both Chinese coded text strings and ASCII text strings, we designate to two distinguishable text modes, ASCII mode and HZ mode, as the only two states allowed in the text. At any given time, the text is in either one of these two modes or in the transition from one to the other. In the HZ mode, only printable ASCII characters (0x21-0x7E) are meanful with the size of basic text unit being two bytes long. In the ASCII mode, the size of basic text unit is one (1) byte with the exception '~~', which is the special sequence representing the ASCII character '~'. In both ASCII mode and HZ mode, '~' leads an escape sequence. However, as HZ mode has basic size of text unit being 2 bytes long, only the '~' character which appears at the first byte of the the two-byte character frame are considered as the start of an escape sequence. The default mode is ASCII mode. Each line of text starts with the default ASCII mode. Therefore, all Chinese character strings are to be enclosed with '~{' and '~}' pair in the same text line. The escape sequences defined are as the following: ~{ ---- escape from ASCII mode to GB2312 HZ mode ~} ---- escape from HZ mode to ASCII mode ~~ ---- ASCII character '~' in ASCII mode ~\n ---- line continuation in ASCII mode ~[!-z|] ---- reserved for future HZ mode character sets A few examples of the 7 bit representation of Chinese GB coded test taken directly from [Lee89] are listed as the following: Example 1: (Suppose there is no line size limit.) This sentence is in ASCII. The next sentence is in GB.~{<:Ky2;S{#,NpJ)l6HK!#~}Bye. Example 2: (Suppose the maximum line size is 42.) This sentence is in ASCII. The next sentence is in GB.~{<:Ky2;S{#,~}~ ~{NpJ)l6HK!#~}Bye. Example 3: (Suppose a new line is started for every mode switch.) This sentence is in ASCII. The next sentence is in GB.~ ~{<:Ky2;S{#,NpJ)l6HK!#~}~ Bye.Wei, et al Informational [Page 3]RFC 1842 ASCII/Chinese Character Encoding August 19953. Formal Syntax The notational conventions used here are identical to those used in RFC 822 [RFC822]. The * (asterisk) convention is as follows: l*m something meaning at least l and at most m somethings, with l and m taking default values of 0 and infinity, respectively. message = headers 1*( CRLF *single-byte-char *segment single-byte-seq *single-byte-char ) ; see also [MIME1] "body-part" ; note: must end in ASCII headers = <see [RFC822] "fields" and [MIME1] "body-part"> segment = single-byte-segment / double-byte-segment single-byte-segment = 1*single-byte-char double-byte-segment = double-byte-seq 1*( one-of-94 one-of-94 ) single-byte-seq = "~}" double-byte-seq = "~{" CRLF = CR LF ; ( Octal, Decimal.) CR = <ASCII CR, carriage return>; ( 15, 13.) LF = <ASCII LF, linefeed> ; ( 12, 10.) one-of-94 = <any one of 94 values> ; (41-176, 33.-126.) single-byte-char = <any 7BIT, including bare CR & bare LF, but NOT including CRLF, not including > / "~~">; 7BIT = <any 7-bit value> ; ( 0-177, 0.-127.)Wei, et al Informational [Page 4]RFC 1842 ASCII/Chinese Character Encoding August 19954. MIME Considerations The name given to the HZ character encoding is "HZ-GB-2312". This name is intended to be used in MIME messages as follows: Content-Type: text/plain; charset=HZ-GB-2312 The HZ-GB-2312 encoding is already in 7-bit form, so it is not necessary to use a Content-Transfer-Encoding header.5. Background Information A GB code is a two byte character withe the first byte is in the range of 0x21-0x77 and the second byte in the range 0x21-0x7E. As the printable ASCII subset of characters are single byte character in the range of 0x21--0x7E, two printable ASCII characters can represent a two byte GB coded Chinese character if proper escape sequence is used to indicate the proper text mode. This form the base of the above described HZ 7-bit representation methods. Further, with the use of a printable ASCII character, '~', as the leading byte of the escape sequence, the HZ representation eliminated the need of reserving any non-printable ASCII characters, which are commonly used by application programs (as well as system environment) for various control function or other special signaling. Therefore, the HZ representation method described here posses the least probability of interfering with the host and network environment. This is also a convenient for application for implementing the HZ coding method. HZ representation method has been implemented in various Chinese software across computer hardware platforms. It has also being tested for more than two years over USENET newsgroups, alt.chinese.text and chinese.*, for the transmission of Chinese texts over the internet. The original points of those transferred Chinese texts are geographically scattered around the world and under the constraints of vast different system and network environments. Therefore, such a test group may well represent a rather complete sample of the real internet world. The successful test of the HZ representation method therefore builds up the confidence that it is well suited for transmitting multi-byte text messages over the internet. Under HZ representation, ASCII text remain as 7-bit characters and therefore HZ representation together with the 7-bit ASCII character set can be viewed as forming a superset of characters.Wei, et al Informational [Page 5]RFC 1842 ASCII/Chinese Character Encoding August 19956. References [ASCII] American National Standards Institute, "Coded character set -- 7-bit American national standard code for information interchange", ANSI X3.4-1986. [GB 2312] Technical Administrative Bureau of P.R.China, "Coding of Chinese Ideogram Set for Information Interchange Basic Set", GB 2312-80. [Lee89] Lee, F., "HZ - A Data Format for Exchanging Files of Arbitrarily Mixed Chinese and ASCII characters", RFC 1843, Stanford University, August 1995. [MIME1] Borenstein N., and N. Freed, "MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies", RFC 1521, Bellcore, Innosoft, September 1993. [MIME2] Moore, K., "MIME (Multipurpose Internet Mail Extensions) Part Two: Message Header Extensions for Non-ASCII Text", RFC 1522, University of Tennessee, September 1993. [RFC822] Crocker, D., "Standard for the Format of ARPA Internet Text Messages", STD 11, RFC 822, UDEL, August 1982. [RFC1036] Horton M., and R. Adams, "Standard for Interchange of USENET Messages", RFC 1036, AT&T Bell Laboratories, Center for Seismic Studies, December 1987. [Wei94] Wei, Yagui, "A Proposal for a Consolidated Collection of East Asian Language Coding Standards Using Solely ASCII Printable Characters", June 30, 1994.7. Acknowledgements Many people have involved the design and specification of the HZ 7- bit Chinese representation system at different stages. Most notable among them are Ed Lai, Chunqing Cheng, Fung Fung Lee, and Ricky Yeung. This document is merely a recollection of thoughts and efforts made collectively by this group of people whose devotion has led to the current success of the HZ Chinese representation over the Internet. Further, the authors wish to thank AsiaInfo Services Inc. for sponsoring the preparation of this document and for facilitate the communication need to refine this document.Wei, et al Informational [Page 6]
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -