⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 rfc 1842 - chinese language processing and chinese computing.htm

📁 简单介绍base64,UTF8编码解码原理
💻 HTM
📖 第 1 页 / 共 2 页
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3c.org/TR/1999/REC-html401-19991224/loose.dtd">
<!-- saved from url=(0047)http://seba.studentenweb.org/thesis/rfc1842.php -->
<HTML><HEAD><TITLE>RFC 1842 - Chinese Language Processing and Chinese Computing</TITLE><LINK 
href="RFC 1842 - Chinese Language Processing and Chinese Computing.files/layout.css" 
type=text/css rel=stylesheet>
<META http-equiv=Content-Type content="text/html; charset=UTF-8">
<META content="MSHTML 6.00.2900.2912" name=GENERATOR></HEAD>
<BODY>
<TABLE cellSpacing=0 cellPadding=0 width="100%">
  <TBODY>
  <TR bgColor=#354463>
    <TD vAlign=center height=20>
      <P class=nav2 align=left>【狗】丙戌年乙未月癸卯日 / 六月十七日 </P></TD>
    <TD vAlign=center colSpan=4 height=20>
      <P class=nav2 align=right>Thursday July 13, 2006 </P></TD></TR>
  <TR bgColor=#7c8aa4>
    <TD class=nav2 vAlign=center height=20>&nbsp; </TD>
    <TD class=nav2 vAlign=center align=right colSpan=5 height=20><SELECT 
      class=nav2 style="BACKGROUND-COLOR: #7c8aa4" 
      onchange="x = this.options[this.selectedIndex].value; if (x != '') {top.location = x;}" 
      name=navigation> <OPTION selected>think quick</OPTION> <OPTION 
        value=howto.php>.HOWTO</OPTION> <OPTION 
        value=howto-pinyin.php>&nbsp;..input-pinyin</OPTION> <OPTION 
        value=howto-input.php>&nbsp;..input Chinese</OPTION> <OPTION 
        value=howto-vert.php>&nbsp;..input CJK vertically</OPTION> <OPTION 
        value=howto-view.php>&nbsp;..view Chinese on the web</OPTION> <OPTION 
        value=howto-mail.php>&nbsp;..send emails</OPTION> <OPTION 
        value=howto-internet.php>&nbsp;..use Chinese on the internet</OPTION> 
        <OPTION value=howto-fonts.php>&nbsp;..fonts</OPTION> <OPTION 
        value=howto-win.php>&nbsp;..windows &amp; IME</OPTION> <OPTION 
        value=howto-print.php>&nbsp;..print</OPTION> <OPTION 
        value=howto-db.php>&nbsp;..databases</OPTION> <OPTION 
        value=howto-unicode.php>&nbsp;..unicode</OPTION> <OPTION 
        value=charset.php>.Character sets</OPTION> <OPTION 
        value=encoding.php>.Encoding</OPTION> <OPTION value=im.php>.Input 
        Methods</OPTION></SELECT> </TD></TR>
  <TR bgColor=#000000>
    <TD align=right colSpan=5 height=1></TD></TR>
  <TR bgColor=#bfc4cb>
    <TD vAlign=center align=middle colSpan=5 height=20>
      <P class=nav align=center><A class=nav 
      href="http://seba.studentenweb.org/thesis/index.php">home</A> &nbsp; · 
      &nbsp;<B><A class=nav 
      href="http://seba.studentenweb.org/thesis/howto.php">howto</A> </B>&nbsp; 
      · &nbsp; <A class=nav 
      href="http://seba.studentenweb.org/thesis/feedback.php">feedback</A> 
      &nbsp; · &nbsp; <B><A class=nav 
      href="http://seba.studentenweb.org/thesis/thesis.php">thesis</A></B> 
      &nbsp; · &nbsp; <A class=nav 
      href="http://seba.studentenweb.org/thesis/download.php">downloads</A> 
      &nbsp; · &nbsp; <A class=nav href="http://www.ldschinese.com/boards/" 
      target=_blank>discussionboard</A> </P></TD></TR>
  <TR bgColor=#000000>
    <TD colSpan=5 height=1></TD></TR>
  <TR>
    <TD class=content vAlign=top width="80%"><PRE>
Network Working Group                                             Y. Wei
Request for Comments: 1842                        AsiaInfo Services Inc.
Category: Informational                                         Y. Zhang
                                                           Harvard Univ.
                                                                   J. Li
                                                              Rice Univ.
                                                                 J. Ding
                                                  AsiaInfo Services Inc.
                                                                Y. Jiang
                                                       Univ. of Maryland
                                                             August 1995

      ASCII Printable Characters-Based Chinese Character Encoding
                         for Internet Messages

Status of this Memo

   This memo provides information for the Internet community.  This memo
   does not specify an Internet standard of any kind.  Distribution of
   this memo is unlimited.

Abstract

   This document describes the encoding used in electronic mail [<A class=outsite href="http://www.faqs.org/rfcs/rfc822.html">RFC822</A>]
   and network news [<A class=outsite href="http://www.faqs.org/rfcs/rfc1036.html">RFC1036</A>] messages over the Internet. The 7-bit
   representation of GB 2312 Chinese text was specified by Fung Fung Lee
   of Stanford University [Lee89] and implemented in various software
   packages under different platforms (see appendix for a partial list
   of the available software packages that support this encoding
   method). It is further tested and used in the usenet newsgroups
   alt.chinese.text and chinese.* as well as various other network
   forums with considerable success. Future extensions of this encoding
   method can accommodate additional GB character sets and other east
   asian language character sets [Wei94].

   The name given to this encoding is "HZ-GB-2312", which is intended to
   be used in the "charset" parameter field of MIME headers (see [MIME1]
   and [MIME2]).

Table of Contents

   1.     Introduction................................................ 2
   2.     Description................................................. 3
   3.     Formal Syntax............................................... 4
   4.     MIME Considerations......................................... 5
   5.     Background Information...................................... 5
   6.     References.................................................. 6
   7.     Acknowledgements............................................ 6
   8.     Security Considerations..................................... 7
   9.     Authors' Addresses.......................................... 7
   10.    Appendix: List of Software Implementing HZ Representation... 9

1. Introduction

   Chinese (and other east Asia languages) characters are encoded with
   multiple bytes to guarantee sufficient coding space for the large
   number of glyphs these languages contain. With the prolification of
   internetwork traffic around the world, it becomes necessary to define
   ways to facilitate the transfer of text in multiple-byte character-
   set languages (hereafter as Chinese text) over internet.

   There are two layers of concerns need to be addressed by any
   mechanism whose purpose is to transfer Chinese text over internet.
   The first is on application layer, in which concerned applications
   should be able to recognize the encoding of the text and/or discern
   different character sets which might be mixed in the text and handle
   it accordingly. The second layer is the actual transport of Chinese
   text between point A to point B over the Internet. Because the
   prevailing mail transport protocol used over internet, the Simple
   Mail Transport Protocol (aka. SMTP) was designed originally for ASCII
   character set only, many internet mail agents are not 8 bit clean and
   therefore introduce challenges for any attempt to actually implement
   a mechanism for the transport of Chinese text over internet.

   Here we describe a mechanism for transmission of Chinese text over IP
   network. This described mechanism has being implemented by various
   software package dealing with multi-language support and has been
   tested on USENET newsgroups and other types of internet forums over
   the last two years. The test results shows that the HZ representation
   can pass through almost all existing mail delivery agents without
   being corrupted. The HZ representation currently handles GB2312-80
   Chinese character set only. Further expansion to other Chinese
   encoding systems and to other East Asia Language is under
   consideration.

2. Description

   For an arbitrary mixed text with both Chinese coded text strings and
   ASCII text strings, we designate to two distinguishable text modes,
   ASCII mode and HZ mode, as the only two states allowed in the text.
   At any given time, the text is in either one of these two modes or in
   the transition from one to the other. In the HZ mode, only printable
   ASCII characters (0x21-0x7E) are meanful with the size of basic text
   unit being two bytes long.

   In the ASCII mode, the size of basic text unit is one (1) byte with
   the exception '~~', which is the special sequence representing the
   ASCII character '~'. In both ASCII mode and HZ mode, '~' leads an
   escape sequence. However, as HZ mode has basic size of text unit
   being 2 bytes long, only the '~' character which appears at the first
   byte of the the two-byte character frame are considered as the start
   of an escape sequence.

   The default mode is ASCII mode. Each line of text starts with the
   default ASCII mode. Therefore, all Chinese character strings are to
   be enclosed with '~{' and '~}' pair in the same text line.

   The escape sequences defined are as the following:

        ~{       ---- escape from ASCII mode to GB2312 HZ mode
        ~}       ---- escape from HZ mode to ASCII mode
        ~~       ---- ASCII character '~' in ASCII mode
        ~\n      ---- line continuation in ASCII mode
        ~[!-z|]  ---- reserved for future HZ mode character sets

   A few examples of the 7 bit representation of Chinese GB coded test
   taken directly from [Lee89] are listed as the following:

   Example 1:  (Suppose there is no line size limit.)
               This sentence is in ASCII.
               The next sentence is in GB.~{&lt;:Ky2;S{#,NpJ)l6HK!#~}Bye.

   Example 2:  (Suppose the maximum line size is 42.)
               This sentence is in ASCII.
               The next sentence is in GB.~{&lt;:Ky2;S{#,~}~
               ~{NpJ)l6HK!#~}Bye.

   Example 3:  (Suppose a new line is started for every mode switch.)
               This sentence is in ASCII.
               The next sentence is in GB.~
               ~{&lt;:Ky2;S{#,NpJ)l6HK!#~}~
               Bye.

3. Formal Syntax

   The notational conventions used here are identical to those used in
   <A class=outsite href="http://www.faqs.org/rfcs/rfc822.html">RFC 822</A> [<A class=outsite href="http://www.faqs.org/rfcs/rfc822.html">RFC822</A>].

   The * (asterisk) convention is as follows:

       l*m something

   meaning at least l and at most m somethings, with l and m taking
   default values of 0 and infinity, respectively.

   message             = headers 1*( CRLF *single-byte-char *segment
                         single-byte-seq *single-byte-char )
                                       ; see also [MIME1] "body-part"
                                       ; note: must end in ASCII

   headers             = &lt;see [<A class=outsite href="http://www.faqs.org/rfcs/rfc822.html">RFC822</A>] "fields" and [MIME1] "body-part"&gt;

   segment             = single-byte-segment / double-byte-segment

   single-byte-segment = 1*single-byte-char

   double-byte-segment = double-byte-seq 1*( one-of-94 one-of-94 )

   single-byte-seq     = "~}"

   double-byte-seq     = "~{"

   CRLF                = CR LF
                                                    ; ( Octal, Decimal.)

   CR                  = &lt;ASCII CR, carriage return&gt;; (    15,      13.)

   LF                  = &lt;ASCII LF, linefeed&gt;       ; (    12,      10.)

   one-of-94           = &lt;any one of 94 values&gt;     ; (41-176, 33.-126.)

   single-byte-char    = &lt;any 7BIT, including bare CR &amp; bare LF, but NOT
                          including CRLF, not including &gt; / "~~"&gt;;

   7BIT                = &lt;any 7-bit value&gt;          ; ( 0-177,  0.-127.)

4. MIME Considerations

   The name given to the HZ character encoding is "HZ-GB-2312". This
   name is intended to be used in MIME messages as follows:

       Content-Type: text/plain; charset=HZ-GB-2312

   The HZ-GB-2312 encoding is already in 7-bit form, so it is not
   necessary to use a Content-Transfer-Encoding header.

5. Background Information

   A GB code is a two byte character withe the first byte is in the
   range of 0x21-0x77 and the second byte in the range 0x21-0x7E. As the
   printable ASCII subset of characters are single byte character in the
   range of 0x21--0x7E, two printable ASCII characters can represent a
   two byte GB coded Chinese character if proper escape sequence is used
   to indicate the proper text mode. This form the base of the above
   described HZ 7-bit representation methods. Further, with the use of a
   printable ASCII character, '~', as the leading byte of the escape
   sequence, the HZ representation eliminated the need of reserving any
   non-printable ASCII characters, which are commonly used by
   application programs (as well as system environment) for various
   control function or other special signaling. Therefore, the HZ
   representation method described here posses the least probability of
   interfering with the host and network environment.  This is also a
   convenient for application for implementing the HZ coding method.

   HZ representation method has been implemented in various Chinese
   software across computer hardware platforms. It has also being tested
   for more than two years over USENET newsgroups, alt.chinese.text and
   chinese.*, for the transmission of Chinese texts over the internet.
   The original points of those transferred Chinese texts are
   geographically scattered around the world and under the constraints
   of vast different system and network environments.  Therefore, such a
   test group may well represent a rather complete sample of the real
   internet world. The successful test of the HZ representation method
   therefore builds up the confidence that it is well suited for
   transmitting multi-byte text messages over the internet.

   Under HZ representation, ASCII text remain as 7-bit characters and
   therefore HZ representation together with the 7-bit ASCII character
   set can be viewed as forming a superset of characters.

6. References

   [ASCII] American National Standards Institute, "Coded character set
   -- 7-bit American national standard code for information
   interchange", ANSI X3.4-1986.

   [GB 2312] Technical Administrative Bureau of P.R.China, "Coding of
   Chinese Ideogram Set for Information Interchange Basic Set",
   GB 2312-80.

   [Lee89] Lee, F., "HZ - A Data Format for Exchanging Files of
   Arbitrarily Mixed Chinese and ASCII characters", <A class=outsite href="http://www.faqs.org/rfcs/rfc1843.html">RFC 1843</A>,
   Stanford University, August 1995.

   [MIME1] Borenstein N., and N. Freed, "MIME (Multipurpose Internet
   Mail Extensions) Part One: Mechanisms for Specifying and Describing
   the Format of Internet Message Bodies", <A class=outsite href="http://www.faqs.org/rfcs/rfc1521.html">RFC 1521</A>, Bellcore, Innosoft,
   September 1993.

   [MIME2] Moore, K., "MIME (Multipurpose Internet Mail Extensions)
   Part Two: Message Header Extensions for Non-ASCII Text", <A class=outsite href="http://www.faqs.org/rfcs/rfc1522.html">RFC 1522</A>,
   University of Tennessee, September 1993.

   [<A class=outsite href="http://www.faqs.org/rfcs/rfc822.html">RFC822</A>] Crocker, D., "Standard for the Format of ARPA Internet

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -