rfc2130.txt

来自「RFC 的详细文档!」· 文本 代码 · 共 1,740 行 · 第 1/5 页

TXT
1,740
字号






Network Working Group                                        C. Weider
Request for Comments: 2130                                   Microsoft
Category: Informational                                     C. Preston
                                                       Preston & Lynch
                                                           K. Simonsen
                                                                 DKUUG
                                                         H. Alvestrand
                                                               UNINETT
                                                           R. Atkinson
                                                         Cisco Systems
                                                            M. Crispin
                                              University of Washington
                                                           P. Svanberg
                                                                   KTH
                                                            April 1997

              The Report of the IAB Character Set Workshop
                    held 29 February - 1 March, 1996

Status of this Memo

   This memo provides information for the Internet community.  This memo
   does not specify an Internet standard of any kind.  Distribution of
   this memo is unlimited.

Acknowledgments

   The authors would like to sincerely thank Information Sciences
   Institute (ISI), and in particular Joyce K. Reynolds for graciously
   hosting this event; Joe Kemp and Jeanine Yamazaki of ISI made sure
   the facilities met our needs.  We also wish to thank the Internet
   Society, which underwrote travel for participants who might not
   otherwise have been able to attend.  Of course, we also wish to thank
   the many experts who participated in the workshop and on the mailing
   list; a complete list of these people can be found in Appendix D.
   Bunyip Information Systems was kind enough to provide mailing list
   facilities for this work.

Table of Contents

   Abstract
   0:    Executive summary..........................................   2
   1:    Introduction...............................................   3
   2:    Character sets on the Internet -- the problem..............   3
   2.1:  Character set handling in existing protocols...............   4
   3:    Architectural model........................................   6
   3.1:  Segments defined...........................................   7
   3.2:  On the wire................................................   8



Weider, et. al.              Informational                      [Page 1]

RFC 2130             Character Set Workshop Report            April 1997


   3.3:  Determining which values of CCS, CES, and TES are used.....   9
   3.4:  Recommended Defaults.......................................  10
   3.5:  Guidelines for conversions between coded character sets....  13
   4:    Presentation issues........................................  14
   5:    Open issues................................................  14
   5.1:  Language tags..............................................  15
   5.2:  Public identifiers.........................................  16
   5.3:  Bi-directionality..........................................  16
   6:    Security Considerations....................................  16
   7:    Conclusions................................................  16
   8:    Recommendations............................................  17
   8.1:  To the IAB.................................................  17
   8.2:  For new Internet protocols.................................  18
   8.3:  For registration of new character sets.....................  18
   Appendix A: List of protocols affected by character set issues...  20
   Appendix B: Acronyms.............................................  23
   Appendix C: Glossary.............................................  24
   Appendix D: References...........................................  25
   Appendix E: Recommended reading..................................  27
   Appendix F: Workshop attendee list...............................  29
   Appendix G: Authors' Addresses...................................  30

Abstract

   This report details the conclusions of an IAB-sponsored invitational
   workshop held 29 February  - 1 March, 1996, to discuss the use of
   character sets on the Internet.  It motivates the need to have
   character set handling in Internet protocols which transmit text,
   provides a conceptual framework for specifying character sets,
   recommends the use of MIME tagging for transmitted text, recommends a
   default character set *without* stating that there is no need for
   other character sets, and makes a series of recommendations to the
   IAB, IANA, and the IESG for furthering the integration of the
   character set framework into text transmission protocols.

0: Executive summary

   The term 'Character Set' means many things to many people. Even the
   MIME registry of character sets registers items that have great
   differences in semantics and applicability. This workshop provides
   guidance to the IAB and IETF about the use of character sets on the
   Internet and provides a common framework for interoperability between
   the many characters in use there.

   The framework consists of four components: an architecture model,
   which specifies components necessary for on-the-wire transmission of
   text; recommendations for tagging transmitted (and stored) text;
   recommended defaults for each level of the model; and a set of



Weider, et. al.              Informational                      [Page 2]

RFC 2130             Character Set Workshop Report            April 1997


   recommendations to the IAB, IANA, and the IESG for furthering the
   integration of  this framework into text transmission protocols.

   The architectural model specifies 7 layers, of which only three are
   required for on-the-wire transmission. The Coded Character Set is a
   mapping from a set of abstract characters to a set of integers. The
   Character Encoding Scheme is a mapping from a Coded Character Set (or
   several) to a set of octets. The Transfer Encoding Syntax is a
   transformation applied to data which has been encoded using a
   Character Encoding Scheme to allow it to be transmitted. These layers
   should be specified in a transmitted text stream by using the MIME
   encoding mechanisms.

   This report recommends the use of ISO 10646 as the default Coded
   Character Set, and UTF-8 as the default Character Encoding Scheme in
   the creation of new protocols or new version of old protocols which
   transmit text. These defaults do not deprecate the use of other
   character sets when and where they are needed; they are simply
   intended to provide guidance and a specification for
   interoperability.

1:  Introduction

   This is the report of an IAB-sponsored invitational workshop on the
   use of Character Sets on the Internet, held 29 February - 1 March
   1996 at Information Sciences Institute (ISI) in Marina del Rey,
   California.  In addition, this report covers the discussion on the
   mailing list up to and slightly beyond the workshop itself.  The
   goals of this workshop were to provide guidance to the IAB and the
   IETF about the use of character sets on the Internet, and if possible
   a common framework for interoperability between the many character
   sets in use there.  Both goals were achieved.

2:  Character sets on the Internet - the problem

   The term 'character set' is typically applied to the contents of a
   wide variety of text transmission and display protocols used on the
   Internet.  Because the term is used to mean different things,
   confusion has arisen.  For example, the MIME registry of character
   sets [MIME] contains items that may differ greatly in their
   applicability and semantics in various Internet protocols.

   In addition, there is a vast profusion of different text encoding
   schemes in use on the Internet.  This per se is not a problem; each
   scheme has evolved to meet real needs.  However, information
   applications such as mail, directories, and the World Wide Web have
   each developed different techniques for dealing with the growing
   number of schemes.  A robust information architecture for the



Weider, et. al.              Informational                      [Page 3]

RFC 2130             Character Set Workshop Report            April 1997


   Internet requires as much interoperability between these techniques
   as possible.

2.1:  Related topics deemed out of scope for this workshop

   Successful display of plain text transmitted over the Internet
   requires a lot of information about the text itself, such as the
   underlying character set, language, and so forth.  An additional set
   of formatting information is needed if the receiving application
   wishes to use local (cultural) conventions when it presents the data
   to the user.  This formatting includes information, that provides the
   data necessary to format certain  types of textual data (dates,
   times, numbers and monetary notation) into a form which is familiar
   to the user.  The POSIX [POSIX] notation of locale encompasses
   language, coded character set and cultural conventions.

   To avoid unfruitful discussion, and to make the best use of the time
   available for the workshop, we declared the following  issues out of
   scope for the purposes of this workshop:

   -  glyphs
   -  sorting
   -  culture (e.g. do we present the American or British spelling?)
   -  user interface issues
   -  internal representation of textual data
   -  included characters (why aren't certain characters available in
          any character set?)
   -  locale (in the POSIX sense)
   -  font registration
   -  semantics
   -  user input/output issues
   -  Han unification issues

   There are some related issues which were included for discussion,
   most importantly the 'locale' components necessary for transport and
   identification of multilingual texts.

2.2:  Character Set handling in existing protocols

   One of the group's overriding concerns was that the framework
   developed for character set handling not break existing protocols.
   With that in mind, the way character sets are being used in existing
   protocols was examined.  See Appendix A for a list of those protocols
   and some recommendations for change.

2.2.1:  General comments

   The problem areas here fall into three main categories: protocols,



Weider, et. al.              Informational                      [Page 4]

RFC 2130             Character Set Workshop Report            April 1997


   identifiers, and data.

2.2.1.1:  Protocols

   The protocol machinery SHOULD NOT be changed; allowing, for instance,
   SMTP [SMTP] to use both MAIL FROM and POST FRA is dangerous to the
   protocols' stability.  However, many protocols carry error messages
   and other information that is intended for human consumption; it
   MIGHT be an advantage to allow these to be localized into a specific
   language and character set, rather than staying in English and US-
   ASCII [ASCII].  If this is done, new extensions should follow the
   framework outlined below.

2.2.1.2:  Identifiers.

   There is a strong statement of direction from the IAB, RFC 1958 [RFC
   1958],  which states:

        4.3 Public (i.e. widely visible) names should be in case
            independent ASCII.  Specifically, this refers to DNS names,
            and to protocol elements that are transmitted in text format.
            ...
        5.4 Designs should be fully international, with support for
            localization (adaptation to local character sets). In
            particular, there should be a uniform approach to character
            set tagging for information content.

   In protocols that up to now have used US-ASCII only, UTF-8 [UTF-8]
   forms a simple upgrade path; however, its use should be negotiated
   either by negotiating a protocol version or by negotiating charset
   usage, and a fallback to a US-ASCII compatible representation such as
   UTF-7 [UTF-7] MUST be available.

   The need for passing application data such as language on individual
   identifiers varies between applications; protocols SHOULD attempt to
   evaluate this need when designing mechanisms.  Applying the ASCII
   requirement for identifiers that are only used in a local context
   (such as private mailbox folder names) is both unrealistic and
   unreasonable; in such cases, methods for consistency in the handling
   of character set should be considered.

2.2.1.3:  Data

   Data that require character set handling includes text, databases,
   and HTML [HTML] pages, for example.  In these the support for
   multiple character sets and proper application information is
   absolutely vital, and MUST be supported.




Weider, et. al.              Informational                      [Page 5]

RFC 2130             Character Set Workshop Report            April 1997


2.3:  Architectural requirements

   To address the issues enumerated for this work, first an
   architectural model was created which establishes the components that
   are required to fully specify the transmission of textual data. Many
   of these components are already familiar to the users of encoding
   protocols such as MIME.  Not all of these are discussed in detail in
   this report; we restrict ourselves primarily to those components
   which are required to specify the 'on-the-wire' phase of text
   transmission.

   Mandating a single, all-encompassing character set would not fit well
   with the IETF philosophy of planning for architectural diversity.
   So, the best that can be done is to provide a common *framework* for
   identifying and using the multitude of character sets available on
   the Internet.  It would be an advantage if the total number of Coded
   Character Sets could be kept to a minimum.  This framework should
   meet the following requirements:

   -  it should not break existing protocols (because then the likelihood
        of deployment is very small),
   -  it should allow the use of character sets currently used on the
        Internet, and
   -  it should be relatively easy to build into new protocols.

3:  Architectural model

   The basic architectural model which guided our discussions is shown
   in below.  A distinction was made between those segments which were
   necessary to successfully transmit character set data on-the-wire and
   those needed to present that data to a user in a comprehensible
   manner.  The discussions were primarily restricted to those segments
   of the model which specify the 'on-the-wire' transmission of textual
   data.

   User interface issues: these are briefly discussed in Section 3.1.1.
        Layout
        Culture
        Locale
        Language
   On-the-wire: see section 3.2 for detailed discussion.
        Transfer Syntax
        Character Encoding Scheme
        Coded Character Set







Weider, et. al.              Informational                      [Page 6]

RFC 2130             Character Set Workshop Report            April 1997


3.1:  Segments defined

3.1:1:  User interface

3.1.1.1:  Layout

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?