rfc2130.txt

来自「RFC 的详细文档!」· 文本 代码 · 共 1,740 行 · 第 1/5 页

TXT
1,740
字号
   Layout includes the elements needed for displaying text to the user,
   such as font selection, word-wrapping, etc.  It is similar to the
   'presentation' layer in the 7-layer ISO telecommunications model
   [ISO-7498].

3.1.1.2:  Culture

   Culture includes information about cultural preferences, which affect
   spelling, word choice, and so forth.

3.1.1.3:  Locale

   The locale component includes the information necessary to make
   choices about text manipulation which will present the text to the
   user in an expected format.  This information may include the display
   of date, time and monetary symbol preferences.  Notice that locale
   modifications are typically applied to a text stream before it is
   presented to the user, although they also are used to specify input
   formats.

3.1.1.4:  Language

   This component specifies the language of the transmitted text.  At
   times and in specific cases, language information may be required to
   achieve a particular level of quality for the purpose of displaying a
   text stream.  For example, UTF-8 encoded Han may require transmission
   of a language tag to select the specific glyphs to be displayed at a
   particular level of quality.

   Note that information other than language may be used to achieve the
   required level of quality in a display process.  In particular, a
   font tag is sufficient to produce identical results.  However, the
   association of a language with a specific block of text has
   usefulness far beyond its use in display.  In particular, as the
   amount of information available in multiple languages on the World
   Wide Web grows, it becomes critical to specify which language is in
   use in particular documents, to assist automatic indexing and
   retrieval of relevant documents.







Weider, et. al.              Informational                      [Page 7]

RFC 2130             Character Set Workshop Report            April 1997


   The term 'language tag' should be reserved for the short identifier
   of RFC 1766 [RFC-1766] that only serves to identify the language.
   While there may be other text attributes intimately associated with
   the language of the document, such as desired font or text direction,
   these should be specified with other identifiers rather than
   overloading the language tag.

3.2:  On the wire

   There are three segments of the model which are required for
   completely specifying the content of a transmitted text stream (with
   the occasional exception of the Language component, mentioned above).
   These components are:

   1)  Coded Character Set,
   2)  Character Encoding Scheme, and
   3)  Transfer Encoding Syntax.

   Each of these abstract components must be explicitly specified by the
   transmitter when the data is sent.  There may be instances of an
   implicit specification due to the protocol/standard being used (i.e.
   ANSI/NISO Z39.50).  Also, in MIME, the Coded Character Set and
   Character Encoding Scheme are specified by the Charset parameter to
   the Content-Type header field, and Transfer Encoding Syntax is
   specified by the Content-Transfer-Encoding header field.

3.2.1:  Coded Character Set

   A Coded Character Set (CCS) is a mapping from a set of abstract
   characters to a set of integers.  Examples of coded character sets
   are ISO 10646 [ISO-10646], US-ASCII [ASCII], and ISO-8859 series
   [ISO-8859].

3.2.2:  Character Encoding Scheme

   A Character Encoding Scheme (CES) is a mapping from a Coded Character
   Set or several coded character sets to a set of octets. Examples of
   Character Encoding Schemes are ISO 2022 [ISO-2022] and UTF-8 [UTF-8].
   A given CES is typically associated with a single CCS; for example,
   UTF-8 applies only to ISO 10646.











Weider, et. al.              Informational                      [Page 8]

RFC 2130             Character Set Workshop Report            April 1997


3.2.3:  Transfer Encoding Syntax

   It is frequently necessary to transform encoded text into a format
   which is transmissible by specific protocols.  The Transfer Encoding
   Syntax (TES) is a transformation applied to character data encoded
   using a CCS and possibly a CES to allow it to be transmitted.
   Examples of Transfer Encoding Syntaxes are Base64 Encoding [Base64],
   gzip encoding, and so forth.

3.3:  Determining which values of CCS, CES, and TES are used

   To completely specify which CCS, CES, and TES are used in a specific
   text transmission, there needs to be a consistent set of labels for
   specifying which CCS, CES, and TES are used.  Once the appropriate
   mechanisms have been selected, there are six techniques for attaching
   these labels to the data.

   The labels themselves are named and registered, either with IANA
   [IANA] or with some other registry.  Ideally, their definitions are
   retrievable from some registration authority.

   Labels may be determined in one of the following ways:

   -  Determined by guessing, where the receiver of the text has to
      guess the values of the CCS, CES, and TES. For example: "I got
      this from Sweden so it's probably  ISO-8859-1."  This is
      obviously not a very foolproof way to decode text.
   -  Determined by the standard, where the protocol used to transmit
      the data has made documented choices of CCS, CES, and TES in the
      standard. Thus, the encodings used are known through the
      access protocol, for example HTTP [HTTP] uses (but is not
      limited to) ISO-8859-1, SMTP uses US-ASCII.
   -  Attached to the transfer envelope, where the descriptive labels are
      attached to the wrapper placed around the text for transport.
      MIME headers are a good example of this technique.
   -  Included in the data stream, where the data stream itself has
      been encoded in such a way as to signal the character set used.
      For example, ISO-2022 encodes the data with escape sequences to
      provide information on the character subset currently being used.
   -  Agreed by prior bilateral agreement, where some out-of-band
      negotiation has allowed the text transmitter and receiver to
      determine the CCS, CES, and  TES for the transmitted text.
   -  Agreed to by negotiation during some phase, typically
      initialization of the protocol.







Weider, et. al.              Informational                      [Page 9]

RFC 2130             Character Set Workshop Report            April 1997


3.3.1:  Recommendations for value specification mechanisms

   While each of these techniques (with the  exception of guessing) is
   useful in particular situations, interoperability requires a more
   consistent set of techniques.  Thus, we recommend that MIME
   registered values be used for all tagging of character sets and
   languages UNLESS there is an existing mechanism for determining the
   required information using one of the other techniques (except
   guessing).  This recommendation will require a fair bit of work on
   the part of protocol designers, implementors, the IETF, the IESG, and
   the IAB.

   However, it is important to point out that the MIME concept of
   'charset' in some cases cuts across several layers of components in
   our model.  While this can be accepted in existing registrations, we
   also recommend that the MIME registration procedure for character
   sets be modified to show how a proposed character set deals with the
   CCS and the CES. Most 'charsets' have a well defined CCS and CES,
   they should merely be teased apart for the registration.

   There are a number of other recommendations, but these will be
   covered in the next sections.

3.4:  Recommended Defaults

   For a number of reasons, one cannot define a mandatory set of
   defaults for all Internet protocols.  There is a mass of current
   practice, future protocols are likely to have different purposes,
   which may determine their handling of text, and protocols may need
   specific variation support.  For example, in mail, text is a
   predominant data type and coded character sets then become a major
   issue for the protocol.  Also, since e-mail is ubiquitous and users
   expect to be able to send it to everyone, the mail protocols need to
   be quite adept at handling different character set encodings.  On the
   other hand, if strings are seldom used in a given protocol, there is
   no need to weigh the protocol down with a sophisticated apparatus for
   handling multiple character sets, assuming that the predicated
   character set can handle all the protocol's needs. This observation
   also applies to the specification techniques for character set
   parameters.  If only one character set encoding is needed, it can be
   made explicit in the protocol specification.  Protocols with a
   greater need for character set support will need a more elaborate
   specification technique.








Weider, et. al.              Informational                     [Page 10]

RFC 2130             Character Set Workshop Report            April 1997


3.4.1:  Clarity of specification

   We recommend that each protocol clearly specify what it is using for
   each of the layers of the transmission model.  Users (or clients)
   should never have to guess what the parameter is for a given layer.

3.4.2:  Default Coded Character Set:

   The default Coded Character Set is the repertoire of ISO-10646.

3.4.3:   Default Character Encoding Scheme

   For text-oriented protocols, new protocols should use UTF-8, and
   protocols that have a backwards compatibility requirement should use
   the default of the existing protocol, e.g. US-ASCII for mail, and
   ISO-8859-1 for HTTP.  The recommended specification scheme is the
   MIME "charset" specification, using the IANA "charset"
   specifications.  The MIME specifications will need to be clarified to
   meet this model in the future.

   For other protocols, the default should be UTF-8 as this initially
   allows US-ASCII to be entered as-is, and enables the full repertoire
   of ISO 10646.

   Some protocols, such as those descended from SGML [SGML], have other
   natural notations for characters outside their "natural" repertoire;
   for instance, HTML [HTML] allows the use of &#nnnn to refer to any
   ISO 10646 character.  Note that this, like all other encodings that
   depend on "escape characters", redefines at least one character from
   the base character set for use as an indicator of "foreign"
   characters.  Use of this approach must be weighed very carefully.

3.4.4:   Default Transport Encoding Scheme

   There is no recommended default for this level.  For plain text
   oriented protocols, the bytestream transport format should be 8-bit
   clean, possibly with normalization of end-of-line indicators.  Some
   special cases could be made for protocols that are not 8-bit clean,
   such as encoding it for transport over 7-bit connections.  For binary
   the same recommendation holds as above.  The specification technique
   should either be defined in the  protocol, if only one way is
   permitted, or by use of MIME content-transfer-encoding (CTE)
   techniques, using IANA registered values.








Weider, et. al.              Informational                     [Page 11]

RFC 2130             Character Set Workshop Report            April 1997


3.4.5:  Default Language

   There is no recommended default for the language level.  For human
   readable text, there should always be a way to specify the natural
   language. The specification technique should be a MIME identifier
   with IANA  registered values for languages.  If headers are used, the
   header should be 'Content-Language'.

3.4.6:  Default Locale

   The default should be the POSIX locale.  The specification technique
   should use the Cultural register of CEN ENV 12005 [CEN] for the
   values.  If headers are used, the header should be 'Content-Locale'.

3.4.7:  Default Culture

   There is no recommended default for the Culture level.  The
   specification  technique should be a MIME or MIME-like identifier
   (e.g. Content-Culture) and should use the Cultural register of CEN
   ENV 12005 for its values.

3.4.8:  Default Presentation

   There is no recommended default for the Presentation level.  The
   specification technique should be a MIME or MIME-like identifier
   (e.g.  Content-Layout) and use the glyph register of ISO 10036 and
   other registers for its values.

3.4.9:  Multiplexing

   In some cases, text transmission may require the use of a number of
   different values for a given parameter; for example, English
   annotation of Japanese text might well require shifting the Content-
   Language parameter.  The way to switch the value of parameters within
   a single body of text depends on the application.  For instance, the
   HTML I18N [I18N] work defines a language attribute on most of its
   elements, including <SPAN>, <HTML>, and <BODY>, for the purpose of
   switching between different languages.  When only one value is
   needed, this value should be as general as possible, and specified in
   the protocol standard with reference to the IANA or other registry
   value.  All levels should be specified explicitly.

3.4.10:  Storage

   Because stored text may very well be stored without any of the
   additional information necessary for decoding, stored text SHOULD be
   tagged in a MIME compliant fashion.  This alleviates the problem of
   being unable to interpret text which has been stored for a long time,



Weider, et. al.              Informational                     [Page 12]

RFC 2130             Character Set Workshop Report            April 1997


   or text whose provenance is not available.

3.5:  Guidelines for conversions between coded character sets

   This section covers various algorithms to convert a source text S,
   encoded in the coded character set CCS(S), to a target text T,
   encoded in the coded character set CCS(T).

   Rep(X) is the character repertoire of coded character set X, i.e. the
   set of characters which can be represented with X.

3.5.1:  Exact conversion

   When Rep(CCS(S)) and Rep(CCS(T)) are equal or Rep(CCS(S)) is a subset
   of Rep(CCS(T)), exact conversion is possible; i.e. T is equal to S.
   The octets just need to be remapped.  The algorithm for performing
   this remapping is simple, if the IANA-registered definition tables
   for CCS(S) and CCS(T) are available.

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?