rfc2130.txt
来自「RFC 的详细文档!」· 文本 代码 · 共 1,740 行 · 第 1/5 页
TXT
1,740 行
Layout includes the elements needed for displaying text to the user,
such as font selection, word-wrapping, etc. It is similar to the
'presentation' layer in the 7-layer ISO telecommunications model
[ISO-7498].
3.1.1.2: Culture
Culture includes information about cultural preferences, which affect
spelling, word choice, and so forth.
3.1.1.3: Locale
The locale component includes the information necessary to make
choices about text manipulation which will present the text to the
user in an expected format. This information may include the display
of date, time and monetary symbol preferences. Notice that locale
modifications are typically applied to a text stream before it is
presented to the user, although they also are used to specify input
formats.
3.1.1.4: Language
This component specifies the language of the transmitted text. At
times and in specific cases, language information may be required to
achieve a particular level of quality for the purpose of displaying a
text stream. For example, UTF-8 encoded Han may require transmission
of a language tag to select the specific glyphs to be displayed at a
particular level of quality.
Note that information other than language may be used to achieve the
required level of quality in a display process. In particular, a
font tag is sufficient to produce identical results. However, the
association of a language with a specific block of text has
usefulness far beyond its use in display. In particular, as the
amount of information available in multiple languages on the World
Wide Web grows, it becomes critical to specify which language is in
use in particular documents, to assist automatic indexing and
retrieval of relevant documents.
Weider, et. al. Informational [Page 7]
RFC 2130 Character Set Workshop Report April 1997
The term 'language tag' should be reserved for the short identifier
of RFC 1766 [RFC-1766] that only serves to identify the language.
While there may be other text attributes intimately associated with
the language of the document, such as desired font or text direction,
these should be specified with other identifiers rather than
overloading the language tag.
3.2: On the wire
There are three segments of the model which are required for
completely specifying the content of a transmitted text stream (with
the occasional exception of the Language component, mentioned above).
These components are:
1) Coded Character Set,
2) Character Encoding Scheme, and
3) Transfer Encoding Syntax.
Each of these abstract components must be explicitly specified by the
transmitter when the data is sent. There may be instances of an
implicit specification due to the protocol/standard being used (i.e.
ANSI/NISO Z39.50). Also, in MIME, the Coded Character Set and
Character Encoding Scheme are specified by the Charset parameter to
the Content-Type header field, and Transfer Encoding Syntax is
specified by the Content-Transfer-Encoding header field.
3.2.1: Coded Character Set
A Coded Character Set (CCS) is a mapping from a set of abstract
characters to a set of integers. Examples of coded character sets
are ISO 10646 [ISO-10646], US-ASCII [ASCII], and ISO-8859 series
[ISO-8859].
3.2.2: Character Encoding Scheme
A Character Encoding Scheme (CES) is a mapping from a Coded Character
Set or several coded character sets to a set of octets. Examples of
Character Encoding Schemes are ISO 2022 [ISO-2022] and UTF-8 [UTF-8].
A given CES is typically associated with a single CCS; for example,
UTF-8 applies only to ISO 10646.
Weider, et. al. Informational [Page 8]
RFC 2130 Character Set Workshop Report April 1997
3.2.3: Transfer Encoding Syntax
It is frequently necessary to transform encoded text into a format
which is transmissible by specific protocols. The Transfer Encoding
Syntax (TES) is a transformation applied to character data encoded
using a CCS and possibly a CES to allow it to be transmitted.
Examples of Transfer Encoding Syntaxes are Base64 Encoding [Base64],
gzip encoding, and so forth.
3.3: Determining which values of CCS, CES, and TES are used
To completely specify which CCS, CES, and TES are used in a specific
text transmission, there needs to be a consistent set of labels for
specifying which CCS, CES, and TES are used. Once the appropriate
mechanisms have been selected, there are six techniques for attaching
these labels to the data.
The labels themselves are named and registered, either with IANA
[IANA] or with some other registry. Ideally, their definitions are
retrievable from some registration authority.
Labels may be determined in one of the following ways:
- Determined by guessing, where the receiver of the text has to
guess the values of the CCS, CES, and TES. For example: "I got
this from Sweden so it's probably ISO-8859-1." This is
obviously not a very foolproof way to decode text.
- Determined by the standard, where the protocol used to transmit
the data has made documented choices of CCS, CES, and TES in the
standard. Thus, the encodings used are known through the
access protocol, for example HTTP [HTTP] uses (but is not
limited to) ISO-8859-1, SMTP uses US-ASCII.
- Attached to the transfer envelope, where the descriptive labels are
attached to the wrapper placed around the text for transport.
MIME headers are a good example of this technique.
- Included in the data stream, where the data stream itself has
been encoded in such a way as to signal the character set used.
For example, ISO-2022 encodes the data with escape sequences to
provide information on the character subset currently being used.
- Agreed by prior bilateral agreement, where some out-of-band
negotiation has allowed the text transmitter and receiver to
determine the CCS, CES, and TES for the transmitted text.
- Agreed to by negotiation during some phase, typically
initialization of the protocol.
Weider, et. al. Informational [Page 9]
RFC 2130 Character Set Workshop Report April 1997
3.3.1: Recommendations for value specification mechanisms
While each of these techniques (with the exception of guessing) is
useful in particular situations, interoperability requires a more
consistent set of techniques. Thus, we recommend that MIME
registered values be used for all tagging of character sets and
languages UNLESS there is an existing mechanism for determining the
required information using one of the other techniques (except
guessing). This recommendation will require a fair bit of work on
the part of protocol designers, implementors, the IETF, the IESG, and
the IAB.
However, it is important to point out that the MIME concept of
'charset' in some cases cuts across several layers of components in
our model. While this can be accepted in existing registrations, we
also recommend that the MIME registration procedure for character
sets be modified to show how a proposed character set deals with the
CCS and the CES. Most 'charsets' have a well defined CCS and CES,
they should merely be teased apart for the registration.
There are a number of other recommendations, but these will be
covered in the next sections.
3.4: Recommended Defaults
For a number of reasons, one cannot define a mandatory set of
defaults for all Internet protocols. There is a mass of current
practice, future protocols are likely to have different purposes,
which may determine their handling of text, and protocols may need
specific variation support. For example, in mail, text is a
predominant data type and coded character sets then become a major
issue for the protocol. Also, since e-mail is ubiquitous and users
expect to be able to send it to everyone, the mail protocols need to
be quite adept at handling different character set encodings. On the
other hand, if strings are seldom used in a given protocol, there is
no need to weigh the protocol down with a sophisticated apparatus for
handling multiple character sets, assuming that the predicated
character set can handle all the protocol's needs. This observation
also applies to the specification techniques for character set
parameters. If only one character set encoding is needed, it can be
made explicit in the protocol specification. Protocols with a
greater need for character set support will need a more elaborate
specification technique.
Weider, et. al. Informational [Page 10]
RFC 2130 Character Set Workshop Report April 1997
3.4.1: Clarity of specification
We recommend that each protocol clearly specify what it is using for
each of the layers of the transmission model. Users (or clients)
should never have to guess what the parameter is for a given layer.
3.4.2: Default Coded Character Set:
The default Coded Character Set is the repertoire of ISO-10646.
3.4.3: Default Character Encoding Scheme
For text-oriented protocols, new protocols should use UTF-8, and
protocols that have a backwards compatibility requirement should use
the default of the existing protocol, e.g. US-ASCII for mail, and
ISO-8859-1 for HTTP. The recommended specification scheme is the
MIME "charset" specification, using the IANA "charset"
specifications. The MIME specifications will need to be clarified to
meet this model in the future.
For other protocols, the default should be UTF-8 as this initially
allows US-ASCII to be entered as-is, and enables the full repertoire
of ISO 10646.
Some protocols, such as those descended from SGML [SGML], have other
natural notations for characters outside their "natural" repertoire;
for instance, HTML [HTML] allows the use of &#nnnn to refer to any
ISO 10646 character. Note that this, like all other encodings that
depend on "escape characters", redefines at least one character from
the base character set for use as an indicator of "foreign"
characters. Use of this approach must be weighed very carefully.
3.4.4: Default Transport Encoding Scheme
There is no recommended default for this level. For plain text
oriented protocols, the bytestream transport format should be 8-bit
clean, possibly with normalization of end-of-line indicators. Some
special cases could be made for protocols that are not 8-bit clean,
such as encoding it for transport over 7-bit connections. For binary
the same recommendation holds as above. The specification technique
should either be defined in the protocol, if only one way is
permitted, or by use of MIME content-transfer-encoding (CTE)
techniques, using IANA registered values.
Weider, et. al. Informational [Page 11]
RFC 2130 Character Set Workshop Report April 1997
3.4.5: Default Language
There is no recommended default for the language level. For human
readable text, there should always be a way to specify the natural
language. The specification technique should be a MIME identifier
with IANA registered values for languages. If headers are used, the
header should be 'Content-Language'.
3.4.6: Default Locale
The default should be the POSIX locale. The specification technique
should use the Cultural register of CEN ENV 12005 [CEN] for the
values. If headers are used, the header should be 'Content-Locale'.
3.4.7: Default Culture
There is no recommended default for the Culture level. The
specification technique should be a MIME or MIME-like identifier
(e.g. Content-Culture) and should use the Cultural register of CEN
ENV 12005 for its values.
3.4.8: Default Presentation
There is no recommended default for the Presentation level. The
specification technique should be a MIME or MIME-like identifier
(e.g. Content-Layout) and use the glyph register of ISO 10036 and
other registers for its values.
3.4.9: Multiplexing
In some cases, text transmission may require the use of a number of
different values for a given parameter; for example, English
annotation of Japanese text might well require shifting the Content-
Language parameter. The way to switch the value of parameters within
a single body of text depends on the application. For instance, the
HTML I18N [I18N] work defines a language attribute on most of its
elements, including <SPAN>, <HTML>, and <BODY>, for the purpose of
switching between different languages. When only one value is
needed, this value should be as general as possible, and specified in
the protocol standard with reference to the IANA or other registry
value. All levels should be specified explicitly.
3.4.10: Storage
Because stored text may very well be stored without any of the
additional information necessary for decoding, stored text SHOULD be
tagged in a MIME compliant fashion. This alleviates the problem of
being unable to interpret text which has been stored for a long time,
Weider, et. al. Informational [Page 12]
RFC 2130 Character Set Workshop Report April 1997
or text whose provenance is not available.
3.5: Guidelines for conversions between coded character sets
This section covers various algorithms to convert a source text S,
encoded in the coded character set CCS(S), to a target text T,
encoded in the coded character set CCS(T).
Rep(X) is the character repertoire of coded character set X, i.e. the
set of characters which can be represented with X.
3.5.1: Exact conversion
When Rep(CCS(S)) and Rep(CCS(T)) are equal or Rep(CCS(S)) is a subset
of Rep(CCS(T)), exact conversion is possible; i.e. T is equal to S.
The octets just need to be remapped. The algorithm for performing
this remapping is simple, if the IANA-registered definition tables
for CCS(S) and CCS(T) are available.
⌨️ 快捷键说明
复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?