📄 rfc2130.txt
字号:
Layout includes the elements needed for displaying text to the user, such as font selection, word-wrapping, etc. It is similar to the 'presentation' layer in the 7-layer ISO telecommunications model [ISO-7498].3.1.1.2: Culture Culture includes information about cultural preferences, which affect spelling, word choice, and so forth.3.1.1.3: Locale The locale component includes the information necessary to make choices about text manipulation which will present the text to the user in an expected format. This information may include the display of date, time and monetary symbol preferences. Notice that locale modifications are typically applied to a text stream before it is presented to the user, although they also are used to specify input formats.3.1.1.4: Language This component specifies the language of the transmitted text. At times and in specific cases, language information may be required to achieve a particular level of quality for the purpose of displaying a text stream. For example, UTF-8 encoded Han may require transmission of a language tag to select the specific glyphs to be displayed at a particular level of quality. Note that information other than language may be used to achieve the required level of quality in a display process. In particular, a font tag is sufficient to produce identical results. However, the association of a language with a specific block of text has usefulness far beyond its use in display. In particular, as the amount of information available in multiple languages on the World Wide Web grows, it becomes critical to specify which language is in use in particular documents, to assist automatic indexing and retrieval of relevant documents.Weider, et. al. Informational [Page 7]RFC 2130 Character Set Workshop Report April 1997 The term 'language tag' should be reserved for the short identifier of RFC 1766 [RFC-1766] that only serves to identify the language. While there may be other text attributes intimately associated with the language of the document, such as desired font or text direction, these should be specified with other identifiers rather than overloading the language tag.3.2: On the wire There are three segments of the model which are required for completely specifying the content of a transmitted text stream (with the occasional exception of the Language component, mentioned above). These components are: 1) Coded Character Set, 2) Character Encoding Scheme, and 3) Transfer Encoding Syntax. Each of these abstract components must be explicitly specified by the transmitter when the data is sent. There may be instances of an implicit specification due to the protocol/standard being used (i.e. ANSI/NISO Z39.50). Also, in MIME, the Coded Character Set and Character Encoding Scheme are specified by the Charset parameter to the Content-Type header field, and Transfer Encoding Syntax is specified by the Content-Transfer-Encoding header field.3.2.1: Coded Character Set A Coded Character Set (CCS) is a mapping from a set of abstract characters to a set of integers. Examples of coded character sets are ISO 10646 [ISO-10646], US-ASCII [ASCII], and ISO-8859 series [ISO-8859].3.2.2: Character Encoding Scheme A Character Encoding Scheme (CES) is a mapping from a Coded Character Set or several coded character sets to a set of octets. Examples of Character Encoding Schemes are ISO 2022 [ISO-2022] and UTF-8 [UTF-8]. A given CES is typically associated with a single CCS; for example, UTF-8 applies only to ISO 10646.Weider, et. al. Informational [Page 8]RFC 2130 Character Set Workshop Report April 19973.2.3: Transfer Encoding Syntax It is frequently necessary to transform encoded text into a format which is transmissible by specific protocols. The Transfer Encoding Syntax (TES) is a transformation applied to character data encoded using a CCS and possibly a CES to allow it to be transmitted. Examples of Transfer Encoding Syntaxes are Base64 Encoding [Base64], gzip encoding, and so forth.3.3: Determining which values of CCS, CES, and TES are used To completely specify which CCS, CES, and TES are used in a specific text transmission, there needs to be a consistent set of labels for specifying which CCS, CES, and TES are used. Once the appropriate mechanisms have been selected, there are six techniques for attaching these labels to the data. The labels themselves are named and registered, either with IANA [IANA] or with some other registry. Ideally, their definitions are retrievable from some registration authority. Labels may be determined in one of the following ways: - Determined by guessing, where the receiver of the text has to guess the values of the CCS, CES, and TES. For example: "I got this from Sweden so it's probably ISO-8859-1." This is obviously not a very foolproof way to decode text. - Determined by the standard, where the protocol used to transmit the data has made documented choices of CCS, CES, and TES in the standard. Thus, the encodings used are known through the access protocol, for example HTTP [HTTP] uses (but is not limited to) ISO-8859-1, SMTP uses US-ASCII. - Attached to the transfer envelope, where the descriptive labels are attached to the wrapper placed around the text for transport. MIME headers are a good example of this technique. - Included in the data stream, where the data stream itself has been encoded in such a way as to signal the character set used. For example, ISO-2022 encodes the data with escape sequences to provide information on the character subset currently being used. - Agreed by prior bilateral agreement, where some out-of-band negotiation has allowed the text transmitter and receiver to determine the CCS, CES, and TES for the transmitted text. - Agreed to by negotiation during some phase, typically initialization of the protocol.Weider, et. al. Informational [Page 9]RFC 2130 Character Set Workshop Report April 19973.3.1: Recommendations for value specification mechanisms While each of these techniques (with the exception of guessing) is useful in particular situations, interoperability requires a more consistent set of techniques. Thus, we recommend that MIME registered values be used for all tagging of character sets and languages UNLESS there is an existing mechanism for determining the required information using one of the other techniques (except guessing). This recommendation will require a fair bit of work on the part of protocol designers, implementors, the IETF, the IESG, and the IAB. However, it is important to point out that the MIME concept of 'charset' in some cases cuts across several layers of components in our model. While this can be accepted in existing registrations, we also recommend that the MIME registration procedure for character sets be modified to show how a proposed character set deals with the CCS and the CES. Most 'charsets' have a well defined CCS and CES, they should merely be teased apart for the registration. There are a number of other recommendations, but these will be covered in the next sections.3.4: Recommended Defaults For a number of reasons, one cannot define a mandatory set of defaults for all Internet protocols. There is a mass of current practice, future protocols are likely to have different purposes, which may determine their handling of text, and protocols may need specific variation support. For example, in mail, text is a predominant data type and coded character sets then become a major issue for the protocol. Also, since e-mail is ubiquitous and users expect to be able to send it to everyone, the mail protocols need to be quite adept at handling different character set encodings. On the other hand, if strings are seldom used in a given protocol, there is no need to weigh the protocol down with a sophisticated apparatus for handling multiple character sets, assuming that the predicated character set can handle all the protocol's needs. This observation also applies to the specification techniques for character set parameters. If only one character set encoding is needed, it can be made explicit in the protocol specification. Protocols with a greater need for character set support will need a more elaborate specification technique.Weider, et. al. Informational [Page 10]RFC 2130 Character Set Workshop Report April 19973.4.1: Clarity of specification We recommend that each protocol clearly specify what it is using for each of the layers of the transmission model. Users (or clients) should never have to guess what the parameter is for a given layer.3.4.2: Default Coded Character Set: The default Coded Character Set is the repertoire of ISO-10646.3.4.3: Default Character Encoding Scheme For text-oriented protocols, new protocols should use UTF-8, and protocols that have a backwards compatibility requirement should use the default of the existing protocol, e.g. US-ASCII for mail, and ISO-8859-1 for HTTP. The recommended specification scheme is the MIME "charset" specification, using the IANA "charset" specifications. The MIME specifications will need to be clarified to meet this model in the future. For other protocols, the default should be UTF-8 as this initially allows US-ASCII to be entered as-is, and enables the full repertoire of ISO 10646. Some protocols, such as those descended from SGML [SGML], have other natural notations for characters outside their "natural" repertoire; for instance, HTML [HTML] allows the use of &#nnnn to refer to any ISO 10646 character. Note that this, like all other encodings that depend on "escape characters", redefines at least one character from the base character set for use as an indicator of "foreign" characters. Use of this approach must be weighed very carefully.3.4.4: Default Transport Encoding Scheme There is no recommended default for this level. For plain text oriented protocols, the bytestream transport format should be 8-bit clean, possibly with normalization of end-of-line indicators. Some special cases could be made for protocols that are not 8-bit clean, such as encoding it for transport over 7-bit connections. For binary the same recommendation holds as above. The specification technique should either be defined in the protocol, if only one way is permitted, or by use of MIME content-transfer-encoding (CTE) techniques, using IANA registered values.Weider, et. al. Informational [Page 11]RFC 2130 Character Set Workshop Report April 19973.4.5: Default Language There is no recommended default for the language level. For human readable text, there should always be a way to specify the natural language. The specification technique should be a MIME identifier with IANA registered values for languages. If headers are used, the header should be 'Content-Language'.3.4.6: Default Locale The default should be the POSIX locale. The specification technique should use the Cultural register of CEN ENV 12005 [CEN] for the values. If headers are used, the header should be 'Content-Locale'.3.4.7: Default Culture There is no recommended default for the Culture level. The specification technique should be a MIME or MIME-like identifier (e.g. Content-Culture) and should use the Cultural register of CEN ENV 12005 for its values.3.4.8: Default Presentation There is no recommended default for the Presentation level. The specification technique should be a MIME or MIME-like identifier (e.g. Content-Layout) and use the glyph register of ISO 10036 and other registers for its values.3.4.9: Multiplexing In some cases, text transmission may require the use of a number of different values for a given parameter; for example, English annotation of Japanese text might well require shifting the Content- Language parameter. The way to switch the value of parameters within a single body of text depends on the application. For instance, the HTML I18N [I18N] work defines a language attribute on most of its elements, including <SPAN>, <HTML>, and <BODY>, for the purpose of switching between different languages. When only one value is needed, this value should be as general as possible, and specified in the protocol standard with reference to the IANA or other registry value. All levels should be specified explicitly.3.4.10: Storage Because stored text may very well be stored without any of the additional information necessary for decoding, stored text SHOULD be tagged in a MIME compliant fashion. This alleviates the problem of being unable to interpret text which has been stored for a long time,Weider, et. al. Informational [Page 12]RFC 2130 Character Set Workshop Report April 1997 or text whose provenance is not available.3.5: Guidelines for conversions between coded character sets This section covers various algorithms to convert a source text S, encoded in the coded character set CCS(S), to a target text T, encoded in the coded character set CCS(T). Rep(X) is the character repertoire of coded character set X, i.e. the set of characters which can be represented with X.3.5.1: Exact conversion When Rep(CCS(S)) and Rep(CCS(T)) are equal or Rep(CCS(S)) is a subset of Rep(CCS(T)), exact conversion is possible; i.e. T is equal to S. The octets just need to be remapped. The algorithm for performing this remapping is simple, if the IANA-registered definition tables for CCS(S) and CCS(T) are available.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -