📄 rfc2130.txt
字号:
3.5.2: Approximate conversion In all other cases, any conversion creates a text T which differs from S. There are different principles for how this inevitable difference should be handled. A choice between them should be made, depending on the purpose and requirements of the conversion. Where possible, the client application should be given mechanisms to determine what has been done to the text. 3.5.2.1: Length-modifying conversion for human display When the length of the target text T is allowed to differ from the length of the source text S, one should use a conversion method in which each source character is converted to one or several target character(s), using a best resemblance criteria in the choice of that target character(s). Examples: LATIN CAPITAL LETTER [*] -> AE COPYRIGHT SIGN [*] -> (c)3.5.2.2: Length-preserving conversion for human display Where the text T must be presented and the length of T cannot differ from the length of S, one should use a conversion method where each source character is converted to one target character, using some kind of best resemblance criteria in the choice of target character.Weider, et. al. Informational [Page 13]RFC 2130 Character Set Workshop Report April 1997 Examples: LATIN CAPITAL LETTER [*] -> A COPYRIGHT SIGN [*] -> C3.5.2.3: Conversion without data loss Where the conversion of the text S into T must be completely reversible, apply a Character Encoding Syntax or other reversible transformation method. This case is most frequently met in data storage requirements. Examples: LATIN CAPITAL LETTER [*] -> &AE COPYRIGHT SIGN [*] -> &(C An alternate method, which can be used if the size of Rep(CCS(T)) >= Rep(CCS(S)), then for each character in Rep(CCS(S)) which is not present in Rep(CCS(T)), define a mapping into a character in Rep(CCS(T)) which is not present in Rep(CCS(S)). Examples: LATIN CAPITAL LETTER [*] -> CYRILLIC CAPITAL LETTER [*] COPYRIGHT SIGN [*] -> PARTIAL DIFFERENTIAL SIGN [*] Note that conversion without data loss requires redefining some member of T to indicate "the introduction of character data outside T". This effectively adds another level of CES on top of CES(T).4: Presentation issues There are a number of considerations to make in selecting the base character set. One such consideration is the protocol's convenience to users with limited equipment (for example only ISO 8859-1 or a keyboard without the ability to enter all the characters in ISO 10646). Alternative representation should be considered for these users, both for input and output. Possible options for the representation of characters that can not be displayed include transliteration (a la CEN/TC304 or ISO TC46/SC2 ), RFC 1345 [RFC- 1345] representative icons, or the WG2 short name (u+xxxx).5: Open issues In addition to the issues declared out of scope and enumerated in section 2.1, the following issues are still open and will need to be addressed in other forums. These issues: language tags, public identifiers such as URL names, and bi-directionality are briefly discussed below as they repeatedly encroached the discussion.Weider, et. al. Informational [Page 14]RFC 2130 Character Set Workshop Report April 19975.1: Language tags Although the workshop decided not to explicitly address the so-called "CJK issue", a few members felt it was necessary to have some mechanism to address the problem of correct Han character display in the ISO-10646 issue, and that saying that it was a "font issue" would not suffice. The "CJK issue" refers to the extended discussion about "Han unification", the use of a single ISO-10646 codepoint to represent multiple national variants of a Chinese (Han) character. ISO-10646 can map uniquely to any single CJK national character set, but in the absence of additional information an application can not display an ISO-10646 text using the proper national variants for that text. It was agreed that language tags would be sufficient to disambiguate unified characters. There was not, in our opinion, a significant technical difference between the use of different coded character sets with overlapping codepoints, and a single coded character set with language tags. Either way, the application has sufficient information to display the text properly. It was observed that in contemporary usage of MIME charsets, the language is implied as well as the coded character set and the character encoding syntax. We agreed that this is excessive overloading of MIME charsets. To specify the language used in a particular block of text, we recommend that the MIME tag "Content-Language" be used. There are a number of questions about this approach that need to be worked out, however: - Is Content-Language: actually suitable? - Is there an overload between this function and the other intended functions of Content-Language: as described in RFC 1766? - What, precisely, does "Content-Language: zh-tw, ja, ko, zh-cn" mean in this context? We believe it means that, in drawing a Han character, the Taiwanese variant (presumably traditional Han) is preferred, followed by the Japanese, Korean, and mainland Chinese (presumably simplified Han) variants. It does *NOT* mean "mixed text containing Taiwanese, Japanese, Korean, and mainland Chinese text with all the national variants in each of these". Mixed CJK text, that simultaneously displays different variants occupying the same codepoint, requires language tags embedded in the data. Ohta and Handa propose in RFC 1554 [RFC-1554] a MIME charsetWeider, et. al. Informational [Page 15]RFC 2130 Character Set Workshop Report April 1997 using ISO-2022 shifts between multiple coded character sets; in effect this is an encoding that uses coded character sets for displaying the appropriate glyphs. There is some speculation that states that mixed CJK text is relatively infrequent, and that therefore it is acceptable to require that such text be represented using a rich text format that can support language tags. In other words, that a simplifying assumption can be made for TEXT/PLAIN in email using ISO-10646 that will not require multiple display representations for the same codepoint. A mechanism such as RFC 1554 could address this need if it was important; although arguably RFC 1554 should really be identified as TEXT/ISO-2022. Note again that we recommend that support for language tagging SHOULD be built into new protocols, as this will become a critical component of the automated indexing and retrieval in information applications of the future.5.2: Public identifiers There is a considerable demand from the user community for the ability to use non-ASCII characters in URL names, IMAP mailbox names, file names, and other public identifiers. This is still an open problem.5.3: Bi-directionality It was realized that a consistent framework for bi-directional text was needed but there was no attempt to work on it in this workshop.6: Security Considerations There are no security considerations associated with character sets.7: Conclusions This paper provides a conceptual framework and a set of recommendations which, if adopted, should provide a solid foundation for interoperability on the Internet. There are, however, a number of open issues which will need to be addressed to provide ever better use of text on the Internet.Weider, et. al. Informational [Page 16]RFC 2130 Character Set Workshop Report April 19978: Recommendations8.1: To the IAB There were a number of recommendations to the IAB about making the standards process more aware of the need for character set interoperability, and about the framework itself. A: The IAB should trigger the examination of all RFCs to determine the way they handle character sets, and obsolete or annotate the RFCs where necessary. B: The IESG should trigger the recommendation of procedures to the RFC editor to encourage RFCs to specify character set handling if they specify the transmission of text. C: The IAB should trigger the production of a perspectives document on the character set work that has gone on in the past and relate it to the current framework. D: Full ISO 10646 has a sufficiently broad repertoire, and scope for further extension, that it is sufficient for use in Internet Protocols (without excluding the use of existing alternatives). There is no need for specific development of character set standards for the Internet. E: The IAB should encourage the IRTF to create a research group to explore the open issues of character sets on the Internet. This group should set its sights much higher than this workshop did. F: The IANA (perhaps with the help of an IETF or IRTF group) should develop procedures for the registration of new character sets for use in the Internet. G: Register UTF-8 as a Character Encoding Scheme for MIME. H: The current use of the "x-*" format for distinguishing experimental tags should be continued for private use among consenting parties. All other namespaces should be allocated by IANA. I: Application protocol RFCs SHOULD include a section on "multilingual Considerations". J: Application Protocol RFCs SHOULD indicate how to transfer 'on the wire' all characters in the character sets they use. They SHOULD also specify how to transfer other information that applications may need to know about the data.Weider, et. al. Informational [Page 17]RFC 2130 Character Set Workshop Report April 1997 K: The IESG should trigger a set of extensions to RFC 1522 to allow language tagging of the free text parts of message headers.8.2: For new Internet protocols New protocols do not suffer from the need to be compatible with old 7-bit pipes. New protocol specifications SHOULD use ISO 10646 as the base charset unless there is an overriding need to use a different base character set. New protocols SHOULD use values from the IANA registries when referring to parameter values. The way these values are carried in the protocols is protocol dependent; if the protocol uses RFC-822- like headers, the header names already in use SHOULD be used. For protocols with only a single choice for each component, the protocol should use the most general specification and should be specified with reference to the registered value in the protocol standard. Protocols SHOULD tag text streams with the language of the text.8.3: For the registration of new character sets Ned Freed will be releasing a new MIME registration document in conjunction with this paper.8.3.1: A definition table for a coded character set A definition table for a coded character set A must for each character C that is in the repertoire of A give: a) if C is present in ISO 10646, the code value (in hexadecimal form) for that character. b) If C is not present in ISO 10646, but may be constructed using ISO 10646 combining characters, the series of code values (in hexadecimal form) used to construct that character. c) if C is not present in ISO 10646, a textual description of the character, and a reference to its origin.Weider, et. al. Informational [Page 18]RFC 2130 Character Set Workshop Report April 19978.3.2: A definition of a character encoding scheme A definition of a character encoding scheme consists of: - A description of an algorithm which transforms every possible sequence of octets to either a sequence of pairs <CCS, code value> or to the error state "illegal octet sequence" - Specifications, either by reference to CCS's registered by IANA or in text, of each CCS upon which this CES is based.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -