rfc2070.txt
来自「<VC++网络游戏建摸与实现>源代码」· 文本 代码 · 共 1,667 行 · 第 1/5 页
TXT
1,667 行
Network Working Group F. YergeauRequest for Comments: 2070 Alis TechnologiesCategory: Standards Track G. Nicol Electronic Book Technologies G. Adams Spyglass M. Duerst University of Zurich January 1997 Internationalization of the Hypertext Markup LanguageStatus of this Memo This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited.Abstract The Hypertext Markup Language (HTML) is a markup language used to create hypertext documents that are platform independent. Initially, the application of HTML on the World Wide Web was seriously restricted by its reliance on the ISO-8859-1 coded character set, which is appropriate only for Western European languages. Despite this restriction, HTML has been widely used with other languages, using other coded character sets or character encodings, at the expense of interoperability. This document is meant to address the issue of the internationalization (i18n, i followed by 18 letters followed by n) of HTML by extending the specification of HTML and giving additional recommendations for proper internationalization support. A foremost consideration is to make sure that HTML remains a valid application of SGML, while enabling its use with all languages of the world.Table of Contents 1. Introduction .................................................. 2 1.1. Scope ...................................................... 2 1.2. Conformance ................................................ 3 2. The document character set ..................................... 4 2.1. Reference processing model ................................. 4 2.2. The document character set ................................. 6 2.3. Undisplayable characters ................................... 8Yergeau, et. al. Standards Track [Page 1]RFC 2070 HTML Internationalization January 1997 3. The LANG attribute.............................................. 8 4. Additional entities, attributes and elements ................... 9 4.1. Full Latin-1 entity set .................................... 9 4.2. Markup for language-dependent presentation ................ 10 5. Forms ..........................................................16 5.1. DTD additions ..............................................16 5.2. Form submission ............................................17 6. External character encoding issues .............................18 7. HTML public text ...............................................20 7.1. HTML DTD ...................................................20 7.2. SGML declaration for HTML ..................................35 7.3. ISO Latin 1 character entity set ...........................37 8. Security Considerations.........................................40 Bibliography ......................................................40 Authors' Addresses ................................................431. Introduction The Hypertext Markup Language (HTML) is a markup language used to create hypertext documents that are platform independent. Initially, the application of HTML on the World Wide Web was seriously restricted by its reliance on the ISO-8859-1 coded character set, which is appropriate only for Western European languages. Despite this restriction, HTML has been widely used with other languages, using other coded character sets or character encodings, through various ad hoc extensions to the language [TAKADA]. This document is meant to address the issue of the internationalization of HTML by extending the specification of HTML and giving additional recommendations for proper internationalization support. It is in good part based on a paper by one of the authors on multilingualism on the WWW [NICOL]. A foremost consideration is to make sure that HTML remains a valid application of SGML, while enabling its use with all languages of the world. The specific issues addressed are the SGML document character set to be used for HTML, the proper treatment of the charset parameter associated with the "text/html" content type and the specification of some additional elements and entities.1.1 Scope HTML has been in use by the World-Wide Web (WWW) global information initiative since 1990. This specification extends the capabilities of HTML 2.0 (RFC 1866), primarily by removing the restriction to the ISO-8859-1 coded character set [ISO-8859].Yergeau, et. al. Standards Track [Page 2]RFC 2070 HTML Internationalization January 1997 HTML is an application of ISO Standard 8879:1986, Information Processing Text and Office Systems -- Standard Generalized Markup Language (SGML) [ISO-8879]. The HTML Document Type Definition (DTD) is a formal definition of the HTML syntax in terms of SGML. This specification amends the DTD of HTML 2.0 in order to make it applicable to documents encompassing a character repertoire much larger than that of ISO-8859-1, while still remaining SGML conformant. Both formal and actual development of HTML are advancing very fast. The features described in this document are designed so that they can (and should) be added to other forms of HTML besides that described in RFC 1866. Where indicated, attributes introduced here should be extended to the appropriate elements.1.2 Conformance This specification changes slightly the conformance requirements of HTML documents and HTML user agents.1.2.1 Documents All HTML 2.0 conforming documents remain conforming with this specification. However, the extensions introduced here make valid certain documents that would not be HTML 2.0 conforming, in particular those containing characters or character references outside of the repertoire of ISO 8859-1, and those containing markup introduced herein.1.2.2. User agents In addition to the requirements of RFC 1866, the following requirements are placed on HTML user agents. To ensure interoperability and proper support for at least ISO- 8859-1 in an environment where character encoding schemes other than ISO-8859-1 are present, user agents MUST correctly interpret the charset parameter accompanying an HTML document received from the network. Furthermore, conforming user-agents MUST at least parse correctly all numeric character references within the range of ISO 10646-1 [ISO-10646]. Conforming user-agents are required to apply the BIDI presentation algorithm if they display right-to-left characters. If there is no displayable right-to-left character in a document, there is no need to apply BIDI processing.Yergeau, et. al. Standards Track [Page 3]RFC 2070 HTML Internationalization January 19972. The document character set2.1. Reference processing model This overview explains a reference processing model used for HTML, and in particular the SGML concept of a document character set. An actual implementation may widely differ in its internal workings from the model given below, but should behave as described to an outside observer. Because there are various widely differing encodings of text, SGML does not directly address how the sequence of characters that constitutes an SGML document in the abstract sense are encoded by means of a sequence of octets (or occasionally bit groups of another length than 8) in a concrete realization of the document such as a computer file. This encoding is called the external character encoding of the concrete SGML document, and it should be carefully distinguished from the document character set of the abstract HTML document. SGML views the characters as a single set (called a "character repertoire"), and a "code set" that assigns an integer number (known as "character number") to each character in the repertoire. The document character set declaration defines what each of the character numbers represents [GOLD90, p. 451]. In most cases, an SGML DTD and all documents that refer to it have a single document character set, and all markup and data characters are part of this set. HTML, as an application of SGML, does not directly address the question of the external character encoding. This is deferred to mechanisms external to HTML, such as MIME as used by the HTTP protocol or by electronic mail. For the HTTP protocol [RFC2068], the external character encoding is indicated by the "charset" parameter of the "Content-Type" field of the header of an HTTP response. For example, to indicate that the transmitted document is encoded in the "JUNET" encoding of Japanese [RFC1468], the header will contain the following line: Content-Type: text/html; charset=ISO-2022-JP The term "charset" in MIME is used to designate a character encoding, rather than merely a coded character set as the term may suggest. A character encoding is a mapping (possibly many-to-one) of sequences of octets to sequences of characters taken from one or more character repertoires. The HTTP protocol also defines a mechanism for the client to specify the character encodings it can accept. Clients and servers areYergeau, et. al. Standards Track [Page 4]RFC 2070 HTML Internationalization January 1997 strongly requested to use these mechanisms to assure correct transmission and interpretation of any document. Provisions that can be taken to help correct interpretation, even in cases where a server or client do not yet use these mechanisms, are described in section 6. Similarly, if HTML documents are transferred by electronic mail, the external character encoding is defined by the "charset" parameter of the "Content-Type" MIME header field [RFC2045], and defaults to US- ASCII in its absence. No mechanisms are currently standardized for indicating the external character encoding of HTML documents transferred by FTP or accessed in distributed file systems. In the case any other way of transferring and storing HTML documents are defined or become popular, it is advised that similar provisions be made to clearly identify the character encoding used and/or to use a single/default encoding capable of representing the widest range of characters used in an international context. Whatever the external character encoding may be, the reference processing model translates it to the document character set specified in Section 2.2 before processing specific to SGML/HTML. The reference processing model can be depicted as follows: [resource]->[decoder]->[entity ]->[ SGML ]->[application]->[display] [manager] [parser] ^ | | | +----------+ The decoder is responsible for decoding the external representation of the resource to the document character set. The entity manager, the parser, and the application deal only with characters of the document character set. A display-oriented part of the application or the display machinery itself may again convert characters represented in the document character set to some other representation more suitable for their purpose. In any case, the entity manager, the parser, and the application, as far as character semantics are concerned, are using the HTML document character set only. An actual implementation may choose, or not, to translate the document into some encoding of the document character set as described above; the behaviour described by this reference processing model can be achieved otherwise. This subject is well out of the scope of this specification, however, and the reader is invited toYergeau, et. al. Standards Track [Page 5]RFC 2070 HTML Internationalization January 1997 consult the SGML standard [ISO-8879] or an SGML handbook [BRYAN88] [GOLD90] [VANH90] [SQ91] for further information. The most important consequence of this reference processing model is that numeric character references are always resolved with respect to the fixed document character set, and thus to the same characters, whatever the external encoding actually used. For an example, see Section 2.2.2.2. The document character set The document character set, in the SGML sense, is the Universal Character Set (UCS) of ISO 10646:1993 [ISO-10646], as amended. Currently, this is code-by-code identical with the Unicode standard, version 1.1 [UNICODE]. NOTE -- implementers should be aware that ISO 10646 is amended from time to time; 4 amendments have been adopted since the initial 1993 publication, none of which significantly affects this specification. A fifth amendment, now under consideration, will introduce incompatible changes to the standard: 6556 Korean Hangul syllables allocated between code positions 3400 and 4DFF (hexadecimal) will be moved to new positions (and 4516 new syllables added), thus making references to the old positions invalid. Since the Unicode consortium has already adopted a corresponding amendment for inclusion in the forthcoming Unicode 2.0, adoption of DAM 5 is considered likely and implementers should probably consider the old code positions as already invalid. Despite this one-time change, the relevant standard bodies have committed themselves not to change any allocated code position in the future. To encode Korean Hangul irrespective of these changes, the conjoining Hangul Jamo in the range 1110-11F9 can be used. The adoption of this document character set implies a change in the SGML declaration specified in the HTML 2.0 specification (section 9.5 of [RFC1866]). The change amounts to removing the first BASESET specification and its accompanying DESCSET declaration, replacing them with the following declaration:
⌨️ 快捷键说明
复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?