rfc2070.txt

来自「著名的RFC文档,其中有一些文档是已经翻译成中文的的.」· 文本 代码 · 共 1,667 行 · 第 1/5 页

TXT
1,667
字号
Network Working Group                                       F. YergeauRequest for Comments: 2070                           Alis TechnologiesCategory: Standards Track                                     G. Nicol                                          Electronic Book Technologies                                                              G. Adams                                                              Spyglass                                                             M. Duerst                                                  University of Zurich                                                          January 1997         Internationalization of the Hypertext Markup LanguageStatus of this Memo   This document specifies an Internet standards track protocol for the   Internet community, and requests discussion and suggestions for   improvements.  Please refer to the current edition of the "Internet   Official Protocol Standards" (STD 1) for the standardization state   and status of this protocol.  Distribution of this memo is unlimited.Abstract   The Hypertext Markup Language (HTML) is a markup language used to   create hypertext documents that are platform independent.  Initially,   the application of HTML on the World Wide Web was seriously   restricted by its reliance on the ISO-8859-1 coded character set,   which is appropriate only for Western European languages.  Despite   this restriction, HTML has been widely used with other languages,   using other coded character sets or character encodings, at the   expense of interoperability.   This document is meant to address the issue of the   internationalization (i18n, i followed by 18 letters followed by n)   of HTML by extending the specification of HTML and giving additional   recommendations for proper internationalization support.  A foremost   consideration is to make sure that HTML remains a valid application   of SGML, while enabling its use with all languages of the world.Table of Contents   1.  Introduction .................................................. 2     1.1. Scope ...................................................... 2     1.2. Conformance ................................................ 3   2. The document character set ..................................... 4     2.1. Reference processing model ................................. 4     2.2. The document character set ................................. 6     2.3. Undisplayable characters ................................... 8Yergeau, et. al.            Standards Track                     [Page 1]RFC 2070               HTML Internationalization            January 1997   3. The LANG attribute.............................................. 8   4. Additional entities, attributes and elements ................... 9     4.1. Full Latin-1 entity set .................................... 9     4.2. Markup for language-dependent presentation ................ 10   5. Forms ..........................................................16     5.1. DTD additions ..............................................16     5.2. Form submission ............................................17   6. External character encoding issues .............................18   7. HTML public text ...............................................20     7.1. HTML DTD ...................................................20     7.2. SGML declaration for HTML ..................................35     7.3. ISO Latin 1 character entity set ...........................37   8. Security Considerations.........................................40   Bibliography ......................................................40   Authors' Addresses ................................................431.  Introduction   The Hypertext Markup Language (HTML) is a markup language used to   create hypertext documents that are platform independent.  Initially,   the application of HTML on the World Wide Web was seriously   restricted by its reliance on the ISO-8859-1 coded character set,   which is appropriate only for Western European languages.  Despite   this restriction, HTML has been widely used with other languages,   using other coded character sets or character encodings, through   various ad hoc extensions to the language [TAKADA].   This document is meant to address the issue of the   internationalization of HTML by extending the specification of HTML   and giving additional recommendations for proper internationalization   support.  It is in good part based on a paper by one of the authors   on multilingualism on the WWW [NICOL].  A foremost consideration is   to make sure that HTML remains a valid application of SGML, while   enabling its use with all languages of the world.   The specific issues addressed are the SGML document character set to   be used for HTML, the proper treatment of the charset parameter   associated with the "text/html" content type and the specification of   some additional elements and entities.1.1 Scope   HTML has been in use by the World-Wide Web (WWW) global information   initiative since 1990.  This specification extends the capabilities   of HTML 2.0 (RFC 1866), primarily by removing the restriction to the   ISO-8859-1 coded character set [ISO-8859].Yergeau, et. al.            Standards Track                     [Page 2]RFC 2070               HTML Internationalization            January 1997   HTML is an application of ISO Standard 8879:1986, Information   Processing Text and Office Systems -- Standard Generalized Markup   Language (SGML) [ISO-8879]. The HTML Document Type Definition (DTD)   is a formal definition of the HTML syntax in terms of SGML.  This   specification amends the DTD of HTML 2.0 in order to make it   applicable to documents encompassing a character repertoire much   larger than that of ISO-8859-1, while still remaining SGML   conformant.   Both formal and actual development of HTML are advancing very fast.   The features described in this document are designed so that they can   (and should) be added to other forms of HTML besides that described   in RFC 1866. Where indicated, attributes introduced here should be   extended to the appropriate elements.1.2 Conformance   This specification changes slightly the conformance requirements of   HTML documents and HTML user agents.1.2.1 Documents   All HTML 2.0 conforming documents remain conforming with this   specification.  However, the extensions introduced here make valid   certain documents that would not be HTML 2.0 conforming, in   particular those containing characters or character references   outside of the repertoire of ISO 8859-1, and those containing markup   introduced herein.1.2.2. User agents   In addition to the requirements of RFC 1866, the following   requirements are placed on HTML user agents.      To ensure interoperability and proper support for at least ISO-      8859-1 in an environment where character encoding schemes other      than ISO-8859-1 are present, user agents MUST correctly interpret      the charset parameter accompanying an HTML document received from      the network.      Furthermore, conforming user-agents MUST at least parse correctly      all numeric character references within the range of ISO 10646-1      [ISO-10646].      Conforming user-agents are required to apply the BIDI presentation      algorithm if they display right-to-left characters.  If there is      no displayable right-to-left character in a document, there is no      need to apply BIDI processing.Yergeau, et. al.            Standards Track                     [Page 3]RFC 2070               HTML Internationalization            January 19972. The document character set2.1. Reference processing model   This overview explains a reference processing model used for HTML,   and in particular the SGML concept of a document character set. An   actual implementation may widely differ in its internal workings from   the model given below, but should behave as described to an outside   observer.   Because there are various widely differing encodings of text, SGML   does not directly address how the sequence of characters that   constitutes an SGML document in the abstract sense are encoded by   means of a sequence of octets (or occasionally bit groups of another   length than 8) in a concrete realization of the document such as a   computer file. This encoding is called the external character   encoding of the concrete SGML document, and it should be carefully   distinguished from the document character set of the abstract HTML   document.  SGML views the characters as a single set (called a   "character repertoire"), and a "code set" that assigns an integer   number (known as "character number") to each character in the   repertoire.  The document character set declaration defines what each   of the character numbers represents [GOLD90, p. 451].  In most cases,   an SGML DTD and all documents that refer to it have a single document   character set, and all markup and data characters are part of this   set.   HTML, as an application of SGML, does not directly address the   question of the external character encoding. This is deferred to   mechanisms external to HTML, such as MIME as used by the HTTP   protocol or by electronic mail.   For the HTTP protocol [RFC2068], the external character encoding is   indicated by the "charset" parameter of the "Content-Type" field of   the header of an HTTP response. For example, to indicate that the   transmitted document is encoded in the "JUNET" encoding of Japanese   [RFC1468], the header will contain the following line:   Content-Type: text/html; charset=ISO-2022-JP   The term "charset" in MIME is used to designate a character encoding,   rather than merely a coded character set as the term may suggest.  A   character encoding is a mapping (possibly many-to-one) of sequences   of octets to sequences of characters taken from one or more character   repertoires.   The HTTP protocol also defines a mechanism for the client to specify   the character encodings it can accept. Clients and servers areYergeau, et. al.            Standards Track                     [Page 4]RFC 2070               HTML Internationalization            January 1997   strongly requested to use these mechanisms to assure correct   transmission and interpretation of any document. Provisions that can   be taken to help correct interpretation, even in cases where a server   or client do not yet use these mechanisms, are described in section   6.   Similarly, if HTML documents are transferred by electronic mail, the   external character encoding is defined by the "charset" parameter of   the "Content-Type" MIME header field [RFC2045], and defaults to US-   ASCII in its absence.   No mechanisms are currently standardized for indicating the external   character encoding of HTML documents transferred by FTP or accessed   in distributed file systems.   In the case any other way of transferring and storing HTML documents   are defined or become popular, it is advised that similar provisions   be made to clearly identify the character encoding used and/or to use   a single/default encoding capable of representing the widest range of   characters used in an international context.   Whatever the external character encoding may be, the reference   processing model translates it to the document character set   specified in Section 2.2 before processing specific to SGML/HTML.   The reference processing model can be depicted as follows:    [resource]->[decoder]->[entity ]->[ SGML ]->[application]->[display]                           [manager]  [parser]                                ^          |                                |          |                                +----------+   The decoder is responsible for decoding the external representation   of the resource to the document character set.  The entity manager,   the parser, and the application deal only with characters of the    document character set.  A display-oriented part of the application   or the display machinery itself may again convert characters   represented in the document character set to some other   representation more suitable for their purpose. In any case, the   entity manager, the parser, and the application, as far as character   semantics are concerned, are using the HTML document character set   only.   An actual implementation may choose, or not, to translate the   document into some encoding of the document character set as   described above; the behaviour described by this reference processing   model can be achieved otherwise.  This subject is well out of the   scope of this specification, however, and the reader is invited toYergeau, et. al.            Standards Track                     [Page 5]RFC 2070               HTML Internationalization            January 1997   consult the SGML standard [ISO-8879] or an SGML handbook [BRYAN88]   [GOLD90] [VANH90] [SQ91] for further information.   The most important consequence of this reference processing model is   that numeric character references are always resolved with respect to   the fixed document character set, and thus to the same characters,   whatever the external encoding actually used. For an example, see   Section 2.2.2.2. The document character set   The document character set, in the SGML sense, is the Universal   Character Set (UCS) of ISO 10646:1993 [ISO-10646], as amended.   Currently, this is code-by-code identical with the Unicode standard,   version 1.1 [UNICODE].      NOTE -- implementers should be aware that ISO 10646 is amended      from time to time; 4 amendments have been adopted since the      initial 1993 publication, none of which significantly affects this      specification.  A fifth amendment, now under consideration, will      introduce incompatible changes to the standard: 6556 Korean Hangul      syllables allocated between code positions 3400 and 4DFF      (hexadecimal) will be moved to new positions (and 4516 new      syllables added), thus making references to the old positions      invalid.  Since the Unicode consortium has already adopted a      corresponding amendment for inclusion in the forthcoming Unicode      2.0, adoption of DAM 5 is considered likely and implementers      should probably consider the old code positions as already      invalid.  Despite this one-time change, the relevant standard      bodies have committed themselves not to change any allocated code      position in the future.  To encode Korean Hangul irrespective of      these changes, the conjoining Hangul Jamo in the range 1110-11F9      can be used.   The adoption of this document character set implies a change in the   SGML declaration specified in the HTML 2.0 specification (section 9.5   of [RFC1866]).  The change amounts to removing the first BASESET   specification and its accompanying DESCSET declaration, replacing   them with the following declaration:

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?