rfc2070.txt

来自「RFC 的详细文档!」· 文本 代码 · 共 1,611 行 · 第 1/5 页

TXT
1,611
字号






Network Working Group                                       F. Yergeau
Request for Comments: 2070                           Alis Technologies
Category: Standards Track                                     G. Nicol
                                          Electronic Book Technologies
                                                              G. Adams
                                                              Spyglass
                                                             M. Duerst
                                                  University of Zurich
                                                          January 1997


         Internationalization of the Hypertext Markup Language

Status of this Memo

   This document specifies an Internet standards track protocol for the
   Internet community, and requests discussion and suggestions for
   improvements.  Please refer to the current edition of the "Internet
   Official Protocol Standards" (STD 1) for the standardization state
   and status of this protocol.  Distribution of this memo is unlimited.

Abstract

   The Hypertext Markup Language (HTML) is a markup language used to
   create hypertext documents that are platform independent.  Initially,
   the application of HTML on the World Wide Web was seriously
   restricted by its reliance on the ISO-8859-1 coded character set,
   which is appropriate only for Western European languages.  Despite
   this restriction, HTML has been widely used with other languages,
   using other coded character sets or character encodings, at the
   expense of interoperability.

   This document is meant to address the issue of the
   internationalization (i18n, i followed by 18 letters followed by n)
   of HTML by extending the specification of HTML and giving additional
   recommendations for proper internationalization support.  A foremost
   consideration is to make sure that HTML remains a valid application
   of SGML, while enabling its use with all languages of the world.

Table of Contents

   1.  Introduction .................................................. 2
     1.1. Scope ...................................................... 2
     1.2. Conformance ................................................ 3
   2. The document character set ..................................... 4
     2.1. Reference processing model ................................. 4
     2.2. The document character set ................................. 6
     2.3. Undisplayable characters ................................... 8



Yergeau, et. al.            Standards Track                     [Page 1]

RFC 2070               HTML Internationalization            January 1997


   3. The LANG attribute.............................................. 8
   4. Additional entities, attributes and elements ................... 9
     4.1. Full Latin-1 entity set .................................... 9
     4.2. Markup for language-dependent presentation ................ 10
   5. Forms ..........................................................16
     5.1. DTD additions ..............................................16
     5.2. Form submission ............................................17
   6. External character encoding issues .............................18
   7. HTML public text ...............................................20
     7.1. HTML DTD ...................................................20
     7.2. SGML declaration for HTML ..................................35
     7.3. ISO Latin 1 character entity set ...........................37
   8. Security Considerations.........................................40
   Bibliography ......................................................40
   Authors' Addresses ................................................43

1.  Introduction

   The Hypertext Markup Language (HTML) is a markup language used to
   create hypertext documents that are platform independent.  Initially,
   the application of HTML on the World Wide Web was seriously
   restricted by its reliance on the ISO-8859-1 coded character set,
   which is appropriate only for Western European languages.  Despite
   this restriction, HTML has been widely used with other languages,
   using other coded character sets or character encodings, through
   various ad hoc extensions to the language [TAKADA].

   This document is meant to address the issue of the
   internationalization of HTML by extending the specification of HTML
   and giving additional recommendations for proper internationalization
   support.  It is in good part based on a paper by one of the authors
   on multilingualism on the WWW [NICOL].  A foremost consideration is
   to make sure that HTML remains a valid application of SGML, while
   enabling its use with all languages of the world.

   The specific issues addressed are the SGML document character set to
   be used for HTML, the proper treatment of the charset parameter
   associated with the "text/html" content type and the specification of
   some additional elements and entities.

1.1 Scope

   HTML has been in use by the World-Wide Web (WWW) global information
   initiative since 1990.  This specification extends the capabilities
   of HTML 2.0 (RFC 1866), primarily by removing the restriction to the
   ISO-8859-1 coded character set [ISO-8859].





Yergeau, et. al.            Standards Track                     [Page 2]

RFC 2070               HTML Internationalization            January 1997


   HTML is an application of ISO Standard 8879:1986, Information
   Processing Text and Office Systems -- Standard Generalized Markup
   Language (SGML) [ISO-8879]. The HTML Document Type Definition (DTD)
   is a formal definition of the HTML syntax in terms of SGML.  This
   specification amends the DTD of HTML 2.0 in order to make it
   applicable to documents encompassing a character repertoire much
   larger than that of ISO-8859-1, while still remaining SGML
   conformant.

   Both formal and actual development of HTML are advancing very fast.
   The features described in this document are designed so that they can
   (and should) be added to other forms of HTML besides that described
   in RFC 1866. Where indicated, attributes introduced here should be
   extended to the appropriate elements.

1.2 Conformance

   This specification changes slightly the conformance requirements of
   HTML documents and HTML user agents.

1.2.1 Documents

   All HTML 2.0 conforming documents remain conforming with this
   specification.  However, the extensions introduced here make valid
   certain documents that would not be HTML 2.0 conforming, in
   particular those containing characters or character references
   outside of the repertoire of ISO 8859-1, and those containing markup
   introduced herein.

1.2.2. User agents

   In addition to the requirements of RFC 1866, the following
   requirements are placed on HTML user agents.

      To ensure interoperability and proper support for at least ISO-
      8859-1 in an environment where character encoding schemes other
      than ISO-8859-1 are present, user agents MUST correctly interpret
      the charset parameter accompanying an HTML document received from
      the network.

      Furthermore, conforming user-agents MUST at least parse correctly
      all numeric character references within the range of ISO 10646-1
      [ISO-10646].

      Conforming user-agents are required to apply the BIDI presentation
      algorithm if they display right-to-left characters.  If there is
      no displayable right-to-left character in a document, there is no
      need to apply BIDI processing.



Yergeau, et. al.            Standards Track                     [Page 3]

RFC 2070               HTML Internationalization            January 1997


2. The document character set

2.1. Reference processing model

   This overview explains a reference processing model used for HTML,
   and in particular the SGML concept of a document character set. An
   actual implementation may widely differ in its internal workings from
   the model given below, but should behave as described to an outside
   observer.

   Because there are various widely differing encodings of text, SGML
   does not directly address how the sequence of characters that
   constitutes an SGML document in the abstract sense are encoded by
   means of a sequence of octets (or occasionally bit groups of another
   length than 8) in a concrete realization of the document such as a
   computer file. This encoding is called the external character
   encoding of the concrete SGML document, and it should be carefully
   distinguished from the document character set of the abstract HTML
   document.  SGML views the characters as a single set (called a
   "character repertoire"), and a "code set" that assigns an integer
   number (known as "character number") to each character in the
   repertoire.  The document character set declaration defines what each
   of the character numbers represents [GOLD90, p. 451].  In most cases,
   an SGML DTD and all documents that refer to it have a single document
   character set, and all markup and data characters are part of this
   set.

   HTML, as an application of SGML, does not directly address the
   question of the external character encoding. This is deferred to
   mechanisms external to HTML, such as MIME as used by the HTTP
   protocol or by electronic mail.

   For the HTTP protocol [RFC2068], the external character encoding is
   indicated by the "charset" parameter of the "Content-Type" field of
   the header of an HTTP response. For example, to indicate that the
   transmitted document is encoded in the "JUNET" encoding of Japanese
   [RFC1468], the header will contain the following line:

   Content-Type: text/html; charset=ISO-2022-JP

   The term "charset" in MIME is used to designate a character encoding,
   rather than merely a coded character set as the term may suggest.  A
   character encoding is a mapping (possibly many-to-one) of sequences
   of octets to sequences of characters taken from one or more character
   repertoires.

   The HTTP protocol also defines a mechanism for the client to specify
   the character encodings it can accept. Clients and servers are



Yergeau, et. al.            Standards Track                     [Page 4]

RFC 2070               HTML Internationalization            January 1997


   strongly requested to use these mechanisms to assure correct
   transmission and interpretation of any document. Provisions that can
   be taken to help correct interpretation, even in cases where a server
   or client do not yet use these mechanisms, are described in section
   6.

   Similarly, if HTML documents are transferred by electronic mail, the
   external character encoding is defined by the "charset" parameter of
   the "Content-Type" MIME header field [RFC2045], and defaults to US-
   ASCII in its absence.

   No mechanisms are currently standardized for indicating the external
   character encoding of HTML documents transferred by FTP or accessed
   in distributed file systems.

   In the case any other way of transferring and storing HTML documents
   are defined or become popular, it is advised that similar provisions
   be made to clearly identify the character encoding used and/or to use
   a single/default encoding capable of representing the widest range of
   characters used in an international context.

   Whatever the external character encoding may be, the reference
   processing model translates it to the document character set
   specified in Section 2.2 before processing specific to SGML/HTML.
   The reference processing model can be depicted as follows:

    [resource]->[decoder]->[entity ]->[ SGML ]->[application]->[display]
                           [manager]  [parser]
                                ^          |
                                |          |
                                +----------+

   The decoder is responsible for decoding the external representation
   of the resource to the document character set.  The entity manager,
   the parser, and the application deal only with characters of the
    document character set.  A display-oriented part of the application
   or the display machinery itself may again convert characters
   represented in the document character set to some other
   representation more suitable for their purpose. In any case, the
   entity manager, the parser, and the application, as far as character
   semantics are concerned, are using the HTML document character set
   only.

   An actual implementation may choose, or not, to translate the
   document into some encoding of the document character set as
   described above; the behaviour described by this reference processing
   model can be achieved otherwise.  This subject is well out of the
   scope of this specification, however, and the reader is invited to



Yergeau, et. al.            Standards Track                     [Page 5]

RFC 2070               HTML Internationalization            January 1997


   consult the SGML standard [ISO-8879] or an SGML handbook [BRYAN88]
   [GOLD90] [VANH90] [SQ91] for further information.

   The most important consequence of this reference processing model is
   that numeric character references are always resolved with respect to
   the fixed document character set, and thus to the same characters,
   whatever the external encoding actually used. For an example, see
   Section 2.2.

2.2. The document character set

   The document character set, in the SGML sense, is the Universal
   Character Set (UCS) of ISO 10646:1993 [ISO-10646], as amended.
   Currently, this is code-by-code identical with the Unicode standard,
   version 1.1 [UNICODE].

      NOTE -- implementers should be aware that ISO 10646 is amended
      from time to time; 4 amendments have been adopted since the
      initial 1993 publication, none of which significantly affects this
      specification.  A fifth amendment, now under consideration, will
      introduce incompatible changes to the standard: 6556 Korean Hangul
      syllables allocated between code positions 3400 and 4DFF
      (hexadecimal) will be moved to new positions (and 4516 new
      syllables added), thus making references to the old positions
      invalid.  Since the Unicode consortium has already adopted a
      corresponding amendment for inclusion in the forthcoming Unicode
      2.0, adoption of DAM 5 is considered likely and implementers
      should probably consider the old code positions as already
      invalid.  Despite this one-time change, the relevant standard
      bodies have committed themselves not to change any allocated code
      position in the future.  To encode Korean Hangul irrespective of
      these changes, the conjoining Hangul Jamo in the range 1110-11F9
      can be used.

   The adoption of this document character set implies a change in the
   SGML declaration specified in the HTML 2.0 specification (section 9.5
   of [RFC1866]).  The change amounts to removing the first BASESET

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?