rfc2070.txt

来自「著名的RFC文档,其中有一些文档是已经翻译成中文的的.」· 文本 代码 · 共 1,667 行 · 第 1/5 页

TXT
1,667
字号
Yergeau, et. al.            Standards Track                    [Page 12]RFC 2070               HTML Internationalization            January 1997   In the DTD, the LANG and DIR attributes are grouped together in a   parameter entity called attrs.  To parallel RFC 1942 [RFC1942], the   ID and CLASS attributes are also included in attrs. The ID and CLASS   attributes are required for use with style sheets, and RFC 1942   defines them as follows:ID      Used to define a document-wide identifier. This can be used        for naming positions within documents as the destination of a        hypertext link. It may also be used by style sheets for        rendering an element in a unique style. An ID attribute value is        an SGML NAME token. NAME tokens are formed by an initial        letter followed by letters, digits, "-" and "." characters. The        letters are restricted to A-Z and a-z.CLASS   A space separated list of SGML NAME tokens. CLASS names        specify that the element belongs to the corresponding named        classes. It allows authors to distinguish different roles        played by the same tag. The classes may be used by style        sheets to provide different renderings as appropriate to        these roles.4.2.3. Cursive joining behaviour   Markup is needed in some cases to force cursive joining behavior in   contexts in which it would not normally occur, or to block it when it   would normally occur.   The zero-width joiner and non-joiner (&zwj; and &zwnj;) are used to   control cursive joining behaviour.  For example, ARABIC LETTER HEH is   used in isolation to abbreviate "Hijri" (the Islamic calendrical   system); however, the initial form of the letter is desired, because   the isolated form of HEH looks like the digit five as employed in   Arabic script.  This is obtained by following the HEH with a zero-   width joiner whose only effect is to provide context.  In Persian   texts, there are cases where a letter that normally would join a   subsequent letter in a cursive connection does not.  Here a zero-   width non- joiner is used.4.2.4. Bidirectional text   Many languages are written in horizontal lines from left to right,   while others are written from right to left.  When both writing   directions are present, one talks of bidirectional text (BIDI for   short). BIDI text requires markup in special circumstances where   ambiguities as to the directionality of some characters have to be   resolved.  This markup affects the ability to render BIDI text in a   semantically legible fashion.  That is, without this special BIDI   markup, cases arise which would prevent *any* rendering whatsoeverYergeau, et. al.            Standards Track                    [Page 13]RFC 2070               HTML Internationalization            January 1997   that reflected the basic meaning of the text. Plain text may contain   BIDI markup in the form of special-purpose formatting characters.   This is also possible in HTML, which includes the five BIDI-related   formatting characters (202A - 202E) of ISO 10646.  As an alternative,   HTML provides equivalent SGML markup.   BIDI is a complex issue, and conversion of logical text sequences to   display sequences has to be done according to the algorithm and   character properties specified in [UNICODE]. Here, explanations are   given only as far as they are needed to understand the necessity of   the features introduced and to define their exact semantics.   The Unicode BIDI algorithm is based on the individual characters of a   text being stored in logical order, that is the order in which they   are normally input and in which the corresponding sounds are normally   spoken. To make rendering of logical order text possible, the   algorithm assigns a directionality property to each character, e.g.   Latin letters are specified to have a left-to-right direction, Arabic   and Hebrew characters have a right-to-left direction.   The left-to-right and right-to-left marks (&lrm; and &rlm;) are used   to disambiguate directionality of neutral characters. For example,   when a double quote sits between an Arabic and a Latin letter, its   direction is ambiguous; if a directional mark is added on one side   such that the quotation mark is surrounded by characters of only one   directionality, the ambiguity is removed. These characters are like   zero width spaces which have a directional property (but no word/line   break property).   Nested embeddings of contra-directional text runs, due to nested   quotations or to the pasting of text from one BIDI context to   another, is also a case where the implicit directionality of   characters is not sufficient, requiring markup.  Also, it is   frequently desirable to specify the basic directionality of a block   of text. For these purposes, the DIR attribute is used.   On block-type elements, the DIR attribute indicates the base   directionality of the text in the block; if omitted it is inherited   from the parent element.  The default directionality of the overall   HTML document is left-to-right.   On inline elements, it makes the element start a new embedding level   (to be explained below); if omitted the inline element does not start   a new embedding level.Yergeau, et. al.            Standards Track                    [Page 14]RFC 2070               HTML Internationalization            January 1997      NOTE -- the PRE, XMP and LISTING elements admit the DIR attribute.      Their contents should not be considered as preformatted with      respect to bidirectional layout, but the BIDI algorithm should be      applied to each line of text.   Following is an example of a case where embedding is needed, showing   its effect:      Given the following latin (upper case) and arabic (lower case)      letters in backing store with the specified embeddings:      <SPAN DIR=LTR> AB <SPAN DIR=RTL> xy <SPAN DIR=LTR> CD </SPAN> zw      </SPAN> EF </SPAN>      One gets the following rendering (with [] showing the directional      transitions):      [ AB [ wz [ CD ] yx ] EF ]      On the other hand, without this markup and with a base direction      of LTR one gets the following rendering:      [ AB [ yx ] CD [ wz ] EF ]      Notice that yx is on the left and wz on the right unlike the above      case where the embedding levels are used.  Without the embedding      markup one has at most two levels: a base directional level and a      single counterflow directional level.   The DIR attribute on inline elements is equivalent to the formatting   characters  LEFT-TO-RIGHT EMBEDDING (202A) and RIGHT-TO-LEFT   EMBEDDING (202B) of ISO 10646.  The end tag of the element is   equivalent to the POP DIRECTIONAL FORMATTING (202C) character.   Directional override, as provided by the BDO element, is needed to   deal with unusual short pieces of text in which directionality cannot   be resolved from context in an unambiguous fashion. For example, it   can be used to force left-to-right (or right-to-left) display of part   numbers composed of Latin letters, digits and Hebrew letters.   The effect of BDO is to force the directionality of all characters   within it to the value of DIR, irrespective of their intrinsic   directional properties.  It is equivalent to using the LEFT-TO-RIGHT   OVERRIDE (202D) or RIGHT-TO-LEFT OVERRIDE (202E) characters of ISO   10646, the end tag again being equivalent to the POP DIRECTIONAL   FORMATTING (202C) character.Yergeau, et. al.            Standards Track                    [Page 15]RFC 2070               HTML Internationalization            January 1997      NOTE -- authors and authoring software writers should be aware      that conflicts can arise if the DIR attribute is used on inline      elements (including BDO) concurrently with the use of the      corresponding ISO 10646 formatting characters.      Preferably one or the other should be used exclusively; the markup      method is better able to guarantee document structural integrity,      and alleviates some problems when editing bidirectional HTML text      with a simple text editor, but some software may be more apt at      using the 10646 characters.  If both methods are used, great care      should be exercised to insure proper nesting of markup and      directional embedding or override; otherwise, rendering results      are undefined.5. Forms5.1. DTD additions   It is natural to expect input in any language in forms, as they   provide one of the only ways of obtaining user input. While this is   primarily a UI issue, there are some things that should be specified   at the HTML level to guide behavior and promote interoperability.   To ensure full interoperability, it is necessary for the user agent   (and the user) to have an indication of the character encoding(s)   that the server providing a form will be able to handle upon   submission of the filled-in form.  Such an indication is provided by   the ACCEPT-CHARSET attribute of the INPUT and TEXTAREA elements,   modeled on the HTTP Accept-Charset header (see [HTTP-1.1]), which   contains a space and/or comma delimited list of character sets   acceptable to the server.  A user agent may want to somehow advise   the user of the contents of this attribute, or to restrict his   possibility to enter characters outside the repertoires of the listed   character sets.      NOTE -- The list of character sets is to be interpreted as an      EXCLUSIVE-OR list; the server announces that it is ready to accept      any ONE of these character encoding schemes for each part of a      multipart entity.  The client may perform character encoding      translation to satisfy the server if necessary.      NOTE -- The default value for the ACCEPT-CHARSET attribute of an      INPUT or TEXTAREA element is the reserved value "UNKNOWN".  A user      agent may interpret that value as the character encoding scheme      that was used to transmit the document containing that element.Yergeau, et. al.            Standards Track                    [Page 16]RFC 2070               HTML Internationalization            January 19975.2. Form submission   The HTML 2.0 form submission mechanism, based on the "application/x-   www-form-urlencoded" media type, is ill-equipped with regard to   internationalization.  In fact, since URLs are restricted to ASCII   characters, the mechanism is akward even for ISO-8859-1 text.   Section 2.2 of [RFC1738] specifies that octets may be encoded using   the "%HH" notation, but text submitted from a form is composed of   characters, not octets.  Lacking a specification of a character   encoding scheme, the "%HH" notation has no well-defined meaning.   The best solution is to use the "multipart/form-data" media type   described in [RFC1867] with the POST method of form submission.  This   mechanism encapsulates the value part of each name-value pair in a   body-part of a multipart MIME body that is sent as the HTTP entity;   each body part can be labeled with an appropriate Content-Type,   including if necessary a charset parameter that specifies the   character encoding scheme.  The changes to the DTD necessary to   support this method of form submission have been incorporated in the   DTD included in this specification.   A less satisfactory solution is to add a MIME charset parameter to   the "application/x-www-form-urlencoded" media type specifier sent   along with a POST method form submission, with the understanding that   the URL encoding of [RFC1738] is applied on top of the specified   character encoding, as a kind of implicit Content-Transfer-Encoding.   One problem with both solutions above is that current browsers do not   generally allow for bookmarks to specify the POST method; this should   be improved.  Conversely, the GET method could be used with the form   data transmitted in the body instead of in the URL.  Nothing in the   protocol seems to prevent it, but no implementations appear to exist   at present.   How the user agent determines the encoding of the text entered by the   user is outside the scope of this specification.      NOTE -- Designers of forms and their handling scripts should be      aware of an important caveat: when the default value of a field      (the VALUE attribute) is returned upon form submission (i.e. the      user did not modify this value), it cannot be guaranteed to be      transmitted as a sequence of octets identical to that in the      source document -- only as a possibly different but valid encoding      of the same sequence of text elements.  This may be true even if      the encoding of the document containing the form and that used for      submission are the same.Yergeau, et. al.            Standards Track                    [Page 17]RFC 2070               HTML Internationalization            January 1997      Differences can occur when a sequence of characters can be      represented by various sequences of octets, and also when a      composite sequence (a base character plus one or more combining      diacritics) can be represented by either a different but      equivalent composite sequence or by a fully precomposed character.      For instance, the UCS-2 sequence 00EA+0323 (LATIN SMALL LETTER E      WITH CIRCUMFLEX ACCENT + COMBINING DOT BELOW) may be transformed      into 1EC7 (LATIN SMALL LETTER E WITH CIRCUMFLEX ACCENT AND DOT      BELOW), into 0065+0302+0323 (LATIN SMALL LETTER E + COMBINING      CIRCUMFLEX ACCENT + COMBINING DOT BELOW), as well as into other      equivalent composite sequences.6. External character encoding issues   Proper interpretation of a text document requires that the character   encoding scheme be known.  Current HTTP servers, however, do not   generally include an appropriate charset parameter with the Content-   Type header.  This is bad behaviour, which is even encouraged by the   continued existence of browsers that declare an unrecognized media   type when they receive a charset parameter.  User agent   implementators are strongly encouraged to make their software   tolerant of this parameter, even if they cannot take advantage of it.   Proper labelling is highly desirable, but some preventive measures   can be taken to minimize the detrimental effects of its absence:   In the case where a document is accessed from a hyperlink in an   origin HTML document, a CHARSET attribute is added to the attribute   list of elements with link semantics (A and LINK), specifically by   adding it to the linkExtraAttributes entity.  The value of that   attribute is to be considered a hint to the User Agent as to the   character encoding scheme used by the resource pointed to by the   hyperlink; it should be the appropriate value of the MIME charset   parameter for that resource.   In any document, it is possible to include an indication of the   encoding scheme like the following, as early as possible within the   HEAD of the document:    <META HTTP-EQUIV="Content-Type"     CONTENT="text/html; charset=ISO-2022-JP">   This is not foolproof, but will work if the encoding scheme is such   that ASCII-valued octets stand for ASCII characters only at least   until the META element is parsed.  Note that there are better ways

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?