rfc2070.txt
来自「著名的RFC文档,其中有一些文档是已经翻译成中文的的.」· 文本 代码 · 共 1,667 行 · 第 1/5 页
TXT
1,667 行
Yergeau, et. al. Standards Track [Page 12]RFC 2070 HTML Internationalization January 1997 In the DTD, the LANG and DIR attributes are grouped together in a parameter entity called attrs. To parallel RFC 1942 [RFC1942], the ID and CLASS attributes are also included in attrs. The ID and CLASS attributes are required for use with style sheets, and RFC 1942 defines them as follows:ID Used to define a document-wide identifier. This can be used for naming positions within documents as the destination of a hypertext link. It may also be used by style sheets for rendering an element in a unique style. An ID attribute value is an SGML NAME token. NAME tokens are formed by an initial letter followed by letters, digits, "-" and "." characters. The letters are restricted to A-Z and a-z.CLASS A space separated list of SGML NAME tokens. CLASS names specify that the element belongs to the corresponding named classes. It allows authors to distinguish different roles played by the same tag. The classes may be used by style sheets to provide different renderings as appropriate to these roles.4.2.3. Cursive joining behaviour Markup is needed in some cases to force cursive joining behavior in contexts in which it would not normally occur, or to block it when it would normally occur. The zero-width joiner and non-joiner (‍ and ‌) are used to control cursive joining behaviour. For example, ARABIC LETTER HEH is used in isolation to abbreviate "Hijri" (the Islamic calendrical system); however, the initial form of the letter is desired, because the isolated form of HEH looks like the digit five as employed in Arabic script. This is obtained by following the HEH with a zero- width joiner whose only effect is to provide context. In Persian texts, there are cases where a letter that normally would join a subsequent letter in a cursive connection does not. Here a zero- width non- joiner is used.4.2.4. Bidirectional text Many languages are written in horizontal lines from left to right, while others are written from right to left. When both writing directions are present, one talks of bidirectional text (BIDI for short). BIDI text requires markup in special circumstances where ambiguities as to the directionality of some characters have to be resolved. This markup affects the ability to render BIDI text in a semantically legible fashion. That is, without this special BIDI markup, cases arise which would prevent *any* rendering whatsoeverYergeau, et. al. Standards Track [Page 13]RFC 2070 HTML Internationalization January 1997 that reflected the basic meaning of the text. Plain text may contain BIDI markup in the form of special-purpose formatting characters. This is also possible in HTML, which includes the five BIDI-related formatting characters (202A - 202E) of ISO 10646. As an alternative, HTML provides equivalent SGML markup. BIDI is a complex issue, and conversion of logical text sequences to display sequences has to be done according to the algorithm and character properties specified in [UNICODE]. Here, explanations are given only as far as they are needed to understand the necessity of the features introduced and to define their exact semantics. The Unicode BIDI algorithm is based on the individual characters of a text being stored in logical order, that is the order in which they are normally input and in which the corresponding sounds are normally spoken. To make rendering of logical order text possible, the algorithm assigns a directionality property to each character, e.g. Latin letters are specified to have a left-to-right direction, Arabic and Hebrew characters have a right-to-left direction. The left-to-right and right-to-left marks (‎ and ‏) are used to disambiguate directionality of neutral characters. For example, when a double quote sits between an Arabic and a Latin letter, its direction is ambiguous; if a directional mark is added on one side such that the quotation mark is surrounded by characters of only one directionality, the ambiguity is removed. These characters are like zero width spaces which have a directional property (but no word/line break property). Nested embeddings of contra-directional text runs, due to nested quotations or to the pasting of text from one BIDI context to another, is also a case where the implicit directionality of characters is not sufficient, requiring markup. Also, it is frequently desirable to specify the basic directionality of a block of text. For these purposes, the DIR attribute is used. On block-type elements, the DIR attribute indicates the base directionality of the text in the block; if omitted it is inherited from the parent element. The default directionality of the overall HTML document is left-to-right. On inline elements, it makes the element start a new embedding level (to be explained below); if omitted the inline element does not start a new embedding level.Yergeau, et. al. Standards Track [Page 14]RFC 2070 HTML Internationalization January 1997 NOTE -- the PRE, XMP and LISTING elements admit the DIR attribute. Their contents should not be considered as preformatted with respect to bidirectional layout, but the BIDI algorithm should be applied to each line of text. Following is an example of a case where embedding is needed, showing its effect: Given the following latin (upper case) and arabic (lower case) letters in backing store with the specified embeddings: <SPAN DIR=LTR> AB <SPAN DIR=RTL> xy <SPAN DIR=LTR> CD </SPAN> zw </SPAN> EF </SPAN> One gets the following rendering (with [] showing the directional transitions): [ AB [ wz [ CD ] yx ] EF ] On the other hand, without this markup and with a base direction of LTR one gets the following rendering: [ AB [ yx ] CD [ wz ] EF ] Notice that yx is on the left and wz on the right unlike the above case where the embedding levels are used. Without the embedding markup one has at most two levels: a base directional level and a single counterflow directional level. The DIR attribute on inline elements is equivalent to the formatting characters LEFT-TO-RIGHT EMBEDDING (202A) and RIGHT-TO-LEFT EMBEDDING (202B) of ISO 10646. The end tag of the element is equivalent to the POP DIRECTIONAL FORMATTING (202C) character. Directional override, as provided by the BDO element, is needed to deal with unusual short pieces of text in which directionality cannot be resolved from context in an unambiguous fashion. For example, it can be used to force left-to-right (or right-to-left) display of part numbers composed of Latin letters, digits and Hebrew letters. The effect of BDO is to force the directionality of all characters within it to the value of DIR, irrespective of their intrinsic directional properties. It is equivalent to using the LEFT-TO-RIGHT OVERRIDE (202D) or RIGHT-TO-LEFT OVERRIDE (202E) characters of ISO 10646, the end tag again being equivalent to the POP DIRECTIONAL FORMATTING (202C) character.Yergeau, et. al. Standards Track [Page 15]RFC 2070 HTML Internationalization January 1997 NOTE -- authors and authoring software writers should be aware that conflicts can arise if the DIR attribute is used on inline elements (including BDO) concurrently with the use of the corresponding ISO 10646 formatting characters. Preferably one or the other should be used exclusively; the markup method is better able to guarantee document structural integrity, and alleviates some problems when editing bidirectional HTML text with a simple text editor, but some software may be more apt at using the 10646 characters. If both methods are used, great care should be exercised to insure proper nesting of markup and directional embedding or override; otherwise, rendering results are undefined.5. Forms5.1. DTD additions It is natural to expect input in any language in forms, as they provide one of the only ways of obtaining user input. While this is primarily a UI issue, there are some things that should be specified at the HTML level to guide behavior and promote interoperability. To ensure full interoperability, it is necessary for the user agent (and the user) to have an indication of the character encoding(s) that the server providing a form will be able to handle upon submission of the filled-in form. Such an indication is provided by the ACCEPT-CHARSET attribute of the INPUT and TEXTAREA elements, modeled on the HTTP Accept-Charset header (see [HTTP-1.1]), which contains a space and/or comma delimited list of character sets acceptable to the server. A user agent may want to somehow advise the user of the contents of this attribute, or to restrict his possibility to enter characters outside the repertoires of the listed character sets. NOTE -- The list of character sets is to be interpreted as an EXCLUSIVE-OR list; the server announces that it is ready to accept any ONE of these character encoding schemes for each part of a multipart entity. The client may perform character encoding translation to satisfy the server if necessary. NOTE -- The default value for the ACCEPT-CHARSET attribute of an INPUT or TEXTAREA element is the reserved value "UNKNOWN". A user agent may interpret that value as the character encoding scheme that was used to transmit the document containing that element.Yergeau, et. al. Standards Track [Page 16]RFC 2070 HTML Internationalization January 19975.2. Form submission The HTML 2.0 form submission mechanism, based on the "application/x- www-form-urlencoded" media type, is ill-equipped with regard to internationalization. In fact, since URLs are restricted to ASCII characters, the mechanism is akward even for ISO-8859-1 text. Section 2.2 of [RFC1738] specifies that octets may be encoded using the "%HH" notation, but text submitted from a form is composed of characters, not octets. Lacking a specification of a character encoding scheme, the "%HH" notation has no well-defined meaning. The best solution is to use the "multipart/form-data" media type described in [RFC1867] with the POST method of form submission. This mechanism encapsulates the value part of each name-value pair in a body-part of a multipart MIME body that is sent as the HTTP entity; each body part can be labeled with an appropriate Content-Type, including if necessary a charset parameter that specifies the character encoding scheme. The changes to the DTD necessary to support this method of form submission have been incorporated in the DTD included in this specification. A less satisfactory solution is to add a MIME charset parameter to the "application/x-www-form-urlencoded" media type specifier sent along with a POST method form submission, with the understanding that the URL encoding of [RFC1738] is applied on top of the specified character encoding, as a kind of implicit Content-Transfer-Encoding. One problem with both solutions above is that current browsers do not generally allow for bookmarks to specify the POST method; this should be improved. Conversely, the GET method could be used with the form data transmitted in the body instead of in the URL. Nothing in the protocol seems to prevent it, but no implementations appear to exist at present. How the user agent determines the encoding of the text entered by the user is outside the scope of this specification. NOTE -- Designers of forms and their handling scripts should be aware of an important caveat: when the default value of a field (the VALUE attribute) is returned upon form submission (i.e. the user did not modify this value), it cannot be guaranteed to be transmitted as a sequence of octets identical to that in the source document -- only as a possibly different but valid encoding of the same sequence of text elements. This may be true even if the encoding of the document containing the form and that used for submission are the same.Yergeau, et. al. Standards Track [Page 17]RFC 2070 HTML Internationalization January 1997 Differences can occur when a sequence of characters can be represented by various sequences of octets, and also when a composite sequence (a base character plus one or more combining diacritics) can be represented by either a different but equivalent composite sequence or by a fully precomposed character. For instance, the UCS-2 sequence 00EA+0323 (LATIN SMALL LETTER E WITH CIRCUMFLEX ACCENT + COMBINING DOT BELOW) may be transformed into 1EC7 (LATIN SMALL LETTER E WITH CIRCUMFLEX ACCENT AND DOT BELOW), into 0065+0302+0323 (LATIN SMALL LETTER E + COMBINING CIRCUMFLEX ACCENT + COMBINING DOT BELOW), as well as into other equivalent composite sequences.6. External character encoding issues Proper interpretation of a text document requires that the character encoding scheme be known. Current HTTP servers, however, do not generally include an appropriate charset parameter with the Content- Type header. This is bad behaviour, which is even encouraged by the continued existence of browsers that declare an unrecognized media type when they receive a charset parameter. User agent implementators are strongly encouraged to make their software tolerant of this parameter, even if they cannot take advantage of it. Proper labelling is highly desirable, but some preventive measures can be taken to minimize the detrimental effects of its absence: In the case where a document is accessed from a hyperlink in an origin HTML document, a CHARSET attribute is added to the attribute list of elements with link semantics (A and LINK), specifically by adding it to the linkExtraAttributes entity. The value of that attribute is to be considered a hint to the User Agent as to the character encoding scheme used by the resource pointed to by the hyperlink; it should be the appropriate value of the MIME charset parameter for that resource. In any document, it is possible to include an indication of the encoding scheme like the following, as early as possible within the HEAD of the document: <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-2022-JP"> This is not foolproof, but will work if the encoding scheme is such that ASCII-valued octets stand for ASCII characters only at least until the META element is parsed. Note that there are better ways
⌨️ 快捷键说明
复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?