📄 rfc2482.txt
字号:
RFC 2482 Language Tagging in Unicode Plain Text January 1999 Meaning: The first character of the string matched by this non-terminal must be '?' 2. A number of predicate functions are employed in semantic constraint rules which are not otherwise defined; their name is sufficient for determining their predication. Example: IsRFC1766LanguageIdentifier ( tag-argument ) Meaning: tag-argument is a valid RFC1766 language identifier 3. A lexical expander function, TAG, is employed to denote the tag form of an ASCII character; the argument to this function is either a character or a character set specified by a range or enumeration expression. Example: TAG('-') Meaning: TAG HYPHEN-MINUS Example: TAG([A-Z]) Meaning: TAG LATIN CAPITAL LETTER A ... TAG LATIN CAPITAL LETTER Z 4. A macro is employed to denote terminal symbols that are character literals which can't be directly represented in ASCII. The argument to the macro is the UNICODE (ISO/IEC 10646) character name. Example: '${TAG CANCEL}' Meaning: character literal whose code value is U-000E007F 5. Occurrence indicators used are '+' (one or more) and '*' (zero or more); optional occurrence is indicated by enclosure in '[' and ']'.4.6.1 Formal Tag Syntaxtag : language-tag | cancel-all-tag ;language-tag : language-tag-introducer language-tag-argument ;Whistler & Adams Informational [Page 8]RFC 2482 Language Tagging in Unicode Plain Text January 1999language-tag-argument : tag-argument {{ Assert ( IsRFC1766LanguageIdentifier ( $$ ); }} | tag-cancel ;cancel-all-tag : tag-cancel ;tag-argument : tag-character+ ;tag-character : { c : c in TAG( { a : a in printable ASCII characters or SPACE } ) } ;language-tag-introducer : '${TAG LANGUAGE}' ;tag-cancel : '${TAG CANCEL}' ;5.0 Tag Types5.1 Language Tags Language tags are of general interest and should have a high degree of interoperability for protocol usage. To this end, a specific LANGUAGE TAG tag identification character is provided. A Plane 14 tag string prefixed by U-000E0001 LANGUAGE TAG is specified to constitute a language tag. Furthermore, the tag values for the language tag are to be spelled out as specified in RFC 1766, making use only of registered tag values or of user-defined language tags starting with the characters "x-". For example, to embed a language tag for Japanese, the Plane 14 characters would be used as follows. The Japanese tag from RFC 1766 is "ja" (composed of ISO 639 language id) or, alternatively, "ja-JP" (composed of ISO 639 language id plus ISO 3166 country id). Since RFC 1766 specifies that language tags are not case significant, it is recommended that for language tags, the entire tag be lowercased before conversion to Plane 14 tag characters. (This would not be required for Unicode conformance, but should be followed as general practice by protocols making use of RFC 1766 language tags, to simplify and speed up the processing for operations which need to identify or ignore language tags embedded in text.) Lowercasing,Whistler & Adams Informational [Page 9]RFC 2482 Language Tagging in Unicode Plain Text January 1999 rather than uppercasing, is recommended because it follows the majority practice of expressing language tag values in lowercase letters. Thus the entire language tag (in its longer form) would be converted to Plane 14 tag characters as follows: U-000E0001 U-000E006A U-000E0061 U-000E002D U-000E006A U-000E0070 The language tag (in its shorter, "ja" form) could be expressed as follows: U-000E0001 U-000E006A U-000E0061 The value of this string is then expressed in whichever encoding form (UCS-4, UTF-16, UTF-8) is required and embedded in text at the relevant point.5.2 Additional Tags Additional tag identification characters might be defined in the future. An example would be a CHARACTER SET SOURCE TAG, or a GENERIC TAG for private definition of tags. In each case, when a specific tag identification character is encoded, a corresponding reference standard for the values of the tags associated with the identifier should be designated, so that interoperating parties which make use of the tags will know how to interpret the values the tags may take.6.0 Display Issues All characters in the tag character block are considered to have no visible rendering in normal text. A process which interprets tags may choose to modify the rendering of text based on the tag values (as for example, changing font to preferred style for rendering Chinese versus Japanese). The tag characters themselves have no display; they may be considered similar to a U+200B ZERO WIDTH SPACE in that regard. The tag characters also do not affect breaking, joining, or any other format or layout properties, except insofar as the process interpreting the tag chooses to impose such behavior based on the tag value. For debugging or other operations which must render the tags themselves visible, it is advisable that the tag characters be rendered using the corresponding ASCII character glyphs (perhaps modified systematically to differentiate them from normal ASCIIWhistler & Adams Informational [Page 10]RFC 2482 Language Tagging in Unicode Plain Text January 1999 characters). But, as noted below, the tag character values are chosen so that even without display support, the tag characters will be interpretable in most debuggers.7.0 Unicode Conformance Issues The basic rules for Unicode conformance for the tag characters are exactly the same as for any other Unicode characters. A conformant process is not required to interpret the tag characters. If it does not interpret tag characters, it should leave their values undisturbed and do whatever it does with any other uninterpreted characters. If it does interpret them, it should interpret them according to the standard, i.e. as spelled-out tags. So for a non-TagAware Unicode application, any language tag characters (or any other kind of tag expressed with Plane 14 tag characters) encountered would be handled exactly as for uninterpreted Tibetan from the BMP, uninterpreted Linear B from Plane 1, or uninterpreted Egyptian hieroglyphics from private use space in Plane 15. A TagAware but TagPhobic Unicode application can recognize the tag character range in Plane 14 and choose to deliberately strip them out completely to produce plain text with no tags. The presence of a correctly formed tag cannot be taken as a guarantee that the data so tagged is correctly tagged. For example, nothing prevents an application from erroneously labelling French data as Spanish, or from labelling JIS-derived data as Japanese, even if it contains Greek or Cyrillic characters.7.1 Note on Encoding Language Tags The fact that this proposal for encoding tag characters in Unicode includes a mechanism for specifying language tag values does not mean that Unicode is departing from one of its basic encoding principles: Unicode encodes scripts, not languages. This is still true of the Unicode encoding (and ISO/IEC 10646), even in the presence of a mechanism for specifying language tags in plain text. There is nothing obligatory about the use of Plane 14 tags, whether for language tags or any other kind of tags. Language tagging in no way impacts current encoded characters or the encoding of future scripts.Whistler & Adams Informational [Page 11]RFC 2482 Language Tagging in Unicode Plain Text January 1999 It is fully anticipated that implementations of Unicode which already make use of out-of-band mechanisms for language tagging or "heavy- weight" in-band mechanisms such as HTML will continue to do exactly what they are doing and will ignore Plane 14 tag characters completely.8.0 Security Considerations There are no known security issues raised by this document.References [ISO10646] ISO/IEC 10646-1:1993 International Organization for Standardization. "Information Technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane", Geneva, 1993. [RFC1766] Alvestrand, H., "Tags for the Identification of Languages", RFC 1766, March 1995. [RFC2070] Yergeau, F., Nicol, G. Adams, G. and M. Duerst, "Internationalization of the Hypertext Markup Language", RFC 2070, January 1997. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2130] Weider, C. Preston, C., Simonsen, K., Alvestrand, H., Atkinson, R., Crispin, M. and P. Svanberg, "The Report of the IAB Character Set Workshop held 29 February - 1 March, 1996", RFC 2130, April 1997. [UNICODE] The Unicode Standard, Version 2.0, The Unicode Consortium, Addison-Wesley, July 1996.Acknowledgements The following people also contributed to this document, directly or indirectly: Chris Newman, Mark Crispin, Rick McGowan, Joe Becker, John Jenkins, and Asmus Freytag. This document also was reviewed by the Unicode Technical Committee, and the authors wish to thank all of the UTC representatives for their input. The authors are, of course, responsible for any errors or omissions which may remain in the text.Whistler & Adams Informational [Page 12]RFC 2482 Language Tagging in Unicode Plain Text January 1999Authors' Addresses Ken Whistler Sybase, Inc. 6475 Christie Ave. Emeryville, CA 94608-1050 Phone: +1 510 922 3611 EMail: kenw@sybase.com Glenn Adams Spyglass, Inc. One Cambridge Center Cambridge, MA 02142 Phone: +1 617 679 4652 EMail: glenn@spyglass.comWhistler & Adams Informational [Page 13]RFC 2482 Language Tagging in Unicode Plain Text January 1999Full Copyright Statement Copyright (C) The Internet Society (1999). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.Whistler & Adams Informational [Page 14]
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -