📄 rfc2482.txt

📁 RFC 的详细文档！
💻 TXT
📖 第 1 页 / 共 2 页
字号:
上一页 12

RFC 2482         Language Tagging in Unicode Plain Text     January 1999


      Meaning:   The first character of the string matched by this
                 non-terminal must be '?'

   2. A number of predicate functions are employed in semantic
      constraint rules which are not otherwise defined; their name is
      sufficient for determining their predication.

      Example:   IsRFC1766LanguageIdentifier ( tag-argument )

      Meaning:   tag-argument is a valid RFC1766 language identifier

   3. A lexical expander function, TAG, is employed to denote the tag
      form of an ASCII character; the argument to this function is
      either a character or a character set specified by a range or
      enumeration expression.

      Example:   TAG('-')

      Meaning:   TAG HYPHEN-MINUS

      Example:   TAG([A-Z])

      Meaning:   TAG LATIN CAPITAL LETTER A ...
                 TAG LATIN CAPITAL LETTER Z

   4. A macro is employed to denote terminal symbols that are character
      literals which can't be directly represented in ASCII. The
      argument to the macro is the UNICODE (ISO/IEC 10646) character
      name.

      Example:   '${TAG CANCEL}'

      Meaning:   character literal whose code value is U-000E007F

   5. Occurrence indicators used are '+' (one or more) and '*' (zero or
      more); optional occurrence is indicated by enclosure in '[' and
      ']'.

4.6.1 Formal Tag Syntax

tag                     :   language-tag
                        |   cancel-all-tag
                        ;

language-tag            :   language-tag-introducer language-tag-argument
                        ;





Whistler & Adams             Informational                      [Page 8]

RFC 2482         Language Tagging in Unicode Plain Text     January 1999


language-tag-argument   :   tag-argument
              {{ Assert ( IsRFC1766LanguageIdentifier ( $$ ); }}
                        |   tag-cancel
                        ;

cancel-all-tag          :   tag-cancel
                        ;

tag-argument            :   tag-character+
                        ;

tag-character           :   { c : c in
              TAG( { a : a in printable ASCII characters or SPACE } ) }
                        ;

language-tag-introducer :   '${TAG LANGUAGE}'
                        ;

tag-cancel              :   '${TAG CANCEL}'
                        ;


5.0 Tag Types

5.1 Language Tags

   Language tags are of general interest and should have a high degree
   of interoperability for protocol usage. To this end, a specific
   LANGUAGE TAG tag identification character is provided.  A Plane 14
   tag string prefixed by U-000E0001 LANGUAGE TAG is specified to
   constitute a language tag. Furthermore, the tag values for the
   language tag are to be spelled out as specified in RFC 1766, making
   use only of registered tag values or of user-defined language tags
   starting with the characters "x-".

   For example, to embed a language tag for Japanese, the Plane 14
   characters would be used as follows. The Japanese tag from RFC 1766
   is "ja" (composed of ISO 639 language id) or, alternatively, "ja-JP"
   (composed of ISO 639 language id plus ISO 3166 country id).  Since
   RFC 1766 specifies that language tags are not case significant, it is
   recommended that for language tags, the entire tag be lowercased
   before conversion to Plane 14 tag characters. (This would not be
   required for Unicode conformance, but should be followed as general
   practice by protocols making use of RFC 1766 language tags, to
   simplify and speed up the processing for operations which need to
   identify or ignore language tags embedded in text.) Lowercasing,





Whistler & Adams             Informational                      [Page 9]

RFC 2482         Language Tagging in Unicode Plain Text     January 1999


   rather than uppercasing, is recommended because it follows the
   majority practice of expressing language tag values in lowercase
   letters.

   Thus the entire language tag (in its longer form) would be converted
   to Plane 14 tag characters as follows:

   U-000E0001 U-000E006A U-000E0061 U-000E002D U-000E006A U-000E0070

   The language tag (in its shorter, "ja" form) could be expressed as
   follows:

   U-000E0001 U-000E006A U-000E0061

   The value of this string is then expressed in whichever encoding form
   (UCS-4, UTF-16, UTF-8) is required and embedded in text at the
   relevant point.

5.2 Additional Tags

   Additional tag identification characters might be defined in the
   future. An example would be a CHARACTER SET SOURCE TAG, or a GENERIC
   TAG for private definition of tags.

   In each case, when a specific tag identification character is
   encoded, a corresponding reference standard for the values of the
   tags associated with the identifier should be designated, so that
   interoperating parties which make use of the tags will know how to
   interpret the values the tags may take.

6.0 Display Issues

   All characters in the tag character block are considered to have no
   visible rendering in normal text. A process which interprets tags may
   choose to modify the rendering of text based on the tag values (as
   for example, changing font to preferred style for rendering Chinese
   versus Japanese). The tag characters themselves have no display; they
   may be considered similar to a U+200B ZERO WIDTH SPACE in that
   regard. The tag characters also do not affect breaking, joining, or
   any other format or layout properties, except insofar as the process
   interpreting the tag chooses to impose such behavior based on the tag
   value.

   For debugging or other operations which must render the tags
   themselves visible, it is advisable that the tag characters be
   rendered using the corresponding ASCII character glyphs (perhaps
   modified systematically to differentiate them from normal ASCII




Whistler & Adams             Informational                     [Page 10]

RFC 2482         Language Tagging in Unicode Plain Text     January 1999


   characters). But, as noted below, the tag character values are chosen
   so that even without display support, the tag characters will be
   interpretable in most debuggers.

7.0 Unicode Conformance Issues

   The basic rules for Unicode conformance for the tag characters are
   exactly the same as for any other Unicode characters. A conformant
   process is not required to interpret the tag characters. If it does
   not interpret tag characters, it should leave their values
   undisturbed and do whatever it does with any other uninterpreted
   characters. If it does interpret them, it should interpret them
   according to the standard, i.e. as spelled-out tags.

   So for a non-TagAware Unicode application, any language tag
   characters (or any other kind of tag expressed with Plane 14 tag
   characters) encountered would be handled exactly as for uninterpreted
   Tibetan from the BMP, uninterpreted Linear B from Plane 1, or
   uninterpreted Egyptian hieroglyphics from private use space in Plane
   15.

   A TagAware but TagPhobic Unicode application can recognize the tag
   character range in Plane 14 and choose to deliberately strip them out
   completely to produce plain text with no tags.

   The presence of a correctly formed tag cannot be taken as a guarantee
   that the data so tagged is correctly tagged. For example, nothing
   prevents an application from erroneously labelling French data as
   Spanish, or from labelling JIS-derived data as Japanese, even if it
   contains Greek or Cyrillic characters.

7.1 Note on Encoding Language Tags

   The fact that this proposal for encoding tag characters in Unicode
   includes a mechanism for specifying language tag values does not mean
   that Unicode is departing from one of its basic encoding principles:

       Unicode encodes scripts, not languages.

   This is still true of the Unicode encoding (and ISO/IEC 10646), even
   in the presence of a mechanism for specifying language tags in plain
   text.  There is nothing obligatory about the use of Plane 14 tags,
   whether for language tags or any other kind of tags.

   Language tagging in no way impacts current encoded characters or the
   encoding of future scripts.





Whistler & Adams             Informational                     [Page 11]

RFC 2482         Language Tagging in Unicode Plain Text     January 1999


   It is fully anticipated that implementations of Unicode which already
   make use of out-of-band mechanisms for language tagging or "heavy-
   weight" in-band mechanisms such as HTML will continue to do exactly
   what they are doing and will ignore Plane 14 tag characters
   completely.

8.0 Security Considerations

   There are no known security issues raised by this document.

References

   [ISO10646] ISO/IEC 10646-1:1993 International Organization for
              Standardization.  "Information Technology -- Universal
              Multiple-Octet Coded Character Set (UCS) -- Part 1:
              Architecture and Basic Multilingual Plane", Geneva, 1993.

   [RFC1766]  Alvestrand, H., "Tags for the Identification of
              Languages", RFC 1766, March 1995.

   [RFC2070]  Yergeau, F., Nicol, G. Adams, G. and M. Duerst,
              "Internationalization of the Hypertext Markup Language",
              RFC 2070, January 1997.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC2130]  Weider, C. Preston, C., Simonsen, K., Alvestrand, H.,
              Atkinson, R., Crispin, M. and P. Svanberg, "The Report of
              the IAB Character Set Workshop held 29 February - 1 March,
              1996", RFC 2130, April 1997.

   [UNICODE]  The Unicode Standard, Version 2.0, The Unicode Consortium,
              Addison-Wesley, July 1996.

Acknowledgements

   The following people also contributed to this document, directly or
   indirectly: Chris Newman, Mark Crispin, Rick McGowan, Joe Becker,
   John Jenkins, and Asmus Freytag. This document also was reviewed by
   the Unicode Technical Committee, and the authors wish to thank all of
   the UTC representatives for their input. The authors are, of course,
   responsible for any errors or omissions which may remain in the text.








Whistler & Adams             Informational                     [Page 12]

RFC 2482         Language Tagging in Unicode Plain Text     January 1999


Authors' Addresses

   Ken Whistler
   Sybase, Inc.
   6475 Christie Ave.
   Emeryville, CA 94608-1050

   Phone: +1 510 922 3611
   EMail: kenw@sybase.com


   Glenn Adams
   Spyglass, Inc.
   One Cambridge Center
   Cambridge, MA 02142

   Phone: +1 617 679 4652
   EMail: glenn@spyglass.com

































Whistler & Adams             Informational                     [Page 13]

RFC 2482         Language Tagging in Unicode Plain Text     January 1999


Full Copyright Statement

   Copyright (C) The Internet Society (1999).  All Rights Reserved.

   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implementation may be prepared, copied, published
   and distributed, in whole or in part, without restriction of any
   kind, provided that the above copyright notice and this paragraph are
   included on all such copies and derivative works.  However, this
   document itself may not be modified in any way, such as by removing
   the copyright notice or references to the Internet Society or other
   Internet organizations, except as needed for the purpose of
   developing Internet standards in which case the procedures for
   copyrights defined in the Internet Standards process must be
   followed, or as required to translate it into languages other than
   English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assigns.

   This document and the information contained herein is provided on an
   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
























Whistler & Adams             Informational                     [Page 14]
上一页 12
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -