📄 rfc2482.txt
字号:
RFC 2482 Language Tagging in Unicode Plain Text January 1999
Meaning: The first character of the string matched by this
non-terminal must be '?'
2. A number of predicate functions are employed in semantic
constraint rules which are not otherwise defined; their name is
sufficient for determining their predication.
Example: IsRFC1766LanguageIdentifier ( tag-argument )
Meaning: tag-argument is a valid RFC1766 language identifier
3. A lexical expander function, TAG, is employed to denote the tag
form of an ASCII character; the argument to this function is
either a character or a character set specified by a range or
enumeration expression.
Example: TAG('-')
Meaning: TAG HYPHEN-MINUS
Example: TAG([A-Z])
Meaning: TAG LATIN CAPITAL LETTER A ...
TAG LATIN CAPITAL LETTER Z
4. A macro is employed to denote terminal symbols that are character
literals which can't be directly represented in ASCII. The
argument to the macro is the UNICODE (ISO/IEC 10646) character
name.
Example: '${TAG CANCEL}'
Meaning: character literal whose code value is U-000E007F
5. Occurrence indicators used are '+' (one or more) and '*' (zero or
more); optional occurrence is indicated by enclosure in '[' and
']'.
4.6.1 Formal Tag Syntax
tag : language-tag
| cancel-all-tag
;
language-tag : language-tag-introducer language-tag-argument
;
Whistler & Adams Informational [Page 8]
RFC 2482 Language Tagging in Unicode Plain Text January 1999
language-tag-argument : tag-argument
{{ Assert ( IsRFC1766LanguageIdentifier ( $$ ); }}
| tag-cancel
;
cancel-all-tag : tag-cancel
;
tag-argument : tag-character+
;
tag-character : { c : c in
TAG( { a : a in printable ASCII characters or SPACE } ) }
;
language-tag-introducer : '${TAG LANGUAGE}'
;
tag-cancel : '${TAG CANCEL}'
;
5.0 Tag Types
5.1 Language Tags
Language tags are of general interest and should have a high degree
of interoperability for protocol usage. To this end, a specific
LANGUAGE TAG tag identification character is provided. A Plane 14
tag string prefixed by U-000E0001 LANGUAGE TAG is specified to
constitute a language tag. Furthermore, the tag values for the
language tag are to be spelled out as specified in RFC 1766, making
use only of registered tag values or of user-defined language tags
starting with the characters "x-".
For example, to embed a language tag for Japanese, the Plane 14
characters would be used as follows. The Japanese tag from RFC 1766
is "ja" (composed of ISO 639 language id) or, alternatively, "ja-JP"
(composed of ISO 639 language id plus ISO 3166 country id). Since
RFC 1766 specifies that language tags are not case significant, it is
recommended that for language tags, the entire tag be lowercased
before conversion to Plane 14 tag characters. (This would not be
required for Unicode conformance, but should be followed as general
practice by protocols making use of RFC 1766 language tags, to
simplify and speed up the processing for operations which need to
identify or ignore language tags embedded in text.) Lowercasing,
Whistler & Adams Informational [Page 9]
RFC 2482 Language Tagging in Unicode Plain Text January 1999
rather than uppercasing, is recommended because it follows the
majority practice of expressing language tag values in lowercase
letters.
Thus the entire language tag (in its longer form) would be converted
to Plane 14 tag characters as follows:
U-000E0001 U-000E006A U-000E0061 U-000E002D U-000E006A U-000E0070
The language tag (in its shorter, "ja" form) could be expressed as
follows:
U-000E0001 U-000E006A U-000E0061
The value of this string is then expressed in whichever encoding form
(UCS-4, UTF-16, UTF-8) is required and embedded in text at the
relevant point.
5.2 Additional Tags
Additional tag identification characters might be defined in the
future. An example would be a CHARACTER SET SOURCE TAG, or a GENERIC
TAG for private definition of tags.
In each case, when a specific tag identification character is
encoded, a corresponding reference standard for the values of the
tags associated with the identifier should be designated, so that
interoperating parties which make use of the tags will know how to
interpret the values the tags may take.
6.0 Display Issues
All characters in the tag character block are considered to have no
visible rendering in normal text. A process which interprets tags may
choose to modify the rendering of text based on the tag values (as
for example, changing font to preferred style for rendering Chinese
versus Japanese). The tag characters themselves have no display; they
may be considered similar to a U+200B ZERO WIDTH SPACE in that
regard. The tag characters also do not affect breaking, joining, or
any other format or layout properties, except insofar as the process
interpreting the tag chooses to impose such behavior based on the tag
value.
For debugging or other operations which must render the tags
themselves visible, it is advisable that the tag characters be
rendered using the corresponding ASCII character glyphs (perhaps
modified systematically to differentiate them from normal ASCII
Whistler & Adams Informational [Page 10]
RFC 2482 Language Tagging in Unicode Plain Text January 1999
characters). But, as noted below, the tag character values are chosen
so that even without display support, the tag characters will be
interpretable in most debuggers.
7.0 Unicode Conformance Issues
The basic rules for Unicode conformance for the tag characters are
exactly the same as for any other Unicode characters. A conformant
process is not required to interpret the tag characters. If it does
not interpret tag characters, it should leave their values
undisturbed and do whatever it does with any other uninterpreted
characters. If it does interpret them, it should interpret them
according to the standard, i.e. as spelled-out tags.
So for a non-TagAware Unicode application, any language tag
characters (or any other kind of tag expressed with Plane 14 tag
characters) encountered would be handled exactly as for uninterpreted
Tibetan from the BMP, uninterpreted Linear B from Plane 1, or
uninterpreted Egyptian hieroglyphics from private use space in Plane
15.
A TagAware but TagPhobic Unicode application can recognize the tag
character range in Plane 14 and choose to deliberately strip them out
completely to produce plain text with no tags.
The presence of a correctly formed tag cannot be taken as a guarantee
that the data so tagged is correctly tagged. For example, nothing
prevents an application from erroneously labelling French data as
Spanish, or from labelling JIS-derived data as Japanese, even if it
contains Greek or Cyrillic characters.
7.1 Note on Encoding Language Tags
The fact that this proposal for encoding tag characters in Unicode
includes a mechanism for specifying language tag values does not mean
that Unicode is departing from one of its basic encoding principles:
Unicode encodes scripts, not languages.
This is still true of the Unicode encoding (and ISO/IEC 10646), even
in the presence of a mechanism for specifying language tags in plain
text. There is nothing obligatory about the use of Plane 14 tags,
whether for language tags or any other kind of tags.
Language tagging in no way impacts current encoded characters or the
encoding of future scripts.
Whistler & Adams Informational [Page 11]
RFC 2482 Language Tagging in Unicode Plain Text January 1999
It is fully anticipated that implementations of Unicode which already
make use of out-of-band mechanisms for language tagging or "heavy-
weight" in-band mechanisms such as HTML will continue to do exactly
what they are doing and will ignore Plane 14 tag characters
completely.
8.0 Security Considerations
There are no known security issues raised by this document.
References
[ISO10646] ISO/IEC 10646-1:1993 International Organization for
Standardization. "Information Technology -- Universal
Multiple-Octet Coded Character Set (UCS) -- Part 1:
Architecture and Basic Multilingual Plane", Geneva, 1993.
[RFC1766] Alvestrand, H., "Tags for the Identification of
Languages", RFC 1766, March 1995.
[RFC2070] Yergeau, F., Nicol, G. Adams, G. and M. Duerst,
"Internationalization of the Hypertext Markup Language",
RFC 2070, January 1997.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2130] Weider, C. Preston, C., Simonsen, K., Alvestrand, H.,
Atkinson, R., Crispin, M. and P. Svanberg, "The Report of
the IAB Character Set Workshop held 29 February - 1 March,
1996", RFC 2130, April 1997.
[UNICODE] The Unicode Standard, Version 2.0, The Unicode Consortium,
Addison-Wesley, July 1996.
Acknowledgements
The following people also contributed to this document, directly or
indirectly: Chris Newman, Mark Crispin, Rick McGowan, Joe Becker,
John Jenkins, and Asmus Freytag. This document also was reviewed by
the Unicode Technical Committee, and the authors wish to thank all of
the UTC representatives for their input. The authors are, of course,
responsible for any errors or omissions which may remain in the text.
Whistler & Adams Informational [Page 12]
RFC 2482 Language Tagging in Unicode Plain Text January 1999
Authors' Addresses
Ken Whistler
Sybase, Inc.
6475 Christie Ave.
Emeryville, CA 94608-1050
Phone: +1 510 922 3611
EMail: kenw@sybase.com
Glenn Adams
Spyglass, Inc.
One Cambridge Center
Cambridge, MA 02142
Phone: +1 617 679 4652
EMail: glenn@spyglass.com
Whistler & Adams Informational [Page 13]
RFC 2482 Language Tagging in Unicode Plain Text January 1999
Full Copyright Statement
Copyright (C) The Internet Society (1999). All Rights Reserved.
This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph are
included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other than
English.
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.
This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Whistler & Adams Informational [Page 14]
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -