rfc2482.txt

来自「RFC 的详细文档！」· 文本代码 · 共 788 行 · 第 1/2 页
TXT
788 行






Network Working Group                                       K. Whistler
Request for Comments: 2482                                       Sybase
Category: Informational                                        G. Adams
                                                               Spyglass
                                                           January 1999


                 Language Tagging in Unicode Plain Text

Status of this Memo

   This memo provides information for the Internet community.  It does
   not specify an Internet standard of any kind.  Distribution of this
   memo is unlimited.

Copyright Notice

   Copyright (C) The Internet Society (1999).  All Rights Reserved.

IESG Note:

   This document has been accepted by ISO/IEC JTC1/SC2/WG2 in meeting
   #34 to be submitted as a recommendation from WG2 for inclusion in
   Plane 14 in part 2 of ISO/IEC 10646.

1.  Abstract

   This document proposed a mechanism for language tagging in [UNICODE]
   plain text. A set of special-use tag characters on Plane 14 of
   [ISO10646] (accessible through UTF-8, UTF-16, and UCS-4 encoding
   forms) are proposed for encoding to enable the spelling out of
   ASCII-based string tags using characters which can be strictly
   separated from ordinary text content characters in ISO10646 (or
   UNICODE).

   One tag identification character and one cancel tag character are
   also proposed. In particular, a language tag identification character
   is proposed to identify a language tag string specifically; the
   language tag itself makes use of [RFC1766] language tag strings
   spelled out using the Plane 14 tag characters. Provision of a
   specific, low-overhead mechanism for embedding language tags in plain
   text is aimed at meeting the need of Internet Protocols such as ACAP,
   which require a standard mechanism for marking language in UTF-8
   strings.

   The tagging mechanism as well the characters proposed in this
   document have been approved by the Unicode Consortium for inclusion
   in The Unicode Standard.  However, implementation of this decision



Whistler & Adams             Informational                      [Page 1]

RFC 2482         Language Tagging in Unicode Plain Text     January 1999


   awaits formal acceptance by ISO JTC1/SC2/WG2, the working group
   responsible for ISO10646. Potential implementers should be aware that
   until this formal acceptance occurs, any usage of the characters
   proposed herein is strictly experimental and not sanctioned for
   standardized character data interchange.

2.  Definitions and Notation

   No attempt is made to define all terms used in this document. In
   particular, the terminology pertaining to the subject of coded
   character systems is not explicitly specified. See [UNICODE],
   [ISO10646], and [RFC2130] for additional definitions in this area.

2.1 Requirements Notation

   This document occasionally uses terms that appear in capital letters.
   When the terms "MUST", "SHOULD", "MUST NOT", "SHOULD NOT", and "MAY"
   appear capitalized, they are being used to indicate particular
   requirements of this specification. A discussion of the meanings of
   these terms appears in [RFC2119].

2.2 Definitions

   The terms defined below are used in special senses and thus warrant
   some clarification.

2.2.1 Tagging

   The association of attributes of text with a point or range of the
   primary text. (The value of a particular tag is not generally
   considered to be a part of the "content" of the text. Typical
   examples of tagging is to mark language or font of a portion of
   text.)

2.2.2 Annotation

   The association of secondary textual content with a point or range of
   the primary text. (The value of a particular annotation *is*
   considered to be a part of the "content" of the text. Typical
   examples include glossing, citations, exemplication, Japanese yomi,
   etc.)

2.2.3 Out-of-band

   An out-of-band channel conveys a tag in such a way that the textual
   content, as encoded, is completely untouched and unmodified. This is
   typically done by metadata or hyperstructure of some sort.




Whistler & Adams             Informational                      [Page 2]

RFC 2482         Language Tagging in Unicode Plain Text     January 1999


2.2.4 In-band

   An in-band channel conveys a tag along with the textual content,
   using the same basic encoding mechanism as the text itself. This is
   done by various means, but an obvious example is SGML markup, where
   the tags are encoded in the same character set as the text and are
   interspersed with and carried along with the text data.

3.0 Background

   There has been much discussion over the last 8 years of language
   tagging and of other kinds of tagging of Unicode plain text. It is
   fair to say that there is more-or-less universal agreement that
   language tagging of Unicode plain text is required for certain
   textual processes. For example, language "hinting" of multilingual
   text is necessary for multilingual spell-checking based on multiple
   dictionaries to work well.  Language tagging provides a minimum level
   of required information for text-to-speech processes to work
   correctly.  Language tagging is regularly done on web pages, to
   enable selection of alternate content, for example.

   However, there has been a great deal of controversy regarding the
   appropriate placement of language tags. Some have held that the only
   appropriate placement of language tags (or other kinds of tags) is
   out-of-band, making use of attributed text structures or metadata.
   Others have argued that there are requirements for lower-complexity
   in-band mechanisms for language tags (or other tags) in plain text.

   The controversy has been muddied by the existence and widespread use
   of a number of in-band text markup mechanisms (HTML, text/enriched,
   etc.) which enable language tagging, but which imply the use of
   general parsing mechanisms which are deemed too "heavyweight" for
   protocol developers and a number of other applications. The
   difficulty of using general in-band text markup for simple protocols
   derives from the fact that some characters are used both for textual
   content and for the text markup; this makes it more difficult to
   write simple, fast algorithms to find only the textual content and
   ignore the tags, or vice versa. (Think of this as the algorithmic
   equivalent of the difficulty the human reader has attempting to read
   just the content of raw HTML source text without a browser
   interpreting all the markup tags.)

   The Plane 14 proposal addresses the recurrent and persistent call for
   a lighter-weight mechanism for text tagging than typical text markup
   mechanisms in Unicode. It proposes a special set of characters used
   *only* for tagging. These tag characters can be embedded into plain





Whistler & Adams             Informational                      [Page 3]

RFC 2482         Language Tagging in Unicode Plain Text     January 1999


   text and can be identified and/or ignored with trivial algorithms,
   since there is no overloading of usage for these tag characters--they
   can only express tag values and never textual content itself.

   The Plane 14 proposal is not intended for general annotation of text,
   such as textual citations, phonetic readings (e.g.  Japanese Yomi),
   etc. In its present form, its use is intended to be restriced solely
   to specifying in-line language tags.  Future extensions may widen
   this scope of intended usage.

4.0 Proposal

   This proposal suggests the use of 97 dedicated tag characters encoded
   at the start of Plane 14 of ISO/IEC 10646 consisting of a clone of
   the 94 printable 7-bit ASCII graphic characters and ASCII SPACE, as
   well as a tag identification character and a tag cancel character.

   These tag characters are to be used to spell out any ASCII-based
   tagging scheme which needs to be embedded in Unicode plain text. In
   particular, they can be used to spell out language tags in order to
   meet the expressed requirements of the ACAP protocol and the likely
   requirements of other new protocols following the guidelines of the
   IAB character workshop (RFC 2130).

   The suggested range in Plane 14 for the block reserved for tag
   characters is as follows, expressed in each of the three most
   generally used encoding schemes for ISO/IEC 10646:

   UCS-4

   U-000E0000 .. U-000E007F

   UTF-16

   U+DB40 U+DC00 .. U+DB40 U+DC7F

   UTF-8

   0xF3 0xA0 0x80 0x80 .. 0xF3 0xA0 0x81 0xBF

   Of this range, U-000E0020 .. U-000E007E is the suggested range for
   the ASCII clone tag characters themselves.

4.1 Names for the Tag Characters

   The names for the ASCII clone tag characters should be exactly the
   ISO 10646 names for 7-bit ASCII, prefixed with the word "TAG".




Whistler & Adams             Informational                      [Page 4]

RFC 2482         Language Tagging in Unicode Plain Text     January 1999


   In addition, there is one tag identification character and a CANCEL
   TAG character. The use and syntax of these characters is described in
   detail below.

   The entire encoding for the proposed Plane 14 tag characters and
   names of those characters can be derived from the following list.
   (The encoded values here and throughout this proposal are listed in
   UCS-4 form, which is easiest to interpret. It is assumed that most
   Unicode applications will, however, be making use either of UTF-16 or
   UTF-8 encoding forms for actual implementation.)

   U-000E0000  <reserved>
   U-000E0001  LANGUAGE TAG
   U-000E0002  <reserved>
   U-000E001F  <reserved>
   U-000E0020  TAG SPACE
   U-000E0021  TAG EXCLAMATION MARK
   U-000E0041  TAG LATIN CAPITAL LETTER A
   U-000E007A  TAG LATIN SMALL LETTER Z
   U-000E007E  TAG TILDE
   U-000E007F  CANCEL TAG

4.2 Range Checking for Tag Characters

   The range checks required for code testing for tag characters would
   be as follows. The same range check is expressed here in C for each
   of the three significant encoding forms for 10646.

Range check expressed in UCS-4:

if ( ( *s >= 0xE0000 ) || ( *s <= 0xE007F ) )

Range check expressed in UTF-16 (Unicode):

if ( ( *s == 0xDB40 ) && ( *(s+1) >= 0xDC00 ) && ( *(s+1) <= 0xDC7F ) )

Expressed in UTF-8:

if ( ( *s == 0xF3 ) && ( *(s+1) == 0xA0 ) && ( *(s+2) & 0xE0 == 0x80 )

   Because of the choice of the range for the tag characters, it would
   also be possible to express the range check for UCS-4 or UTF-16 in
   terms of bitmask operations, as well.








Whistler & Adams             Informational                      [Page 5]

RFC 2482         Language Tagging in Unicode Plain Text     January 1999


4.3 Syntax for Embedding Tags

   The use of the Plane 14 tag characters is very simple. In order to
   embed any ASCII-derived tag in Unicode plain text, the tag is simply
   spelled out with the tag characters instead, prefixed with the
   relevant tag identification character. The resultant string is
   embedded directly in the text.

   The tag identification character is used as a mechanism for
   identifying tags of different types. This enables multiple types of
   tags to coexist amicably embedded in plain text and solves the
   problem of delimitation if a tag is concatenated directly onto
   another tag. Although only one type of tag is currently specified,
   namely the language tag, the encoding of other tag identification
   characters in the future would allow for distinct tag types to be
   used.

   No termination character is required for a tag. A tag terminates
   either when the first non Plane 14 Tag Character (i.e. any other
   normal Unicode value) is encountered, or when the next tag
   identification character is encountered.

   All tag arguments must be encoded only with the tag characters U-
   000E0020 .. U-000E007E. No other characters are valid for expressing
   the tag argument.

   A detailed BNF syntax for tags is listed below.

4.4   Tag Scope and Nesting

   The value of an established tag continues from the point the tag is
   embedded in text until either:

      A. The text itself goes out of scope, as defined by the
         application. (E.g. for line-oriented protocols, when reaching
         the end-of-line or end-of-string; for text streams, when
         reaching the end-of-stream; etc.)

   or

      B. The tag is explicitly cancelled by the CANCEL TAG character.

   Tags of the same type cannot be nested in any way. The appearance of
   a new embedded language tag, for example, after text which was
   already language tagged, simply changes the tagged value for
   subsequent text to that specified in the new tag.





Whistler & Adams             Informational                      [Page 6]

RFC 2482         Language Tagging in Unicode Plain Text     January 1999


   Tags of different type can have interdigitating scope, but not
   hierarchical scope. In effect, tags of different type completely
   ignore each other, so that the use of language tags can be completely
   asynchronous with the use of character set source tags (or any other
   tag type) in the same text in the future.

4.5 Cancelling Tag Values

   U-000E007F CANCEL TAG is provided to allow the specific cancelling of
   a tag value. The use of CANCEL TAG has the following syntax.  To
   cancel a tag value of a particular type, prefix the CANCEL TAG
   character with the tag identification character of the appropriate
   type. For example, the complete string to cancel a language tag is:

   U-000E0001 U-000E007F

   The value of the relevant tag type returns to the default state for
   that tag type, namely: no tag value specified, the same as untagged
   text.

   The use of CANCEL TAG without a prefixed tag identification character
   cancels *any* Plane 14 tag values which may be defined. Since only
   language tags are currently provided with an explicit tag
   identification character, only language tags are currently affected.

   The main function of CANCEL TAG is to make possible such operations
   as blind concatenation of strings in a tagged context without the
   propagation of inappropriate tag values across the string boundaries.
   For example, a string tagged with a Japanese language tag can have
   its tag value "sealed off" with a terminating CANCEL TAG before
   another string of unknown language value is concatenated to it. This
   would prevent the string of unknown language from being erroneously
   marked as being Japanese simply because of a concatenation to a
   Japanese string.

4.6 Tag Syntax Description

   An extended BNF (Backus-Naur Form) description of the tags specified
   in this proposal is found below.  Note the following BNF extensions
   used in this formalism:

   1. Semantic constraints are specified by rules in the form of an
      assertion specified between double braces; the variable $$ denotes
      the string consisting of all terminal symbols matched by the this
      non-terminal.

      Example:   {{ Assert ( $$[0] == '?' ); }}




Whistler & Adams             Informational                      [Page 7]
rfc2482.txt - 源码说明

本页面展示了「RFC 的详细文档！」中的 rfc2482.txt 源码文件，采用文本编程语言编写，共 788 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与RFC相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?