rfc1502.txt

来自「RFC 的详细文档!」· 文本 代码 · 共 787 行 · 第 1/2 页

TXT
787
字号

RFC 1502          X.400 Use of Extended Character Sets       August 1993


   When the text is not representable in one of the ISO-8859 character
   sets, the following rules may be applied:

    (1)  If any Latin characters are used, keep IA5 as the G0 set.

    (2)  If a mainstream character set is used (Greek, Cyrillic,
         Hebrew, Arabic), designate this as the G1 character set,
         and permanently shift this into the upper half of the code
         page (LS1R).
         EXCEPTION: The Japanese community has a long tradition of
         switching between the Japanese 16-bit character set
         ISO-IR-87 and US-ASCII as the G0 set. See [7]
         for details. If ISO-IR-87 is used, that technique should be
         used instead of the one recommended here.

    (3)  If occasional extensions to a character set that is
         basically Latin occur (like accents, national variants
         and so on), and these are available in a single character
         set, designate the relevant set as G2 and use single
         shift (SS2) to invoke characters from this character set.

         The ISO 8859 supplementary set, ISO-IR-154, is recommended
         for this purpose.

         This corresponds to the ISO 4873 "second level" application.

    (4)  If two non-Latin character sets are used, the second should
         be designated as G3, and shifted into the upper half of the
         code page by the use of Locking Shift 3 Right (LS3R).

         This corresponds to the ISO 4873 "third level" application.

    (5)  If avoidable, use of character sets with floating accents,
         like ISO 6937, should be avoided.

    (6)  The shifts changing the lower half of the code table (SI/SO,
         LS2 and LS3) should NOT be used.

   RATIONALE: Keeping the G0 set reserved for US-ASCII will ensure that
   text in US-ASCII has the same bit representation always.

   The use of the upper code page for other scripts ensures that both
   text in these languages and text of this type mixed with English can
   be represented without the use of shift sequences.

   If the language and/or content of a text is completely unknown,
   chapter 5 gives an algorithm that may be used to decide upon the
   character sets. This might, for instance, be suitable for use at



Alvestrand                                                      [Page 8]

RFC 1502          X.400 Use of Extended Character Sets       August 1993


   automatic mail gateways.

   NOTE: At the time of this writing, few applications that use ISO 4873
   level 2 and level 3 encoding exist. It has been estimated that
   implementing them in an application that already uses a rich
   repertoire of characters is a matter of programmer-days, not
   programmer-months, but this has not been proven.

4.  GUIDELINES FOR THE RENDERING OF GENERALTEXT

   As a basic rule, one should NOT assume that any of the rules above
   are followed.

   An user agent capable of rendering GeneralText should:

    (1)  ALWAYS be able to identify and render characters in IA5, no
         matter how they are designated and invoked.

    (2)  ALWAYS be able to identify and render characters in the
         "native" character sets, no matter how they are designated
         and invoked.

    (3)  ALWAYS indicate the presence of characters that cannot be
         adequately represented on the current output device.

    (4)  NEVER render a character in an unknown or unrepresentable
         character set by displaying the character in the same bit
         position in the native character set.

    (5)  PREFERABLY be able to identify and render characters that are
         the same as characters in the "native" character sets, even
         though they are designated and invoked as part of other
         character sets.  This applies in particular to the
         "invariant" part of ISO 8859, parts 1 through 6.

    (6)  PREFERABLY be able to combine the floating accents of ISO
         6937 with their base characters for suitable rendering using
         the capabilities of the current output device.

    (7)  PREFERABLY be able to display text both in a mode using
         fallbacks for nonrenderable characters and in a mode
         designating nonrenderable characters as such.

    (8)  PREFERABLY be able to save the content of a GeneralText
         message to a file or other suitable media, saving all
         character set information, for later processing by other
         means.  It is not illegal to render the character set
         information into a different format; however, it should be



Alvestrand                                                      [Page 9]

RFC 1502          X.400 Use of Extended Character Sets       August 1993


         noted that it is easy to lose vital information if the format
         chosen for representing character sets does not offer the
         possibility of referencing all character sets in the ISO
         registry of character sets.

   These requirements also apply to gateways that transform the message
   into some other format, for example a gateway that transforms a
   message into MIME using [7] for the purpose.

5.  RECOMMENDATION FOR SELECTION OF CHARACTER SETS

5.1.  Algorithm for selection of character sets

   When one has text in which characters from several character sets
   occurs, and wants to process this into a GeneralText document, it is
   often hard to guess right at the character sets to select.

   The following paragraphs give an algorithm that can be started at the
   beginning of a message, and at the end of it, return a set of
   character sets that can be used as G0..G3 character sets, OR an
   indication that the task is impossible.

    VARIABLES:

    UsedSets
         The set of character sets that MUST be used for this message

    UsableSets
         The set of character sets that MAY be used for this message.
         Each set also contains a counter for each character position.

    PossibleSets
         The set of all the character sets known to be usable in the
         destination format.

         ALGORITHM:

    1)   Add IA5 (ISO-IR-6) to the UsedSets (as G0).

    2)   Get the next character of the text.  If the text is
         completely analyzed, go to FINISHED

    3)   If it is in the UsedSets, go to 2).

    4)   Find the set of character sets from PossibleSets in which the
         character occurs. If it does not occur in any, report
         failure.




Alvestrand                                                     [Page 10]

RFC 1502          X.400 Use of Extended Character Sets       August 1993


    5)   If it is in a single character set in PossibleSets only, add
         this set to UsedSets, and go to 2).

    6)   If it is in more than one character set, add these to
         PossibleSets (if not already present), and increment the
         counter for that character in all the sets. Go to 2).

    FINISHED)

    1)   (FINAL SELECTION) Remove any character set in UsedSets from
         PossibleSets.

         Zero the counters for any character in PossibleSets that also
         occurs in UsedSets.
         WHILE (more characters left)
           Select one character set and move it from PossibleSets to
           UsedSqets.
           Zero the counters for all characters in this set in the other
           PossibleSets.
         END WHILE
         This step can be "tuned" any way you want, for instance by
         choosing the character sets most likely to be understood at
         the destination first, choosing the character sets covering
         the most characters first, avoiding multi-byte character sets
         as long as possible, or any other scheme suitable for the
         application.

5.2.  WHAT TO DO ON FAILURE

   Failure will occur in this schema if a character is found that is not
   in the PossibleSets. It may then be handled in one of the following
   ways:

    (1)  Replace the character with the SUB control character

    (2)  Replace the character with Keld Simonsen Mnemonics [8].
         This is a reversible transformation as long as the
         recipient is aware that it has been used, but requires
         passing out-of-band information to indicate this.

    (3)  Replace the lost characters with any suitable fallback or
         mnemonic scheme intended for human understanding

    (4)  Bounce the message/refuse the conversion/give up.

   The action to be taken may be different based on the percentage of
   "lost" characters.




Alvestrand                                                     [Page 11]

RFC 1502          X.400 Use of Extended Character Sets       August 1993


   If the message has "controls" like "conversion with loss prohibited",
   only the last possibility may be used.

5.3.  RECOMMENDED CHARACTER SETS

   There are 2 steps in the algorithm above that are left for local
   judgement:

    (1)  Selection of the sets to appear in PossibleSets.

    (2)  The algorithm for deciding which character set to select in
         step 9.

   In the context of generating X.400 GeneralText messages, the
   following is recommended:

    Sets in PossibleSets:
    ISO-IR-6        Esc 28 42 (G0)  US-ASCII, IA5, ISO646
    ISO-IR-100      Esc 2D 41 (G1)  ISO-8859-1   West Europe
    ISO-IR-101      Esc 2D 42 (G1)  ISO-8859-2   Central/Eastern Europe
    ISO-IR-144      Esc 2D 4C (G1)  ISO-8859-5   Cyrillic
    ISO-IR-127      Esc 2D 47 (G1)  ISO-8859-6   Arabic
    ISO-IR-126      Esc 2D 46 (G1)  ISO-8859-7   Greek
    ISO-IR-138      Esc 2D 48 (G1)  ISO-8859-8   Hebrew
    ISO-IR-148      Esc 2D 4D (G1)  ISO-8859-9   Turkish

   The following multi-byte character sets are recommended:

    ISO-IR-87 (Japanese JIS C6226-1983)     Esc 24 29 42 (G1)
    ISO-IR-149 (Korean KS C 5601-1989)      Esc 24 29 43 (G1)
    ISO-IR-58 (Chinese GB 2312-80)          Esc 24 29 41 (G1)

   It is a STRONG recommendation that character sets not listed above,
   which do not add any new characters to the total set of characters
   given by the character sets above, should NOT be used in X.400
   interchange.

   ISO-IR-87 is the Japanese character set that is allowed in a Teletex
   string, such as the subject field.

   NOTE: ISO-IR-87 has been "superseded" by ISO-IR-168, which allows two
   extra Kanji characters. Any application that handles ISO-IR-87 should
   also be able to handle ISO-IR-168.

   Algorithm for selecting character sets:

   Start at the top of the list above, and add each set only if it is
   needed.



Alvestrand                                                     [Page 12]

RFC 1502          X.400 Use of Extended Character Sets       August 1993


6.  REFERENCES

   [1]  Information technology - ISO 8-bit code for information
        interchange - Structure and rules for implementation, Third
        edition, 1991-12-15.

   [2]  Information technology - 8-bit single-byte coded graphic
        character sets (parts 1-11; the parts have different dates, the
        ones referenced here are from RFC 1345).

   [3]  Information technology - Coded graphic character set for text
        communication (parts 1 and 2; part 2 dated 1983-12-15).

   [4]  Code for the representation of names of languages. 1988 version.

   [5]  CCITT Recommendation X.209(1988): Specification of Basic
        Encoding Rules for Abstract Syntax Notation One (ASN.1).
        Technically aligned with ISO 8825 and ISO 8825/AD 1.

   [6]  Information Technology - Universal Multiple-Octet Coded
        Character Set (UCS) - ISO 10646.

   [7]  Murai, J., Crispin M., and E. van der Poel, "Japanese Character
        Encoding for Internet Message Bodies", RFC 1468, Keio
        University, Panda Programming, June 1993.

   [8]  Simonsen, K., "Character Mnemonics & Character Sets", RFC 1345,
        Rationel Almen Planlaegning, June 1992.

   [9]  Information Technology - Text communication - Message- Oriented
        Text Interchange Systems (MOTIS) - ISO 10021  - October 1988.

   [10] ISO DIS documents describing X.400/84 with slight extensions.
        Now very hard to get copies of, since they failed to become
        ISes.

   [11] CCITT Recommendation X.420 (1988), Interpersonal Messaging
        System.

   [12] International Standard--Information Processing-- ISO 7-bit and
        8-bit coded character sets--Code extension techniques, ISO
        2022:1986.

7.  Security Considerations

   Security issues are not discussed in this memo.





Alvestrand                                                     [Page 13]

RFC 1502          X.400 Use of Extended Character Sets       August 1993


8.  Author's Address

   Harald Tveit Alvestrand
   SINTEF DELAB
   N-7034 Trondheim
   NORWAY

   EMail: Harald.Alvestrand@delab.sintef.no











































Alvestrand                                                     [Page 14]


⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?