📄 draft-ietf-idn-utf6-00.txt
字号:
Internet Engineering Task Force (IETF) Mark WelterINTERNET-DRAFT Brian W. Spolarichdraft-ietf-idn-utf6-00 WALID, Inc.November 16, 2000 Expires May 16, 2001 UTF-6 - Yet Another ASCII-Compatible Encoding for IDNStatus of this memoThis document is an Internet-Draft and is in full conformance with allprovisions of Section 10 of RFC2026.Internet-Drafts are working documents of the Internet Engineering TaskForce (IETF), its areas, and its working groups. Note that othergroups may also distribute working documents as Internet-Drafts.Internet-Drafts are draft documents valid for a maximum of six monthsand may be updated, replaced, or obsoleted by other documents at anytime. It is inappropriate to use Internet-Drafts as referencematerial or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html.The distribution of this document is unlimited.Copyright (c) The Internet Society (2000). All Rights Reserved.AbstractThis document describes a tranformation method for representingUnicode character codepoints in host name parts in a fashion that is completely compatible with the current Domain Name System. It is proposed as a potential candidate for an ASCII-Compatible Encoding (ACE)for supporting the deployment of an internationalized Domain Name System.The tranformation method, an extension of the UTF-5 encoding proposed byDuerst, provides both for more efficient representation of typical Unicode sequences while preserving simplicity and readability. This transformation method is deployed as part of the current WALID multilingual domain name system implementation, although that status should not necessarily influence the evaluation of its merits as a candidate encoding method.Table of Contents1. Introduction1.1 Terminology2. Hostname Part Transformation2.1 Post-Converted Name Prefix2.2 Hostname Prepartion2.3 Definitions2.4 UTF-6 Encoding2.4.1 Variable Length Hex Encoding2.4.2 UTF-6 Compression Algorithm2.4.3 Forward Transformation Algorithm2.5 UTF-6 Decoding2.5.1 Variable Length Hex Decoding2.5.2 UTF-6 Decompression Algorithm2.5.3 Reverse Transformation Algorithm3. Examples3.1 'www.walid.com' (in Arabic)4. Security Considerations5. References1. IntroductionUTF-6 describes an encoding scheme of the ISO/IEC 10646 [ISO10646]character set (whose character code assignments are synchronizedwith Unicode [UNICODE3]), and the procedures for using this schemeto transform host name parts containing Unicode character sequencesinto sequences that are compatible with the current DNS protocol[STD13]. As such, it satisfies the definition of a 'charset' asdefined in [IDNREQ].1.1 TerminologyThe key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and"MAY" in this document are to be interpreted as described in RFC 2119[RFC2119].Hexadecimal values are shown preceded with an "0x". For example,"0xa1b5" indicates two octets, 0xa1 followed by 0xb5. Binary values areshown preceded with an "0b". For example, a nine-bit value might beshown as "0b101101111".Examples in this document use the notation from the Unicode Standard[UNICODE3] as well as the ISO 10646 names. For example, the letter "a"may be represented as either "U+0061" or "LATIN SMALL LETTER A".UTF-6 converts strings with internationalized characters intostrings of US-ASCII that are acceptable as host name parts in currentDNS host naming usage. The former are called "pre-converted" and thelatter are called "post-converted". This specification defines botha forward and reverse transformation algorithm.2. Hostname Part TransformationAccording to [STD13], hostname parts must be case-insensitive, start andend with a letter or digit, and contain only letters, digits, and thehyphen character ("-"). This, of course, excludes most characters usedby non-English speakers, characters, as well as many other characters in the ASCII character repertoire. Further, domain name parts must be 63 octets or shorter in length.2.1 Post-Converted Name PrefixThis document defines the string 'wq--' as a prefix to identify UTF-6-encoded sequences. For the purposes of comparison in the IDN Working Group activities, the 'wq--' prefix should be used solely to identify UTF-6 sequences. However, should this document proceed beyond draft status the prefix should be changed to whatever prefix, if any,is the final consensus of the IDN working group.Note that the prepending of a fixed identifier sequence is only onemechanism for differentiating ASCII character encoded internationaldomain names from 'ordinary' domain names. One method, as proposed in[IDNRACE], is to include a character prefix or suffix that does notappear in any name in any zone file. A second method is to insert adomain component which pushes off any international names one or morelevels deeper into the DNS heirarchy. There are trade-offs betweenthese two methods which are independent of the Unicode to ASCIItranscoding method finally chosen. We do not address the internationalvs. 'ordinary' name differention issue in this paper.2.2 Hostname PrepartionThe hostname part is assumed to have at least one character disallowedby [STD13], and that is has been processed for logically equivalent character mapping, filtering of disallowed characters (if any), and compatibility composition/decomposition before presentation to the UTF-6 conversion algorithm. While it is possible to invent a transcoding mechanism that relieson certain Unicode characters being deemed illegal within domain namesand hence available to the transcoding mechanism for improving encodingefficiency, we feel that such a proposal would complicate mattersexcessively. We also believe that Unicode name preprocessing forboth name resolution and name registration should be considered as sseparate, independent issues, which we will attempt to address in aseparate document.2.3 DefinitionsFor clarity: 'integer' is an unsigned binary quantity; 'byte' is an 8-bit integer quantity; 'nibble' is a 4-bit integer quantity.2.4 UTF-6 EncodingThe idea behind this scheme was to improve on the UTF-5 transformationalgorithm described in [IDNDUERST] by providing a straightforwardcompression mechanism. UTF-6 defines a compression mechanism byindentifying identical leading byte or nibble values in the pre-convertedstring, and using the length of this leading value to select a mask whichcan be applied to the pre-converted string. The resulting post-convertedstring is preserves the simplicity and readability of UTF-5 while enabling longer sequences to be encoded into a single host name part.2.4.1 Variable Length Hex EncodingThe variable length hex encoding algorithm was introduced by Duerst in [IDNDUERST]. It encodes an integer value in a slight modification of traditional hexadecimal notation, the difference being that the most significant digit is represented with an alternate set of "digits" - -- 'g through 'v' are used to represent 0 through 15. The result is a variable length encoding which can efficiently represent integers of arbitrary length. The variable length nibble encoding of an integer, C, is definedas follows: 1. Skip over leading non-significant zero nibbles to find I, the first significant nibble of c; 2. Emit the Ith character of the set [ghijklmopqrstuv]; 3. Continue from most to least significant, encoding each remaining nibble J by emitting the Jth character of the set [0123456789abcdef].Examples: 0x1f4c is encoded as "hf4c" 0x0624 is encoded as "m24" 0x0000 is encoded as "g" 'n' a single character in single quotes stands for the Unicode code point for that character. 2.4.2 UTF-6 Compression AlgorithmUTF-6 improves on the UTF-5 encoding by providing compression, whichenables encoding of a larger number of characters in each hostnamepart. The compression algorithm is defined as follows: 1. Set the mask to 0xFFFF; 2. If the number of non '-' characters is less than 2, proceed to step 5; 3. If the most significant byte of every non '-' character is the same value: 3a. Set HB to this value; 3b. Emit 'Y'; 3c. Emit the variable length hex encoding of HB; 3d. Set the mask to 0x00FF; 3e. Proceed to step 5. 4. If the most significant nibble of every non '-' character is the same value: 4a. Set HN to this value; 4b. Emit 'Z'; 4c. Emit the variable length hex encoding of HN; 4d. Set the mask to 0x0FFF. 5. Foreach input character:
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -