📄 draft-ietf-idn-dude-02.txt
字号:
INTERNET-DRAFT Mark Welterdraft-ietf-idn-dude-02.txt Brian W. SpolarichExpires 2001-Dec-07 Adam M. Costello 2001-Jun-07 Differential Unicode Domain Encoding (DUDE)Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Distribution of this document is unlimited. Please send comments to the authors or to the idn working group at idn@ops.ietf.org.Abstract DUDE is a reversible transformation from a sequence of nonnegative integer values to a sequence of letters, digits, and hyphens (LDH characters). DUDE provides a simple and efficient ASCII-Compatible Encoding (ACE) of Unicode strings [UNICODE] for use with Internationalized Domain Names [IDN] [IDNA].Contents 1. Introduction 2. Terminology 3. Overview 4. Base-32 characters 5. Encoding procedure 6. Decoding procedure 7. Example strings 8. Security considerations 9. References A. Acknowledgements B. Author contact information C. Mixed-case annotation D. Differences from draft-ietf-idn-dude-01 E. Example implementation1. Introduction The IDNA draft [IDNA] describes an architecture for supporting internationalized domain names. Each label of a domain name may begin with a special prefix, in which case the remainder of the label is an ASCII-Compatible Encoding (ACE) of a Unicode string satisfying certain constraints. For the details of the constraints, see [IDNA] and [NAMEPREP]. The prefix has not yet been specified, but see http://www.i-d-n.net/ for prefixes to be used for testing and experimentation. DUDE is intended to be used as an ACE within IDNA, and has been designed to have the following features: * Completeness: Every sequence of nonnegative integers maps to an LDH string. Restrictions on which integers are allowed, and on sequence length, may be imposed by higher layers. * Uniqueness: Every sequence of nonnegative integers maps to at most one LDH string. * Reversibility: Any Unicode string mapped to an LDH string can be recovered from that LDH string. * Efficient encoding: The ratio of encoded size to original size is small. This is important in the context of domain names because [RFC1034] restricts the length of a domain label to 63 characters. * Simplicity: The encoding and decoding algorithms are reasonably simple to implement. The goals of efficiency and simplicity are at odds; DUDE places greater emphasis on simplicity. An optional feature is described in appendix C "Mixed-case annotation".2. Terminology The key words "must", "shall", "required", "should", "recommended", and "may" in this document are to be interpreted as described in RFC 2119 [RFC2119]. LDH characters are the letters A-Z and a-z, the digits 0-9, and hyphen-minus. A quartet is a sequence of four bits (also known as a nibble or nybble). A quintet is a sequence of five bits. Hexadecimal values are shown preceeded by "0x". For example, 0x60 is decimal 96. As in the Unicode Standard [UNICODE], Unicode code points are denoted by "U+" followed by four to six hexadecimal digits, while a range of code points is denoted by two hexadecimal numbers separated by "..", with no prefixes. XOR means bitwise exclusive or. Given two nonnegative integer values A and B, A XOR B is the nonnegative integer value whose binary representation is 1 in whichever places the binary representations of A and B disagree, and 0 wherever they agree. For the purpose of applying this rule, recall that an integer's representation begins with an infinite number of unwritten zeros. In some programming languages, care may need to be taken that A and B are stored in variables of the same type and size.3. Overview DUDE encodes a sequence of nonnegative integral values as a sequence of LDH characters, although implementations will of course need to represent the output characters somehow, typically as ASCII octets. When DUDE is used to encode Unicode characters, the input values are Unicode code points (integral values in the range 0..10FFFF, but not D800..DFFF, which are reserved for use by UTF-16). Each value in the input sequence is represented by one or more LDH characters in the encoded string. The value 0x2D is represented by hyphen-minus (U+002D). Each non-hyphen-minus character in the encoded string represents a quintet. A sequence of quintets represents the bitwise XOR between each non-0x2D integer and the previous one.4. Base-32 characters "a" = 0 = 0x00 = 00000 "s" = 16 = 0x10 = 10000 "b" = 1 = 0x01 = 00001 "t" = 17 = 0x11 = 10001 "c" = 2 = 0x02 = 00010 "u" = 18 = 0x12 = 10010 "d" = 3 = 0x03 = 00011 "v" = 19 = 0x13 = 10011 "e" = 4 = 0x04 = 00100 "w" = 20 = 0x14 = 10100 "f" = 5 = 0x05 = 00101 "x" = 21 = 0x15 = 10101 "g" = 6 = 0x06 = 00110 "y" = 22 = 0x16 = 10110 "h" = 7 = 0x07 = 00111 "z" = 23 = 0x17 = 10111 "i" = 8 = 0x08 = 01000 "2" = 24 = 0x18 = 11000 "j" = 9 = 0x09 = 01001 "3" = 25 = 0x19 = 11001 "k" = 10 = 0x0A = 01010 "4" = 26 = 0x1A = 11010 "m" = 11 = 0x0B = 01011 "5" = 27 = 0x1B = 11011 "n" = 12 = 0x0C = 01100 "6" = 28 = 0x1C = 11100 "p" = 13 = 0x0D = 01101 "7" = 29 = 0x1D = 11101 "q" = 14 = 0x0E = 01110 "8" = 30 = 0x1E = 11110 "r" = 15 = 0x0F = 01111 "9" = 31 = 0x1F = 11111 The digits "0" and "1" and the letters "o" and "l" are not used, to avoid transcription errors. A decoder must accept both the uppercase and lowercase forms of the base-32 characters (including mixtures of both forms). An encoder should output only lowercase forms or only uppercase forms (unless it uses the feature described in the appendix C "Mixed-case annotation").5. Encoding procedure All ordering of bits, quartets, and quintets is big-endian (most significant first). let prev = 0x60 for each input integer n (in order) do begin if n == 0x2D then output hyphen-minus else begin let diff = prev XOR n represent diff in base 16 as a sequence of quartets, as few as are sufficient (but at least one) prepend 0 to the last quartet and 1 to each of the others output a base-32 character corresponding to each quintet let prev = n end end If an encoder encounters an input value larger than expected (for example, the largest Unicode code point is U+10FFFF, and nameprep [NAMEPREP03] can never output a code point larger than U+EFFFD), the encoder may either encode the value correctly, or may fail, but it must not produce incorrect output. The encoder must fail if it encounters a negative input value.6. Decoding procedure let prev = 0x60 while the input string is not exhausted do begin if the next character is hyphen-minus then consume it and output 0x2D else begin consume characters and convert them to quintets until encountering a quintet whose first bit is 0 fail upon encountering a non-base-32 character or end-of-input strip the first bit of each quintet concatenate the resulting quartets to form diff let prev = prev XOR diff output prev end end encode the output sequence and compare it to the input string fail if they do not match (case-insensitively) The comparison at the end is necessary to guarantee the uniqueness property (there cannot be two distinct encoded strings representing the same sequence of integers). This check also frees the decoder from having to check for overflow while decoding the base-32 characters. (If the decoder is one step of a larger decoding process, it may be possible to defer the re-encoding and comparison to the end of that larger decoding process.)7. Example strings The first several examples are nonsense strings of mostly unassigned code points intended to exercise the corner cases of the algorithm. (A) u+0061 DUDE: b (B) u+2C7EF u+2C7EF DUDE: u6z2ra (C) u+1752B u+1752A DUDE: tzxwmb (D) u+63AB1 u+63ABA DUDE: yv47bm (E) u+261AF u+261BF DUDE: uyt6rta (F) u+C3A31 u+C3A8C DUDE: 6v4xb5p (G) u+09F44 u+0954C DUDE: 39ue4si (H) u+8D1A3 u+8C8A3 DUDE: 27t6dt3sa (I) u+6C2B6 u+CC266 DUDE: y6u7g4ss7a (J) u+002D u+002D u+002D u+E848F DUDE: ---82w8r (K) u+BD08E u+002D u+002D u+002D DUDE: 57s8q--- (L) u+A9A24 u+002D u+002D u+002D u+C05B7 DUDE: 434we---y393d (M) u+7FFFFFFF DUDE: z999993r or explicit failure The next several examples are realistic Unicode strings that could be used in domain names. They exhibit single-row text, two-row text, ideographic text, and mixtures thereof. These examples are names of Japanese television programs, music artists, and songs, merely because one of the authors happened to have them handy. (N) 3<nen>b<gumi><kinpachi><sensei> (Latin, kanji) u+0033 u+5E74 u+0062 u+7D44 u+91D1 u+516B u+5148 u+751F DUDE: xdx8whx8tgz7ug863f6s5kuduwxh (O) <amuro><namie>-with-super-monkeys (Latin, kanji, hyphens) u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074 u+0068 u+002D u+0073 u+0075 u+0070 u+0065 u+0072 u+002D u+006D u+006F u+006E u+006B u+0065 u+0079 u+0073 DUDE: x58jupu8nuy6gt99m-yssctqtptn-tmgftfth-trcbfqtnk (P) maji<de>koi<suru>5<byou><mae> (Latin, hiragana, kanji) u+006D u+0061 u+006A u+0069 u+3067 u+006B u+006F u+0069 u+3059 u+308B u+0035 u+79D2 u+524D DUDE: pnmdvssqvssnegvsva7cvs5qz38hu53r (Q) <pafii>de<runba> (Latin, katakana) u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0 DUDE: vs5bezgxrvs3ibvs2qtiud (R) <sono><supiido><de> (hiragana, katakana)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -