📄 rfc3492.txt
字号:
predict the minimum number of digits needed to represent the next delta: while delta > ((base - tmin) * tmax) div 2 do let delta = delta div (base - tmin) 4. The bias is set: let bias = (base * the number of divisions performed in step 3) + (((base - tmin + 1) * delta) div (delta + skew)) The motivation for this procedure is that the current delta provides a hint about the likely size of the next delta, and so t(j) is set to tmax for the more significant digits starting with the one expected to be last, tmin for the less significant digits up through the one expected to be third-last, and somewhere between tmin and tmax for the digit expected to be second-lastCostello Standards Track [Page 7]RFC 3492 IDNA Punycode March 2003 (balancing the hope of the expected-last digit being unnecessary against the danger of it being insufficient).4. Bootstring parameters Given a set of basic code points, one needs to be designated as the delimiter. The base cannot be greater than the number of distinguishable basic code points remaining. The digit-values in the range 0 through base-1 need to be associated with distinct non- delimiter basic code points. In some cases multiple code points need to have the same digit-value; for example, uppercase and lowercase versions of the same letter need to be equivalent if basic strings are case-insensitive. The initial value of n cannot be greater than the minimum non-basic code point that could appear in extended strings. The remaining five parameters (tmin, tmax, skew, damp, and the initial value of bias) need to satisfy the following constraints: 0 <= tmin <= tmax <= base-1 skew >= 1 damp >= 2 initial_bias mod base <= base - tmin Provided the constraints are satisfied, these five parameters affect efficiency but not correctness. They are best chosen empirically. If support for mixed-case annotation is desired (see appendix A), make sure that the code points corresponding to 0 through tmax-1 all have both uppercase and lowercase forms.5. Parameter values for Punycode Punycode uses the following Bootstring parameter values: base = 36 tmin = 1 tmax = 26 skew = 38 damp = 700 initial_bias = 72 initial_n = 128 = 0x80 Although the only restriction Punycode imposes on the input integers is that they be nonnegative, these parameters are especially designed to work well with Unicode [UNICODE] code points, which are integers in the range 0..10FFFF (but not D800..DFFF, which are reserved forCostello Standards Track [Page 8]RFC 3492 IDNA Punycode March 2003 use by the UTF-16 encoding of Unicode). The basic code points are the ASCII [ASCII] code points (0..7F), of which U+002D (-) is the delimiter, and some of the others have digit-values as follows: code points digit-values ------------ ---------------------- 41..5A (A-Z) = 0 to 25, respectively 61..7A (a-z) = 0 to 25, respectively 30..39 (0-9) = 26 to 35, respectively Using hyphen-minus as the delimiter implies that the encoded string can end with a hyphen-minus only if the Unicode string consists entirely of basic code points, but IDNA forbids such strings from being encoded. The encoded string can begin with a hyphen-minus, but IDNA prepends a prefix. Therefore IDNA using Punycode conforms to the RFC 952 rule that host name labels neither begin nor end with a hyphen-minus [RFC952]. A decoder MUST recognize the letters in both uppercase and lowercase forms (including mixtures of both forms). An encoder SHOULD output only uppercase forms or only lowercase forms, unless it uses mixed- case annotation (see appendix A). Presumably most users will not manually write or type encoded strings (as opposed to cutting and pasting them), but those who do will need to be alert to the potential visual ambiguity between the following sets of characters: G 6 I l 1 O 0 S 5 U V Z 2 Such ambiguities are usually resolved by context, but in a Punycode encoded string there is no context apparent to humans.6. Bootstring algorithms Some parts of the pseudocode can be omitted if the parameters satisfy certain conditions (for which Punycode qualifies). These parts are enclosed in {braces}, and notes immediately following the pseudocode explain the conditions under which they can be omitted.Costello Standards Track [Page 9]RFC 3492 IDNA Punycode March 2003 Formally, code points are integers, and hence the pseudocode assumes that arithmetic operations can be performed directly on code points. In some programming languages, explicit conversion between code points and integers might be necessary.6.1 Bias adaptation function function adapt(delta,numpoints,firsttime): if firsttime then let delta = delta div damp else let delta = delta div 2 let delta = delta + (delta div numpoints) let k = 0 while delta > ((base - tmin) * tmax) div 2 do begin let delta = delta div (base - tmin) let k = k + base end return k + (((base - tmin + 1) * delta) div (delta + skew)) It does not matter whether the modifications to delta and k inside adapt() affect variables of the same name inside the encoding/decoding procedures, because after calling adapt() the caller does not read those variables before overwriting them.Costello Standards Track [Page 10]RFC 3492 IDNA Punycode March 20036.2 Decoding procedure let n = initial_n let i = 0 let bias = initial_bias let output = an empty string indexed from 0 consume all code points before the last delimiter (if there is one) and copy them to output, fail on any non-basic code point if more than zero code points were consumed then consume one more (which will be the last delimiter) while the input is not exhausted do begin let oldi = i let w = 1 for k = base to infinity in steps of base do begin consume a code point, or fail if there was none to consume let digit = the code point's digit-value, fail if it has none let i = i + digit * w, fail on overflow let t = tmin if k <= bias {+ tmin}, or tmax if k >= bias + tmax, or k - bias otherwise if digit < t then break let w = w * (base - t), fail on overflow end let bias = adapt(i - oldi, length(output) + 1, test oldi is 0?) let n = n + i div (length(output) + 1), fail on overflow let i = i mod (length(output) + 1) {if n is a basic code point then fail} insert n into output at position i increment i end The full statement enclosed in braces (checking whether n is a basic code point) can be omitted if initial_n exceeds all basic code points (which is true for Punycode), because n is never less than initial_n. In the assignment of t, where t is clamped to the range tmin through tmax, "+ tmin" can always be omitted. This makes the clamping calculation incorrect when bias < k < bias + tmin, but that cannot happen because of the way bias is computed and because of the constraints on the parameters. Because the decoder state can only advance monotonically, and there is only one representation of any delta, there is therefore only one encoded string that can represent a given sequence of integers. The only error conditions are invalid code points, unexpected end-of- input, overflow, and basic code points encoded using deltas instead of appearing literally. If the decoder fails on these errors as shown above, then it cannot produce the same output for two distinct inputs. Without this property it would have been necessary to re-Costello Standards Track [Page 11]RFC 3492 IDNA Punycode March 2003 encode the output and verify that it matches the input in order to guarantee the uniqueness of the encoding.6.3 Encoding procedure let n = initial_n let delta = 0 let bias = initial_bias let h = b = the number of basic code points in the input copy them to the output in order, followed by a delimiter if b > 0 {if the input contains a non-basic code point < n then fail} while h < length(input) do begin let m = the minimum {non-basic} code point >= n in the input let delta = delta + (m - n) * (h + 1), fail on overflow let n = m for each code point c in the input (in order) do begin if c < n {or c is basic} then increment delta, fail on overflow if c == n then begin let q = delta for k = base to infinity in steps of base do begin let t = tmin if k <= bias {+ tmin}, or tmax if k >= bias + tmax, or k - bias otherwise if q < t then break output the code point for digit t + ((q - t) mod (base - t)) let q = (q - t) div (base - t) end output the code point for digit q let bias = adapt(delta, h + 1, test h equals b?) let delta = 0 increment h end end increment delta and n end The full statement enclosed in braces (checking whether the input contains a non-basic code point less than n) can be omitted if all code points less than initial_n are basic code points (which is true for Punycode if code points are unsigned). The brace-enclosed conditions "non-basic" and "or c is basic" can be omitted if initial_n exceeds all basic code points (which is true for Punycode), because the code point being tested is never less than initial_n. In the assignment of t, where t is clamped to the range tmin through tmax, "+ tmin" can always be omitted. This makes the clamping calculation incorrect when bias < k < bias + tmin, but that cannotCostello Standards Track [Page 12]RFC 3492 IDNA Punycode March 2003 happen because of the way bias is computed and because of the constraints on the parameters. The checks for overflow are necessary to avoid producing invalid output when the input contains very large values or is very long. The increment of delta at the bottom of the outer loop cannot overflow because delta < length(input) before the increment, and length(input) is already assumed to be representable. The increment of n could overflow, but only if h == length(input), in which case the procedure is finished anyway.6.4 Overflow handling For IDNA, 26-bit unsigned integers are sufficient to handle all valid IDNA labels without overflow, because any string that needed a 27-bit delta would have to exceed either the code point limit (0..10FFFF) or the label length limit (63 characters). However, overflow handling is necessary because the inputs are not necessarily valid IDNA labels. If the programming language does not provide overflow detection, the following technique can be used. Suppose A, B, and C are representable nonnegative integers and C is nonzero. Then A + B overflows if and only if B > maxint - A, and A + (B * C) overflows if and only if B > (maxint - A) div C, where maxint is the greatest integer for which maxint + 1 cannot be represented. Refer to appendix C "Punycode sample implementation" for demonstrations of this technique in the C language. The decoding and encoding algorithms shown in sections 6.2 and 6.3 handle overflow by detecting it whenever it happens. Another approach is to enforce limits on the inputs that prevent overflow from happening. For example, if the encoder were to verify that no input code points exceed M and that the input length does not exceed L, then no delta could ever exceed (M - initial_n) * (L + 1), and hence no overflow could occur if integer variables were capable of representing values that large. This prevention approach would impose more restrictions on the input than the detection approach does, but might be considered simpler in some programming languages. In theory, the decoder could use an analogous approach, limiting the number of digits in a variable-length integer (that is, limiting the number of iterations in the innermost loop). However, the number of digits that suffice to represent a given delta can sometimes represent much larger deltas (because of the adaptation), and hence this approach would probably need integers wider than 32 bits.Costello Standards Track [Page 13]RFC 3492 IDNA Punycode March 2003 Yet another approach for the decoder is to allow overflow to occur, but to check the final output string by re-encoding it and comparing to the decoder input. If and only if they do not match (using a case-insensitive ASCII comparison) overflow has occurred. This delayed-detection approach would not impose any more restrictions on the input than the immediate-detection approach does, and might be considered simpler in some programming languages. In fact, if the decoder is used only inside the IDNA ToUnicode operation [IDNA], then it need not check for overflow at all, because
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -