📄 draft-ietf-idn-amc-ace-m-00.txt
字号:
INTERNET-DRAFT Adam M. Costellodraft-ietf-idn-amc-ace-m-00.txt 2001-Feb-12Expires 2001-Aug-14 AMC-ACE-M version 0.1.0Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Distribution of this document is unlimited. Please send comments to the author at amc@cs.berkeley.edu, or to the idn working group at idn@ops.ietf.org. A non-paginated (and possibly newer) version of this specification may be available at http://www.cs.berkeley.edu/~amc/charset/amc-ace-mAbstract AMC-ACE-M is a reversible map from a sequence of Unicode [UNICODE] characters to a sequence of letters (A-Z, a-z), digits (0-9), and hyphen-minus (-), henceforth called LDH characters. Such a map (called an "ASCII-Compatible Encoding", or ACE) might be useful for internationalized domain names [IDN], because host name labels are currently restricted to LDH characters by [RFC952] and [RFC1123]. AMC-ACE-M is a cross between BRACE [BRACE00] (which is efficient but complex) and DUDE [DUDE00] (which is simple and provides case preservation). AMC-ACE-M is much simpler than BRACE but similarly efficient, and provides case preservation like DUDE. Besides domain names, there might also be other contexts where it is useful to transform Unicode characters into "safe" (delimiter-free) ASCII characters. (If other contexts consider hyphen-minus to be unsafe, a different character could be used to play its role, like underscore.)Contents Features Name Overview Base-32 characters Encoding procedure Decoding procedure Signature Case sensitivity models Comparison with RACE, BRACE, LACE, and DUDE Example strings Security considerations References Author Example implementationFeatures Uniqueness: Every Unicode string maps to at most one LDH string. Completeness: Every Unicode string maps to an LDH string. Restrictions on which Unicode strings are allowed, and on length, may be imposed by higher layers. Efficient encoding: The ratio of encoded size to original size is small for all Unicode strings. This is important in the context of domain names because [RFC1034] restricts the length of a domain label to 63 characters. Simplicity: The encoding and decoding algorithms are reasonably simple to implement. The goals of efficiency and simplicity are at odds; AMC-ACE-M aims at a good balance between them. Case-preservation: If the Unicode string has been case-folded prior to encoding, it is possible to record the case information in the case of the letters in the encoding, allowing a mixed-case Unicode string to be recovered if desired, but a case-insensitive comparison of two encoded strings is equivalent to a case-insensitive comparison of the Unicode strings. This feature is optional; see section "Case sensitivity models". Readability: The letters A-Z and a-z and the digits 0-9 appearing in the Unicode string are represented as themselves in the label. This comes for free because it usually the most efficient encoding anyway.Name AMC-ACE-M is a working name that should be changed if it is adopted. (The M merely indicates that it is the thirteenth ACE devised by this author. BRACE was the third. D through L did not deliver enough efficiency to justify their complexity.) Rather than waste good names on experimental proposals, let's wait until one proposal is chosen, then assign it a good name. Suggestions (assuming the primary use is in domain names): UniHost UTF-A ("A" for "ASCII" or "alphanumeric", but unfortunately UTF-A sounds like UTF-8) UTF-H ("H" for "host names", but unfortunately UTF-H sounds like UTF-8) UTF-D ("D" for "domain names") NUDE (Normal Unicode Domain Encoding)Overview AMC-ACE-M maps characters to characters--it does not consume or produce code points, code units, or bytes, although the algorithm makes use of code points, and implementations will of course need to represent the input and output characters somehow, usually as bytes or other code units. Each character in the Unicode string is represented by an integral number of characters in the encoded string. There is no intermediate bit string or octet string. The encoded string alternates between two modes: literal mode and base-32 mode. LDH characters in the Unicode string are encoded literally, except that hyphen-minus is doubled. Non-LDH characters in the Unicode string are encoded using base-32, in which each character of the encoded string represents five bits (a "quintet"). A non-paired hyphen-minus in the encoded string indicates a mode change. In base-32 mode a group of one to five quintets are used to represent a number, which is added to an offset to yield a Unicode code point, which in turn represents a Unicode character. (Surrogates, which are code units used by UTF-16 in pairs to refer to code points, are not used and not allowed in AMC-ACE-M.) Similarities between the code points are exploited to make the encoding more compact.Base-32 characters "a" = 0 = 0x00 = 00000 "s" = 16 = 0x10 = 10000 "b" = 1 = 0x01 = 00001 "t" = 17 = 0x11 = 10001 "c" = 2 = 0x02 = 00010 "u" = 18 = 0x12 = 10010 "d" = 3 = 0x03 = 00011 "v" = 19 = 0x13 = 10011 "e" = 4 = 0x04 = 00100 "w" = 20 = 0x14 = 10100 "f" = 5 = 0x05 = 00101 "x" = 21 = 0x15 = 10101 "g" = 6 = 0x06 = 00110 "y" = 22 = 0x16 = 10110 "h" = 7 = 0x07 = 00111 "z" = 23 = 0x17 = 10111 "i" = 8 = 0x08 = 01000 "2" = 24 = 0x18 = 11000 "j" = 9 = 0x09 = 01001 "3" = 25 = 0x19 = 11001 "k" = 10 = 0x0A = 01010 "4" = 26 = 0x1A = 11010 "m" = 11 = 0x0B = 01011 "5" = 27 = 0x1B = 11011 "n" = 12 = 0x0C = 01100 "6" = 28 = 0x1C = 11100 "p" = 13 = 0x0D = 01101 "7" = 29 = 0x1D = 11101 "q" = 14 = 0x0E = 01110 "8" = 30 = 0x1E = 11110 "r" = 15 = 0x0F = 01111 "9" = 31 = 0x1F = 11111 The digits "0" and "1" and the letters "o" and "l" are not used, to avoid transcription errors. All decoders must recognize both the uppercase and lowercase forms of the base-32 characters. The case may or may not convey information, as described in section "Case sensitivity models".Encoding procedure The encoder first examines the Unicode string and chooses some parameters. It writes these parameters into the output string, then proceeds to encode each Unicode character, one at a time. The exact sequence of steps is given below. All ordering of bits and quintets is big-endian (most significant first). The >> and << operators used below mean bit shift, as in C. For >> there is no question of logical versus arithmetic shift because AMC-ACE-M makes no use of negative numbers. 0) Determine the Unicode code point for each non-LDH character in the Unicode string. Since LDH characters are encoded literally, their code points are not needed. Depending on how the Unicode string is presented to the encoder, this step may be a no-op. 1) Verify that there are are no invalid code points in the input; that is, none exceed 0x10FFFF (the highest code point in the Unicode code space) and none are in the range D800..DFFF (surrogates). 2) Determine the most populous row: Row n is defined as the 256 code points starting with n << 8, except that this definition would makes rows D8..DF useless, because they would contain only surrogates. Therefore AMC-ACE-M defines rows D8..DF to be the following non-aligned blocks of 256 code points: row D8 = 0020..001F row D9 = 005B..015A row DA = 007B..017A row DB = 00A0..019F row DC = 00C0..01BF row DD = 00DF..01DE row DE = 0134..0233 row DF = 0270..036F (Rationale: Whereas almost every small script is confined to a single row, the Latin script is split across a few rows, and the row boundaries are not especially convenient for many languages.) Determine the row containing the most non-LDH input code points, breaking ties in favor of smaller-numbered rows. (If a code point appears multiple times in the input, it counts multiple times. This applies to steps 3 and 4 also.) Call it row B. Let offsetB be the first code point of row B. 3) Determine the most populous 16-window: For each n in 0..31 let offset = ((offsetB >> 3) + n) << 3 and count the number of code points in the range offset through offset + 0xF. Let A be the value of n that maximizes this count, breaking ties in favor of smaller values of n, and let offsetA be the corresponding offset. 4) Determine the most populous 20k-window: If the input is empty, then let C = 0. Otherwise, for each input code point, let n = code_point >> 11, and count the number of non-LDH input code points that are not in row B and are in the range (n << 11) through (n << 11) + 0x4FFF. Determine the value of n that maximizes the count, breaking ties in favor of smaller values of n, and let C be that value. 5) Choose a style: One of the base-32 codes used in step 7.3 has two variants, and so base-32 mode is subdivided into two styles, narrow and wide, depending on which variant is used. Compute the total number of base-32 characters that would be produced if narrow style were used, and the number if wide style were used. The easiest way to do this is to mimic the logic of steps 6 and 7.3. Use whichever style would produce fewer base-32 characters. In case of a tie, use narrow style. 6) Encode the parameters. If narrow style is used, then let offsetC = (offsetB >> 12) << 12, and encode B and A as three or four base-32 characters: 00bbb bbbbb aaaaa if B <= 0xFF 01bbb bbbbb bbbbb aaaaa otherwise If wide style is used, then let offsetC = C << 11, and encode B and C as three or five base-32 characters: 10bbb bbbbb ccccc if B <= 0xFF and C <= 0x1F 11bbb bbbbb bbbbb ccccc ccccc otherwise 7) Encode each input character in turn, using the first of the following cases that applies. The mode is initially base-32. 7.1) The character is a hyphen-minus (U+002D). Encode it as two hyphen-minuses. 7.2) The character is an LDH character. If in base-32 mode then output a hyphen-minus and switch to literal mode. Copy the character to the output. 7.3) The character is a non-LDH character. If in literal mode then output a hyphen-minus and switch to base-32 mode. Encode the character's code point using the first of the following cases that applies. Square brackets enclose quintets that can be used to record the upper/lowercase attribute of the Unicode character (because the corresponding base-32 characters are guaranteed to be letters rather than digits) (see section "Case sensitivity models"). 7.3.1) Narrow style was chosen and the code point is in the range offsetA through offsetA + 0xF. Subtract offsetA and encode the difference as a single base-32 character: [0xxxx] 7.3.2) The code point is in the range offsetB through offsetB + 0xFF. Subtract offsetB and encode the difference as two base-32 characters: 1xxxx [0xxxx] 7.3.3) The code point is in the range offsetC through offsetC + 0xFFF. Subtract offsetC and encode the difference as three base-32 characters: 1xxxx 1xxxx [0xxxx] 7.3.4) Wide style was chosen and the code point is in the range offsetC + 0x1000 through offsetC + 0x4FFF. Subtract offsetC + 0x1000 and encode the difference as three base-32 characters: [0xxxx] xxxxx xxxxx 7.3.5) The code point is in the range 0 through 0xFFFF. Encode it as four base-32 characters: 1xxxx 1xxxx 1xxxx [0xxxx] 7.3.6) If we've come this far, the code point must be in the range 0x10000 through 0x10FFFF. Subtract 0x10000 and encode the difference as five base-32 characters: 1xxxx 1xxxx 1xxxx 1xxxx [0xxxx]Decoding procedure The details of the decoding procedure are implied by the encoding procedure. The overall sequence of steps is as follows. 1) Undo the encoder's step 6: From the first few base-32 characters, determine whether narrow or wide style is used, and determine the offsets. 2) Set the mode to base-32. For each remaining input character, use the first of the following cases that applies: 2.1) The character is a hyphen-minus, and the following character is also a hyphen-minus. Consume them both and output a hyphen-minus. 2.2) The character is a hyphen-minus. Consume it and toggle the mode flag. 2.3) The current mode is literal. Consume the input character and output it. 2.4) Interpret the input character and up to four of its successors as base-32. Consume characters until one is found whose value has the form 0xxxx. That is the one that carries the upper/lowercase information. Remember
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -