⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 draft-ietf-idn-amc-ace-m-00.txt

📁 bind-3.2.
💻 TXT
📖 第 1 页 / 共 5 页
字号:
INTERNET-DRAFT                                          Adam M. Costellodraft-ietf-idn-amc-ace-m-00.txt                              2001-Feb-12Expires 2001-Aug-14                         AMC-ACE-M version 0.1.0Status of this Memo    This document is an Internet-Draft and is in full conformance with    all provisions of Section 10 of RFC2026.    Internet-Drafts are working documents of the Internet Engineering    Task Force (IETF), its areas, and its working groups.  Note    that other groups may also distribute working documents as    Internet-Drafts.    Internet-Drafts are draft documents valid for a maximum of six    months and may be updated, replaced, or obsoleted by other documents    at any time.  It is inappropriate to use Internet-Drafts as    reference material or to cite them other than as "work in progress."    The list of current Internet-Drafts can be accessed at    http://www.ietf.org/ietf/1id-abstracts.txt    The list of Internet-Draft Shadow Directories can be accessed at    http://www.ietf.org/shadow.html    Distribution of this document is unlimited.  Please send comments    to the author at amc@cs.berkeley.edu, or to the idn working    group at idn@ops.ietf.org.  A non-paginated (and possibly    newer) version of this specification may be available at    http://www.cs.berkeley.edu/~amc/charset/amc-ace-mAbstract    AMC-ACE-M is a reversible map from a sequence of Unicode [UNICODE]    characters to a sequence of letters (A-Z, a-z), digits (0-9), and    hyphen-minus (-), henceforth called LDH characters.  Such a map    (called an "ASCII-Compatible Encoding", or ACE) might be useful for    internationalized domain names [IDN], because host name labels are    currently restricted to LDH characters by [RFC952] and [RFC1123].    AMC-ACE-M is a cross between BRACE [BRACE00] (which is efficient    but complex) and DUDE [DUDE00] (which is simple and provides case    preservation).  AMC-ACE-M is much simpler than BRACE but similarly    efficient, and provides case preservation like DUDE.    Besides domain names, there might also be other contexts where it is    useful to transform Unicode characters into "safe" (delimiter-free)    ASCII characters.  (If other contexts consider hyphen-minus to be    unsafe, a different character could be used to play its role, like    underscore.)Contents    Features    Name    Overview    Base-32 characters    Encoding procedure    Decoding procedure    Signature    Case sensitivity models    Comparison with RACE, BRACE, LACE, and DUDE    Example strings    Security considerations    References    Author    Example implementationFeatures    Uniqueness:  Every Unicode string maps to at most one LDH string.    Completeness:  Every Unicode string maps to an LDH string.    Restrictions on which Unicode strings are allowed, and on length,    may be imposed by higher layers.    Efficient encoding:  The ratio of encoded size to original size is    small for all Unicode strings.  This is important in the context    of domain names because [RFC1034] restricts the length of a domain    label to 63 characters.    Simplicity:  The encoding and decoding algorithms are reasonably    simple to implement.  The goals of efficiency and simplicity are at    odds; AMC-ACE-M aims at a good balance between them.    Case-preservation:  If the Unicode string has been case-folded prior    to encoding, it is possible to record the case information in the    case of the letters in the encoding, allowing a mixed-case Unicode    string to be recovered if desired, but a case-insensitive comparison    of two encoded strings is equivalent to a case-insensitive    comparison of the Unicode strings.  This feature is optional; see    section "Case sensitivity models".    Readability:  The letters A-Z and a-z and the digits 0-9 appearing    in the Unicode string are represented as themselves in the label.    This comes for free because it usually the most efficient encoding    anyway.Name    AMC-ACE-M is a working name that should be changed if it is adopted.    (The M merely indicates that it is the thirteenth ACE devised by    this author.  BRACE was the third.  D through L did not deliver    enough efficiency to justify their complexity.)  Rather than waste    good names on experimental proposals, let's wait until one proposal    is chosen, then assign it a good name.  Suggestions (assuming the    primary use is in domain names):        UniHost        UTF-A ("A" for "ASCII" or "alphanumeric",               but unfortunately UTF-A sounds like UTF-8)        UTF-H ("H" for "host names",               but unfortunately UTF-H sounds like UTF-8)        UTF-D ("D" for "domain names")        NUDE (Normal Unicode Domain Encoding)Overview    AMC-ACE-M maps characters to characters--it does not consume or    produce code points, code units, or bytes, although the algorithm    makes use of code points, and implementations will of course need to    represent the input and output characters somehow, usually as bytes    or other code units.    Each character in the Unicode string is represented by an    integral number of characters in the encoded string.  There is no    intermediate bit string or octet string.    The encoded string alternates between two modes: literal mode and    base-32 mode.  LDH characters in the Unicode string are encoded    literally, except that hyphen-minus is doubled.  Non-LDH characters    in the Unicode string are encoded using base-32, in which each    character of the encoded string represents five bits (a "quintet").    A non-paired hyphen-minus in the encoded string indicates a mode    change.    In base-32 mode a group of one to five quintets are used to    represent a number, which is added to an offset to yield a    Unicode code point, which in turn represents a Unicode character.    (Surrogates, which are code units used by UTF-16 in pairs to    refer to code points, are not used and not allowed in AMC-ACE-M.)    Similarities between the code points are exploited to make the    encoding more compact.Base-32 characters        "a" =  0 = 0x00 = 00000         "s" = 16 = 0x10 = 10000        "b" =  1 = 0x01 = 00001         "t" = 17 = 0x11 = 10001        "c" =  2 = 0x02 = 00010         "u" = 18 = 0x12 = 10010        "d" =  3 = 0x03 = 00011         "v" = 19 = 0x13 = 10011        "e" =  4 = 0x04 = 00100         "w" = 20 = 0x14 = 10100        "f" =  5 = 0x05 = 00101         "x" = 21 = 0x15 = 10101        "g" =  6 = 0x06 = 00110         "y" = 22 = 0x16 = 10110        "h" =  7 = 0x07 = 00111         "z" = 23 = 0x17 = 10111        "i" =  8 = 0x08 = 01000         "2" = 24 = 0x18 = 11000        "j" =  9 = 0x09 = 01001         "3" = 25 = 0x19 = 11001        "k" = 10 = 0x0A = 01010         "4" = 26 = 0x1A = 11010        "m" = 11 = 0x0B = 01011         "5" = 27 = 0x1B = 11011        "n" = 12 = 0x0C = 01100         "6" = 28 = 0x1C = 11100        "p" = 13 = 0x0D = 01101         "7" = 29 = 0x1D = 11101        "q" = 14 = 0x0E = 01110         "8" = 30 = 0x1E = 11110        "r" = 15 = 0x0F = 01111         "9" = 31 = 0x1F = 11111    The digits "0" and "1" and the letters "o" and "l" are not used, to    avoid transcription errors.    All decoders must recognize both the uppercase and lowercase    forms of the base-32 characters.  The case may or may not convey    information, as described in section "Case sensitivity models".Encoding procedure    The encoder first examines the Unicode string and chooses some    parameters.  It writes these parameters into the output string, then    proceeds to encode each Unicode character, one at a time.  The exact    sequence of steps is given below.  All ordering of bits and quintets    is big-endian (most significant first).  The >> and << operators    used below mean bit shift, as in C.  For >> there is no question of    logical versus arithmetic shift because AMC-ACE-M makes no use of    negative numbers.     0) Determine the Unicode code point for each non-LDH character in        the Unicode string.  Since LDH characters are encoded literally,        their code points are not needed.  Depending on how the Unicode        string is presented to the encoder, this step may be a no-op.     1) Verify that there are are no invalid code points in the input;        that is, none exceed 0x10FFFF (the highest code point in the        Unicode code space) and none are in the range D800..DFFF        (surrogates).     2) Determine the most populous row:  Row n is defined as the 256        code points starting with n << 8, except that this definition        would makes rows D8..DF useless, because they would contain only        surrogates.  Therefore AMC-ACE-M defines rows D8..DF to be the        following non-aligned blocks of 256 code points:            row D8 = 0020..001F            row D9 = 005B..015A            row DA = 007B..017A            row DB = 00A0..019F            row DC = 00C0..01BF            row DD = 00DF..01DE            row DE = 0134..0233            row DF = 0270..036F        (Rationale:  Whereas almost every small script is confined to        a single row, the Latin script is split across a few rows,        and the row boundaries are not especially convenient for many        languages.)        Determine the row containing the most non-LDH input code points,        breaking ties in favor of smaller-numbered rows.  (If a code        point appears multiple times in the input, it counts multiple        times.  This applies to steps 3 and 4 also.)  Call it row B.        Let offsetB be the first code point of row B.     3) Determine the most populous 16-window:  For each n in 0..31 let        offset = ((offsetB >> 3) + n) << 3 and count the number of code        points in the range offset through offset + 0xF.  Let A be the        value of n that maximizes this count, breaking ties in favor        of smaller values of n, and let offsetA be the corresponding        offset.     4) Determine the most populous 20k-window:  If the input is empty,        then let C = 0.  Otherwise, for each input code point, let n =        code_point >> 11, and count the number of non-LDH input code        points that are not in row B and are in the range (n << 11)        through (n << 11) + 0x4FFF.  Determine the value of n that        maximizes the count, breaking ties in favor of smaller values of        n, and let C be that value.     5) Choose a style:  One of the base-32 codes used in step 7.3 has        two variants, and so base-32 mode is subdivided into two styles,        narrow and wide, depending on which variant is used.  Compute        the total number of base-32 characters that would be produced        if narrow style were used, and the number if wide style were        used.  The easiest way to do this is to mimic the logic of steps        6 and 7.3.  Use whichever style would produce fewer base-32        characters.  In case of a tie, use narrow style.     6) Encode the parameters.  If narrow style is used, then let        offsetC = (offsetB >> 12) << 12, and encode B and A as three or        four base-32 characters:            00bbb bbbbb aaaaa        if B <= 0xFF            01bbb bbbbb bbbbb aaaaa  otherwise        If wide style is used, then let offsetC = C << 11, and encode B        and C as three or five base-32 characters:            10bbb bbbbb ccccc              if B <= 0xFF and C <= 0x1F            11bbb bbbbb bbbbb ccccc ccccc  otherwise     7) Encode each input character in turn, using the first of the        following cases that applies.  The mode is initially base-32.         7.1) The character is a hyphen-minus (U+002D).  Encode it as              two hyphen-minuses.         7.2) The character is an LDH character.  If in base-32 mode              then output a hyphen-minus and switch to literal mode.              Copy the character to the output.         7.3) The character is a non-LDH character.  If in literal              mode then output a hyphen-minus and switch to base-32              mode.  Encode the character's code point using the              first of the following cases that applies.  Square              brackets enclose quintets that can be used to record              the upper/lowercase attribute of the Unicode character              (because the corresponding base-32 characters are              guaranteed to be letters rather than digits) (see section              "Case sensitivity models").               7.3.1) Narrow style was chosen and the code point is in                      the range offsetA through offsetA + 0xF.  Subtract                      offsetA and encode the difference as a single                      base-32 character:                          [0xxxx]               7.3.2) The code point is in the range offsetB through                      offsetB + 0xFF.  Subtract offsetB and encode the                      difference as two base-32 characters:                          1xxxx [0xxxx]               7.3.3) The code point is in the range offsetC through                      offsetC + 0xFFF.  Subtract offsetC and encode the                      difference as three base-32 characters:                          1xxxx 1xxxx [0xxxx]               7.3.4) Wide style was chosen and the code point is in                      the range offsetC + 0x1000 through offsetC +                      0x4FFF.  Subtract offsetC + 0x1000 and encode the                      difference as three base-32 characters:                          [0xxxx] xxxxx xxxxx               7.3.5) The code point is in the range 0 through 0xFFFF.                      Encode it as four base-32 characters:                          1xxxx 1xxxx 1xxxx [0xxxx]               7.3.6) If we've come this far, the code point must be                      in the range 0x10000 through 0x10FFFF.  Subtract                      0x10000 and encode the difference as five base-32                      characters:                          1xxxx 1xxxx 1xxxx 1xxxx [0xxxx]Decoding procedure    The details of the decoding procedure are implied by the encoding    procedure.  The overall sequence of steps is as follows.     1) Undo the encoder's step 6:  From the first few base-32        characters, determine whether narrow or wide style is used, and        determine the offsets.     2) Set the mode to base-32.  For each remaining input character, use        the first of the following cases that applies:         2.1) The character is a hyphen-minus, and the following              character is also a hyphen-minus.  Consume them both and              output a hyphen-minus.         2.2) The character is a hyphen-minus.  Consume it and toggle              the mode flag.         2.3) The current mode is literal.  Consume the input character              and output it.         2.4) Interpret the input character and up to four of its              successors as base-32.  Consume characters until one is              found whose value has the form 0xxxx.  That is the one              that carries the upper/lowercase information.  Remember

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -