📄 draft-ietf-idn-brace-00.txt

📁 bind-3.2.
💻 TXT
📖 第 1 页 / 共 3 页
字号:
12 3 下一页
INTERNET-DRAFT                                          Adam M. Costellodraft-ietf-idn-brace-00.txt                                  2000-Sep-14Expires 2001-Mar-14       BRACE: Bi-mode Row-based ASCII-Compatible Encoding for IDN                              version 0.1.2Status of this Memo    This document is an Internet-Draft and is in full conformance with    all provisions of Section 10 of RFC2026.    Internet-Drafts are working documents of the Internet Engineering    Task Force (IETF), its areas, and its working groups.  Note    that other groups may also distribute working documents as    Internet-Drafts.    Internet-Drafts are draft documents valid for a maximum of six    months and may be updated, replaced, or obsoleted by other documents    at any time.  It is inappropriate to use Internet- Drafts as    reference material or to cite them other than as "work in progress."    The list of current Internet-Drafts can be accessed at    http://www.ietf.org/ietf/1id-abstracts.txt    The list of Internet-Draft Shadow Directories can be accessed at    http://www.ietf.org/shadow.html    Distribution of this document is unlimited.  Please send comments    to the author at amc@cs.berkeley.edu, or to the idn working group    at idn@ops.ietf.org.  A newer version of this specification may be    available at http://www.cs.berkeley.edu/~amc/charset/braceAbstract    BRACE is a reversible function from Unicode (UTF-16) [UNICODE]    text strings to host name labels.  Host name labels are defined by    [RFC952] and [RFC1123] as case-insensitive strings of ASCII letters,    digits, and hyphens, neither beginning nor ending with a hyphen.    [RFC1034] restricts the length of labels to 63 characters.Contents    Primary goals    Secondary goals    Overview    Encoding procedure    Base-32 characters    Encoding styles    Decoding procedure    Comparison with RACE    Example strings    Security considerations    References    Author    Example implementationPrimary goals    Efficient encoding:  Small ratio of encoded size to original size,    for all UTF-16 strings.    Uniqueness:  Every UTF-16 string maps to at most one label.    Completeness:  Every UTF-16 string maps to a label, provided it is    not too long.  Restrictions on which UTF-16 strings are allowed is    purely a matter of policy.    Degeneration:  All valid host name labels that do not end with the    BRACE signature "-8Q9" (or "-8q9") are the BRACE encodings of their    own UTF-16 representations.Secondary goals    Conceptual simplicity:  This has been somewhat compromised for the    sake of efficient encoding.    Readability:  ASCII letters and digits in the original string are    represented literally in the encoded string.  This comes for free    because it is the most efficient encoding anyway.Overview    The encoded string alternates between two modes.  ASCII letters,    digits, and hyphens in the Unicode string (which will henceforth be    called LDH characters) are encoded literally, except that hyphens    are doubled.  Non-LDH codes in the Unicode string are encoded    using base-32 mode, in which each character of the encoded string    represents five bits.  Single hyphens in the encoded string indicate    mode changes.    The base-32 mode uses exactly one of four styles.  Half-row style is    used for Unicode strings in which all the non-LDH codes belong to    a single half-row (have the same upper 9 bits).  Full-row style is    used for Unicode strings in which all the non-LDH codes belong to a    single row (have the same upper 8 bits) but not all the same half.    Mixed style is used when when many of the non-LDH characters (but    not all of them) belong to the same row.  No-row style is used for    all other strings.Encoding procedure    If the UTF-16 string contains more than 63 16-bit codes, it's too    long, so abort.    If the upper bytes are all zero, and the string formed by the lower    bytes is a valid host name label and does not end with "-8Q9" or    "-8q9", output the low bytes and stop.    The encoder needs a bit queue capable of holding up to 22 bits, a    buffer of LDH characters capable of holding up to 124 characters,    and a 4-value encoding style indicator.  The LDH buffer is initially    empty.  The initial contents of the bit queue, and the value of the    style indicator, depend on which encoding style is chosen (which    is explained below).  Bit strings are enqueued and dequeued in    big-endian order (most significant bit first).    After choosing the style and initializing the bit queue, perform the    following actions:        while the bit queue contains at least 5 bits            dequeue 5 bits            output the corresponding base-32 character        endwhile        for each 16-bit code of the UTF-16 string (in order) do            if the code is 0x002D ("-", ASCII hyphen) then                append two hyphens to the ASCII buffer            else if the code is an LDH character then                if the LDH buffer contains no non-hyphens then                    append one hyphen to the LDH buffer                endif                append the code to the LDH buffer            else (the code is not an LDH character)                if the LDH buffer contains any non-hyphens then                    append one hyphen to the LDH buffer                endif                if the bit queue is empty then                    output the LDH buffer and reset it to empty                endif                enqueue the bit string corresponding to the code                (the bit string depends on the encoding style)                dequeue 5 bits                output the corresponding base-32 character                output the LDH buffer and reset it to empty                while the bit queue contains at least 5 bits                    dequeue 5 bits                    output the corresponding base-32 character                endwhile            endif        endfor        if the bit queue is not empty            enqueue zero bits until it contains 5 bits            dequeue 5 bits            output the corresponding base-32 character        endif        output the LDH buffer        output the LDH characters "-8Q9"    If the total number of characters output was greater than 63, the    string is too long for a host name label.    Notice that a group of LDH characters appears in the output as soon    as all the bits of the preceeding non-LDH codes have appeared.  The    base-32 character that appears just before the switch to literal    mode may contain at most four bits of information from the first    non-LDH character that comes after the LDH group.Base-32 characters    "2" =  0 = 00000    "3" =  1 = 00001    "4" =  2 = 00010    "5" =  3 = 00011    "6" =  4 = 00100    "7" =  5 = 00101    "8" =  6 = 00110    "9" =  7 = 00111    "A" =  8 = 01000    "B" =  9 = 01001    "C" = 10 = 01010    "D" = 11 = 01011    "E" = 12 = 01100    "F" = 13 = 01101    "G" = 14 = 01110    "H" = 15 = 01111    "I" = 16 = 10000    "J" = 17 = 10001    "K" = 18 = 10010    "M" = 19 = 10011    "N" = 20 = 10100    "P" = 21 = 10101    "Q" = 22 = 10110    "R" = 23 = 10111    "S" = 24 = 11000    "T" = 25 = 11001    "U" = 26 = 11010    "V" = 27 = 11011    "W" = 28 = 11100    "X" = 29 = 11101    "Y" = 30 = 11110    "Z" = 31 = 11111    The digits "0" and "1" and the letters "O" and "L" ("l") are not    used, to avoid transcription errors.    The base-32 characters, like all characters in host name labels, are    case-insensitive, so they must be recognized in both upper and lower    case.  However, since existing LDH labels are usually stored in    lower case, it is recommended that the base-32 portions of encoded    names be stored in upper case, to help humans easily pick out the    literal portions.Encoding styles    The choice of encoding style depends only on the codes in the UTF-16    string that are not LDH characters.  It in no way depends on any LDH    codes that may be present.    Each code belongs to a particular half-row, which is given by its    upper 9 bits.  If all of the non-LDH codes belong to the same    half-row, use half-row style:  Initialize the bit queue by enqueuing    two 0 bits followed by the designated half-row number (the 9-bit    half-row number shared by all the codes).  During the encoding    procedure the bit string corresponding to each code is its lower 7    bits.    If not all the non-LDH codes belong to the same half-row, but they    all belong to the same row (same upper 8 bits), use full-row style:    Initialize the bit queue by enqueuing a 0 bit, then a 1 bit, then    the designated row number (the 8-bit row number shared by all the    codes).  During the encoding procedure the bit string corresponding    to each code is its lower 8 bits.    If not all non-LDH codes belong to the same row, then consider    using mixed style, which chooses a priviledged half-row.  For each    half-row used by the non-LDH codes, count the number of codes that    belong to that half-row.  Then, for each half-row, calculate M, the    number of base-32 characters that would be required if that half row    were chosen:        N = total number of non-LDH codes        H = number of non-LDH codes in the candidate half-row        C = number of non-LDH codes in the complementary half-row (the            one with the opposite lowest bit)        M = (2 + 9 + 18*(N - H - C) + 8*H + 9*C + 4) / 5          = 3 + (18*N - 10*H - 9*C) / 5    (The division is integer division, which discards any remainder.)    Choose the half-row with the smallest M, breaking ties in favor of    lower-numbered half-rows.  Compare this M with M', the number of    base-32 characters that would be required if no-row style were used:        M' = (2 + 16*N + 4) / 5 = (6 + 16*N) / 5    If M' <= M, use no-row style:  Initialize the bit queue by    enqueueing two 1 bits.  There is no designated row number.  During    the encoding procedure the bit string corresponding to each code is    the full 16-bit code itself.    If M < M', use mixed style:  Initialize the bit queue by enqueueing    a 1 bit, then a 0 bit, then the designated 9-bit half-row number    (the one chosen above).  During the encoding procedure the bit    string corresponding to each code is:        0 followed by the lower 7 bits if the code belongs to the chosen        half-row;        10 followed by the lower 7 bits if the code belongs to the        complementary half-row;        11 followed by the whole 16-bit code otherwise.Decoding procedure    The following description assumes that UTF-16 output is desired.    If the input string does not end with "-8Q9" or "-8q9", output the    input string (converted from ASCII to UTF-16) and stop.    The decoder needs a bit queue capable of holding up to 22 bits.  It    is initially empty.  It also needs a literal-mode flag, which is    initially unset, and a 4-value style indicator.    Perform the following actions:        read the first character and enqueue its base-32 quintet        dequeue two bits and use them to set the style indicator        if the style uses a designated full/half row number then          while the queue does not contain enough bits to represent it            read the next character and enqueue its base-32          endwhile          dequeue enough bits to set the designated row (or half-row)        endif        for each remaining input character except the last four do            if the character is an ASCII hyphen then                if the next character is also an ASCII hyphen then                    skip it                    output an ASCII hyphen (converted to UTF-16)                else                    toggle the literal-mode flag                endif            else if the literal-mode flag is set then                output the character (converted to UTF-16)            else (the literal-mode flag is clear)                enqueue the character's base-32 quintet                if the bit queue contains enough bits to represent a                    UTF-16 code (which depends on the style indicator)                then                    dequeue just enough bits to represent one code                    output the code                endif            endif        endfor    At the end the bit queue may contain up to four 0 bits.  If it    contains anything else, the input was invalid.Comparison with RACE    BRACE is an extension of RACE [RACE01].  For Unicode strings    that contain no LDH characters and use the full-row or no-row    encoding styles, BRACE is virtually identical to RACE.  For other    strings, BRACE produces a smaller encoding than RACE.  For example,    the encoding is substantially more compact for Unicode strings    containing a substantial number of LDH characters, or containing    many Japanese kana with some kanji.    Unlike RACE, any LDH characters present in the Unicode string are    represented literally in the BRACE-encoded string.  This may or may    not be useful, but it happens to be the most compact way to encode    LDH characters.    Whereas RACE uses a signature prefix, BRACE uses a signature suffix.    This makes it easy to guarantee that the encoded label never ends    with a hyphen, even if the original UTF-16 string does.  (Whether    such a UTF-16 string should be allowed is a matter of policy, not    technical capability).    The main drawback of BRACE is its greater complexity.
12 3 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -