📄 draft-ietf-idn-race-03.txt
字号:
Internet Draft Paul Hoffmandraft-ietf-idn-race-03.txt IMC & VPNCNovember 22, 2000Expires in six months RACE: Row-based ASCII Compatible Encoding for IDNStatus of this memoThis document is an Internet-Draft and is in full conformance with allprovisions of Section 10 of RFC2026.Internet-Drafts are working documents of the Internet Engineering TaskForce (IETF), its areas, and its working groups. Note that othergroups may also distribute working documents as Internet-Drafts.Internet-Drafts are draft documents valid for a maximum of six monthsand may be updated, replaced, or obsoleted by other documents at anytime. It is inappropriate to use Internet-Drafts as referencematerial or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html.AbstractThis document describes a transformation method for representingnon-ASCII characters in host name parts in a fashion that is completelycompatible with the current DNS. It is a potential candidate for anASCII-Compatible Encoding (ACE) for internationalized host names, asdescribed in the comparison document from the IETF IDN Working Group.This method is based on the observation that many internationalizedhost name parts will have all their characters in one row of the ISO10646 repertoire.1. IntroductionThere is a strong world-wide desire to use characters other than plainASCII in host names. Host names have become the equivalent of businessor product names for many services on the Internet, so there is a needto make them usable by people whose native scripts are not representableby ASCII. The requirements for internationalizing host names aredescribed in the IDN WG's requirements document, [IDNReq].The IDN WG's comparison document [IDNComp] describes three potentialmain architectures for IDN: arch-1 (just send binary), arch-2 (sendbinary or ACE), and arch-3 (just send ACE). RACE is an ACE, calledRow-based ACE or RACE, that can be used with protocols that match arch-2or arch-3. RACE specifies an ACE format as specified in ace-1 in[IDNComp]. Further, it specifies an identifying mechanism for ace-2 in[IDNComp], namely ace-2.1.1 (add hopefully-unique legal tag to thebeginning of the name part).Author's note: although earlier drafts of this document supported theideas in arch-3, I no longer support that idea and instead only supportarch-2. Of course, someone else might right an IDN proposal that matchesarch-3 and use RACE as the protocol.In formal terms, RACE describes a character encoding scheme of theISO/IEC 10646 [ISO10646] coded character set (whose assignment ofcharacters is synchronized with Unicode [Unicode3]) and the rules forusing that scheme in the DNS. As such, it could also be called a"charset" as defined in [IDNReq].The RACE protocol has the following features:- There is exactly one way to convert internationalized host parts toand from RACE parts. Host name part uniqueness is preserved.- Host parts that have no international characters are not changed.- Names using RACE can include more internationalized characters thanwith other ACE protocols that have been suggested to date. Names in theHan, Yi, Hangul syllables, or Ethiopic scripts can have up to 17characters, and names in most other scripts can have up to 35characters. Further, a name that consist of characters from onenon-Latin script but also contains some Latin characters such as digitsor hyphens can have close to 33 characters.It is important to note that the following sections contain manynormative statements with "MUST" and "MUST NOT". Any implementation thatdoes not follow these statements exactly is likely to cause damage tothe Internet by creating non-unique representations of host names.1.1 TerminologyThe key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and"MAY" in this document are to be interpreted as described in RFC 2119[RFC2119].Hexadecimal values are shown preceded with an "0x". For example,"0xa1b5" indicates two octets, 0xa1 followed by 0xb5. Binary values areshown preceded with an "0b". For example, a nine-bit value might beshown as "0b101101111".Examples in this document use the notation from the Unicode Standard[Unicode3] as well as the ISO 10646 names. For example, the letter "a"may be represented as either "U+0061" or "LATIN SMALL LETTER A".RACE converts strings with internationalized characters intostrings of US-ASCII that are acceptable as host name parts in currentDNS host naming usage. The former are called "pre-converted" and thelatter are called "post-converted".1.2 IDN summaryUsing the terminology in [IDNComp], RACE specifies an ACE format asspecified in ace-1. Further, it specifies an identifying mechanism forace-2, namely ace-2.1.1 (add hopefully-unique legal tag to the beginningof the name part).RACE has the following length characteristics. In this list, "row" meansa row from ISO 10646.- If the characters in the input all come from the same row, up to 35characters per name part are allowed.- If the characters in the input come from two or more rows, neither ofwhich is row 0, up to 17 characters per name part are allowed.- If the characters in the input come from two rows, one of which is row0, between 17 and 33 characters per name part are allowed.2. Host Part TransformationAccording to [STD13], host parts must be case-insensitive, start andend with a letter or digit, and contain only letters, digits, and thehyphen character ("-"). This, of course, excludes any internationalizedcharacters, as well as many other characters in the ASCII characterrepertoire. Further, domain name parts must be 63 octets or shorter inlength.2.1 Name taggingAll post-converted name parts that contain internationalized charactersbegin with the string "bq--". (Of course, because host name parts arecase-insensitive, this might also be represented as "Bq--" or "bQ--" or"BQ--".) The string "bq--" was chosen because it is extremely unlikelyto exist in host parts before this specification was produced. As ahistorical note, in late August 2000, none of the second-level host nameparts in any of the .com, .edu, .net, and .org top-level domains beganwith "bq--"; there are many tens of thousands of other strings of threecharacters followed by a hyphen that have this property and could beused instead. The string "bq--" will change to other strings with thesame properties in future versions of this draft.Note that a zone administrator might still choose to use "bq--" at thebeginning of a host name part even if that part does not containinternationalized characters. Zone administrators SHOULD NOT create hostpart names that begin with "bq--" unless those names are post-convertednames. Creating host part names that begin with "bq--" but that are notpost-converted names may cause two distinct problems. Some displaysystems, after converting the post-converted name part back to aninternationalized name part, might display the name parts in apossibly-confusing fashion to users. More seriously, some resolvers,after converting the post-converted name part back to aninternationalized name part, might reject the host name if it containsillegal characters.2.2 Converting an internationalized name to an ACE name partTo convert a string of internationalized characters into an ACE namepart, the following steps MUST be preformed in the exact order of thesubsections given here.If a name part consists exclusively of characters that conform to thehost name requirements in [STD13], the name MUST NOT be converted toLACE. That is, a name part that can be represented without LACE MUST NOTbe encoded using LACE. This absolute requirement prevents there frombeing two different encodings for a single DNS host name.If any checking for prohibited name parts (such as ones that areprohibited characters, case-folding, or canonicalization) is to be done,it MUST be done before doing the conversion to an ACE name part.Characters outside the first plane of characters (those with codepointsabove U+FFFF) MUST be represented using surrogates, as described in theUTF-16 description in ISO 10646.The input name string consists of characters from the ISO 10646character set in big-endian UTF-16 encoding. This is the pre-convertedstring.2.2.1 Check the input string for disallowed namesIf the input string consists only of characters that conform to the hostname requirements in [STD13], the conversion MUST stop with an error.2.2.2 Compress the pre-converted stringThe entire pre-converted string MUST be compressed using the compressionalgorithm specified in section 2.4. The result of this step is thecompressed string.2.2.3 Check the length of the compressed stringThe compressed string MUST be 36 octets or shorter. If the compressedstring is 37 octets or longer, the conversion MUST stop with an error.2.2.4 Encode the compressed string with Base32The compressed string MUST be converted using the Base32 encodingdescribed in section 2.5. The result of this step is the encoded string.2.2.5 Prepend "bq--" to the encoded string and finishPrepend the characters "bq--" to the encoded string. This is the hostname part that can be used in DNS resolution.2.3 Converting a host name part to an internationalized nameThe input string for conversion is a valid host name part. Note that ifany checking for prohibited name parts (such as prohibited characters,case-folding, or canonicalization is to be done, it MUST be done afterdoing the conversion from an ACE name part.If a decoded name part consists exclusively of characters that conformto the host name requirements in [STD13], the conversion from LACE MUSTfail. Because a name part that can be represented without LACE MUST NOTbe encoded using LACE, the decoding process MUST check for name partsthat consists exclusively of characters that conform to the host namerequirements in [STD13] and, if such a name part is found, MUSTbeconsidered an error (and possibly a security violation).2.3.1 Strip the "bq--"The input string MUST begin with the characters "bq--". If it does not,the conversion MUST stop with an error. Otherwise, remove the characters"bq--" from the input string. The result of this step is the strippedstring.2.3.2 Decode the stripped string with Base32The entire stripped string MUST be checked to see if it is valid Base32output. The entire stripped string MUST be changed to all lower-caseletters and digits. If any resulting characters are not in Table 1, theconversion MUST stop with an error; the input string is thepost-converted string. Otherwise, the entire resulting string MUST beconverted to a binary format using the Base32 decoding described insection 2.5. The result of this step is the decoded string.2.3.3 Decompress the decoded stringThe entire decoded string MUST be converted to ISO 10646 charactersusing the decompression algorithm described in section 2.4. The resultof this is the internationalized string.2.3.4 Check the internationalized string for disallowed namesIf the internationalized string consists only of characters that conformto the host name requirements in [STD13], the conversion MUST stop withan error.2.4 Compression algorithmThe basic method for compression is to reduce a full string thatconsists of characters all from a single row of the ISO 10646repertoire, or all from a single row plus from row 0, to as few octetsas possible. Any full string that has characters that come from tworows, neither of which are row 0, or three or more rows, has all theoctets of the input string in the output string.If the string comes from only one row, compression is to one octet percharacter in the string. If the string comes from only one row otherthan row 0, but also has characters only from row 0, compression is toone octet for the characters from the non-0 row and two octets for thecharacters from row 0. Otherwise, there is no compression and the outputis a string that has two octets per input character.The compressed string always has a one-octet header. If the string comesfrom only one row, the header octet is the upper octet of thecharacters. If the string comes from only one row other than row 0, butalso has characters only from row 0, the header octet is the upper octetof the characters from the non-0 row. Otherwise, the header octet is0xD8, which is the upper octet of a surrogate pair. Design note: It isimpossible to have a legal stream of UTF-16 characters that has all theupper octets being 0xD8 because a character whose upper octet is 0xD8must be followed by one whose upper octet is in the range 0xDC through0xDF.Although the two-octet mode limits the number of characters in a RACEname part to 17, this is still generally enough for almost all names inalmost scripts. Also, this limit is close to the limits set by otherencoding proposals.Note that the compression and decompression rules MUST be followedexactly. This requirement prevents a single host name part from havingtwo encodings. Thus, for any input to the algorithm, there is only onepossible output. An implementation cannot chose to use one-octet mode or
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -