📄 draft-ietf-idn-lace-01.txt
字号:
Internet Draft Mark Davisdraft-ietf-idn-lace-01.txt IBMJanuary 5, 2001 Paul HoffmanExpires July 5, 2001 IMC & VPNC LACE: Length-based ASCII Compatible Encoding for IDNStatus of this memoThis document is an Internet-Draft and is in full conformance with allprovisions of Section 10 of RFC2026.Internet-Drafts are working documents of the Internet Engineering TaskForce (IETF), its areas, and its working groups. Note that othergroups may also distribute working documents as Internet-Drafts.Internet-Drafts are draft documents valid for a maximum of six monthsand may be updated, replaced, or obsoleted by other documents at anytime. It is inappropriate to use Internet-Drafts as referencematerial or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html.AbstractThis document describes a transformation method for representingnon-ASCII characters in host name parts in a fashion that is completelycompatible with the current DNS. It is a potential candidate for anASCII-Compatible Encoding (ACE) for internationalized host names, asdescribed in the comparison document from the IETF IDN Working Group.This method is based on the observation that many internationalized hostname parts will have a few substrings from a small number of rows of theISO 10646 repertoire. Run-length encoding for these types ofhost names will be fairly compact, and is fairly easy to describe. 1. IntroductionThere is a strong world-wide desire to use characters other than plainASCII in host names. Host names have become the equivalent of businessor product names for many services on the Internet, so there is a needto make them usable by people whose native scripts are not representableby ASCII. The requirements for internationalizing host names aredescribed in the IDN WG's requirements document, [IDNReq].The IDN WG's comparison document [IDNComp] describes three potentialmain architectures for IDN: arch-1 (just send binary), arch-2 (sendbinary or ACE), and arch-3 (just send ACE). LACE is an ACE, calledLength-based ACE or LACE, that can be used with protocols that match arch-2or arch-3. LACE specifies an ACE format as specified in ace-1 in[IDNComp]. Further, it specifies an identifying mechanism for ace-2 in[IDNComp], namely ace-2.1.1 (add hopefully-unique legal tag to thebeginning of the name part).In formal terms, LACE describes a character encoding scheme of theISO/IEC 10646 [ISO10646] coded character set (whose assignment ofcharacters is synchronized with Unicode [Unicode3]) and the rules forusing that scheme in the DNS. As such, it could also be called a"charset" as defined in [IDNReq]. It can also be viewed as a specializedUTF (transformation format), designed to work within the restrictions ofthe DNS.The LACE protocol has the following features:- There is exactly one way to convert internationalized host parts toand from LACE parts. Host name part uniqueness is preserved.- Host parts that have no international characters are not changed.- Names using LACE can include more internationalized characters thanwith other ACE protocols that have been suggested to date. LACE-encodednames are variable length, depending on the number of transitionsbetween rows in the ISO 10646 repertoire that appear in the name part.Name parts that cannot be compressed using run-length encoding can haveup to 17 characters, and names that can be compressed can have up to 35characters. Further, a name that has just a few row transitionstypically can have over 30 characters.It is important to note that the following sections contain manynormative statements with "MUST" and "MUST NOT". Any implementation thatdoes not follow these statements exactly is likely to cause damage tothe Internet by creating non-unique representations of host names.1.1 TerminologyThe key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and"MAY" in this document are to be interpreted as described in RFC 2119[RFC2119].Hexadecimal values are shown preceded with an "0x". For example,"0xa1b5" indicates two octets, 0xa1 followed by 0xb5. Binary values areshown preceded with an "0b". For example, a nine-bit value might beshown as "0b101101111".Examples in this document use the notation for code points and namesfrom the Unicode Standard [Unicode3] and ISO 10646. For example, theletter "a" may be represented as either "U+0061" or "LATIN SMALL LETTERA".LACE converts strings with internationalized characters intostrings of US-ASCII that are acceptable as host name parts in currentDNS host naming usage. The former are called "pre-converted" and thelatter are called "post-converted".1.2 IDN summaryUsing the terminology in [IDNComp], LACE specifies an ACE format asspecified in ace-1. Further, it specifies an identifying mechanism forace-2, namely ace-2.1.1 (add hopefully-unique legal tag to the beginningof the name part).LACE has the following length characteristics.- LACE-encoded names are variable length, depending on the number oftransitions between rows that appear in the name part.- Name parts that cannot be compressed using run-length encoding canhave up to 17 characters.- Names that can be compressed can have up to 35 characters.-A name that has just a few row transitions typically can have over 30characters.2. Host Part TransformationAccording to [STD13], host parts must be case-insensitive, start andend with a letter or digit, and contain only letters, digits, and thehyphen character ("-"). This, of course, excludes any internationalizedcharacters, as well as many other characters in the ASCII characterrepertoire. Further, domain name parts must be 63 octets or shorter inlength.2.1 Name taggingAll post-converted name parts that contain internationalized charactersbegin with the string "lq--". (Of course, because host name parts arecase-insensitive, this might also be represented as "Lq--" or "lQ--" or"LQ--".) The string "lq--" was chosen because it is extremely unlikelyto exist in host parts before this specification was produced. As ahistorical note, in late October 2000, none of the second-level hostname parts in any of the .com, .edu, .net, and .org top-level domainsbegan with "lq--"; there are many tens of thousands of other strings ofthree characters followed by a hyphen that have this property and couldbe used instead. The string "lq--" will change to other strings with thesame properties in future versions of this draft.Note that a zone administrator might still choose to use "lq--" at thebeginning of a host name part even if that part does not containinternationalized characters. Zone administrators SHOULD NOT create hostpart names that begin with "lq--" unless those names are post-convertednames. Creating host part names that begin with "lq--" but that are notpost-converted names may cause two distinct problems. Some displaysystems, after converting the post-converted name part back to aninternationalized name part, might display the name parts in apossibly-confusing fashion to users. More seriously, some resolvers,after converting the post-converted name part back to aninternationalized name part, might reject the host name if it containsillegal characters.2.2 Converting an internationalized name to an ACE name partTo convert a string of internationalized characters into an ACE namepart, the following steps MUST be preformed in the exact order of thesubsections given here.If a name part consists exclusively of characters that conform to thehost name requirements in [STD13], the name MUST NOT be converted toLACE. That is, a name part that can be represented without LACE MUST NOTbe encoded using LACE. This absolute requirement prevents there frombeing two different encodings for a single DNS host name.If any checking for prohibited name parts (such as ones that areprohibited characters, case-folding, or canonicalization) is to be done,it MUST be done before doing the conversion to an ACE name part.Characters outside the first plane of characters (those with codepointsabove U+FFFF) MUST be represented using surrogates, as described inRFC 2781 [RFC2781].The input name string consists of characters from the ISO 10646character set in big-endian UTF-16 encoding. This is the pre-convertedstring.2.2.1 Check the input string for disallowed namesIf the input string consists only of characters that conform to the hostname requirements in [STD13], the conversion MUST stop with an error.2.2.2 Compress the pre-converted stringThe entire pre-converted string MUST be compressed using the compressionalgorithm specified in section 2.4. The result of this step is thecompressed string.2.2.3 Check the length of the compressed stringThe compressed string MUST be 36 octets or shorter. If the compressedstring is 37 octets or longer, the conversion MUST stop with an error.2.2.4 Encode the compressed string with Base32The compressed string MUST be converted using the Base32 encodingdescribed in section 2.5. The result of this step is the encoded string.2.2.5 Prepend "lq--" to the encoded string and finishPrepend the characters "lq--" to the encoded string. This is the hostname part that can be used in DNS resolution.2.3 Converting a host name part to an internationalized nameThe input string for conversion is a valid host name part. Note that ifany checking for prohibited name parts (such as prohibited characters,case-folding, or canonicalization is to be done, it MUST be done afterdoing the conversion from an ACE name part.If a decoded name part consists exclusively of characters that conformto the host name requirements in [STD13], the conversion from LACE MUSTfail. Because a name part that can be represented without LACE MUST NOTbe encoded using LACE, the decoding process MUST check for name partsthat consists exclusively of characters that conform to the host namerequirements in [STD13] and, if such a name part is found, MUSTbeconsidered an error (and possibly a security violation).2.3.1 Strip the "lq--"The input string MUST begin with the characters "lq--". If it does not,the conversion MUST stop with an error. Otherwise, remove the characters"lq--" from the input string. The result of this step is the strippedstring.2.3.2 Decode the stripped string with Base32The entire stripped string MUST be checked to see if it is valid Base32output. The entire stripped string MUST be changed to all lower-caseletters and digits. If any resulting characters are not in Table 1, theconversion MUST stop with an error; the input string is thepost-converted string. Otherwise, the entire resulting string MUST beconverted to a binary format using the Base32 decoding described insection 2.5. The result of this step is the decoded string.2.3.3 Decompress the decoded stringThe entire decoded string MUST be converted to ISO 10646 charactersusing the decompression algorithm described in section 2.4. The resultof this is the internationalized string.2.3.4 Check the internationalized string for disallowed namesIf the internationalized string consists only of characters that conformto the host name requirements in [STD13], the conversion MUST stop withan error.2.4 Compression algorithmThe basic method for compression is to reduce a substring that consistsof characters all from a single row of the ISO 10646 repertoire to acount octet followed by the row header followed by the lower octets ofthe characters. If this ends up being longer than the input, the stringis not compressed, but instead has a unique one-octet header attached.Although the uncompressed mode limits the number of characters in a LACEname part to 17, this is still generally enough for all names in almostscripts. Also, this limit is close to the limits set by other encodingproposals.Note that the compression and decompression rules MUST be followedexactly. This requirement prevents a single host name part from havingtwo encodings. Thus, for any input to the algorithm, there is only onepossible output. An implementation cannot chose to use one-octet mode ortwo-octet mode using anything other than the logic given in this
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -