📄 draft-ietf-idn-nameprep-03.txt
字号:
Internet Draft Paul Hoffmandraft-ietf-idn-nameprep-03.txt IMC & VPNCFebruary 24, 2001 Marc BlanchetExpires in six months ViaGenie Preparation of Internationalized Host NamesStatus of this memoThis document is an Internet-Draft and is in full conformance with allprovisions of Section 10 of RFC2026.Internet-Drafts are working documents of the Internet Engineering TaskForce (IETF), its areas, and its working groups. Note that other groupsmay also distribute working documents as Internet-Drafts.Internet-Drafts are draft documents valid for a maximum of six monthsand may be updated, replaced, or obsoleted by other documents at anytime. It is inappropriate to use Internet-Drafts as reference materialor to cite them other than as "work in progress."To view the list Internet-Draft Shadow Directories, seehttp://www.ietf.org/shadow.html.AbstractThis document describes how to prepare internationalized host names foruse in the DNS. The steps include: - mapping characters to other characters, such as to change their case - normalizing the characters - excluding characters that are prohibited from appearing in internationalized host namesThis document does not specify a wire protocol. This preparation shouldbe done before the DNS request.1. IntroductionWhen expanding today's DNS to include internationalized host names,those new names will be handled in many parts of the DNS. TheInternationalized Domain Name (IDN) Working Group's requirementsdocument [IDNReq] describes a framework for domain name handling as wellas requirements for the new names.A user can enter a domain name into an application program in a myriadof fashions. Depending on the input method, the characters entered inthe domain name may or may not be those that are allowed ininternationalized host names. Thus, there must be a way to normalizedthe user's input before the name is resolved in the DNS.It is a design goal of this document to allow users to enter host namesin applications and have the highest chance of getting the name correct.Another, often conflicting, design goal is to allow as wide of a rangeof characters as possible to be allowed in host names. The user shouldnot be limited to only entering exactly the characters that might havebeen used, but to instead be able to enter characters that unambiguouslynormalize to characters in the desired host name. Although it would beeasy to use the process in this step to "correct" perceived mis-featuresor bugs in the current character standards, this document expressly doesnot do so.This document describes the steps needed to convert a name part from onethat is entered by the user to one that can be used in the DNS.Within a fully-qualified domain name, some labels may beinternationalized, while others are not. This specification should beapplied to all internationalized labels. An application must be able torecognize which part is internationalized; the method for suchrecognition is outside of the scope of this document. Note that thisspecification is harmless to the non-internationalized labels: when thesteps described here are applied to non-internationalized labels, thelabel will not change.1.1 TerminologyThe key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and"MAY" in this document are to be interpreted as described in RFC 2119[RFC2119].Examples in this document use the notation for code points and namesfrom the Unicode Standard [Unicode3] and ISO/IEC 10646 [ISO10646]. Forexample, the letter "a" may be represented as either "U+0061" or "LATINSMALL LETTER A". In the lists of prohibited characters, the "U+" is leftoff to make the lists easier to read. The names of character ranges areshown in square brackets (such as "[SYMBOLS]") and do not come from thestandards.Note: A glossary of terms used in Unicode and ISO/IEC 10646 can be foundin [Glossary]. Information on the 10646/Unicode character model can befound in [CharModel].2. Preparation OverviewThe steps for preparing names are:1) Input from the application service interface -- This can be done inmany ways and is not specified in this document2) Map -- For each character in the input, check if it has a mappingand, if so, replace it with its mapping. The mappings are a combinationof folding uppercase characters to lowercase and hyphen mapping. This isdescribed in Section 4.3) Normalize -- Normalize the characters. This is described in Section5.4) Look for prohibited output -- Check for any characters that are notallowed in the output. If any are found, return an error to theapplication service interface. This is described in Section 6.5) Resolution of the prepared name -- This must be specified in adifferent IDN document.The above steps MUST be performed in the order given in order to complywith this specification.The steps in this document have associated tables in the document. Thetables are derived from outside sources, and the derivation is brieflydescribed in the document. Although a great deal of effort has gone intopreparing the tables, there is a chance that the tables do not correctlyreflect the outside sources. Regardless of whether or not the tablesdiffer from the sources, implementations MUST use the tables in thisdocument for their processing. That is, if there is an error in thetables, the tables must still be used. Future versions of this documentmay include corrections and additions to the tables.3. MappingEach character in the input stream is checked against the mapping table.The mapping table can be found in Appendix E of this document. Thattable includes all the steps described in the subsections below.The mappings can be one-to-none, one-to-one, or one-to-many. That is,some characters may be eliminated or replaced by more than onecharacter, and the output of this step might be shorter or longer thanthe input. Because of this, an application MUST be prepared to receive alonger or shorter string than the one input in the nameprep algorithm.Rationale: Characters that are not wanted in internationalized nameparts can either be mapped to nothing in the mapping step, or cause anerror in the prohibition step. The general guideline used to pickbetween the two outcomes was that removing alphabetic, non-protocolcharacters be done in the mapping step, but all other removals be donein the prohibition step. This allows for simple linguistic errors on thepart of an input mechanism to be caught in the mapping step, but to nothide serious errors such as entering protocol characters or invisiblecharacters from the user.3.1 Case mappingThe input string is case folded according to [UTR21]. For mostcharacters, this is the same thing as changing the input character to alowercase character. For some characters, however, more complextransformations occur. The mapping table in Appendix E is derived byapplying the rules for equivalence classes from [UTR21].Rationale: This step could have been "change all lowercase charactersinto uppercase characters". However, the upper-to-lower folding waschosen because most users of the Internet today enter host names inlowercase.3.2 Additional folding mappingsThere are some characters that do not have mappings in [UTR21] but stillneed processing. These characters include a few Greek characters andmany symbols that contain Latin characters. The list of characters toadd to the mapping table were determined by the following algorithm:b = NormalizeWithKC(Fold(a));c = NormalizeWithKC(Fold(b));if c is not the same as b, add a mapping for "a to c".Because NormalizeWithKC(Fold(c)) always equals c, the table is stablefrom that point on.3.3 Mapped outThe following characters are simply deleted from the input (that is,they are mapped to nothing) because their presence or absence should notmake two domain names different.Some characters are only useful in line-based text, and are otherwiseinvisible and ignored.00AD; SOFT HYPHEN1806; MONGOLIAN TODO SOFT HYPHEN200B; ZERO WIDTH SPACEFEFF; ZERO WIDTH NO-BREAK SPACEVariation selectors and cursive connectors select different glyphs, butdo not bear semantics.180B; MONGOLIAN FREE VARIATION SELECTOR ONE180C; MONGOLIAN FREE VARIATION SELECTOR TWO180D; MONGOLIAN FREE VARIATION SELECTOR THREE200C; ZERO WIDTH NON-JOINER200D; ZERO WIDTH JOINER4. NormalizationThe output of the mapping step is normalized using form KC, as describedin [UTR15]. Using form KC instead of form C causes many characters thatare identical or near-identical to be converted into a single character.Note that this specification refers to a specific version of [UTR15]. Ifa later version of [UTR15] changes the algorithm used for normalizing,that later version MUST NOT be used with this specification. Note thatit is likely that this specification will be revised if UTR15 ischanged, but until that happens, only the specified version of [UTR15]must be used.5. Prohibited OutputBefore the text can be emitted, it must be checked for prohibited codepoints. There is a variety of prohibited code points, as described inthis section.One of the goals of IDN is to allow the widest possible set of hostnames as long as those host names do not cause other problems, such asconflict with other standards. Specifically, experience with current DNSnames have shown that there is a desire for host names that includepersonal names, company names, and spoken phrases. A goal of thissection is to prohibit as few characters that might be used in thesecontexts as possible.The collected list of prohibited code points can be found in Appendix Fof this document. The list in Appendix F MUST be used by implementationsof this specification. If there are any discrepancies between the listin Appendix F and subsections below, the list Appendix F always takesprecedence.Some code points listed in one section would also appear in othersections. Each code point is only listed once in the table in AppendixF.5.1 Currently-prohibited ASCII charactersSome of the ASCII characters that are currently prohibited in host namesby [STD13] are also used in protocol elements such as URIs [URI]. The othercharacters in the range U+0000 to U+007F that are not currently allowedare also prohibited in host name parts to reserve them for future use inprotocol elements.0000-002C; [ASCII]002E-002F; [ASCII]003A-0040; [ASCII]005B-0060; [ASCII]007B-007F; [ASCII]5.2 Space charactersSpace characters would make visual transcription of URLs nearlyimpossible and could lead to user entry errors in many ways.0020; SPACE00A0; NO-BREAK SPACE2000; EN QUAD2001; EM QUAD2002; EN SPACE2003; EM SPACE2004; THREE-PER-EM SPACE2005; FOUR-PER-EM SPACE2006; SIX-PER-EM SPACE2007; FIGURE SPACE2008; PUNCTUATION SPACE2009; THIN SPACE200A; HAIR SPACE202F; NARROW NO-BREAK SPACE3000; IDEOGRAPHIC SPACE1680; OGHAM SPACE MARK200B; ZERO WIDTH SPACE5.3 Control charactersControl characters cannot be seen and can cause unpredictable resultswhen displayed.0000-001F; [CONTROL CHARACTERS]007F; DELETE0080-009F; [CONTROL CHARACTERS]2028; LINE SEPARATOR2029; PARAGRAPH SEPARATOR5.4 Private use and replacement charactersBecause private-use characters do not have defined meanings, they areprohibited. The private-use characters are:E000-F8FF; [PRIVATE USE, PLANE 0]F0000-FFFFD; [PRIVATE USE, PLANE 15]100000-10FFFD; [PRIVATE USE, PLANE 16]The replacement character (U+FFFD) has no known semantic definition in aname, and is often displayed by renderers to indicate "there would be somecharacter here, but it cannot be rendered". For example, on a computerwith no Asian fonts, a name with three katakana characters might berendered with three replacement characters.FFFD; REPLACEMENT CHARACTER5.5 Non-character code pointsNon-character code points are code points that have been assigned inISO/IEC 10646 but are not characters. Because they are already assigned,they are guaranteed not to later change into characters.FFFE-FFFF; [NONCHARACTER CODE POINTS]1FFFE-1FFFF; [NONCHARACTER CODE POINTS]2FFFE-2FFFF; [NONCHARACTER CODE POINTS]3FFFE-3FFFF; [NONCHARACTER CODE POINTS]4FFFE-4FFFF; [NONCHARACTER CODE POINTS]5FFFE-5FFFF; [NONCHARACTER CODE POINTS]6FFFE-6FFFF; [NONCHARACTER CODE POINTS]7FFFE-7FFFF; [NONCHARACTER CODE POINTS]8FFFE-8FFFF; [NONCHARACTER CODE POINTS]9FFFE-9FFFF; [NONCHARACTER CODE POINTS]AFFFE-AFFFF; [NONCHARACTER CODE POINTS]BFFFE-BFFFF; [NONCHARACTER CODE POINTS]CFFFE-CFFFF; [NONCHARACTER CODE POINTS]DFFFE-DFFFF; [NONCHARACTER CODE POINTS]EFFFE-EFFFF; [NONCHARACTER CODE POINTS]FFFFE-FFFFF; [NONCHARACTER CODE POINTS]10FFFE-10FFFF; [NONCHARACTER CODE POINTS]5.6 Surrogate codesThe following code points are permanently reserved for use as surrogatecode values in the UTF-16 encoding, will never be assigned tocharacters, and are therefore prohibited:D800-DFFF; [SURROGATE CODES]5.7 Inappropriate for plain textThe following characters should not appear in regular text.FFF9; INTERLINEAR ANNOTATION ANCHORFFFA; INTERLINEAR ANNOTATION SEPARATORFFFB; INTERLINEAR ANNOTATION TERMINATORFFFC; OBJECT REPLACEMENT CHARACTER5.8 Inappropriate for domain namesThe ideographic description characters allow different sequences ofcharacters to be rendered the same way, which makes them inappropriatefor host names that must have a single canonical order.2FF0-2FFF; [IDEOGRAPHIC DESCRIPTION CHARACTERS]5.9 Change display propertiesThe following characters, some of which are deprecated in ISO/IEC 10646,can cause changes in display or the order in which characters appearwhen rendered.200E; LEFT-TO-RIGHT MARK200F; RIGHT-TO-LEFT MARK202A; LEFT-TO-RIGHT EMBEDDING202B; RIGHT-TO-LEFT EMBEDDING202C; POP DIRECTIONAL FORMATTING202D; LEFT-TO-RIGHT OVERRIDE202E; RIGHT-TO-LEFT OVERRIDE206A; INHIBIT SYMMETRIC SWAPPING206B; ACTIVATE SYMMETRIC SWAPPING206C; INHIBIT ARABIC FORM SHAPING206D; ACTIVATE ARABIC FORM SHAPING206E; NATIONAL DIGIT SHAPES206F; NOMINAL DIGIT SHAPES5.10 Inappropriate characters from common input mechanismsU+3002 is used as if it were U+002E in many input mechanisms,particularly in Asia. This prohibition allows input mechanisms to safelymap U+3002 to U+002E before doing nameprep without worrying aboutpreventing users from accessing legitimate host name parts.3002; IDEOGRAPHIC FULL STOP6. Unassigned Code PointsAll code points not assigned in ISO/IEC 10646 are called "unassignedcode points". Authoritative name servers MUST NOT have internationalizedname parts that contain any unassigned code points. DNS requests MAYcontain name parts that contain unassigned code points. Note that thisis the only part of this document where the requirements for queriesdiffers from the requirements for names in DNS zones.Using two different policies for where unassigned code points can appearin the DNS prevents the need for versioning the IDN protocol [IDNrev].This is very useful since it makes the overall processing simpler and donot impose a "protocol" to handle versioning. It is expected that ISO/IEC10646 will be updated fairly frequently; recently, it has happenedapproximately once a year. Each time a new version of ISO/IEC 10646 appears,a new version of this document can be created. Some end users will wantto use the new code points as soon as they are defined.The list of unassigned code points can be found in Appendix G of thisdocument. The list in Appendix G MUST be used by implementations of thisspecification. If there are any discrepancies between the list inAppendix G and the ISO/IEC 10646 specification, the list Appendix Galways takes precedence.Due to the way that versioning is handled in this section, host namesthat are embedded in structures that cannot be changed (such as thesigned parts of digital certificates) MUST NOT have internationalizedname parts that contain any unassigned code points.6.1 Categories of code pointsEach code point in ISO/IEC 10646 can be categorized by how it acts in theprocess described in earlier sections of this document:AO Code points that may be in the outputMN Code points that cannot be in the output because they are mapped to nothing or never appear as output from normalizationD Code points that cannot be in the output because they are disallowed in the prohibition stepU Unassigned code pointsA subsequent version of this document that references a newer version ofISO/IEC 10646 with new code points will inherently have some code pointsmove from category U to either D, MN, or AO. For backwardscompatibility, no future version of this document will move code pointsfrom any other category. That is, no current AO, MN, or D code pointswill ever change to a different category.Authoritative name servers MUST NOT contain any name that has codepoints outside of AO for the latest version of this document. That is,they are forbidden to contain any IDN names containing code points fromthe MN, D, or U categories.Applications creating name queries MUST treat U code points as if theywere AO when preparing the name parts according to this document. Thoseapplications MAY optionally have a preprocess that provide stricterchecks: treating unassigned code points in the input as errors, orwarning the user about the fact that the code point is unassigned in theversion of this document that the software is based on; such a choice isa local matter for the software.Non-authoritative DNS servers MAY reject names that contain code pointsthat are in categories MN or D for the version of this document thatthey implement, but MUST NOT reject names because they contain nameparts with code points from category U.6.2 Reasons for difference between authoritative servers and requestsDifferent software using different versions of this document need tointeroperate with maximal compatibility. The scheme described in thissection (authoritative name servers MUST NOT use unassigned code points,requests MAY include unassigned code points) allows that compatibilitywithout introducing any known security or interoperability issues.The list below shows what happens if a request contains a code pointfrom category U that is allowed in a newer version of this document. Therequest either resolves to the domain name that was intended, orresolves to no domain at all. In this list, the request comes from anapplication using version "oldVersion" of this document, theauthoritative name server is using version "newVersion" of thisdocument, and the code point X was in category U on oldVersion, and haschanged category to AO, MN, or D. There are 3 possible scenarios:1. X becomes AO -- In newVersion, X is in category AO. Because theapplication passed X through, it gets back correct data from theauthoritative name server. There is one exceptional case, where X is acombining mark.The order of combining marks is normalized, so if another combining markY has a lower combining class than X then XY will be put in thecanonical order YX. (Unassigned code points are never reordered, so thisdoesn't happen in oldVersion). If the request contains YX, the requestwill get correct data from the authoritative name server. However, nodomain name can be registered with XY, so a request with XY will get a"no such host" error.2. X becomes MN -- In newVersion, X is normalized to code point "nX" andtherefore X is now put in category MN. This cannot exist in any domainname, so any request containing X will get back a "no such host" error.Note, however, if the request had contained the letter nX, it would havegotten back correct data.3. X becomes D -- In newVersion, X is in category MN. This cannot existin any domain name, so any request containing X will get back a "no suchhost" error.In none of the cases does the request get data for a host name otherthan the one it actually wanted.The processing in this document is always stable. If a string S is theresult of processing on newVersion, then it will remain the same whenprocessed on oldVersion.There is always a way for the application to get the correct data fromthe authoritative name server. For example, suppose that <ALPHA> wasunassigned in oldVersion, and that it is assigned in newVersion, butcase-folded to <alpha>. As long as the application supplies stringscontaining <alpha> instead of <ALPHA>, the correct data will bereturned. Because the processing is stable, a different applicationrunning newVersion can pass a processed host name to the applicationrunning oldVersion. It will only contain <alpha>, and will return the
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -