📄 rfc3467.txt
字号:
ISO 10646 basically defines only code points, and not rules for using or comparing the characters. This is part of a long-standing tradition with the work of what is now ISO/IEC JTC1/SC2: they have performed code point assignments and have typically treated the ways in which characters are used as beyond their scope. Consequently, they have not dealt effectively with the broader range of internationalization issues. By contrast, the Unicode Technical Committee (UTC) has defined, in annexes and technical reports (see, e.g., [UTR15]), some additional rules for canonicalization and comparison. Many of those rules and conventions have been factored into the "stringprep" and "nameprep" work, but it is not straightforward to make or define them in a fashion that is sufficiently precise and permanent to be relied on by the DNS. Perhaps more important, the discussions leading to nameprep also identified several areas in which the UTC definitions are inadequate, at least without additional information, to make matching precise and unambiguous. In some of these cases, the Unicode Standard permits several alternate approaches, none of which are an exact and obvious match to DNS needs. That has left these sensitive choices up to IETF, which lacks sufficient in-depth expertise, much less any mechanism for deciding to optimize one language at the expense of another. For example, it is tempting to define some rules on the basis of membership in particular scripts, or for punctuation characters, but there is no precise definition of what characters belong to which script or which ones are, or are not, punctuation. The existence of these areas of vagueness raises two issues: whether trying to do precise matching at the character set level is actually possible (addressed below) and whether driving toward more precision could create issues that cause instability in the implementation and resolution models for the DNS. The Unicode definition also evolves. Version 3.2 appeared shortly after work on this document was initiated. It added some characters and functionality and included a few minor incompatible code point changes. IETF has secured an agreement about constraints on future changes, but it remains to be seen how that agreement will work out in practice. The prognosis actually appears poor at this stage, since UTC chose to ballot a recent possible change which should have been prohibited by the agreement (the outcome of the ballot is not relevant, only that the ballot was issued rather than having the result be a foregone conclusion). However, some members of the community consider some of the changes between Unicode 3.0 and 3.1 and between 3.1 and 3.2, as well as this recent ballot, to beKlensin Informational [Page 19]RFC 3467 Role of the Domain Name System (DNS) February 2003 evidence of instability and that these instabilities are better handled in a system that can be more flexible about handling of characters, scripts, and ancillary information than the DNS. In addition, because the systems implications of internationalization are considered out of scope in SC2, ISO/IEC JTC1 has assigned some of those issues to its SC22/WG20 (the Internationalization working group within the subcommittee that deals with programming languages, systems, and environments). WG20 has historically dealt with internationalization issues thoughtfully and in depth, but its status has several times been in doubt in recent years. However, assignment of these matters to WG20 increases the risk of eventual ISO internationalization standards that specify different behavior than the UTC specifications.4.5 Audiences, End Users, and the User Interface Problem Part of what has "caused" the DNS internationalization problem, as well as the DNS trademark problem and several others, is that we have stopped thinking about "identifiers for objects" -- which normal people are not expected to see -- and started thinking about "names" -- strings that are expected not only to be readable, but to have linguistically-sensible and culturally-dependent meaning to non- specialist users. Within the IETF, the IDN-WG, and sometimes other groups, avoided addressing the implications of that transition by taking "outside our scope -- someone else's problem" approaches or by suggesting that people will just become accustomed to whatever conventions are adopted. The realities of user and vendor behavior suggest that these approaches will not serve the Internet community well in the long term: o If we want to make it a problem in a different part of the user interface structure, we need to figure out where it goes in order to have proof of concept of our solution. Unlike vendors whose sole [business] model is the selling or registering of names, the IETF must produce solutions that actually work, in the applications context as seen by the end user. o The principle that "they will get used to our conventions and adapt" is fine if we are writing rules for programming languages or an API. But the conventions under discussion are not part of a semi-mathematical system, they are deeply ingrained in culture. No matter how often an English-speaking American is told that the Internet requires that the correct spelling of "colour" be used, he or she isn't going to be convinced. Getting a French-speaker in Lyon to use exactly the same lexical conventions as a French-Klensin Informational [Page 20]RFC 3467 Role of the Domain Name System (DNS) February 2003 speaker in Quebec in order to accommodate the decisions of the IETF or of a registrar or registry is just not likely. "Montreal" is either a misspelling or an anglicization of a similar word with an acute accent mark over the "e" (i.e., using the Unicode character U+00E9 or one of its equivalents). But global agreement on a rule that will determine whether the two forms should match -- and that won't astonish end users and speakers of one language or the other -- is as unlikely as agreement on whether "misspelling" or "anglicization" is the greater travesty. More generally, it is not clear that the outcome of any conceivable nameprep-like process is going to be good enough for practical, user-level, use. In the use of human languages by humans, there are many cases in which things that do not match are nonetheless interpreted as matching. The Norwegian/Danish character that appears in U+00F8 (visually, a lower case 'o' overstruck with a forward slash) and the "o-umlaut" German character that appears in U+00F6 (visually, a lower case 'o' with diaeresis (or umlaut)) are clearly different and no matching program should yield an "equal" comparison. But they are more similar to each other than either of them is to, e.g., "e". Humans are able to mentally make the correction in context, and do so easily, and they can be surprised if computers cannot do so. Worse, there is a Swedish character whose appearance is identical to the German o-umlaut, and which shares code point U+00F6, but that, if the languages are known and the sounds of the letters or meanings of words including the character are considered, actually should match the Norwegian/Danish use of U+00F8. This text uses examples in Roman scripts because it is being written in English and those examples are relatively easy to render. But one of the important lessons of the discussions about domain name internationalization in recent years is that problems similar to those described above exist in almost every language and script. Each one has its idiosyncrasies, and each set of idiosyncracies is tied to common usage and cultural issues that are very familiar in the relevant group, and often deeply held as cultural values. As long as a schoolchild in the US can get a bad grade on a spelling test for using a perfectly valid British spelling, or one in France or Germany can get a poor grade for leaving off a diacritical mark, there are issues with the relevant language. Similarly, if children in Egypt or Israel are taught that it is acceptable to write a word with or without vowels or stress marks, but that, if those marks are included, they must be the correct ones, or a user in Korea is potentially offended or astonished by out-of-order sequences of Jamo, systems based on character-at-a-time processing and simplistic matching, with no contextual information, are not going to satisfy user needs.Klensin Informational [Page 21]RFC 3467 Role of the Domain Name System (DNS) February 2003 Users are demanding solutions that deal with language and culture. Systems of identifier symbol-strings that serve specialists or computers are, at best, a solution to a rather different (and, at the time this document was written, somewhat ill-defined), problem. The recent efforts have made it ever more clear that, if we ignore the distinction between the user requirements and narrowly-defined identifiers, we are solving an insufficient problem. And, conversely, the approaches that have been proposed to approximate solutions to the user requirement may be far more complex than simple identifiers require.4.6 Business Cards and Other Natural Uses of Natural Languages Over the last few centuries, local conventions have been established in various parts of the world for dealing with multilingual situations. It may be helpful to examine some of these. For example, if one visits a country where the language is different from ones own, business cards are often printed on two sides, one side in each language. The conventions are not completely consistent and the technique assumes that recipients will be tolerant. Translations of names or places are attempted in some situations and transliterations in others. Since it is widely understood that exact translations or transliterations are often not possible, people typically smile at errors, appreciate the effort, and move on. The DNS situation differs from these practices in at least two ways. Since a global solution is required, the business card would need a number of sides approximating the number of languages in the world, which is probably impossible without violating laws of physics. More important, the opportunities for tolerance don't exist: the DNS requires a exact match or the lookup fails.4.7 ASCII Encodings and the Roman Keyboard Assumption Part of the argument for ACE-based solutions is that they provide an escape for multilingual environments when applications have not been upgraded. When an older application encounters an ACE-based name, the assumption is that the (admittedly ugly) ASCII-coded string will be displayed and can be typed in. This argument is reasonable from the standpoint of mixtures of Roman-based alphabets, but may not be relevant if user-level systems and devices are involved that do not support the entry of Roman-based characters or which cannot conveniently render such characters. Such systems are few in the world today, but the number can reasonably be expected to rise as the Internet is increasingly used by populations whose primary concern is with local issues, local information, and local languages. It is,Klensin Informational [Page 22]RFC 3467 Role of the Domain Name System (DNS) February 2003 for example, fairly easy to imagine populations who use Arabic or Thai scripts and who do not have routine access to scripts or input devices based on Roman-derived alphabets.4.8 Intra-DNS Approaches for "Multilingual Names" It appears, from the cases above and others, that none of the intra- DNS-based solutions for "multilingual names" are workable. They rest on too many assumptions that do not appear to be feasible -- that people will adapt deeply-entrenched language habits to conventions laid down to make the lives of computers easy; that we can make "freeze it now, no need for changes in these areas" decisions about Unicode and nameprep; that ACE will smooth over applications problems, even in environments without the ability to key or render Roman-based glyphs (or where user experience is such that such glyphs cannot easily be distinguished from each other); that the Unicode Consortium will never decide to repair an error in a way that creates a risk of DNS incompatibility; that we can either deploy EDNS [RFC2671] or that long names are not really important; that Japanese and Chinese computer users (and others) will either give up their local or IS 2022-based character coding solutions (for which addition of a large fraction of a million new code points to Unicode is almost certainly a necessary, but probably not sufficient, condition) or build leakproof and completely accurate boundary conversion mechanisms; that out of band or contextual information will always be sufficient for the "map glyph onto script" problem; and so on. In each case, it is likely that about 80% or 90% of cases will work satisfactorily
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -