📄 draft-ietf-idn-cjk-01.txt
字号:
彥繧nternet Draft James SENG<draft-ietf-idn-cjk-01.txt> Yoshiro YONEYA11th Apr 2001 Kenny HUANGExpires 11 Oct 2001 KIM Kyongsok Han Ideograph (CJK) for Internationalized Domain NamesStatus of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html.AbstractDuring the development of Internationalized Domain Name (IDN), it isdiscovered that there is a substantial lack of information andmisunderstanding on Han ideographs and its folding mechanism.This document attempts to address some of the issues on doing hanfolding with respect to IDN. Hopefully, this will dispel some of thecommon misunderstanding of this problem and to discuss some of theissues with han ideograph and its folding mechanism.This document addresses very specific problem to IDN and thus is notmeant as a reference for generic Han folding. Generic Han folding aremuch more complicated and certainly beyond this document. However, theuse of this document may be applicable to other areas that are relatedwith names, e.g. Common Name Resolution Protocol [CNRP].1. Definition and conventionCharacters mentioned in this document are identified by their positionor code point in the Unicode character set [UCS]. The notation U+12AB,for example, indicates the character at the position 12AB (hexadecimal)in the [UCS]. It is strongly recommended that a [UCS] table is availablefor reference for the ideograph described.Han ideographs are defined as the Chinese ideographs starting fromU+3400 to U+9FFF or commonly known as CJK Unification Ideographs. Thiscovers Chinese 'hanzi' {U+6F22 U+5B57/U+6C49 U+5B57}, Japanese 'kanji'(U+6F22 U+5B57) and Korean 'hanja' {U+6F22 U+5B57/U+D55C U+C790}.Additional Han ideographs will appear in other location (not necessaryin plane 0) in the future.Conversion between ideographs can be done using four differentapproaches: Code-base substitution, character-based substitution,lexicon-based substitution and context-based substitution. Han foldingrefers only to code-base substitution, similar to case mapping ofalphabetic characters.2. IntroductionTraditionally, domain names have been case insensitive (as defined in[RFC1035] Section 2.3.3). While this is not a problem when domain namesare restricted to English alphanumeric letters and digits, it becomes aserious problem for IDN. An important criterion for having a robust IDNis to have good normalization and canonicalization forms. This is toensure domain name duplications are kept to the minimal.Fortunately, Unicode Consortium is developing technical reports oncanonicalization [UTR21] and normalization [UTR15]. Hence, it becomessimple for IDN to ride upon the work of Unicode and use thesereferences.Unfortunately, both [UTR15] and [UTR21] are limited in scope and do notaddress many other scripts. In particular, Han ideographs are notdiscussed in detail in these documents and most experts are quick topoint out that this problem is technically impossible.2.1 Han ideographsWhile there are many forms or writing style for Chinese characters, themost common used 'zhengti' {U+6B63 U+4F53/U+6B63 U+9AD4} representChinese ideographs by radicals (U+2E80-U+2FDF) that is composed ofsimple strokes.When the Unicode Consortium started work on Universal Character Set, itwas suggested that Hanzi, Kanji and Hanja ideographs should be unifiedinto a single code space. This resulted in the CJK Unification, whereby27,786 Han ideographs are allocated in U+3400-U+9FFF and U+F900-U+FAFFrange. Another 41,000 Han ideographs will be added to Plane 2.Ideographs are common in China, Korea and Japan but as ideographs spreadand evolve, the form of the ideographs sometimes differs slightly fromcountry to country. For example, the word 'villa' {U+838A} 'zhuang' inChinese, in Japanese is 'sou' {U+8358}. These are given different codepoints in Unicode.3. Chinese (Hanzi)Chinese ideographs or hanzi {U+6F22 U+5B57/U+6C49 U+5B57} originatedfrom pictograph. They are 'pictures' which evolved into ideographsduring several thousand years. For instance, the ideograph for "hill"{U+5C71} still bears some resembles to 3 peaks of a hill.Not all ideographs are pictograph. There are other classifications suchas compound ideographs, phonetic ideographs etc. For example,'endurance' {U+5FCD} is a pierced 'knife' {U+5200} above the 'heart'{U+5FC3}, or as a Chinese saying goes, 'endurance is like having apierced knife in your heart'.Hence, almost all Han ideographs are associated with some meaning byitself which is very different from most other scripts. This causes someconfusion that Han folding is a form of lexicon-substitution.Chinese ideographs underwent a major change in the 1950s after theestablishment of People's Republic of China. A committee on LanguageReform was established in China whose activities include simplificationof Chinese ideographs. The Simplified Chinese (SC) are used in Chinaand Singapore and Traditional Chinese (TC) in Taiwan, Hong Kong PRC,Macau PRC, and most other oversea Chinese.The process is to take complex ideographs and simplify them. The mainpurposes is to make it easier to remember and write and thus to raisethe literacy of the population.For example, 'lightning' TC {U+96FB} becomes SC {U+6535} (They drop the'rain' {U+96E8} part from the TC). In many cases, they bear noresemblance to any of the original traditional forms e.g. 'dragon' TC{U+9F8D} SC {U+9F99}. Two different TC may also have the same SC sinceit means fewer ideographs to learn, e.g. SC {U+53D1} can be {U+667C} or{U+9AEE} depending on semantics. The official 'Comprehensive List ofSimplified Characters' latest published in 1986 listed 2244 SC[ZONGBIAO].Therefore, the process of SC-to-TC is very complicated. It is notpossible to do it accurately without considering the semantics of thephrase.On the other hand, TC-to-SC is much simple although different TCs maymap to one single SC. While Unicode does not handle TC & SC, in theinformal [UNIHAN] document, it listed 2145 TC and its equivalent mappingof SC. However, because that document is informal and not part of theUnicode standard, it is incomplete and has mistakes in the code points.Hence, precise tables for TC-to-SC conversion have not been fully laidout.In domain names, we are particularly interested in is to equivalencescomparison of the names, and not converting SC-to-TC. Therefore, forthis purpose, it is possible that equivalency matching be done in theTC-to-SC folding prior to comparison, similar to lower-case Englishstrings before comparing them, e.g. 'taiwan' SC {U+53F0 U+6E7E} willmatch with TC {U+81FA U+5F4E} or TC {U+53F0 U+5F4E}.The side effect of this method is that comparing SC {U+53D1} to TC{U+667C} or TC {U+9AEE} will both be positive. This implies that SC'hair' SC 咇硡沛 {U+5934 U+53D1} will match TC(U+982D U+9AEE). It will also match TC {U+982D U+9AEE} that does nothave any meaning in Chinese.It should also be noted that SC are not used together with TC. Hence,'hair' is either written as SC {U+5934 U+53D1} or TC {U+982D U+9AEE}but (almost) never {U+5934 U+9AEE} or {U+982D U+53D1}. So the problemof SC and TC may not too serious for IDN.Unfortunately, when it comes to names in Chinese, places where SC areused (i.e. Singapore and China), traditional and simplified ideographsare sometimes mixed within a single name for artistic reasons. Some ofthem even 'create' ideographs for their names.[Need to add a section on Bopomofo U+3118 to U+312A in future draft]4. Korean (Hanja and Hangeul)Korean is one of the first cultures to imported Chinese ideographs intoKorean language as a written form. These Korean ideographs are known as'hanja' {U+6F22 U+5B57/U+D55C U+C790} and they are widely used untilrecently where 'hangeul' {U+D55C U+AE00} become more popular.Hangeul {U+D55C U+AE00} is a systemic script designed by a 15th centuryruler and linguistic expert, King Sejong {U+4E16 U+5B97}. It is basedon the pronunciation of the Korean language, hanmal. A Korean syllableis composed of 'jamo' {U+5B57 U+6BCD/U+C790 U+BAA8} elements thatrepresent different sound. Hence, unlike Han ideographs, each hangeulsyllable does not have any meaning.Each hanja ideographs can be represented by hangeul syllable. Forexample, 'samsung' hanja {U+4E09 U+661F} hangeul {U+C0BC U+C131}. Notethat {U+4E09} is pronounced as 'sa-ah-am' or in jamo {U+3145} {U+314F}{U+3141}, which gives hangeul {U+C0BC}. While Jamo decompositions aredescribed in [UTR15] in Form D decomposition, this document alsosuggested another hanguel canonical decomposition in Appendix A toaccommodates both modern and old hangeul.[Need to fill up Appendix A when information is more complete]Most hanja characters have only one pronunciation. However, some hanjapronunciation differs as according to orthography (same for Chinese &Japanese) or the position in a word, which make this more complex. Andof course, conversation of Hangeul back to hanja is impossible by codesubstitution without consideration for semantics.Korean also invented their own ideographs that are called 'gugja'{U+56FD U+5B57/U+AD6D U+C790}.5. Japanese (Kanji, Hiragana, Katakana)Japanese adopted Chinese ideograph from the Korean and the Chinese sincethe 5th century. Chinese ideographs in Japanese are known as 'kanji'{U+6F22 U+5B57}. They also developed their own syllabary hiragana{U+5E73 U+4EEE U+540D} (U+3040-U+309F) and katakana {U+7247 U+4EEEU+540D} (U+30A0-U+30FF), both are derivative of kanji that has samepronunciation. Hiragana is a simplified cursive form, for example, 'a'{U+3042} was derived from 'an' {U+5B89}. Katakana is a simplified partform, for example, 'a' {U+30A2} was derived from 'a' {U+963F}. However,kanji all remain very integrated within the Japanese language.Japanese also invented ideographs known as 'kokuji' {U+56FD U+5B57}. Forexample, 'iwashi' {U+9C2F} is a Japanese kokuji ideograph. Kokuji areinvented according to Han ligature rules. For example, 'touge' "mountainpass" {U+5CE0} is a conjunction of meaning with 'yama' "mountain"{U+5C71} + 'ue' "up" {U+4E0A} + 'shita' "down" {U+4E0B}.Japanese is also a vocal language, i.e. the script itself is based onpronunciation. Each hiragana corresponding to one pronunciation and 48hiragana forms the basic of the Japanese language, including the lesscommonly used 'we' {U+3091}. Furthermore, hiragana has more 35 forms torepresent voiced sound, P-sound, double consonant. For example, 'ga'{U+304C} is a voiced sound of 'ka' {U+304B}. Katakana is a mirror ofhiragana with few more forms and they are used to integrate foreignwords or phrases into Japanese, or to emphasize words or phrases evenin Japanese, or to represent onomatopoeia. For example, 'hamburger'pronounced as 'han-baa-gaa' in Japanese is written as {U+30CF U+30F3U+30D0 U+30FC U+30AC U+30FC} instead of {U+306F U+3093 U+3070 U+3041U+304C U+3041} because it is a foreign word.If Japanese uses hiragana and katakana only, then it is fairly obviousthat written Japanese is going to be very long. Hence, kanji are usedwhen referring to nouns or verbs. Each kanji corresponds to one or morehiragana characters. For example, 'japan' pronounced as 'nippon'{U+306B U+3063 U+307D U+3093} are written as {U+65E5 U+672C} instead.Hiragana, like Korean jamo, has no meaning itself. And also, Kanji cantake on different pronunciation (which means different hiragana)depending where and how it is use in the sentence. For example, 'sky'{U+7A7A} can be pronounced as {U+305D U+3089} or {U+30BD U+30E9}.Hence, a code substitution between hiragana and kanji is impractical.On the other hand, there are Kanji that has the same meaning with thesame pronunciation and equivalent. For example, 'river' "kawa" can beeither {U+5DDD} or {U+6CB3}. The only differential between the twoideographs is that it signifies the 'size of the river' (the latter isbigger river).Japanese also reduce complex Chinese ideographs to a simplified form.For example, 'both' {U+5169} was simplified {U+4E21}. Note that Chinesesimplified it to {U+4E24} instead. However, traditional Japanese kanjiare seldom used nowadays beyond documenting old historical text thatthey are treated different from the more commonly used simplified form,or used to express proper noun such as person's name or trademarks.Hence, Han folding here is not recommended.4. VietnameseWhile Vietnamese also adopted Chinese ideographs ('chu han') and createdtheir own ideographs ('chu nom'), they were now replaced by romanized'quoc ngu' today. Hence, this document does not attempt to address anyissues with 'chu han' or 'chu nom'.5. zVariantUnicode has a three dimension conceptual model to IdeographUnification. The three dimensions are semantic (X axis - meaning,function), abstract shape (Y-axis - general form) and actual shape(Z-axis 偳
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -