📄 supported.pod
字号:
First we start with which characters to include. We call thiscollection of characters I<character repertoire>.=item *Then we have to give each character a unique ID so your computer cantell the difference between 'a' and 'A'. This itemized characterrepertoire is now a I<character set>.=item *If your computer can grow the character set without furtherprocessing, you can go ahead and use it. This is called a I<codedcharacter set> (CCS) or I<raw character encoding>. ASCII is used thisway for most cases.=item *But in many cases, especially multi-byte CJK encodings, you have totweak a little more. Your network connection may not accept any datawith the Most Significant Bit set, and your computer may not be able totell if a given byte is a whole character or just half of it. So youhave to I<encode> the character set to use it.A I<character encoding scheme> (CES) determines how to encode a givencharacter set, or a set of multiple character sets. 7bit ISO-2022 isan example of a CES. You switch between character sets via I<escapesequences>.=backTechnically, or mathematically, speaking, a character set encoded insuch a CES that maps character by character may form a CCS. EUC is suchan example. The CES of EUC is as follows:=over 2=item *Map ASCII unchanged.=item *Map such a character set that consists of 94 or 96 powered by Nmembers by adding 0x80 to each byte.=item *You can also use 0x8e and 0x8f to indicate that the following sequence ofcharacters belongs to yet another character set. To each following byteis added the value 0x80.=backBy carefully looking at the encoded byte sequence, you can find that thebyte sequence conforms a unique number. In that sense, EUC is a CCSgenerated by a CES above from up to four CCS (complicated?). UTF-8falls into this category. See L<perlUnicode/"UTF-8"> to find out howUTF-8 maps Unicode to a byte sequence.You may also have found out by now why 7bit ISO-2022 cannot comprisea CCS. If you look at a byte sequence \x21\x21, you can't tell ifit is two !'s or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1so you have no trouble differentiating between "!!". and S<" ">.=head1 Encoding Classification (by Anton Tagunov and Dan Kogai)This section tries to classify the supported encodings by their applicability for information exchange over the Internet and to choose the most suitable aliases to name them in the context of such communication.=over 2=item * To (en|de)code encodings marked by C<(**)>, you need C<Encode::HanExtra>, available from CPAN.=backEncoding names US-ASCII UTF-8 ISO-8859-* KOI8-R Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1 EUC-KR Big5 GB2312are registered with IANA as preferred MIME names and maybe used over the Internet.C<Shift_JIS> has been officialized by JIS X 0208:1997.L<Microsoft-related naming mess> gives details.C<GB2312> is the IANA name for C<EUC-CN>.See L<Microsoft-related naming mess> for details.C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>with Encode. See L<Encode::CN> for details. EUC-CN KOI8-U [RFC2319]have not been registered with IANA (as of March 2002) butseem to be supported by major web browsers. The IANA name for C<EUC-CN> is C<GB2312>. KS_C_5601-1987is heavily misused.See L<Microsoft-related naming mess> for details.C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>with Encode. See L<Encode::KR> for details. UTF-16 UTF-16BE UTF-16LEare IANA-registered C<charset>s. See [RFC 2781] for details.Jungshik Shin reports that UTF-16 with a BOM is well acceptedby MS IE 5/6 and NS 4/6. Beware however that=over 2=item *C<UTF-16> support in any software you're going to beusing/interoperating with has probably been less testedthen C<UTF-8> support=item *C<UTF-8> coded data seamlessly passes traditionalcommand piping (C<cat>, C<more>, etc.) while C<UTF-16> codeddata is likely to cause confusion (with its zero bytes,for example)=item *it is beyond the power of words to describe the way HTML browsersencode non-C<ASCII> form data. To get a general impression, visitL<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>.While encoding of form data has stabilized for C<UTF-8> encoded pages(at least IE 5/6, NS 6, and Opera 6 behave consistently), be sure toexpect fun (and cross-browser discrepancies) with C<UTF-16> encodedpages!=backThe rule of thumb is to use C<UTF-8> unless you know whatyou're doing and unless you really benefit from using C<UTF-16>. ISO-IR-165 [RFC1345] VISCII GB 12345 GB 18030 (**) (see links bellow) EUC-TW (**)are totally valid encodings but not registered at IANA.The names under which they are listed here are probably themost widely-known names for these encodings and are recommendednames. BIG5PLUS (**)is a proprietary name. =head2 Microsoft-related naming messMicrosoft products misuse the following names:=over 2=item KS_C_5601-1987Microsoft extension to C<EUC-KR>.Proper names: C<CP949>, C<UHC>, C<x-windows-949> (as used by Mozilla).See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html>for details.Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect this commonmisusage. I<Raw> C<KS_C_5601-1987> encoding is available asC<kcs5601-raw>.See L<Encode::KR> for details.=item GB2312Microsoft extension to C<EUC-CN>.Proper names: C<CP936>, C<GBK>.C<GB2312> has been registered in the C<EUC-CN> meaning atIANA. This has partially repaired the situation: Microsoft's C<GB2312> has become a superset of the official C<GB2312>.Encode aliases C<GB2312> to C<euc-cn> in full agreement withIANA registration. C<cp936> is supported separately.I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>.See L<Encode::CN> for details.=item Big5Microsoft extension to C<Big5>.Proper name: C<CP950>.Encode separately supports C<Big5> and C<cp950>.=item Shift_JISMicrosoft's understanding of C<Shift_JIS>.JIS has not endorsed the full Microsoft standard however.The official C<Shift_JIS> includes only JIS X 0201 and JIS X 0208character sets, while Microsoft has always used C<Shift_JIS>to encode a wider character repertoire. See C<IANA> registration forC<Windows-31J>.As a historical predecessor, Microsoft's variantprobably has more rights for the name, though it may be objectedthat Microsoft shouldn't have used JIS as part of the namein the first place.Unambiguous name: C<CP932>. C<IANA> name (also used by Mozilla, andprovided as an alias by Encode): C<Windows-31J>.Encode separately supports C<Shift_JIS> and C<cp932>.=back=head1 Glossary=over 2=item character repertoireA collection of unique characters. A I<character> set in the strictestsense. At this stage, characters are not numbered.=item coded character set (CCS)A character set that is mapped in a way computers can use directly.Many character encodings, including EUC, fall in this category.=item character encoding scheme (CES)An algorithm to map a character set to a byte sequence. You don'thave to be able to tell which character set a given byte sequencebelongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is anexample of being both a CCS and CES.=item charset (in MIME context)has long been used in the meaning of C<encoding>, CES.While the word combination C<character set> has lost this meaningin MIME context since [RFC 2130], the C<charset> abbreviation hasretained it. This is how [RFC 2277] and [RFC 2278] bless C<charset>: This document uses the term "charset" to mean a set of rules for mapping from a sequence of octets to a sequence of characters, such as the combination of a coded character set and a character encoding scheme; this is also what is used as an identifier in MIME "charset=" parameters, and registered in the IANA charset registry ... (Note that this is NOT a term used by other standards bodies, such as ISO). [RFC 2277]=item EUCExtended Unix Character. See ISO-2022.=item ISO-2022A CES that was carefully designed to coexist with ASCII. There are a 7bit version and an 8 bit version. The 7 bit version switches character set via escape sequence so itcannot form a CCS. Since this is more difficult to handle in programsthan the 8 bit version, the 7 bit version is not very popular except foriso-2022-jp, the I<de facto> standard CES for e-mails.The 8 bit version can form a CCS. EUC and ISO-8859 are two examplesthereof. Pre-5.6 perl could use them as string literals.=item UCSShort for I<Universal Character Set>. When you say just UCS, it meansI<Unicode>.=item UCS-2ISO/IEC 10646 encoding form: Universal Character Set coded in twooctets.=item UnicodeA character set that aims to include all character repertoires of theworld. Many character sets in various national as well as industrialstandards have become, in a way, just subsets of Unicode.=item UTFShort for I<Unicode Transformation Format>. Determines how to map aUnicode character into a byte sequence.=item UTF-16A UTF in 16-bit encoding. Can either be in big endian or littleendian. The big endian version is called UTF-16BE (equal to UCS-2 + surrogate support) and the little endian version is called UTF-16LE.=back=head1 See AlsoL<Encode>, L<Encode::Byte>, L<Encode::CN>, L<Encode::JP>, L<Encode::KR>, L<Encode::TW>,L<Encode::EBCDIC>, L<Encode::Symbol>L<Encode::MIME::Header>, L<Encode::Guess>=head1 References=over 2=item ECMAEuropean Computer Manufacturers AssociationL<http://www.ecma.ch>=over 2=item ECMA-035 (eq C<ISO-2022>)L<http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM> The specification of ISO-2022 is available from the link above.=back=item IANAInternet Assigned Numbers AuthorityL<http://www.iana.org/>=over 2=item Assigned Charset Names by IANAL<http://www.iana.org/assignments/character-sets>Most of the C<canonical names> in Encode derive from this listso you can directly apply the string you have extracted from MIMEheader of mails and web pages.=back=item ISOInternational Organization for StandardizationL<http://www.iso.ch/>=item RFCRequest For Comments -- need I say more?L<http://www.rfc-editor.org/>, L<http://www.rfc.net/>,L<http://www.faqs.org/rfcs/>=item UCUnicode ConsortiumL<http://www.unicode.org/>=over 2=item Unicode GlossaryL<http://www.unicode.org/glossary/>The glossary of this document is based upon this site.=back=back=head2 Other Notable Sites=over 2=item czyborra.comL<http://czyborra.com/>Contains a lot of useful information, especially gory details of ISOvs. vendor mappings.=item CJK.infL<http://www.oreilly.com/people/authors/lunde/cjk_inf.html>Somewhat obsolete (last update in 1996), but still useful. Also tryL<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030>.=item Jungshik Shin's Hangul FAQL<http://jshin.net/faq>And especially its subject 8.L<http://jshin.net/faq/qa8.html>A comprehensive overview of the Korean (C<KS *>) standards.=item debian.org: "Introduction to i18n"A brief description for most of the mentioned CJK encodings iscontained inL<http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html>=back=head2 Offline sources=over 2=item C<CJKV Information Processing> by Ken LundeCJKV Information Processing1999 O'Reilly & Associates, ISBN : 1-56592-224-7The modern successor of C<CJK.inf>.Features a comprehensive coverage of CJKV character sets andencodings along with many other issues faced by anyone tryingto better support CJKV languages/scripts in all the areas ofinformation processing.To purchase this book, visitL<http://www.oreilly.com/catalog/cjkvinfo/>or your favourite bookstore.=back=cut
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -