📄 notes

📁 libiconv是一个很不错的字符集转换库。程序接口也很简单
💻
📖 第 1 页 / 共 2 页
字号:
上一页 12
   Simplified Chinese     * EUC-CN = GB2312       We implement this because it is the widely used representation       of simplified Chinese.     * GBK       We implement this because it appears to be used on Solaris and Windows.     * GB18030       We implement this because it is an official requirement in the       People's Republic of China.     * ISO-2022-CN       We implement this because it is in the RFCs, but I have no idea       whether it is really used.     * ISO-2022-CN-EXT       We implement this because it's in the RFCs, but I don't think it is       really used.     * HZ = HZ-GB-2312       We implement this because the RFCs recommend it for Usenet postings,       and because MSIE4 supports it.   Traditional Chinese     * EUC-TW       We implement it because it appears to be used on Unix.     * BIG5       We implement it because it is the de-facto standard for traditional       Chinese.     * CP950       We implement this because it is the Microsoft variant of BIG5, used       on Windows.     * BIG5+       We DON'T implement this because it doesn't appear to be in wide use.       Only the CWEX fonts use this encoding. Furthermore, the conversion       tables in the big5p package are not coherent: If you convert directly,       you get different results than when you convert via GBK.     * BIG5-HKSCS       We implement it because it is the de-facto standard for traditional       Chinese in Hongkong.   Korean     * EUC-KR       We implement these because they appear to be the widely used       representations for Korean.     * CP949       We implement this because it is the Microsoft variant of EUC-KR, used       on Windows.     * ISO-2022-KR       We implement it because it is in the RFCs and because MSIE4 supports       it, but I have no idea whether it's really used.     * JOHAB       We implement this because it is apparently used on Windows as a locale       encoding (codepage 1361).     * ISO-646-KR       We DON'T implement this because although an old ASCII variant, its       glyph for 0x7E is not clear: RFC 1345 and unicode.org's JOHAB.TXT       say it's a tilde, but Ken Lunde's "CJKV information processing" says       it's an overline. And it is not ISO-IR registered.   Armenian     * ARMSCII-8       We implement it because XFree86 supports it.   Georgian     * Georgian-Academy, Georgian-PS       We implement these because they appear to be both used for Georgian;       Xfree86 supports them.   Thai     * TIS-620       We implement this because it seems to be standard for Thai.     * CP874       We implement this because MSIE4 supports it.     * MacThai       We implement this because the Sun JDK does, and because Mac users       don't deserve to be punished.   Laotian     * MuleLao-1, CP1133       We implement these because XFree86 supports them. I have no idea which       one is used more widely.   Vietnamese     * VISCII, TCVN       We implement these because XFree86 supports them.     * CP1258       We implement this because MSIE4 supports it.   Other languages     * NUNACOM-8 (Inuktitut)       We DON'T implement this because it isn't part of Unicode yet, and       therefore doesn't convert to anything except itself.   Platform specifics     * HP-ROMAN8, NEXTSTEP       We implement these because they were the native character set on HPs       and NeXTs for a long time, and libiconv is intended to be usable on       these old machines.   Full Unicode     * UTF-8, UCS-2, UCS-4       We implement these. Obviously.     * UCS-2BE, UCS-2LE, UCS-4BE, UCS-4LE       We implement these because they are the preferred internal       representation of strings in Unicode aware applications. These are       non-ambiguous names, known to glibc. (glibc doesn't have       UCS-2-INTERNAL and UCS-4-INTERNAL.)     * UTF-16, UTF-16BE, UTF-16LE       We implement these, because UTF-16 is still the favourite encoding of       the president of the Unicode Consortium (for political reasons), and       because they appear in RFC 2781.     * UTF-32, UTF-32BE, UTF-32LE       We implement these because they are part of Unicode 3.1.     * UTF-7       We implement this because it is essential functionality for mail       applications.     * C99       We implement it because it's used for C and C++ programs and because       it's a nice encoding for debugging.     * JAVA       We implement it because it's used for Java programs and because it's       a nice encoding for debugging.     * UNICODE (big endian), UNICODEFEFF (little endian)       We DON'T implement these because they are stupid and not standardized.   Full Unicode, in terms of `uint16_t' or `uint32_t'   (with machine dependent endianness and alignment)     * UCS-2-INTERNAL, UCS-4-INTERNAL       We implement these because they are the preferred internal       representation of strings in Unicode aware applications.Q: Support encodings mentioned in RFC 1345 ?A: No, they are not in use any more. Supporting ISO-646 variants is pointless   since ISO-8859-* have been adopted.Q: Support EBCDIC ?A: No!Q: How do I add a new character set?A: 1. Explain the "why" in this file, above.   2. You need to have a conversion table from/to Unicode. Transform it into   the format used by the mapping tables found on ftp.unicode.org: each line   contains the character code, in hex, with 0x prefix, then whitespace,   then the Unicode code point, in hex, 4 hex digits, with 0x prefix. '#'   counts as a comment delimiter until end of line.   Please also send your table to Mark Leisher <mleisher@crl.nmsu.edu> so he   can include it in his collection.   3. If it's an 8-bit character set, use the '8bit_tab_to_h' program in the   tools directory to generate the C code for the conversion. You may tweak   the resulting C code if you are not satisfied with its quality, but this   is rarely needed.   If it's a two-dimensional character set (with rows and columns), use the   'cjk_tab_to_h' program in the tools directory to generate the C code for   the conversion. You will need to modify the main() function to recognize   the new character set name, with the proper dimensions, but that shouldn't   be too hard. This yields the CCS. The CES you have to write by hand.   4. Store the resulting C code file in the lib directory. Add a #include   directive to converters.h, and add an entry to the encodings.def file.   5. Compile the package, and test your new encoding using a program like   iconv(1) or clisp(1).   6. Augment the testsuite: Add a line to each of tests/Makefile.in,   tests/Makefile.msvc and tests/Makefile.os2. For a stateless encoding,   create the complete table as a TXT file. For a stateful encoding,   provide a text snippet encoded using your new encoding and its UTF-8   equivalent.   7. Update the README and man/iconv_open.3, to mention the new encoding.   Add a note in the NEWS file.Q: What about bidirectional text? Should it be tagged or reversed when   converting from ISO-8859-8 or ISO-8859-6 to Unicode? Qt appears to do   this, see qt-2.0.1/src/tools/qrtlcodec.cpp.A: After reading RFC 1556: I don't think so. Support for ISO-8859-8-I and   ISO-8859-E remains to be implemented.   On the other hand, a page on www.w3c.org says that ISO-8859-8 in *email*   is visually encoded, ISO-8859-8 in *HTML* is logically encoded, i.e.   the same as ISO-8859-8-I. I'm confused.Other character sets not implemented:"MNEMONIC" = "csMnemonic""MNEM" = "csMnem""ISO-10646-UCS-Basic" = "csUnicodeASCII""ISO-10646-Unicode-Latin1" = "csUnicodeLatin1" = "ISO-10646""ISO-10646-J-1""UNICODE-1-1" = "csUnicode11""csWindows31Latin5"Other aliases not implemented (and not implemented in glibc-2.1 either):  From MSIE4:    ISO-8859-1: alias ISO8859-1    ISO-8859-2: alias ISO8859-2    KSC_5601: alias KS_C_5601    UTF-8: aliases UNICODE-1-1-UTF-8 UNICODE-2-0-UTF-8Q: How can I integrate libiconv into my package?A: Just copy the entire libiconv package into a subdirectory of your package.   At configuration time, call libiconv's configure script with the   appropriate --srcdir option and maybe --enable-static or --disable-shared.   Then "cd libiconv && make && make install-lib libdir=... includedir=...".   'install-lib' is a special (not GNU standardized) target which installs   only the include file - in $(includedir) - and the library - in $(libdir) -   and does not use other directory variables. After "installing" libiconv   in your package's build directory, building of your package can proceed.Q: Why is the testsuite so big?A: Because some of the tests are very comprehensive.   If you don't feel like using the testsuite, you can simply remove the   tests/ directory.
上一页 12
💿 文件大小 8695 K
👤 上传用户 wujiahui1pm
📂 所属分类多国语言处理
🏷️ 相关标签

#libiconv #字符 #转换 #程序接口
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -