📄 iconv.tex
字号:
@tabansi_x3.4_1968, ansi_x3.4_1986, iso_646.irv:1991, ascii, iso646_us, us, ibm367, cp367, csascii@tabus_ascii / (ASCII)@tab7-bit ASCII.@itemutf_16@tabutf16@tabutf_16 / (UCS)@tabRFC 2781 UTF-16. The very first NBSP code in stream is interpreted as BOM.@itemutf_16be@tabutf16be@tabutf_16 / (UCS)@tabBig Endian version of RFC 2781 UTF-16.NBSP is always interpreted as NBSP (BOM isn't supported).@itemutf_16le@tabutf16le@tabutf_16 / (UCS)@tabLittle Endian version of RFC 2781 UTF-16.NBSP is always interpreted as NBSP (BOM isn't supported).@itemutf_8@tabutf8@tabutf_8 / (UCS)@tabRFC 3629 UTF-8.@itemwin_1250@tabcp1250@tab@tabWin-1250 Croatian.@itemwin_1251@tabcp1251@tabtable / win_1251@tabWin-1251 - Cyrillic.@itemwin_1252@tabcp1252@tabtable / win_1252@tabWin-1252 - Latin 1.@itemwin_1253@tabcp1253@tabtable / win_1253@tabWin-1253 - Greek.@itemwin_1254@tabcp1254@tabtable / win_1254@tabWin-1254 - Turkish.@itemwin_1255@tabcp1255@tabtable / win_1255@tabWin-1255 - Hebrew.@itemwin_1256@tabcp1256@tabtable / win_1256@tabWin-1256 - Arabic.@itemwin_1257@tabcp1257@tabtable / win_1257@tabWin-1257 - Baltic.@itemwin_1258@tabcp1258@tabtable / win_1258@tabWin-1258 - Vietnamese7 that supports Cyrillic.@end multitable@page@node iconv design decisions@section iconv design decisions@findex CCS table@findex CES converter@findex Speed-optimized tables@findex Size-optimized tables@*The first iconv library design issue arises when considering thefollowing two design approaches:@enumerate@itemHave modules which implement conversion from the encoding A to the encoding Band vice versa i.e., one conversion module relates to any two encodings.@itemHave modules which implement conversion from the encoding A to the fixedencoding C and vice versa i.e., one conversion module relates to anyone encoding A and one fixed encoding C. In this case, to convert fromthe encoding A to the encoding B, two modules are needed (in order to convertfrom A to C and then from C to B).@end enumerate@*It's obvious, that we have tradeoff between commonality/flexibility andefficiency: the first method is more efficient since it convertsdirectly; however, it isn't so flexible since for eachencoding pair a distinct module is needed.@*The Newlib iconv model uses the second method and always converts through the 32-bitUCS but its design also allows one to write specialized conversionmodules if the conversion speed is critical.@*The second design issue is how to break down (decompose) encodings.The Newlib iconv library uses the fact that any encoding may beconsidered as one or more CCS plus a CES. It also decomposes itsconversion modules on @dfn{CES converter} plus one or more @dfn{CCStables}. CCS tables map CCS to UCS and vice versa; the CES convertersmap CCS to the encoding and vice versa.@*As the example, let's consider the conversion from the big5 encoding tothe EUC-TW encoding. The big5 encoding may be decomposed to the ASCII and BIG5CCS-es plus the BIG5 CES. EUC-TW may be decomposed on the CNS11643_PLANE1, CNS11643_PLANE2,and CNS11643_PLANE14 CCS-es plus the EUC CES.@*The euc_jp -> big5 conversion is performed as follows:@enumerate@itemThe EUC converter performs the EUC-TW encoding to the corresponding CCS-estransformation (CNS11643_PLANE1, CNS11643_PLANE2 and CNS11643_PLANE14CCS-es);@itemThe obtained CCS codes are transformed to the UCS codes using the CNS11643_PLANE1,CNS11643_PLANE2 and CNS11643_PLANE14 CCS tables;@itemThe resulting UCS codes are transformed to the ASCII and BIG5 codes usingthe corresponding CCS tables;@itemThe obtained CCS codes are transformed to the big5 encoding using the correspondingCES converter.@end enumerate@*Analogously, the backward conversion is performed as follows:@enumerate@itemThe BIG5 converter performs the big5 encoding to the corresponding CCS-es transformation(the ASCII and BIG5 CCS-es);@itemThe obtained CCS codes are transformed to the UCS codes using the ASCII and BIG5 CCS tables;@itemThe resulting UCS codes are transformed to the ASCII and BIG5 codes usingthe corresponding CCS tables;@itemThe obtained CCS codes are transformed to the EUC-TW encoding using the correspondingCES converter.@end enumerate@*Note, the above is just an example and real names (which are implementedin the Newlib iconv) of the CES converters and the CCS tables are slightly different.@*The third design issue also relates to flexibility. Obviously, it isn'tdesirable to always link all the CES converters and the CCS tables to the librarybut instead, we want to be able to load the needed converters and tablesdynamically on demand. This isn't a problem on "big" machines such asa PC, but it may be very problematical within "small" embedded systems.@*Since the CCS tables are just data, it is possible to load themdynamically from external files. The CES converters, on the other handare algorithms with some code so a dynamic library loading capability is required.@*Apart from possible restrictions applied by embedded systems (smallRAM for example), Newlib itself has no dynamic library support andtherefore, all the CES converters which will ever be used must be linked intothe library. However, loading of the dynamic CCS tables is possible and isimplemented in the Newlib iconv library. It may be enabled via the Newlibconfigure script options.@*The next design issue is fine-tuning the iconv libraryconfiguration. One important ability is for iconv to not link all it'sconverters and tables (if dynamic loading is not enabled) but instead,enable only those encodings which are specified at configurationtime (see the section about the configure script options).@*In addition, the Newlib iconv library configure options distinguish betweenconversion directions. This means that not only are supported encodingsselectable, the conversion direction is as well. For example, if user wantsthe configuration which allows conversions from UTF-8 to UTF-16 anddoesn't plan using the "UTF-16 to UTF-8" conversions, he or she can enable onlythis conversion direction (i.e., no "UTF-16 -> UTF-8"-related code willbe included) thus, saving some memory (note, that such technique allows toexclude one half of a CCS table from linking which may be big enough).@*One more design aspect are the speed- and size- optimized tables. Users canselect between them using configure script options. Thespeed-optimized CCS tables are the same as the size-optimized ones incase of 8-bit CCS (e.g.m KOI8-R), but for 16-bit CCS-es the size-optimizedCCS tables may be 1.5 to 2 times less then the speed-optimized ones. On theother hand, conversion with speed tables is several times faster.@*Its worth to stress that the new encoding support can't bedynamically added into an already compiled Newlib library, even if itneeds only an additional CCS table and iconv is configured to usethe external files with CCS tables (this isn't the fundamental restrictionand the possibility to add new Table-based encoding support dynamically, bymeans of just adding new .cct file, may be easily added).@*Theoretically, the compiled-in CCS tables should be more appropriate forembedded systems than dynamically loaded CCS tables. This is because the compiled-in tables are read-only and can be placed in ROMwhereas dynamic loading requires RAM. Moreover, in the current iconvimplementation, a distinct copy of the dynamic CCS file is loaded for each opened iconv descriptor even in case of the same encoding.This means, for example, that if two iconv descriptors for"KOI8-R -> UCS-4BE" and "KOI8-R -> UTF-16BE" are opened, two copies ofkoi8-r .cct file will be loaded (actually, iconv loads only the needed partof these files). On the other hand, in the case of compiled-in CCS tables, there will always be only one copy.@page@node iconv configuration@section iconv configuration@findex iconv configuration@findex --enable-newlib-iconv-encodings@findex --enable-newlib-iconv-from-encodings@findex --enable-newlib-iconv-to-encodings@findex --enable-newlib-iconv-external-ccs@findex NLSPATH@*To enable an encoding, the @emph{--enable-newlib-iconv-encodings} configurescript option should be used. This option accepts a comma-separated listof @emph{encodings} that should be enabled. The option enables each encoding in both("to" and "from") directions.@*The @option{--enable-newlib-iconv-from-encodings} configure script option enables"from" support for each encoding that was passed to it.@*The @option{--enable-newlib-iconv-to-encodings} configure script option enables"to" support for each encoding that was passed to it.@*Example: if user plans only the "KOI8-R -> UTF-8", "UTF-8 -> ISO-8859-5" and"KOI8-R -> UCS-2" conversions, the most optimal way (minimal iconvcode and data will be linked) is to configure Newlib with the followingoptions:@*@code{--enable-newlib-iconv-encodings=UTF-8--enable-newlib-iconv-from-encodings=KOI8-R--enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5}@*which is the same as@*@code{--enable-newlib-iconv-from-encodings=KOI8-R,UTF-8--enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5,UTF-8}@*User may also just use the@*@code{--enable-newlib-iconv-encodings=KOI8-R,ISO-8859-5,UTF-8,UCS-2}@*configure script option, but it isn't so optimal since there will besome unneeded data and code.@*The @option{--enable-newlib-iconv-external-ccs} option enables iconv'scapabilities to work with the external CCS files.@*The @option{--enable-target-optspace} Newlib configure script option also affectsthe iconv library. If this option is present, the library uses the sizeoptimized CCS tables. This means, that only the size-optimized CCStables will be linked or, if the@option{--enable-newlib-iconv-external-ccs} configure script option was used,the iconv library will load the size-optimized tables. If the@option{--enable-target-optspace}configure script option is disabled,the speed-optimized CCS tables are used.@*Note: .cct files are searched by iconv_open in the $NLSPATH/iconv_data/ directory.Thus, the NLSPATH environment variable should be set.@page@node Encoding names@section Encoding names@findex encoding name@findex encoding alias@findex normalized name@*Each encoding has one @dfn{name} and a number of @dfn{aliases}. Whenuser works with the iconv library (i.e., when the @code{iconv_open} callis used) both name or aliases may be used. The same is when encodingnames are used in configure script options.@*Names and aliases may be specified in any case (small or capitalletters) and the @kbd{-} symbol is equivalent to the @kbd{_} symbol.Also, when working with the iconv library,@*Internally the Newlib iconv library always converts aliases to names. Italso converts names and aliases in the @dfn{normalized} form which meansthat all capital letters are converted to small letters and the @kbd{-}symbols are converted to @kbd{_} symbols.@page@node CCS tables@section CCS tables@findex Size-optimized CCS table@findex Speed-optimized CCS table@findex mktbl.pl Perl script@findex .cct files@findex The CCT tables source files@findex CCS source files@*The iconv library stores files with CCS tables in the the @emph{ccs/}subdirectory. The CCS tables for any CCS may be kept in two forms - in the binary form(@dfn{.cct files}, see the @emph{ccs/binary/} subdirectory) and in formof compilable .c source files. The .cct files are only used when the@option{--enable-newlib-iconv-external-ccs} configure script option is enabled.The .c files are linked to the Newlib library if the correspondingencoding is enabled.@*As stated earlier, the Newlib iconv library performs allconversions through the 32-bit UCS, but the codes which are usedin most CCS-es, fit into the first 16-bit subset of the 32-bit UCS set.Thus, in order to make the CCS tables more compact, the 16-bit UCS-2 isused instead of the 32-bit UCS-4.@*CCS tables may be 8- or 16-bit wide. 8-bit CCS tables map 8-bit CCS to16-bit UCS-2 and vice versa while 16-bit CCS tables map16-bit CCS to 16-bit UCS-2 and vice versa.8-bit tables are small (in size) while 16-bit tables may be big enough.Because of this, 16-bit CCS tables may beeither speed- or size-optimized. Size-optimized CCS tables aresmaller then speed-optimized ones, but the conversion process isslower if the size-optimized CCS tables are used. 8-bit CCS tables have onlysize-optimized variant.Each CCS table (both speed- and size-optimized) consists of@dfn{from_ucs} and @dfn{to_ucs} subtables. "from_ucs" subtable mapsUCS-2 codes to CCS codes, while "to_ucs" subtable maps CCS codes toUCS-2 codes.@*Almost all 16-bit CCS tables contain less then 0xFFFF codes anda lot of gaps exist.@subsection Speed-optimized tables format@*In case of 8-bit speed-optimized CCS tables the "to_ucs" subtables format istrivial - it is just the array of 256 16-bit UCS codes. Therefore, anUCS-2 code @emph{Y} corresponding to a @emph{X} CCS code is calculatesas @emph{Y = to_ucs[X]}.@*Obviously, the simplest way to create the "from_ucs" table or the16-bit "to_ucs" table is to use the huge 16-bit array like in caseof the 8-bit "to_ucs" table. But almost all the 16-bit CCS tables containless then 0xFFFF code maps and this fact may be exploited to reducethe size of the CCS tables.@*In this chapter the "UCS-2 -> CCS" 8-bit CCS table format is described. The16-bit "CCS -> UCS-2" CCS table format is the same, except the mappingdirection and the CCS bits number.@*In case of the 8-bit speed-optimized table the "from_ucs" subtablecorresponds the "from_ucs" array and has the following layout:@*from_ucs array:@*-------------------------------------@*0xFF mapping (2 bytes) (only for8-bit table).@*-------------------------------------@*Heading block@*-------------------------------------@*Block 1@*-------------------------------------@*Block 2@*-------------------------------------@* ...@*-------------------------------------@*Block N@*-------------------------------------@*The 0x0000-0xFFFF 16-bit code range is divided to 256 code subranges. Eachsubrange is represented by an 256-element @dfn{block} (256 1-byteelements or 256 2-byte element in case of 16-bit CCS table) withelements which are equivalent to the CCS codes of this subrange.If the "UCS-2 -> CCS" mapping has big enough gaps, some blocks will beabsent and there will be less then 256 blocks.@*Any element number @emph{m} of @dfn{the heading block} (which contains256 2-byte elements) corresponds to the @emph{m}-th 256-element subrange.If the subrange contains some codes, the value of the @emph{m}-th element ofthe heading block contains the offset of the corresponding block in the"from_ucs" array. If there is no codes in the subrange, the headingblock element contains 0xFFFF.@*If there are some gaps in a block, the corresponding block elements havethe 0xFF value. If there is an 0xFF code present in the CCS, it's mappingis defined in the first 2-byte element of the "from_ucs" array.@*Having such a table format, the algorithm of searching the CCS code@emph{X} which corresponds to the UCS-2 code @emph{Y} is as follows.@*@enumerate@item If @emph{Y} is equivalent to the value of the first 2-byte elementof the "from_ucs" array, @emph{X} is 0xFF. Else, continue to search.@item Calculate the block number: @emph{BlkN = (Y & 0xFF00) >> 8}.@item If the heading block element with number @emph{BlkN} is 0xFFFF, thereis no corresponding CCS code (error, wrong input data). Else, fetch the"flom_ucs" array index of the @emph{BlkN}-th block.@item Calculate the offset of the @emph{X} code in its block: @emph{Xindex = Y & 0xFF}@item If the @emph{Xintex}-th element of the block (which is equivalent to@emph{from_ucs[BlkN+Xindex]}) value is 0xFF, there is no correspondingCCS code (error, wrong input data). Else, @emph{X = from_ucs[BlkN+Xindex]}.@end enumerate@subsection Size-optimized tables format@*As it is stated above, size-optimized tables exist only for 16-bit CCS-es.This is because there is too small difference between the speed-optimizedand the size-optimized table sizes in case of 8-bit CCS-es.@*Formats of the "to_ucs" and "from_ucs" subtables are equivalent in case ofsize-optimized tables.This sections describes the format of the "UCS-2 -> CCS" size-optimizedCCS table. The format of "CCS -> UCS-2" table is the same.The idea of the size-optimized tables is to split the UCS-2 codes("from" codes) on @dfn{ranges} (@dfn{range} is a number of consecutive UCS-2 codes).Then CCS codes ("to" codes) are stored only for the codes from theseranges. Distinct "from" codes, which have no range (@dfn{unranged codes}, are storedtogether with the corresponding "to" codes.@*The following is the layout of the size-optimized table array:@*size_arr array:@*-------------------------------------@*Ranges number (2 bytes)@*-------------------------------------@*Unranged codes number (2 bytes)@*-------------------------------------@*Unranged codes array index (2 bytes)@*-------------------------------------@*Ranges indexes (triads)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -