📄 rfc2640.txt
字号:
Atkinson, R., Crispin, M. and P. Svanberg, "Character Set Workshop Report", RFC 2130, April 1997. [RFC2277] Alvestrand, H., " IETF Policy on Character Sets and Languages", RFC 2277, January 1998. [RFC2279] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC 2279, January 1998. [RFC2389] Elz, R. and P. Hethmon, "Feature Negotiation Mechanism for the File Transfer Protocol", RFC 2389, August 1998. [UNICODE] The Unicode Consortium, "The Unicode Standard - Version 2.0", Addison Westley Developers Press, July 1996. [UTF-8] ISO/IEC 10646-1:1993 AMENDMENT 2 (1996). UCS Transformation Format 8 (UTF-8).Curtin Proposed Standard [Page 14]RFC 2640 FTP Internalization July 19999 Author's Address Bill Curtin JIEO Attn: JEBBD Ft. Monmouth, N.J. 07703-5613 EMail: curtinw@ftm.disa.milCurtin Proposed Standard [Page 15]RFC 2640 FTP Internalization July 1999Annex A - Implementation ConsiderationsA.1 General Considerations - Implementers should ensure that their code accounts for potential problems, such as using a NULL character to terminate a string or no longer being able to steal the high order bit for internal use, when supporting the extended character set. - Implementers should be aware that there is a chance that pathnames that are non UTF-8 may be parsed as valid UTF-8. The probabilities are low for some encoding or statistically zero to zero for others. A recent non-scientific analysis found that EUC encoded Japanese words had a 2.7% false reading; SJIS had a 0.0005% false reading; other encoding such as ASCII or KOI-8 have a 0% false reading. This probability is highest for short pathnames and decreases as pathname size increases. Implementers may want to look for signs that pathnames which parse as UTF-8 are not valid UTF-8, such as the existence of multiple local character sets in short pathnames. Hopefully, as more implementations conform to UTF-8 transfer encoding there will be a smaller need to guess at the encoding. - Client developers should be aware that it will be possible for pathnames to contain mixed characters (e.g. //Latin1DirectoryName/HebrewFileName). They should be prepared to handle the Bi-directional (BIDI) display of these character sets (i.e. right to left display for the directory and left to right display for the filename). While bi-directional display is outside the scope of this document and more complicated than the above example, an algorithm for bi-directional display can be found in the UNICODE 2.0 [UNICODE] standard. Also note that pathnames can have different byte ordering yet be logically and display-wise equivalent due to the insertion of BIDI control characters at different points during composition. Also note that mixed character sets may also present problems with font swapping. - A server that copies pathnames transparently from a local filesystem may continue to do so. It is then up to the local file creators to use UTF-8 pathnames. - Servers can supports charset labeling of files and/or directories, such that different pathnames may have different charsets. The server should attempt to convert all pathnames to UTF-8, but if it can't then it should leave that name in its raw form. - Some server's OS do not mandate character sets, but allow administrators to configure it in the FTP server. These servers should be configured to use a particular mapping table (eitherCurtin Proposed Standard [Page 16]RFC 2640 FTP Internalization July 1999 external or built-in). This will allow the flexibility of defining different charsets for different directories. - If the server's OS does not mandate the character set and the FTP server cannot be configured, the server should simply use the raw bytes in the file name. They might be ASCII or UTF-8. - If the server is a mirror, and wants to look just like the site it is mirroring, it should store the exact file name bytes that it received from the main server.Curtin Proposed Standard [Page 17]RFC 2640 FTP Internalization July 1999A.2 Transition Considerations - Servers which support this specification, when presented a pathname from an old client (one which does not support this specification), can nearly always tell whether the pathname is in UTF-8 (see B.1) or in some other code set. In order to support these older clients, servers may wish to default to a non UTF-8 code set. However, how a server supports non UTF-8 is outside the scope of this specification. - Clients which support this specification will be able to determine if the server can support UTF-8 (i.e. supports this specification) by the ability of the server to support the FEAT command and the UTF8 feature (defined in 3.2). If the newer clients determine that the server does not support UTF-8 it may wish to default to a different code set. Client developers should take into consideration that pathnames, associated with older servers, might be stored in UTF-8. However, how a client supports non UTF-8 is outside the scope of this specification. - Clients and servers can transition to UTF-8 by either converting to/from the local encoding, or the users can store UTF-8 filenames. The former approach is easier on tightly controlled file systems (e.g. PCs and MACs). The latter approach is easier on more free form file systems (e.g. Unix). - For interactive use attention should be focused on user interface and ease of use. Non-interactive use requires a consistent and controlled behavior. - There may be many applications which reference files under their old raw pathname (e.g. linked URLs). Changing the pathname to UTF-8 will cause access to the old URL to fail. A solution may be for the server to act as if there was 2 different pathnames associated with the file. This might be done internal to the server on controlled file systems or by using symbolic links on free form systems. While this approach may work for single file transfer non-interactive use, a non-interactive transfer of all of the files in a directory will produce duplicates. Interactive users may be presented with lists of files which are double the actual number files.Curtin Proposed Standard [Page 18]RFC 2640 FTP Internalization July 1999Annex B - Sample Code and ExamplesB.1 Valid UTF-8 check The following routine checks if a byte sequence is valid UTF-8. This is done by checking for the proper tagging of the first and following bytes to make sure they conform to the UTF-8 format. It then checks to assure that the data part of the UTF-8 sequence conforms to the proper range allowed by the encoding. Note: This routine will not detect characters that have not been assigned and therefore do not exist.int utf8_valid(const unsigned char *buf, unsigned int len){ const unsigned char *endbuf = buf + len; unsigned char byte2mask=0x00, c; int trailing = 0; // trailing (continuation) bytes to follow while (buf != endbuf) { c = *buf++; if (trailing) if ((c&0xC0) == 0x80) // Does trailing byte follow UTF-8 format? {if (byte2mask) // Need to check 2nd byte for proper range? if (c&byte2mask) // Are appropriate bits set? byte2mask=0x00; else return 0; trailing--; } else return 0; else if ((c&0x80) == 0x00) continue; // valid 1 byte UTF-8 else if ((c&0xE0) == 0xC0) // valid 2 byte UTF-8 if (c&0x1E) // Is UTF-8 byte in // proper range? trailing =1; else return 0; else if ((c&0xF0) == 0xE0) // valid 3 byte UTF-8 {if (!(c&0x0F)) // Is UTF-8 byte in // proper range? byte2mask=0x20; // If not set mask // to check next byte trailing = 2;} else if ((c&0xF8) == 0xF0) // valid 4 byte UTF-8 {if (!(c&0x07)) // Is UTF-8 byte in // proper range?Curtin Proposed Standard [Page 19]RFC 2640 FTP Internalization July 1999 byte2mask=0x30; // If not set mask // to check next byte trailing = 3;} else if ((c&0xFC) == 0xF8) // valid 5 byte UTF-8 {if (!(c&0x03)) // Is UTF-8 byte in // proper range? byte2mask=0x38; // If not set mask // to check next byte trailing = 4;} else if ((c&0xFE) == 0xFC) // valid 6 byte UTF-8 {if (!(c&0x01)) // Is UTF-8 byte in // proper range? byte2mask=0x3C; // If not set mask // to check next byte trailing = 5;} else return 0; } return trailing == 0;}B.2 Conversions The code examples in this section closely reflect the algorithm in ISO 10646 and may not present the most efficient solution for converting to / from UTF-8 encoding. If efficiency is an issue, implementers should use the appropriate bitwise operators. Additional code examples and numerous mapping tables can be found at the Unicode site, HTTP://www.unicode.org or FTP://unicode.org. Note that the conversion examples below assume that the local character set supported in the operating system is something other than UCS2/UTF-16. There are some operating systems that already support UCS2/UTF-16 (notably Plan 9 and Windows NT). In this case no conversion will be necessary from the local character set to the UCS.B.2.1 Conversion from Local Character Set to UTF-8 Conversion from the local filesystem character set to UTF-8 will normally involve a two step process. First convert the local character set to the UCS; then convert the UCS to UTF-8. The first step in the process can be performed by maintaining a mapping table that includes the local character set code and the corresponding UCS code. For instance the ISO/IEC 8859-8 [ISO-8859] code for the Hebrew letter "VAV" is 0xE4. The corresponding 4 byte ISO/IEC 10646 code is 0x000005D5.Curtin Proposed Standard [Page 20]RFC 2640 FTP Internalization July 1999 The next step is to convert the UCS character code to the UTF-8 encoding. The following routine can be used to determine and encode the correct number of bytes based on the UCS-4 character code: unsigned int ucs4_to_utf8 (unsigned long *ucs4_buf, unsigned int ucs4_len, unsigned char *utf8_buf) { const unsigned long *ucs4_endbuf = ucs4_buf + ucs4_len; unsigned int utf8_len = 0; // return value for UTF8 size unsigned char *t_utf8_buf = utf8_buf; // Temporary pointer
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -