📄 rfc2640.txt

📁 著名的RFC文档,其中有一些文档是已经翻译成中文的的.
💻 TXT
📖 第 1 页 / 共 4 页
字号:
                Atkinson, R., Crispin, M. and P. Svanberg, "Character                Set Workshop Report", RFC 2130, April 1997.   [RFC2277]    Alvestrand, H., " IETF Policy on Character Sets and                Languages", RFC 2277, January 1998.   [RFC2279]    Yergeau, F., "UTF-8, a transformation format of ISO                10646", RFC 2279, January 1998.   [RFC2389]    Elz, R. and P. Hethmon, "Feature Negotiation Mechanism                for the File Transfer Protocol", RFC 2389, August 1998.   [UNICODE]    The Unicode Consortium, "The Unicode Standard - Version                2.0", Addison Westley Developers Press, July 1996.   [UTF-8]      ISO/IEC 10646-1:1993 AMENDMENT 2 (1996). UCS                Transformation Format 8 (UTF-8).Curtin                     Proposed Standard                   [Page 14]RFC 2640                  FTP Internalization                  July 19999 Author's Address   Bill Curtin   JIEO   Attn: JEBBD   Ft. Monmouth, N.J. 07703-5613   EMail: curtinw@ftm.disa.milCurtin                     Proposed Standard                   [Page 15]RFC 2640                  FTP Internalization                  July 1999Annex A - Implementation ConsiderationsA.1 General Considerations   - Implementers should ensure that their code accounts for potential     problems, such as using a NULL character to terminate a string or     no longer being able to steal the high order bit for internal use,     when supporting the extended character set.   - Implementers should be aware that there is a chance that pathnames     that are non UTF-8 may be parsed as valid UTF-8. The probabilities     are low for some encoding or statistically zero to zero for others.     A recent non-scientific analysis found that EUC encoded Japanese     words had a 2.7% false reading; SJIS had a 0.0005% false reading;     other encoding such as ASCII or KOI-8 have a 0% false reading. This     probability is highest for short pathnames and decreases as     pathname size increases. Implementers may want to look for signs     that pathnames which parse as UTF-8 are not valid UTF-8, such as     the existence of multiple local character sets in short pathnames.     Hopefully, as more implementations conform to UTF-8 transfer     encoding there will be a smaller need to guess at the encoding.   - Client developers should be aware that it will be possible for     pathnames to contain mixed characters (e.g.     //Latin1DirectoryName/HebrewFileName). They should be prepared to     handle the Bi-directional (BIDI) display of these character sets     (i.e. right to left display for the directory and left to right     display for the filename). While bi-directional display is outside     the scope of this document and more complicated than the above     example, an algorithm for bi-directional display can be found in     the UNICODE 2.0 [UNICODE] standard. Also note that pathnames can     have different byte ordering yet be logically and display-wise     equivalent due to the insertion of BIDI control characters at     different points during composition. Also note that mixed character     sets may also present problems with font swapping.   - A server that copies pathnames transparently from a local     filesystem may continue to do so. It is then up to the local file     creators to use UTF-8 pathnames.   - Servers can supports charset labeling of files and/or directories,     such that different pathnames may have different charsets. The     server should attempt to convert all pathnames to UTF-8, but if it     can't then it should leave that name in its raw form.   - Some server's OS do not mandate character sets, but allow     administrators to configure it in the FTP server. These servers     should be configured to use a particular mapping table (eitherCurtin                     Proposed Standard                   [Page 16]RFC 2640                  FTP Internalization                  July 1999     external or built-in). This will allow the flexibility of defining     different charsets for different directories.   - If the server's OS does not mandate the character set and the FTP     server cannot be configured, the server should simply use the raw     bytes in the file name.  They might be ASCII or UTF-8.   - If the server is a mirror, and wants to look just like the site it     is mirroring, it should store the exact file name bytes that it     received from the main server.Curtin                     Proposed Standard                   [Page 17]RFC 2640                  FTP Internalization                  July 1999A.2 Transition Considerations   - Servers which support this specification, when presented a pathname     from an old client (one which does not support this specification),     can nearly always tell whether the pathname is in UTF-8 (see B.1)     or in some other code set. In order to support these older clients,     servers may wish to default to a non UTF-8 code set. However, how a     server supports non UTF-8 is outside the scope of this     specification.   - Clients which support this specification will be able to determine     if the server can support UTF-8 (i.e. supports this specification)     by the ability of the server to support the FEAT command and the     UTF8 feature (defined in 3.2). If the newer clients determine that     the server does not support UTF-8 it may wish to default to a     different code set. Client developers should take into     consideration that pathnames, associated with older servers, might     be stored in UTF-8. However, how a client supports non UTF-8 is     outside the scope of this specification.   - Clients and servers can transition to UTF-8 by either converting     to/from the local encoding, or the users can store UTF-8 filenames.     The former approach is easier on tightly controlled file systems     (e.g. PCs and MACs). The latter approach is easier on more free     form file systems (e.g. Unix).   - For interactive use attention should be focused on user interface     and ease of use. Non-interactive use requires a consistent and     controlled behavior.   - There may be many applications which reference files under their     old raw pathname (e.g. linked URLs). Changing the pathname to UTF-8     will cause access to the old URL to fail. A solution may be for the     server to act as if there was 2 different pathnames associated with     the file. This might be done internal to the server on controlled     file systems or by using symbolic links on free form systems. While     this approach may work for single file transfer non-interactive     use, a non-interactive transfer of all of the files in a directory     will produce duplicates. Interactive users may be presented with     lists of files which are double the actual number files.Curtin                     Proposed Standard                   [Page 18]RFC 2640                  FTP Internalization                  July 1999Annex B - Sample Code and ExamplesB.1 Valid UTF-8 check   The following routine checks if a byte sequence is valid UTF-8. This   is done by checking for the proper tagging of the first and following   bytes to make sure they conform to the UTF-8 format. It then checks   to assure that the data part of the UTF-8 sequence conforms to the   proper range allowed by the encoding. Note: This routine will not   detect characters that have not been assigned and therefore do not   exist.int utf8_valid(const unsigned char *buf, unsigned int len){ const unsigned char *endbuf = buf + len; unsigned char byte2mask=0x00, c; int trailing = 0;  // trailing (continuation) bytes to follow while (buf != endbuf) {   c = *buf++;   if (trailing)    if ((c&0xC0) == 0x80)  // Does trailing byte follow UTF-8 format?    {if (byte2mask)        // Need to check 2nd byte for proper range?      if (c&byte2mask)     // Are appropriate bits set?       byte2mask=0x00;      else       return 0;     trailing--; }    else     return 0;   else    if ((c&0x80) == 0x00)  continue;      // valid 1 byte UTF-8    else if ((c&0xE0) == 0xC0)            // valid 2 byte UTF-8          if (c&0x1E)                     // Is UTF-8 byte in                                          // proper range?           trailing =1;          else           return 0;    else if ((c&0xF0) == 0xE0)           // valid 3 byte UTF-8          {if (!(c&0x0F))                // Is UTF-8 byte in                                         // proper range?            byte2mask=0x20;              // If not set mask                                         // to check next byte            trailing = 2;}    else if ((c&0xF8) == 0xF0)           // valid 4 byte UTF-8          {if (!(c&0x07))                // Is UTF-8 byte in                                         // proper range?Curtin                     Proposed Standard                   [Page 19]RFC 2640                  FTP Internalization                  July 1999            byte2mask=0x30;              // If not set mask                                         // to check next byte            trailing = 3;}    else if ((c&0xFC) == 0xF8)           // valid 5 byte UTF-8          {if (!(c&0x03))                // Is UTF-8 byte in                                         // proper range?            byte2mask=0x38;              // If not set mask                                         // to check next byte            trailing = 4;}    else if ((c&0xFE) == 0xFC)           // valid 6 byte UTF-8          {if (!(c&0x01))                // Is UTF-8 byte in                                         // proper range?            byte2mask=0x3C;              // If not set mask                                         // to check next byte            trailing = 5;}    else  return 0; }  return trailing == 0;}B.2 Conversions   The code examples in this section closely reflect the algorithm in   ISO 10646 and may not present the most efficient solution for   converting to / from UTF-8 encoding. If efficiency is an issue,   implementers should use the appropriate bitwise operators.   Additional code examples and numerous mapping tables can be found at   the Unicode site, HTTP://www.unicode.org or FTP://unicode.org.   Note that the conversion examples below assume that the local   character set supported in the operating system is something other   than UCS2/UTF-16. There are some operating systems that already   support UCS2/UTF-16 (notably Plan 9 and Windows NT). In this case no   conversion will be necessary from the local character set to the UCS.B.2.1 Conversion from Local Character Set to UTF-8   Conversion from the local filesystem character set to UTF-8 will   normally involve a two step process. First convert the local   character set to the UCS; then convert the UCS to UTF-8.   The first step in the process can be performed by maintaining a   mapping table that includes the local character set code and the   corresponding UCS code. For instance the ISO/IEC 8859-8 [ISO-8859]   code for the Hebrew letter "VAV" is 0xE4. The corresponding 4 byte   ISO/IEC 10646 code is 0x000005D5.Curtin                     Proposed Standard                   [Page 20]RFC 2640                  FTP Internalization                  July 1999   The next step is to convert the UCS character code to the UTF-8   encoding. The following routine can be used to determine and encode   the correct number of bytes based on the UCS-4 character code:   unsigned int ucs4_to_utf8 (unsigned long *ucs4_buf, unsigned int                              ucs4_len, unsigned char *utf8_buf)   {    const unsigned long *ucs4_endbuf = ucs4_buf + ucs4_len;    unsigned int utf8_len = 0;        // return value for UTF8 size    unsigned char *t_utf8_buf = utf8_buf; // Temporary pointer
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -