📄 rfc2640.txt

📁 RFC 的详细文档！
💻 TXT
📖 第 1 页 / 共 4 页
字号:
                Atkinson, R., Crispin, M. and P. Svanberg, "Character
                Set Workshop Report", RFC 2130, April 1997.

   [RFC2277]    Alvestrand, H., " IETF Policy on Character Sets and
                Languages", RFC 2277, January 1998.

   [RFC2279]    Yergeau, F., "UTF-8, a transformation format of ISO
                10646", RFC 2279, January 1998.

   [RFC2389]    Elz, R. and P. Hethmon, "Feature Negotiation Mechanism
                for the File Transfer Protocol", RFC 2389, August 1998.

   [UNICODE]    The Unicode Consortium, "The Unicode Standard - Version
                2.0", Addison Westley Developers Press, July 1996.

   [UTF-8]      ISO/IEC 10646-1:1993 AMENDMENT 2 (1996). UCS
                Transformation Format 8 (UTF-8).










Curtin                     Proposed Standard                   [Page 14]

RFC 2640                  FTP Internalization                  July 1999


9 Author's Address

   Bill Curtin
   JIEO
   Attn: JEBBD
   Ft. Monmouth, N.J. 07703-5613

   EMail: curtinw@ftm.disa.mil











































Curtin                     Proposed Standard                   [Page 15]

RFC 2640                  FTP Internalization                  July 1999


Annex A - Implementation Considerations

A.1 General Considerations

   - Implementers should ensure that their code accounts for potential
     problems, such as using a NULL character to terminate a string or
     no longer being able to steal the high order bit for internal use,
     when supporting the extended character set.

   - Implementers should be aware that there is a chance that pathnames
     that are non UTF-8 may be parsed as valid UTF-8. The probabilities
     are low for some encoding or statistically zero to zero for others.
     A recent non-scientific analysis found that EUC encoded Japanese
     words had a 2.7% false reading; SJIS had a 0.0005% false reading;
     other encoding such as ASCII or KOI-8 have a 0% false reading. This
     probability is highest for short pathnames and decreases as
     pathname size increases. Implementers may want to look for signs
     that pathnames which parse as UTF-8 are not valid UTF-8, such as
     the existence of multiple local character sets in short pathnames.
     Hopefully, as more implementations conform to UTF-8 transfer
     encoding there will be a smaller need to guess at the encoding.

   - Client developers should be aware that it will be possible for
     pathnames to contain mixed characters (e.g.
     //Latin1DirectoryName/HebrewFileName). They should be prepared to
     handle the Bi-directional (BIDI) display of these character sets
     (i.e. right to left display for the directory and left to right
     display for the filename). While bi-directional display is outside
     the scope of this document and more complicated than the above
     example, an algorithm for bi-directional display can be found in
     the UNICODE 2.0 [UNICODE] standard. Also note that pathnames can
     have different byte ordering yet be logically and display-wise
     equivalent due to the insertion of BIDI control characters at
     different points during composition. Also note that mixed character
     sets may also present problems with font swapping.

   - A server that copies pathnames transparently from a local
     filesystem may continue to do so. It is then up to the local file
     creators to use UTF-8 pathnames.

   - Servers can supports charset labeling of files and/or directories,
     such that different pathnames may have different charsets. The
     server should attempt to convert all pathnames to UTF-8, but if it
     can't then it should leave that name in its raw form.

   - Some server's OS do not mandate character sets, but allow
     administrators to configure it in the FTP server. These servers
     should be configured to use a particular mapping table (either



Curtin                     Proposed Standard                   [Page 16]

RFC 2640                  FTP Internalization                  July 1999


     external or built-in). This will allow the flexibility of defining
     different charsets for different directories.

   - If the server's OS does not mandate the character set and the FTP
     server cannot be configured, the server should simply use the raw
     bytes in the file name.  They might be ASCII or UTF-8.

   - If the server is a mirror, and wants to look just like the site it
     is mirroring, it should store the exact file name bytes that it
     received from the main server.









































Curtin                     Proposed Standard                   [Page 17]

RFC 2640                  FTP Internalization                  July 1999


A.2 Transition Considerations

   - Servers which support this specification, when presented a pathname
     from an old client (one which does not support this specification),
     can nearly always tell whether the pathname is in UTF-8 (see B.1)
     or in some other code set. In order to support these older clients,
     servers may wish to default to a non UTF-8 code set. However, how a
     server supports non UTF-8 is outside the scope of this
     specification.

   - Clients which support this specification will be able to determine
     if the server can support UTF-8 (i.e. supports this specification)
     by the ability of the server to support the FEAT command and the
     UTF8 feature (defined in 3.2). If the newer clients determine that
     the server does not support UTF-8 it may wish to default to a
     different code set. Client developers should take into
     consideration that pathnames, associated with older servers, might
     be stored in UTF-8. However, how a client supports non UTF-8 is
     outside the scope of this specification.

   - Clients and servers can transition to UTF-8 by either converting
     to/from the local encoding, or the users can store UTF-8 filenames.
     The former approach is easier on tightly controlled file systems
     (e.g. PCs and MACs). The latter approach is easier on more free
     form file systems (e.g. Unix).

   - For interactive use attention should be focused on user interface
     and ease of use. Non-interactive use requires a consistent and
     controlled behavior.

   - There may be many applications which reference files under their
     old raw pathname (e.g. linked URLs). Changing the pathname to UTF-8
     will cause access to the old URL to fail. A solution may be for the
     server to act as if there was 2 different pathnames associated with
     the file. This might be done internal to the server on controlled
     file systems or by using symbolic links on free form systems. While
     this approach may work for single file transfer non-interactive
     use, a non-interactive transfer of all of the files in a directory
     will produce duplicates. Interactive users may be presented with
     lists of files which are double the actual number files.











Curtin                     Proposed Standard                   [Page 18]

RFC 2640                  FTP Internalization                  July 1999


Annex B - Sample Code and Examples

B.1 Valid UTF-8 check

   The following routine checks if a byte sequence is valid UTF-8. This
   is done by checking for the proper tagging of the first and following
   bytes to make sure they conform to the UTF-8 format. It then checks
   to assure that the data part of the UTF-8 sequence conforms to the
   proper range allowed by the encoding. Note: This routine will not
   detect characters that have not been assigned and therefore do not
   exist.

int utf8_valid(const unsigned char *buf, unsigned int len)
{
 const unsigned char *endbuf = buf + len;
 unsigned char byte2mask=0x00, c;
 int trailing = 0;  // trailing (continuation) bytes to follow

 while (buf != endbuf)
 {
   c = *buf++;
   if (trailing)
    if ((c&0xC0) == 0x80)  // Does trailing byte follow UTF-8 format?
    {if (byte2mask)        // Need to check 2nd byte for proper range?
      if (c&byte2mask)     // Are appropriate bits set?
       byte2mask=0x00;
      else
       return 0;
     trailing--; }
    else
     return 0;
   else
    if ((c&0x80) == 0x00)  continue;      // valid 1 byte UTF-8
    else if ((c&0xE0) == 0xC0)            // valid 2 byte UTF-8
          if (c&0x1E)                     // Is UTF-8 byte in
                                          // proper range?
           trailing =1;
          else
           return 0;
    else if ((c&0xF0) == 0xE0)           // valid 3 byte UTF-8
          {if (!(c&0x0F))                // Is UTF-8 byte in
                                         // proper range?
            byte2mask=0x20;              // If not set mask
                                         // to check next byte
            trailing = 2;}
    else if ((c&0xF8) == 0xF0)           // valid 4 byte UTF-8
          {if (!(c&0x07))                // Is UTF-8 byte in
                                         // proper range?



Curtin                     Proposed Standard                   [Page 19]

RFC 2640                  FTP Internalization                  July 1999


            byte2mask=0x30;              // If not set mask
                                         // to check next byte
            trailing = 3;}
    else if ((c&0xFC) == 0xF8)           // valid 5 byte UTF-8
          {if (!(c&0x03))                // Is UTF-8 byte in
                                         // proper range?
            byte2mask=0x38;              // If not set mask
                                         // to check next byte
            trailing = 4;}
    else if ((c&0xFE) == 0xFC)           // valid 6 byte UTF-8
          {if (!(c&0x01))                // Is UTF-8 byte in
                                         // proper range?
            byte2mask=0x3C;              // If not set mask
                                         // to check next byte
            trailing = 5;}
    else  return 0;
 }
  return trailing == 0;
}

B.2 Conversions

   The code examples in this section closely reflect the algorithm in
   ISO 10646 and may not present the most efficient solution for
   converting to / from UTF-8 encoding. If efficiency is an issue,
   implementers should use the appropriate bitwise operators.

   Additional code examples and numerous mapping tables can be found at
   the Unicode site, HTTP://www.unicode.org or FTP://unicode.org.

   Note that the conversion examples below assume that the local
   character set supported in the operating system is something other
   than UCS2/UTF-16. There are some operating systems that already
   support UCS2/UTF-16 (notably Plan 9 and Windows NT). In this case no
   conversion will be necessary from the local character set to the UCS.

B.2.1 Conversion from Local Character Set to UTF-8

   Conversion from the local filesystem character set to UTF-8 will
   normally involve a two step process. First convert the local
   character set to the UCS; then convert the UCS to UTF-8.

   The first step in the process can be performed by maintaining a
   mapping table that includes the local character set code and the
   corresponding UCS code. For instance the ISO/IEC 8859-8 [ISO-8859]
   code for the Hebrew letter "VAV" is 0xE4. The corresponding 4 byte
   ISO/IEC 10646 code is 0x000005D5.




Curtin                     Proposed Standard                   [Page 20]

RFC 2640                  FTP Internalization                  July 1999


   The next step is to convert the UCS character code to the UTF-8
   encoding. The following routine can be used to determine and encode
   the correct number of bytes based on the UCS-4 character code:

   unsigned int ucs4_to_utf8 (unsigned long *ucs4_buf, unsigned int
                              ucs4_len, unsigned char *utf8_buf)

   {
    const unsigned long *ucs4_endbuf = ucs4_buf + ucs4_len;
    unsigned int utf8_len = 0;        // return value for UTF8 size
    unsigned char *t_utf8_buf = utf8_buf; // Temporary pointer
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -