📄 rfc2640.txt
字号:
Atkinson, R., Crispin, M. and P. Svanberg, "Character
Set Workshop Report", RFC 2130, April 1997.
[RFC2277] Alvestrand, H., " IETF Policy on Character Sets and
Languages", RFC 2277, January 1998.
[RFC2279] Yergeau, F., "UTF-8, a transformation format of ISO
10646", RFC 2279, January 1998.
[RFC2389] Elz, R. and P. Hethmon, "Feature Negotiation Mechanism
for the File Transfer Protocol", RFC 2389, August 1998.
[UNICODE] The Unicode Consortium, "The Unicode Standard - Version
2.0", Addison Westley Developers Press, July 1996.
[UTF-8] ISO/IEC 10646-1:1993 AMENDMENT 2 (1996). UCS
Transformation Format 8 (UTF-8).
Curtin Proposed Standard [Page 14]
RFC 2640 FTP Internalization July 1999
9 Author's Address
Bill Curtin
JIEO
Attn: JEBBD
Ft. Monmouth, N.J. 07703-5613
EMail: curtinw@ftm.disa.mil
Curtin Proposed Standard [Page 15]
RFC 2640 FTP Internalization July 1999
Annex A - Implementation Considerations
A.1 General Considerations
- Implementers should ensure that their code accounts for potential
problems, such as using a NULL character to terminate a string or
no longer being able to steal the high order bit for internal use,
when supporting the extended character set.
- Implementers should be aware that there is a chance that pathnames
that are non UTF-8 may be parsed as valid UTF-8. The probabilities
are low for some encoding or statistically zero to zero for others.
A recent non-scientific analysis found that EUC encoded Japanese
words had a 2.7% false reading; SJIS had a 0.0005% false reading;
other encoding such as ASCII or KOI-8 have a 0% false reading. This
probability is highest for short pathnames and decreases as
pathname size increases. Implementers may want to look for signs
that pathnames which parse as UTF-8 are not valid UTF-8, such as
the existence of multiple local character sets in short pathnames.
Hopefully, as more implementations conform to UTF-8 transfer
encoding there will be a smaller need to guess at the encoding.
- Client developers should be aware that it will be possible for
pathnames to contain mixed characters (e.g.
//Latin1DirectoryName/HebrewFileName). They should be prepared to
handle the Bi-directional (BIDI) display of these character sets
(i.e. right to left display for the directory and left to right
display for the filename). While bi-directional display is outside
the scope of this document and more complicated than the above
example, an algorithm for bi-directional display can be found in
the UNICODE 2.0 [UNICODE] standard. Also note that pathnames can
have different byte ordering yet be logically and display-wise
equivalent due to the insertion of BIDI control characters at
different points during composition. Also note that mixed character
sets may also present problems with font swapping.
- A server that copies pathnames transparently from a local
filesystem may continue to do so. It is then up to the local file
creators to use UTF-8 pathnames.
- Servers can supports charset labeling of files and/or directories,
such that different pathnames may have different charsets. The
server should attempt to convert all pathnames to UTF-8, but if it
can't then it should leave that name in its raw form.
- Some server's OS do not mandate character sets, but allow
administrators to configure it in the FTP server. These servers
should be configured to use a particular mapping table (either
Curtin Proposed Standard [Page 16]
RFC 2640 FTP Internalization July 1999
external or built-in). This will allow the flexibility of defining
different charsets for different directories.
- If the server's OS does not mandate the character set and the FTP
server cannot be configured, the server should simply use the raw
bytes in the file name. They might be ASCII or UTF-8.
- If the server is a mirror, and wants to look just like the site it
is mirroring, it should store the exact file name bytes that it
received from the main server.
Curtin Proposed Standard [Page 17]
RFC 2640 FTP Internalization July 1999
A.2 Transition Considerations
- Servers which support this specification, when presented a pathname
from an old client (one which does not support this specification),
can nearly always tell whether the pathname is in UTF-8 (see B.1)
or in some other code set. In order to support these older clients,
servers may wish to default to a non UTF-8 code set. However, how a
server supports non UTF-8 is outside the scope of this
specification.
- Clients which support this specification will be able to determine
if the server can support UTF-8 (i.e. supports this specification)
by the ability of the server to support the FEAT command and the
UTF8 feature (defined in 3.2). If the newer clients determine that
the server does not support UTF-8 it may wish to default to a
different code set. Client developers should take into
consideration that pathnames, associated with older servers, might
be stored in UTF-8. However, how a client supports non UTF-8 is
outside the scope of this specification.
- Clients and servers can transition to UTF-8 by either converting
to/from the local encoding, or the users can store UTF-8 filenames.
The former approach is easier on tightly controlled file systems
(e.g. PCs and MACs). The latter approach is easier on more free
form file systems (e.g. Unix).
- For interactive use attention should be focused on user interface
and ease of use. Non-interactive use requires a consistent and
controlled behavior.
- There may be many applications which reference files under their
old raw pathname (e.g. linked URLs). Changing the pathname to UTF-8
will cause access to the old URL to fail. A solution may be for the
server to act as if there was 2 different pathnames associated with
the file. This might be done internal to the server on controlled
file systems or by using symbolic links on free form systems. While
this approach may work for single file transfer non-interactive
use, a non-interactive transfer of all of the files in a directory
will produce duplicates. Interactive users may be presented with
lists of files which are double the actual number files.
Curtin Proposed Standard [Page 18]
RFC 2640 FTP Internalization July 1999
Annex B - Sample Code and Examples
B.1 Valid UTF-8 check
The following routine checks if a byte sequence is valid UTF-8. This
is done by checking for the proper tagging of the first and following
bytes to make sure they conform to the UTF-8 format. It then checks
to assure that the data part of the UTF-8 sequence conforms to the
proper range allowed by the encoding. Note: This routine will not
detect characters that have not been assigned and therefore do not
exist.
int utf8_valid(const unsigned char *buf, unsigned int len)
{
const unsigned char *endbuf = buf + len;
unsigned char byte2mask=0x00, c;
int trailing = 0; // trailing (continuation) bytes to follow
while (buf != endbuf)
{
c = *buf++;
if (trailing)
if ((c&0xC0) == 0x80) // Does trailing byte follow UTF-8 format?
{if (byte2mask) // Need to check 2nd byte for proper range?
if (c&byte2mask) // Are appropriate bits set?
byte2mask=0x00;
else
return 0;
trailing--; }
else
return 0;
else
if ((c&0x80) == 0x00) continue; // valid 1 byte UTF-8
else if ((c&0xE0) == 0xC0) // valid 2 byte UTF-8
if (c&0x1E) // Is UTF-8 byte in
// proper range?
trailing =1;
else
return 0;
else if ((c&0xF0) == 0xE0) // valid 3 byte UTF-8
{if (!(c&0x0F)) // Is UTF-8 byte in
// proper range?
byte2mask=0x20; // If not set mask
// to check next byte
trailing = 2;}
else if ((c&0xF8) == 0xF0) // valid 4 byte UTF-8
{if (!(c&0x07)) // Is UTF-8 byte in
// proper range?
Curtin Proposed Standard [Page 19]
RFC 2640 FTP Internalization July 1999
byte2mask=0x30; // If not set mask
// to check next byte
trailing = 3;}
else if ((c&0xFC) == 0xF8) // valid 5 byte UTF-8
{if (!(c&0x03)) // Is UTF-8 byte in
// proper range?
byte2mask=0x38; // If not set mask
// to check next byte
trailing = 4;}
else if ((c&0xFE) == 0xFC) // valid 6 byte UTF-8
{if (!(c&0x01)) // Is UTF-8 byte in
// proper range?
byte2mask=0x3C; // If not set mask
// to check next byte
trailing = 5;}
else return 0;
}
return trailing == 0;
}
B.2 Conversions
The code examples in this section closely reflect the algorithm in
ISO 10646 and may not present the most efficient solution for
converting to / from UTF-8 encoding. If efficiency is an issue,
implementers should use the appropriate bitwise operators.
Additional code examples and numerous mapping tables can be found at
the Unicode site, HTTP://www.unicode.org or FTP://unicode.org.
Note that the conversion examples below assume that the local
character set supported in the operating system is something other
than UCS2/UTF-16. There are some operating systems that already
support UCS2/UTF-16 (notably Plan 9 and Windows NT). In this case no
conversion will be necessary from the local character set to the UCS.
B.2.1 Conversion from Local Character Set to UTF-8
Conversion from the local filesystem character set to UTF-8 will
normally involve a two step process. First convert the local
character set to the UCS; then convert the UCS to UTF-8.
The first step in the process can be performed by maintaining a
mapping table that includes the local character set code and the
corresponding UCS code. For instance the ISO/IEC 8859-8 [ISO-8859]
code for the Hebrew letter "VAV" is 0xE4. The corresponding 4 byte
ISO/IEC 10646 code is 0x000005D5.
Curtin Proposed Standard [Page 20]
RFC 2640 FTP Internalization July 1999
The next step is to convert the UCS character code to the UTF-8
encoding. The following routine can be used to determine and encode
the correct number of bytes based on the UCS-4 character code:
unsigned int ucs4_to_utf8 (unsigned long *ucs4_buf, unsigned int
ucs4_len, unsigned char *utf8_buf)
{
const unsigned long *ucs4_endbuf = ucs4_buf + ucs4_len;
unsigned int utf8_len = 0; // return value for UTF8 size
unsigned char *t_utf8_buf = utf8_buf; // Temporary pointer
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -