rfc2640.txt

来自「RFC 的详细文档！」· 文本代码 · 共 1,516 行 · 第 1/4 页
TXT
1,516 行






Network Working Group                                          B. Curtin
Request for Comments: 2640            Defense Information Systems Agency
Updates: 959                                                   July 1999
Category: Proposed Standard


           Internationalization of the File Transfer Protocol

Status of this Memo

   This document specifies an Internet standards track protocol for the
   Internet community, and requests discussion and suggestions for
   improvements.  Please refer to the current edition of the "Internet
   Official Protocol Standards" (STD 1) for the standardization state
   and status of this protocol.  Distribution of this memo is unlimited.

Copyright Notice

   Copyright (C) The Internet Society (1999).  All Rights Reserved.

Abstract

   The File Transfer Protocol, as defined in RFC 959 [RFC959] and RFC
   1123 Section 4 [RFC1123], is one of the oldest and widely used
   protocols on the Internet. The protocol's primary character set, 7
   bit ASCII, has served the protocol well through the early growth
   years of the Internet. However, as the Internet becomes more global,
   there is a need to support character sets beyond 7 bit ASCII.

   This document addresses the internationalization (I18n) of FTP, which
   includes supporting the multiple character sets and languages found
   throughout the Internet community.  This is achieved by extending the
   FTP specification and giving recommendations for proper
   internationalization support.

Table of Contents

   ABSTRACT.......................................................1
   1 INTRODUCTION.................................................2
    1.1 Requirements Terminology..................................2
   2 INTERNATIONALIZATION.........................................3
    2.1 International Character Set...............................3
    2.2 Transfer Encoding Set.....................................4
   3 PATHNAMES....................................................5
    3.1 General compliance........................................5
    3.2 Servers compliance........................................6
    3.3 Clients compliance........................................7
   4 LANGUAGE SUPPORT.............................................7



Curtin                     Proposed Standard                    [Page 1]

RFC 2640                  FTP Internalization                  July 1999


    4.1 The LANG command..........................................8
    4.2 Syntax of the LANG command................................9
    4.3 Feat response for LANG command...........................11
     4.3.1 Feat examples.........................................11
   5 SECURITY CONSIDERATIONS.....................................12
   6 ACKNOWLEDGMENTS.............................................12
   7 GLOSSARY....................................................13
   8 BIBLIOGRAPHY................................................13
   9 AUTHOR'S ADDRESS............................................15
   ANNEX A - IMPLEMENTATION CONSIDERATIONS.......................16
    A.1 General Considerations...................................16
    A.2 Transition Considerations................................18
   ANNEX B - SAMPLE CODE AND EXAMPLES............................19
    B.1 Valid UTF-8 check........................................19
    B.2 Conversions..............................................20
     B.2.1 Conversion from Local Character Set to UTF-8..........20
     B.2.2 Conversion from UTF-8 to Local Character Set..........23
     B.2.3 ISO/IEC 8859-8 Example................................25
     B.2.4 Vendor Codepage Example...............................25
    B.3 Pseudo Code for Translating Servers......................26
   Full Copyright Statement......................................27

1 Introduction

   As the Internet grows throughout the world the requirement to support
   character sets outside of the ASCII [ASCII] / Latin-1 [ISO-8859]
   character set becomes ever more urgent.  For FTP, because of the
   large installed base, it is paramount that this is done without
   breaking existing clients and servers. This document addresses this
   need. In doing so it defines a solution which will still allow the
   installed base to interoperate with new clients and servers.

   This document enhances the capabilities of the File Transfer Protocol
   by removing the 7-bit restrictions on pathnames used in client
   commands and server responses, RECOMMENDs the use of a Universal
   Character Set (UCS) ISO/IEC 10646 [ISO-10646], RECOMMENDs a UCS
   transformation format (UTF) UTF-8 [UTF-8], and defines a new command
   for language negotiation.

   The recommendations made in this document are consistent with the
   recommendations expressed by the IETF policy related to character
   sets and languages as defined in RFC 2277 [RFC2277].

1.1.  Requirements Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in BCP 14 [BCP14].



Curtin                     Proposed Standard                    [Page 2]

RFC 2640                  FTP Internalization                  July 1999


2 Internationalization

   The File Transfer Protocol was developed when the predominate
   character sets were 7 bit ASCII and 8 bit EBCDIC. Today these
   character sets cannot support the wide range of characters needed by
   multinational systems. Given that there are a number of character
   sets in current use that provide more characters than 7-bit ASCII, it
   makes sense to decide on a convenient way to represent the union of
   those possibilities. To work globally either requires support of a
   number of character sets and to be able to convert between them, or
   the use of a single preferred character set. To assure global
   interoperability this document RECOMMENDS the latter approach and
   defines a single character set, in addition to NVT ASCII and EBCDIC,
   which is understandable by all systems. For FTP this character set
   SHALL be ISO/IEC 10646:1993.  For support of global compatibility it
   is STRONGLY RECOMMENDED that clients and servers use UTF-8 encoding
   when exchanging pathnames.  Clients and servers are, however, under
   no obligation to perform any conversion on the contents of a file for
   operations such as STOR or RETR.

   The character set used to store files SHALL remain a local decision
   and MAY depend on the capability of local operating systems. Prior to
   the exchange of pathnames they SHOULD be converted into a ISO/IEC
   10646 format and UTF-8 encoded. This approach, while allowing
   international exchange of pathnames, will still allow backward
   compatibility with older systems because the code set positions for
   ASCII characters are identical to the one byte sequence in UTF-8.

   Sections 2.1 and 2.2 give a brief description of the international
   character set and transfer encoding RECOMMENDED by this document. A
   more thorough description of UTF-8, ISO/IEC 10646, and UNICODE
   [UNICODE], beyond that given in this document, can be found in RFC
   2279 [RFC2279].

2.1 International Character Set

   The character set defined for international support of FTP SHALL be
   the Universal Character Set as defined in ISO 10646:1993 as amended.
   This standard incorporates the character sets of many existing
   international, national, and corporate standards. ISO/IEC 10646
   defines two alternate forms of encoding, UCS-4 and UCS-2. UCS-4 is a
   four byte (31 bit) encoding containing 2**31 code positions divided
   into 128 groups of 256 planes. Each plane consists of 256 rows of 256
   cells. UCS-2 is a 2 byte (16 bit) character set consisting of plane
   zero or the Basic Multilingual Plane (BMP).  Currently, no codesets
   have been defined outside of the 2 byte BMP.





Curtin                     Proposed Standard                    [Page 3]

RFC 2640                  FTP Internalization                  July 1999


   The Unicode standard version 2.0 [UNICODE] is consistent with the
   UCS-2 subset of ISO/IEC 10646. The Unicode standard version 2.0
   includes the repertoire of IS 10646 characters, amendments 1-7 of IS
   10646, and editorial and technical corrigenda.

2.2 Transfer Encoding

   UCS Transformation Format 8 (UTF-8), in the past referred to as UTF-2
   or UTF-FSS, SHALL be used as a transfer encoding to transmit the
   international character set. UTF-8 is a file safe encoding which
   avoids the use of byte values that have special significance during
   the parsing of pathname character strings. UTF-8 is an 8 bit encoding
   of the characters in the UCS. Some of UTF-8's benefits are that it is
   compatible with 7 bit ASCII, so it doesn't affect programs that give
   special meanings to various ASCII characters; it is immune to
   synchronization errors; its encoding rules allow for easy
   identification; and it has enough space to support a large number of
   character sets.

   UTF-8 encoding represents each UCS character as a sequence of 1 to 6
   bytes in length. For all sequences of one byte the most significant
   bit is ZERO. For all sequences of more than one byte the number of
   ONE bits in the first byte, starting from the most significant bit
   position, indicates the number of bytes in the UTF-8 sequence
   followed by a ZERO bit. For example, the first byte of a 3 byte UTF-8
   sequence would have 1110 as its most significant bits. Each
   additional bytes (continuing bytes) in the UTF-8 sequence, contain a
   ONE bit followed by a ZERO bit as their most significant bits. The
   remaining free bit positions in the continuing bytes are used to
   identify characters in the UCS. The relationship between UCS and
   UTF-8 is demonstrated in the following table:

   UCS-4 range(hex)          UTF-8 byte sequence(binary)
   00000000 - 0000007F       0xxxxxxx
   00000080 - 000007FF       110xxxxx 10xxxxxx
   00000800 - 0000FFFF       1110xxxx 10xxxxxx 10xxxxxx
   00010000 - 001FFFFF       11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
   00200000 - 03FFFFFF       111110xx 10xxxxxx 10xxxxxx 10xxxxxx
                             10xxxxxx
   04000000 - 7FFFFFFF       1111110x 10xxxxxx 10xxxxxx 10xxxxxx
                             10xxxxxx 10xxxxxx

   A beneficial property of UTF-8 is that its single byte sequence is
   consistent with the ASCII character set. This feature will allow a
   transition where old ASCII-only clients can still interoperate with
   new servers that support the UTF-8 encoding.





Curtin                     Proposed Standard                    [Page 4]

RFC 2640                  FTP Internalization                  July 1999


   Another feature is that the encoding rules make it very unlikely that
   a character sequence from a different character set will be mistaken
   for a UTF-8 encoded character sequence. Clients and servers can use a
   simple routine to determine if the character set being exchanged is
   valid UTF-8. Section B.1 shows a code example of this check.

3 Pathnames

3.1 General compliance

   - The 7-bit restriction for pathnames exchanged is dropped.

   - Many operating system allow the use of spaces <SP>, carriage return
     <CR>, and line feed <LF> characters as part of the pathname. The
     exchange of pathnames with these special command characters will
     cause the pathnames to be parsed improperly. This is because ftp
     commands associated with pathnames have the form:

      COMMAND <SP> <pathname> <CRLF>.

   To allow the exchange of pathnames containing these characters, the
   definition of pathname is changed from

     <pathname> ::= <string>   ; in BNF format
   to
     pathname = 1*(%x01..%xFF) ; in ABNF format [ABNF].

   To avoid mistaking these characters within pathnames as special
   command characters the following rules will apply:

   There MUST be only one <SP> between a ftp command and the pathname.
   Implementations MUST assume <SP> characters following the initial
   <SP> as part of the pathname. For example the pathname in STOR
   <SP><SP><SP>foo.bar<CRLF> is <SP><SP>foo.bar.

   Current implementations, which may allow multiple <SP> characters as
   separators between the command and pathname, MUST assure that they
   comply with this single <SP> convention. Note: Implementations which
   treat 3 character commands (e.g. CWD, MKD, etc.) as a fixed 4
   character command by padding the command with a trailing <SP> are in
   non-compliance to this specification.

   When a <CR> character is encountered as part of a pathname it MUST be
   padded with a <NUL> character prior to sending the command. On
   receipt of a pathname containing a <CR><NUL> sequence the <NUL>
   character MUST be stripped away. This approach is described in the
   Telnet protocol [RFC854] on pages 11 and 12. For example, to store a
   pathname foo<CR><LF>boo.bar the pathname would become



Curtin                     Proposed Standard                    [Page 5]

RFC 2640                  FTP Internalization                  July 1999


   foo<CR><NUL><LF>boo.bar prior to sending the command STOR
   <SP>foo<CR><NUL><LF>boo.bar<CRLF>. Upon receipt of the altered
   pathname the <NUL> character following the <CR> would be stripped
   away to form the original pathname.

   - Conforming clients and servers MUST support UTF-8 for the transfer
     and receipt of pathnames. Clients and servers MAY in addition give
     users a choice of specifying interpretation of pathnames in another
     encoding. Note that configuring clients and servers to use
     character sets / encoding other than UTF-8 is outside of the scope
     of this document. While it is recognized that in certain
     operational scenarios this may be desirable, this is left as a
     quality of implementation and operational issue.

   - Pathnames are sequences of bytes.  The encoding of names that are
     valid UTF-8 sequences is assumed to be UTF-8.  The character set of
     other names is undefined. Clients and servers, unless otherwise
     configured to support a specific native character set, MUST check
     for a valid UTF-8 byte sequence to determine if the pathname being
     presented is UTF-8.

   - To avoid data loss, clients and servers SHOULD use the UTF-8
     encoded pathnames when unable to convert them to a usable code set.

   - There may be cases when the code set / encoding presented to the
     server or client cannot be determined. In such cases the raw bytes
     SHOULD be used.

3.2 Servers compliance

   - Servers MUST support the UTF-8 feature in response to the FEAT
     command [RFC2389]. The UTF-8 feature is a line containing the exact
     string "UTF8". This string is not case sensitive, but SHOULD be
     transmitted in upper case. The response to a FEAT command SHOULD
     be:

        C> feat
        S> 211- <any descriptive text>
        S>  ...
        S>  UTF8
        S>  ...
        S> 211 end

   The ellipses indicate placeholders where other features may be
   included, but are NOT REQUIRED. The one space indentation of the
   feature lines is mandatory [RFC2389].





Curtin                     Proposed Standard                    [Page 6]

RFC 2640                  FTP Internalization                  July 1999


   - Mirror servers may want to exactly reflect the site that they are
     mirroring. In such cases servers MAY store and present the exact
     pathname bytes that it received from the main server.

3.3 Clients compliance

   - Clients which do not require display of pathnames are under no
     obligation to do so. Non-display clients do not need to conform to
     requirements associated with display.

   - Clients, which are presented UTF-8 pathnames by the server, SHOULD
     parse UTF-8 correctly and attempt to display the pathname within
     the limitation of the resources available.

   - Clients MUST support the FEAT command and recognize the "UTF8"
     feature (defined in 3.2 above) to determine if a server supports
     UTF-8 encoding.

   - Character semantics of other names shall remain undefined. If a
     client detects that a server is non UTF-8, it SHOULD change its
     display appropriately. How a client implementation handles non
     UTF-8 is a quality of implementation issue. It MAY try to assume
     some other encoding, give the user a chance to try to assume
     something, or save encoding assumptions for a server from one FTP
     session to another.

   - Glyph rendering is outside the scope of this document. How a client
     presents characters it cannot display is a quality of
     implementation issue. This document RECOMMENDS that octets
     corresponding to non-displayable characters SHOULD be presented in
     URL %HH format defined in RFC 1738 [RFC1738]. They MAY, however,
     display them as question marks, with their UCS hexadecimal value,
     or in any other suitable fashion.

   - Many existing clients interpret 8-bit pathnames as being in the
     local character set. They MAY continue to do so for pathnames that
     are not valid UTF-8.
rfc2640.txt - 源码说明

本页面展示了「RFC 的详细文档！」中的 rfc2640.txt 源码文件，采用文本编程语言编写，共 1,516 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与RFC相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?