rfc2396.txt

来自「RFC 的详细文档!」· 文本 代码 · 共 1,584 行 · 第 1/5 页

TXT
1,584
字号
   character in any particular coded character set.  How a URI is
   represented in terms of bits and bytes on the wire is dependent upon
   the character encoding of the protocol used to transport it, or the
   charset of the document which contains it.

   The following definitions are common to many elements:

      alpha    = lowalpha | upalpha

      lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" |
                 "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" |
                 "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"

      upalpha  = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
                 "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
                 "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"




Berners-Lee, et. al.        Standards Track                     [Page 6]

RFC 2396                   URI Generic Syntax                August 1998


      digit    = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
                 "8" | "9"

      alphanum = alpha | digit

   The complete URI syntax is collected in Appendix A.

2. URI Characters and Escape Sequences

   URI consist of a restricted set of characters, primarily chosen to
   aid transcribability and usability both in computer systems and in
   non-computer communications. Characters used conventionally as
   delimiters around URI were excluded.  The restricted set of
   characters consists of digits, letters, and a few graphic symbols
   were chosen from those common to most of the character encodings and
   input facilities available to Internet users.

      uric          = reserved | unreserved | escaped

   Within a URI, characters are either used as delimiters, or to
   represent strings of data (octets) within the delimited portions.
   Octets are either represented directly by a character (using the US-
   ASCII character for that octet [ASCII]) or by an escape encoding.
   This representation is elaborated below.

2.1 URI and non-ASCII characters

   The relationship between URI and characters has been a source of
   confusion for characters that are not part of US-ASCII. To describe
   the relationship, it is useful to distinguish between a "character"
   (as a distinguishable semantic entity) and an "octet" (an 8-bit
   byte). There are two mappings, one from URI characters to octets, and
   a second from octets to original characters:

   URI character sequence->octet sequence->original character sequence

   A URI is represented as a sequence of characters, not as a sequence
   of octets. That is because URI might be "transported" by means that
   are not through a computer network, e.g., printed on paper, read over
   the radio, etc.

   A URI scheme may define a mapping from URI characters to octets;
   whether this is done depends on the scheme. Commonly, within a
   delimited component of a URI, a sequence of characters may be used to
   represent a sequence of octets. For example, the character "a"
   represents the octet 97 (decimal), while the character sequence "%",
   "0", "a" represents the octet 10 (decimal).




Berners-Lee, et. al.        Standards Track                     [Page 7]

RFC 2396                   URI Generic Syntax                August 1998


   There is a second translation for some resources: the sequence of
   octets defined by a component of the URI is subsequently used to
   represent a sequence of characters. A 'charset' defines this mapping.
   There are many charsets in use in Internet protocols. For example,
   UTF-8 [UTF-8] defines a mapping from sequences of octets to sequences
   of characters in the repertoire of ISO 10646.

   In the simplest case, the original character sequence contains only
   characters that are defined in US-ASCII, and the two levels of
   mapping are simple and easily invertible: each 'original character'
   is represented as the octet for the US-ASCII code for it, which is,
   in turn, represented as either the US-ASCII character, or else the
   "%" escape sequence for that octet.

   For original character sequences that contain non-ASCII characters,
   however, the situation is more difficult. Internet protocols that
   transmit octet sequences intended to represent character sequences
   are expected to provide some way of identifying the charset used, if
   there might be more than one [RFC2277].  However, there is currently
   no provision within the generic URI syntax to accomplish this
   identification. An individual URI scheme may require a single
   charset, define a default charset, or provide a way to indicate the
   charset used.

   It is expected that a systematic treatment of character encoding
   within URI will be developed as a future modification of this
   specification.

2.2. Reserved Characters

   Many URI include components consisting of or delimited by, certain
   special characters.  These characters are called "reserved", since
   their usage within the URI component is limited to their reserved
   purpose.  If the data for a URI component would conflict with the
   reserved purpose, then the conflicting data must be escaped before
   forming the URI.

      reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
                    "$" | ","

   The "reserved" syntax class above refers to those characters that are
   allowed within a URI, but which may not be allowed within a
   particular component of the generic URI syntax; they are used as
   delimiters of the components described in Section 3.







Berners-Lee, et. al.        Standards Track                     [Page 8]

RFC 2396                   URI Generic Syntax                August 1998


   Characters in the "reserved" set are not reserved in all contexts.
   The set of characters actually reserved within any given URI
   component is defined by that component. In general, a character is
   reserved if the semantics of the URI changes if the character is
   replaced with its escaped US-ASCII encoding.

2.3. Unreserved Characters

   Data characters that are allowed in a URI but do not have a reserved
   purpose are called unreserved.  These include upper and lower case
   letters, decimal digits, and a limited set of punctuation marks and
   symbols.

      unreserved  = alphanum | mark

      mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

   Unreserved characters can be escaped without changing the semantics
   of the URI, but this should not be done unless the URI is being used
   in a context that does not allow the unescaped character to appear.

2.4. Escape Sequences

   Data must be escaped if it does not have a representation using an
   unreserved character; this includes data that does not correspond to
   a printable character of the US-ASCII coded character set, or that
   corresponds to any US-ASCII character that is disallowed, as
   explained below.

2.4.1. Escaped Encoding

   An escaped octet is encoded as a character triplet, consisting of the
   percent character "%" followed by the two hexadecimal digits
   representing the octet code. For example, "%20" is the escaped
   encoding for the US-ASCII space character.

      escaped     = "%" hex hex
      hex         = digit | "A" | "B" | "C" | "D" | "E" | "F" |
                            "a" | "b" | "c" | "d" | "e" | "f"

2.4.2. When to Escape and Unescape

   A URI is always in an "escaped" form, since escaping or unescaping a
   completed URI might change its semantics.  Normally, the only time
   escape encodings can safely be made is when the URI is being created
   from its component parts; each component may have its own set of
   characters that are reserved, so only the mechanism responsible for
   generating or interpreting that component can determine whether or



Berners-Lee, et. al.        Standards Track                     [Page 9]

RFC 2396                   URI Generic Syntax                August 1998


   not escaping a character will change its semantics. Likewise, a URI
   must be separated into its components before the escaped characters
   within those components can be safely decoded.

   In some cases, data that could be represented by an unreserved
   character may appear escaped; for example, some of the unreserved
   "mark" characters are automatically escaped by some systems.  If the
   given URI scheme defines a canonicalization algorithm, then
   unreserved characters may be unescaped according to that algorithm.
   For example, "%7e" is sometimes used instead of "~" in an http URL
   path, but the two are equivalent for an http URL.

   Because the percent "%" character always has the reserved purpose of
   being the escape indicator, it must be escaped as "%25" in order to
   be used as data within a URI.  Implementers should be careful not to
   escape or unescape the same string more than once, since unescaping
   an already unescaped string might lead to misinterpreting a percent
   data character as another escaped character, or vice versa in the
   case of escaping an already escaped string.

2.4.3. Excluded US-ASCII Characters

   Although they are disallowed within the URI syntax, we include here a
   description of those US-ASCII characters that have been excluded and
   the reasons for their exclusion.

   The control characters in the US-ASCII coded character set are not
   used within a URI, both because they are non-printable and because
   they are likely to be misinterpreted by some control mechanisms.

   control     = <US-ASCII coded characters 00-1F and 7F hexadecimal>

   The space character is excluded because significant spaces may
   disappear and insignificant spaces may be introduced when URI are
   transcribed or typeset or subjected to the treatment of word-
   processing programs.  Whitespace is also used to delimit URI in many
   contexts.

   space       = <US-ASCII coded character 20 hexadecimal>

   The angle-bracket "<" and ">" and double-quote (") characters are
   excluded because they are often used as the delimiters around URI in
   text documents and protocol fields.  The character "#" is excluded
   because it is used to delimit a URI from a fragment identifier in URI
   references (Section 4). The percent character "%" is excluded because
   it is used for the encoding of escaped characters.

   delims      = "<" | ">" | "#" | "%" | <">



Berners-Lee, et. al.        Standards Track                    [Page 10]

RFC 2396                   URI Generic Syntax                August 1998


   Other characters are excluded because gateways and other transport
   agents are known to sometimes modify such characters, or they are
   used as delimiters.

   unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

   Data corresponding to excluded characters must be escaped in order to
   be properly represented within a URI.

3. URI Syntactic Components

   The URI syntax is dependent upon the scheme.  In general, absolute
   URI are written as follows:

      <scheme>:<scheme-specific-part>

   An absolute URI contains the name of the scheme being used (<scheme>)
   followed by a colon (":") and then a string (the <scheme-specific-
   part>) whose interpretation depends on the scheme.

   The URI syntax does not require that the scheme-specific-part have
   any general structure or set of semantics which is common among all
   URI.  However, a subset of URI do share a common syntax for
   representing hierarchical relationships within the namespace.  This
   "generic URI" syntax consists of a sequence of four main components:

      <scheme>://<authority><path>?<query>

   each of which, except <scheme>, may be absent from a particular URI.
   For example, some URI schemes do not allow an <authority> component,
   and others do not use a <query> component.

      absoluteURI   = scheme ":" ( hier_part | opaque_part )

   URI that are hierarchical in nature use the slash "/" character for
   separating hierarchical components.  For some file systems, a "/"
   character (used to denote the hierarchical structure of a URI) is the
   delimiter used to construct a file name hierarchy, and thus the URI
   path will look similar to a file pathname.  This does NOT imply that
   the resource is a file or that the URI maps to an actual filesystem
   pathname.

      hier_part     = ( net_path | abs_path ) [ "?" query ]

      net_path      = "//" authority [ abs_path ]

      abs_path      = "/"  path_segments




Berners-Lee, et. al.        Standards Track                    [Page 11]

RFC 2396                   URI Generic Syntax                August 1998


   URI that do not make use of the slash "/" character for separating
   hierarchical components are considered opaque by the generic URI
   parser.

      opaque_part   = uric_no_slash *uric

      uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" |
                      "&" | "=" | "+" | "$" | ","

   We use the term <path> to refer to both the <abs_path> and
   <opaque_part> constructs, since they are mutually exclusive for any
   given URI and can be parsed as a single component.

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?