📄 rfc1630.txt
字号:
Network Working Group T. Berners-Lee
Request for Comments: 1630 CERN
Category: Informational June 1994
Universal Resource Identifiers in WWW
A Unifying Syntax for the Expression of
Names and Addresses of Objects on the Network
as used in the World-Wide Web
Status of this Memo
This memo provides information for the Internet community. This memo
does not specify an Internet standard of any kind. Distribution of
this memo is unlimited.
IESG Note:
Note that the work contained in this memo does not describe an
Internet standard. An Internet standard for general Resource
Identifiers is under development within the IETF.
Introduction
This document defines the syntax used by the World-Wide Web
initiative to encode the names and addresses of objects on the
Internet. The web is considered to include objects accessed using an
extendable number of protocols, existing, invented for the web
itself, or to be invented in the future. Access instructions for an
individual object under a given protocol are encoded into forms of
address string. Other protocols allow the use of object names of
various forms. In order to abstract the idea of a generic object,
the web needs the concepts of the universal set of objects, and of
the universal set of names or addresses of objects.
A Universal Resource Identifier (URI) is a member of this universal
set of names in registered name spaces and addresses referring to
registered protocols or name spaces. A Uniform Resource Locator
(URL), defined elsewhere, is a form of URI which expresses an address
which maps onto an access algorithm using network protocols. Existing
URI schemes which correspond to the (still mutating) concept of IETF
URLs are listed here. The Uniform Resource Name (URN) debate attempts
to define a name space (and presumably resolution protocols) for
persistent object names. This area is not addressed by this document,
which is written in order to document existing practice and provide a
reference point for URL and URN discussions.
Berners-Lee [Page 1]
RFC 1630 URIs in WWW June 1994
The world-wide web protocols are discussed on the mailing list www-
talk-request@info.cern.ch and the newsgroup comp.infosystems.www is
preferable for beginner's questions. The mailing list uri-
request@bunyip.com has discussion related particularly to the URI
issue. The author may be contacted as timbl@info.cern.ch.
This document is available in hypertext form at:
http://info.cern.ch/hypertext/WWW/Addressing/URL/URI_Overview.html
The Need For a Universal Syntax
This section describes the concept of the URI and does not form part
of the specification.
Many protocols and systems for document search and retrieval are
currently in use, and many more protocols or refinements of existing
protocols are to be expected in a field whose expansion is explosive.
These systems are aiming to achieve global search and readership of
documents across differing computing platforms, and despite a
plethora of protocols and data formats. As protocols evolve,
gateways can allow global access to remain possible. As data formats
evolve, format conversion programs can preserve global access. There
is one area, however, in which it is impractical to make conversions,
and that is in the names and addresses used to identify objects.
This is because names and addresses of objects are passed on in so
many ways, from the backs of envelopes to hypertext objects, and may
have a long life.
A common feature of almost all the data models of past and proposed
systems is something which can be mapped onto a concept of "object"
and some kind of name, address, or identifier for that object. One
can therefore define a set of name spaces in which these objects can
be said to exist.
Practical systems need to access and mix objects which are part of
different existing and proposed systems. Therefore, the concept of
the universal set of all objects, and hence the universal set of
names and addresses, in all name spaces, becomes important. This
allows names in different spaces to be treated in a common way, even
though names in different spaces have differing characteristics, as
do the objects to which they refer.
Berners-Lee [Page 2]
RFC 1630 URIs in WWW June 1994
URIs
This document defines a way to encapsulate a name in any
registered name space, and label it with the the name space,
producing a member of the universal set. Such an encoded and
labelled member of this set is known as a Universal Resource
Identifier, or URI.
The universal syntax allows access of objects available using
existing protocols, and may be extended with technology.
The specification of the URI syntax does not imply anything about
the properties of names and addresses in the various name spaces
which are mapped onto the set of URI strings. The properties
follow from the specifications of the protocols and the associated
usage conventions for each scheme.
URLs
For existing Internet access protocols, it is necessary in most
cases to define the encoding of the access algorithm into
something concise enough to be termed address. URIs which refer
to objects accessed with existing protocols are known as "Uniform
Resource Locators" (URLs) and are listed here as used in WWW, but
to be formally defined in a separate document.
URNs
There is currently a drive to define a space of more persistent
names than any URLs. These "Uniform Resource Names" are the
subject of an IETF working group's discussions. (See Sollins and
Masinter, Functional Specifications for URNs, circulated
informally.)
The URI syntax and URL forms have been in widespread use by
World-Wide Web software since 1990.
Berners-Lee [Page 3]
RFC 1630 URIs in WWW June 1994
Design Criteria and Choices
This section is not part of the specification: it is simply an
explanation of the way in which the specification was derived.
Design criteria
The syntax was designed to be:
Extensible New naming schemes may be added later.
Complete It is possible to encode any naming
scheme.
Printable It is possible to express any URI using
7-bit ASCII characters so that URIs may,
if necessary, be passed using pen and ink.
Choices for a universal syntax
For the syntax itself there is little choice except for the order
and punctuation of the elements, and the acceptable characters and
escaping rules.
The extensibility requirement is met by allowing an arbitrary (but
registered) string to be used as a prefix. A prefix is chosen as
left to right parsing is more common than right to left. The
choice of a colon as separator of the prefix from the rest of the
URI was arbitrary.
The decoding of the rest of the string is defined as a function of
the prefix. New prefixed are introduced for new schemes as
necessary, in agreement with the registration authority. The
registration of a new scheme clearly requires the definition of
the decoding of the URI into a given name space, and a definition
of the properties and, where applicable, resolution protocols, for
the name space.
The completeness requirement is easily met by allowing
particularly strange or plain binary names to be encoded in base
16 or 64 using the acceptable characters.
The printability requirement could have been met by requiring all
schemes to encode characters not part of a basic set. This led to
many discussions of what the basic set should be. A difficult
case, for example, is when an ISO latin 1 string appears in a URL,
and within an application with ISO Latin-1 capability, it can be
handled intact. However, for transport in general, the non-ASCII
Berners-Lee [Page 4]
RFC 1630 URIs in WWW June 1994
characters need to be escaped.
The solution to this was to specify a safe set of characters, and
a general escaping scheme which may be used for encoding "unsafe"
characters. This "safe" set is suitable, for example, for use in
electronic mail. This is the canonical form of a URI.
The choice of escape character for introducing representations of
non-allowed characters also tends to be a matter of taste. An
ANSI standard exists in the C language, using the back-slash
character "\". The use of this character on unix command lines,
however, can be a problem as it is interpreted by many shell
programs, and would have itself to be escaped. It is also a
character which is not available on certain keyboards. The equals
sign is commonly used in the encoding of names having
attribute=value pairs. The percent sign was eventually chosen as
a suitable escape character.
There is a conflict between the need to be able to represent many
characters including spaces within a URI directly, and the need to
be able to use a URI in environments which have limited character
sets or in which certain characters are prone to corruption. This
conflict has been resolved by use of an hexadecimal escaping
method which may be applied to any characters forbidden in a given
context. When URLs are moved between contexts, the set of
characters escaped may be enlarged or reduced unambiguously.
The use of white space characters is risky in URIs to be printed
or sent by electronic mail, and the use of multiple white space
characters is very risky. This is because of the frequent
introduction of extraneous white space when lines are wrapped by
systems such as mail, or sheer necessity of narrow column width,
and because of the inter-conversion of various forms of white
space which occurs during character code conversion and the
transfer of text between applications. This is why the canonical
form for URIs has all white spaces encoded.
Reommendations
This section describes the syntax for URIs as used in the WorldWide
Web initiative. The generic syntax provides a framework for new
schemes for names to be resolved using as yet undefined protocols.
URI syntax
A complete URI consists of a naming scheme specifier followed by a
string whose format is a function of the naming scheme. For locators
of information on the Internet, a common syntax is used for the IP
Berners-Lee [Page 5]
RFC 1630 URIs in WWW June 1994
address part. A BNF description of the URL syntax is given in an a
later section. The components are as follows. Fragment identifiers
and relative URIs are not involved in the basic URL definition.
SCHEME
Within the URI of a object, the first element is the name of the
scheme, separated from the rest of the object by a colon.
PATH
The rest of the URI follows the colon in a format depending on the
scheme. The path is interpreted in a manner dependent on the
protocol being used. However, when it contains slashes, these
must imply a hierarchical structure.
Reserved characters
The path in the URI has a significance defined by the particular
scheme. Typically, it is used to encode a name in a given name
space, or an algorithm for accessing an object. In either case, the
encoding may use those characters allowed by the BNF syntax, or
hexadecimal encoding of other characters.
Some of the reserved characters have special uses as defined here.
THE PERCENT SIGN
The percent sign ("%", ASCII 25 hex) is used as the escape
character in the encoding scheme and is never allowed for anything
else.
HIERARCHICAL FORMS
The slash ("/", ASCII 2F hex) character is reserved for the
delimiting of substrings whose relationship is hierarchical. This
enables partial forms of the URI. Substrings consisting of single
or double dots ("." or "..") are similarly reserved.
The significance of the slash between two segments is that the
segment of the path to the left is more significant than the
segment of the path to the right. ("Significance" in this case
refers solely to closeness to the root of the hierarchical
structure and makes no value judgement!)
Berners-Lee [Page 6]
RFC 1630 URIs in WWW June 1994
Note
The similarity to unix and other disk operating system filename
conventions should be taken as purely coincidental, and should
not be taken to indicate that URIs should be interpreted as
file names.
HASH FOR FRAGMENT IDENTIFIERS
The hash ("#", ASCII 23 hex) character is reserved as a delimiter
to separate the URI of an object from a fragment identifier .
QUERY STRINGS
The question mark ("?", ASCII 3F hex) is used to delimit the
boundary between the URI of a queryable object, and a set of words
used to express a query on that object. When this form is used,
the combined URI stands for the object which results from the
query being applied to the original object.
Within the query string, the plus sign is reserved as shorthand
notation for a space. Therefore, real plus signs must be encoded.
This method was used to make query URIs easier to pass in systems
which did not allow spaces.
The query string represents some operation applied to the object,
but this specification gives no common syntax or semantics for it.
In practice the syntax and sematics may depend on the scheme and
may even on the base URI.
OTHER RESERVED CHARACTERS
The astersik ("*", ASCII 2A hex) and exclamation mark ("!" , ASCII
21 hex) are reserved for use as having special signifiance within
specific schemes.
Unsafe characters
In canonical form, certain characters such as spaces, control
characters, some characters whose ASCII code is used differently in
different national character variant 7 bit sets, and all 8bit
characters beyond DEL (7F hex) of the ISO Latin-1 set, shall not be
used unencoded. This is a recommendation for trouble-free
interchange, and as indicated below, the encoded set may be extended
or reduced.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -