📄 rfc2047.txt
字号:
Network Working Group K. Moore
Request for Comments: 2047 University of Tennessee
Obsoletes: 1521, 1522, 1590 November 1996
Category: Standards Track
MIME (Multipurpose Internet Mail Extensions) Part Three:
Message Header Extensions for Non-ASCII Text
Status of this Memo
This document specifies an Internet standards track protocol for the
Internet community, and requests discussion and suggestions for
improvements. Please refer to the current edition of the "Internet
Official Protocol Standards" (STD 1) for the standardization state
and status of this protocol. Distribution of this memo is unlimited.
Abstract
STD 11, RFC 822, defines a message representation protocol specifying
considerable detail about US-ASCII message headers, and leaves the
message content, or message body, as flat US-ASCII text. This set of
documents, collectively called the Multipurpose Internet Mail
Extensions, or MIME, redefines the format of messages to allow for
(1) textual message bodies in character sets other than US-ASCII,
(2) an extensible set of different formats for non-textual message
bodies,
(3) multi-part message bodies, and
(4) textual header information in character sets other than US-ASCII.
These documents are based on earlier work documented in RFC 934, STD
11, and RFC 1049, but extends and revises them. Because RFC 822 said
so little about message bodies, these documents are largely
orthogonal to (rather than a revision of) RFC 822.
This particular document is the third document in the series. It
describes extensions to RFC 822 to allow non-US-ASCII text data in
Internet mail header fields.
Moore Standards Track [Page 1]
RFC 2047 Message Header Extensions November 1996
Other documents in this series include:
+ RFC 2045, which specifies the various headers used to describe
the structure of MIME messages.
+ RFC 2046, which defines the general structure of the MIME media
typing system and defines an initial set of media types,
+ RFC 2048, which specifies various IANA registration procedures
for MIME-related facilities, and
+ RFC 2049, which describes MIME conformance criteria and
provides some illustrative examples of MIME message formats,
acknowledgements, and the bibliography.
These documents are revisions of RFCs 1521, 1522, and 1590, which
themselves were revisions of RFCs 1341 and 1342. An appendix in RFC
2049 describes differences and changes from previous versions.
1. Introduction
RFC 2045 describes a mechanism for denoting textual body parts which
are coded in various character sets, as well as methods for encoding
such body parts as sequences of printable US-ASCII characters. This
memo describes similar techniques to allow the encoding of non-ASCII
text in various portions of a RFC 822 [2] message header, in a manner
which is unlikely to confuse existing message handling software.
Like the encoding techniques described in RFC 2045, the techniques
outlined here were designed to allow the use of non-ASCII characters
in message headers in a way which is unlikely to be disturbed by the
quirks of existing Internet mail handling programs. In particular,
some mail relaying programs are known to (a) delete some message
header fields while retaining others, (b) rearrange the order of
addresses in To or Cc fields, (c) rearrange the (vertical) order of
header fields, and/or (d) "wrap" message headers at different places
than those in the original message. In addition, some mail reading
programs are known to have difficulty correctly parsing message
headers which, while legal according to RFC 822, make use of
backslash-quoting to "hide" special characters such as "<", ",", or
":", or which exploit other infrequently-used features of that
specification.
While it is unfortunate that these programs do not correctly
interpret RFC 822 headers, to "break" these programs would cause
severe operational problems for the Internet mail system. The
extensions described in this memo therefore do not rely on little-
used features of RFC 822.
Moore Standards Track [Page 2]
RFC 2047 Message Header Extensions November 1996
Instead, certain sequences of "ordinary" printable ASCII characters
(known as "encoded-words") are reserved for use as encoded data. The
syntax of encoded-words is such that they are unlikely to
"accidentally" appear as normal text in message headers.
Furthermore, the characters used in encoded-words are restricted to
those which do not have special meanings in the context in which the
encoded-word appears.
Generally, an "encoded-word" is a sequence of printable ASCII
characters that begins with "=?", ends with "?=", and has two "?"s in
between. It specifies a character set and an encoding method, and
also includes the original text encoded as graphic ASCII characters,
according to the rules for that encoding method.
A mail composer that implements this specification will provide a
means of inputting non-ASCII text in header fields, but will
translate these fields (or appropriate portions of these fields) into
encoded-words before inserting them into the message header.
A mail reader that implements this specification will recognize
encoded-words when they appear in certain portions of the message
header. Instead of displaying the encoded-word "as is", it will
reverse the encoding and display the original text in the designated
character set.
NOTES
This memo relies heavily on notation and terms defined RFC 822 and
RFC 2045. In particular, the syntax for the ABNF used in this memo
is defined in RFC 822, as well as many of the terminal or nonterminal
symbols from RFC 822 are used in the grammar for the header
extensions defined here. Among the symbols defined in RFC 822 and
referenced in this memo are: 'addr-spec', 'atom', 'CHAR', 'comment',
'CTLs', 'ctext', 'linear-white-space', 'phrase', 'quoted-pair'.
'quoted-string', 'SPACE', and 'word'. Successful implementation of
this protocol extension requires careful attention to the RFC 822
definitions of these terms.
When the term "ASCII" appears in this memo, it refers to the "7-Bit
American Standard Code for Information Interchange", ANSI X3.4-1986.
The MIME charset name for this character set is "US-ASCII". When not
specifically referring to the MIME charset name, this document uses
the term "ASCII", both for brevity and for consistency with RFC 822.
However, implementors are warned that the character set name must be
spelled "US-ASCII" in MIME message and body part headers.
Moore Standards Track [Page 3]
RFC 2047 Message Header Extensions November 1996
This memo specifies a protocol for the representation of non-ASCII
text in message headers. It specifically DOES NOT define any
translation between "8-bit headers" and pure ASCII headers, nor is
any such translation assumed to be possible.
2. Syntax of encoded-words
An 'encoded-word' is defined by the following ABNF grammar. The
notation of RFC 822 is used, with the exception that white space
characters MUST NOT appear between components of an 'encoded-word'.
encoded-word = "=?" charset "?" encoding "?" encoded-text "?="
charset = token ; see section 3
encoding = token ; see section 4
token = 1*<Any CHAR except SPACE, CTLs, and especials>
especials = "(" / ")" / "<" / ">" / "@" / "," / ";" / ":" / "
<"> / "/" / "[" / "]" / "?" / "." / "="
encoded-text = 1*<Any printable ASCII character other than "?"
or SPACE>
; (but see "Use of encoded-words in message
; headers", section 5)
Both 'encoding' and 'charset' names are case-independent. Thus the
charset name "ISO-8859-1" is equivalent to "iso-8859-1", and the
encoding named "Q" may be spelled either "Q" or "q".
An 'encoded-word' may not be more than 75 characters long, including
'charset', 'encoding', 'encoded-text', and delimiters. If it is
desirable to encode more text than will fit in an 'encoded-word' of
75 characters, multiple 'encoded-word's (separated by CRLF SPACE) may
be used.
While there is no limit to the length of a multiple-line header
field, each line of a header field that contains one or more
'encoded-word's is limited to 76 characters.
The length restrictions are included both to ease interoperability
through internetwork mail gateways, and to impose a limit on the
amount of lookahead a header parser must employ (while looking for a
final ?= delimiter) before it can decide whether a token is an
"encoded-word" or something else.
Moore Standards Track [Page 4]
RFC 2047 Message Header Extensions November 1996
IMPORTANT: 'encoded-word's are designed to be recognized as 'atom's
by an RFC 822 parser. As a consequence, unencoded white space
characters (such as SPACE and HTAB) are FORBIDDEN within an
'encoded-word'. For example, the character sequence
=?iso-8859-1?q?this is some text?=
would be parsed as four 'atom's, rather than as a single 'atom' (by
an RFC 822 parser) or 'encoded-word' (by a parser which understands
'encoded-words'). The correct way to encode the string "this is some
text" is to encode the SPACE characters as well, e.g.
=?iso-8859-1?q?this=20is=20some=20text?=
The characters which may appear in 'encoded-text' are further
restricted by the rules in section 5.
3. Character sets
The 'charset' portion of an 'encoded-word' specifies the character
set associated with the unencoded text. A 'charset' can be any of
the character set names allowed in an MIME "charset" parameter of a
"text/plain" body part, or any character set name registered with
IANA for use with the MIME text/plain content-type.
Some character sets use code-switching techniques to switch between
"ASCII mode" and other modes. If unencoded text in an 'encoded-word'
contains a sequence which causes the charset interpreter to switch
out of ASCII mode, it MUST contain additional control codes such that
ASCII mode is again selected at the end of the 'encoded-word'. (This
rule applies separately to each 'encoded-word', including adjacent
'encoded-word's within a single header field.)
When there is a possibility of using more than one character set to
represent the text in an 'encoded-word', and in the absence of
private agreements between sender and recipients of a message, it is
recommended that members of the ISO-8859-* series be used in
preference to other character sets.
4. Encodings
Initially, the legal values for "encoding" are "Q" and "B". These
encodings are described below. The "Q" encoding is recommended for
use when most of the characters to be encoded are in the ASCII
character set; otherwise, the "B" encoding should be used.
Nevertheless, a mail reader which claims to recognize 'encoded-word's
MUST be able to accept either encoding for any character set which it
supports.
Moore Standards Track [Page 5]
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -