rfc789.txt
来自「RFC 的详细文档!」· 文本 代码 · 共 884 行 · 第 1/2 页
TXT
884 行
RFC 789
Vulnerabilities of Network Control Protocols: An Example
Eric C. Rosen
Bolt Beranek and Newman Inc.
RFC 789 Bolt Beranek and Newman Inc.
Eric C. Rosen
This paper has appeared in the January 1981 edition of the
SIGSOFT Software Engineering Notes, and will soon appear in the
SIGCOMM Computer Communications Review. It is being circulated
as an RFC because it is thought that it may be of interest to a
wider audience, particularly to the internet community. It is a
case study of a particular kind of problem that can arise in
large distributed systems, and of the approach used in the
ARPANET to deal with one such problem.
On October 27, 1980, there was an unusual occurrence on the
ARPANET. For a period of several hours, the network appeared to
be unusable, due to what was later diagnosed as a high priority
software process running out of control. Network-wide
disturbances are extremely unusual in the ARPANET (none has
occurred in several years), and as a result, many people have
expressed interest in learning more about the etiology of this
particular incident. The purpose of this note is to explain what
the symptoms of the problem were, what the underlying causes
were, and what lessons can be drawn. As we shall see, the
immediate cause of the problem was a rather freakish hardware
malfunction (which is not likely to recur) which caused a faulty
sequence of network control packets to be generated. This faulty
sequence of control packets in turn affected the apportionment of
software resources in the IMPs, causing one of the IMP processes
to use an excessive amount of resources, to the detriment of
other IMP processes. Restoring the network to operational
- 1 -
RFC 789 Bolt Beranek and Newman Inc.
Eric C. Rosen
condition was a relatively straightforward task. There was no
damage other than the outage itself, and no residual problems
once the network was restored. Nevertheless, it is quite
interesting to see the way in which unusual (indeed, unique)
circumstances can bring out vulnerabilities in network control
protocols, and that shall be the focus of this paper.
The problem began suddenly when we discovered that, with
very few exceptions, no IMP was able to communicate reliably with
any other IMP. Attempts to go from a TIP to a host on some other
IMP only brought forth the "net trouble" error message,
indicating that no physical path existed between the pair of
IMPs. Connections which already existed were summarily broken.
A flood of phone calls to the Network Control Center (NCC) from
all around the country indicated that the problem was not
localized, but rather seemed to be affecting virtually every IMP.
As a first step towards trying to find out what the state of
the network actually was, we dialed up a number of TIPs around
the country. What we generally found was that the TIPs were up,
but that their lines were down. That is, the TIPs were
communicating properly with the user over the dial-up line, but
no connections to other IMPs were possible.
We tried manually restarting a number of IMPs which are in
our own building (after taking dumps, of course). This procedure
initializes all of the IMPs' dynamic data structures, and will
- 2 -
RFC 789 Bolt Beranek and Newman Inc.
Eric C. Rosen
often clear up problems which arise when, as sometimes happens in
most complex software systems, the IMPs' software gets into a
"funny" state. The IMPs which were restarted worked well until
they were connected to the rest of the net, after which they
exhibited the same complex of symptoms as the IMPs which had not
been restarted.
From the facts so far presented, we were able to draw a
number of conclusions. Any problem which affects all IMPs
throughout the network is usually a routing problem. Restarting
an IMP re-initializes the routing data structures, so the fact
that restarting an IMP did not alleviate the problem in that IMP
suggested that the problem was due to one or more "bad" routing
updates circulating in the network. IMPs which were restarted
would just receive the bad updates from those of their neighbors
which were not restarted. The fact that IMPs seemed unable to
keep their lines up was also a significant clue as to the nature
of the problem. Each pair of neighboring IMPs runs a line
up/down protocol to determine whether the line connecting them is
of sufficient quality to be put into operation. This protocol
involves the sending of HELLO and I-HEARD-YOU messages. We have
noted in the past that under conditions of extremely heavy CPU
utilization, so many buffers can pile up waiting to be served by
the bottleneck CPU process, that the IMPs are unable to acquire
the buffers needed for receiving the HELLO or I-HEARD-YOU
messages. If a condition like this lasts for any length of time,
- 3 -
RFC 789 Bolt Beranek and Newman Inc.
Eric C. Rosen
the IMPs may not be able to run the line up/down protocol, and
lines will be declared down by the IMPs' software. On the basis
of all these facts, our tentative conclusion was that some
malformed update was causing the routing process in the IMPs to
use an excessive amount of CPU time, possibly even to be running
in an infinite loop. (This would be quite a surprise though,
since we tried very hard to protect ourselves against malformed
updates when we designed the routing process.) As we shall see,
this tentative conclusion, although on the right track, was not
quite correct, and the actual situation turned out to be much
more complex.
When we examined core dumps from several IMPs, we noted that
most, in some cases all, of the IMPs' buffers contained routing
updates waiting to be processed. Before describing this
situation further, it is necessary to explain some of the details
of the routing algorithm's updating scheme. (The following
explanation will of course be very brief and incomplete. Readers
with a greater level of interest are urged to consult the
references.) Every so often, each IMP generates a routing update
indicating which other IMPs are its immediate neighbors over
operational lines, and the average per-packet delay (in
milliseconds) over that line. Every IMP is required to generate
such an update at least once per minute, and no IMP is permitted
to generate more than a dozen such updates over the course of a
minute. Each update has a 6-bit sequence number which is
- 4 -
RFC 789 Bolt Beranek and Newman Inc.
Eric C. Rosen
advanced by 1 (modulo 64) for each successive update generated by
a particular IMP. If two updates generated by the same IMP have
sequence numbers n and m, update n is considered to be LATER
(i.e., more recently generated) than update m if and only if one
of the following two conditions hold:
(a) n > m, and n - m <= 32
(b) n < m, and m - n > 32
(where the comparisons and subtractions treat n and m as unsigned
6-bit numbers, with no modulus). When an IMP generates an
update, it sends a copy of the update to each neighbor. When an
IMP A receives an update u1 which was generated by a different
IMP B, it first compares the sequence number of u1 with the
sequence number of the last update, u2, that it accepted from B.
If this comparison indicates that u2 is LATER than u1, u1 is
simply discarded. If, on the other hand, u1 appears to be the
LATER update, IMP A will send u1 to all its neighbors (including
the one from which it was received). The sequence number of u1
will be retained in A's tables as the LATEST received update from
B. Of course, u1 is always accepted if A has seen no previous
update from B. Note that this procedure is designed to ensure
that an update generated by a particular IMP is received,
unchanged, by all other IMPs in the network, IN THE PROPER
SEQUENCE. Each routing update is broadcast (or flooded) to all
IMPs, not just to immediate neighbors of the IMP which generated
- 5 -
RFC 789 Bolt Beranek and Newman Inc.
Eric C. Rosen
the update (as in some other routing algorithms). The purpose of
the sequence numbers is to ensure that all IMPs will agree as to
which update from a given IMP is the most recently generated
update from that IMP.
For reliability, there is a protocol for retransmitting
updates over individual links. Let X and Y be neighboring IMPs,
and let A be a third IMP. Suppose X receives an update which was
generated by A, and transmits it to Y. Now if in the next 100 ms
or so, X does not receive from Y an update which originated at A
and whose sequence number is at least as recent as that of the
update X sent to Y, X concludes that its transmission of the
update did not get through to Y, and that a retransmission is
required. (This conclusion is warranted, since an update which
is received and adjudged to be the most recent from its
originating IMP is sent to all neighbors, including the one from
which it was received.) The IMPs do not keep the original update
packets buffered pending retransmission. Rather, all the
information in the update packet is kept in tables, and the
packet is re-created from the tables if necessary for a
retransmission.
This transmission protocol ("flooding") distributes the
routing updates in a very rapid and reliable manner. Once
generated by an IMP, an update will almost always reach all other
IMPs in a time period on the order of 100 ms. Since an IMP can
generate no more than a dozen updates per minute, and there are
- 6 -
RFC 789 Bolt Beranek and Newman Inc.
Eric C. Rosen
64 possible sequence numbers, sequence number wrap-around is not
a problem. There is only one exception to this. Suppose two
IMPs A and B are out of communication for a period of time
because there is no physical path between them. (This may be due
either to a network partition, or to a more mundane occurrence,
such as one of the IMPs being down.) When communication is
re-established, A and B have no way of knowing how long they have
been out of communication, or how many times the other's sequence
numbers may have wrapped around. Comparing the sequence number
of a newly received update with the sequence number of an update
received before the outage may give an incorrect result. To deal
with this problem, the following scheme is adopted. Let t0 be
the time at which IMP A receives update number n generated by IMP
B. Let t1 be t0 plus 1 minute. If by t1, A receives no update
generated by B with a LATER sequence number than n, A will accept
any update from B as being more recent than n. So if two IMPs
are out of communication for a period of time which is long
enough for the sequence numbers to have wrapped around, this
procedure ensures that proper resynchronization of sequence
numbers is effected when communication is re-established.
There is just one more facet of the updating process which
needs to be discussed. Because of the way the line up/down
protocol works, a line cannot be brought up until 60 seconds
after its performance becomes good enough to warrant operational
use. (Roughly speaking, this is the time it takes to determine
- 7 -
RFC 789 Bolt Beranek and Newman Inc.
Eric C. Rosen
that the line's performance is good enough.) During this
60-second period, no data is sent over the line, but routing
updates are transmitted. Remember that every node is required to
generate a routing update at least once per minute. Therefore,
this procedure ensures that if two IMPs are out of communication
because of the failure of some line, each has the most recent
⌨️ 快捷键说明
复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?