rfc789.txt

来自「RFC 的详细文档!」· 文本 代码 · 共 884 行 · 第 1/2 页

TXT
884
字号

                                                          RFC 789

















    Vulnerabilities of Network Control Protocols: An Example



                          Eric C. Rosen


                  Bolt Beranek and Newman Inc.

RFC 789                              Bolt Beranek and Newman Inc.
                                                    Eric C. Rosen

     This paper has appeared in the January 1981 edition  of  the

SIGSOFT  Software  Engineering Notes, and will soon appear in the

SIGCOMM Computer Communications Review.  It is  being  circulated

as  an  RFC because it is thought that it may be of interest to a

wider audience, particularly to the internet community.  It is  a

case  study  of  a  particular  kind of problem that can arise in

large distributed systems,  and  of  the  approach  used  in  the

ARPANET to deal with one such problem.


     On  October 27, 1980, there was an unusual occurrence on the

ARPANET.  For a period of several hours, the network appeared  to

be  unusable,  due to what was later diagnosed as a high priority

software  process   running   out   of   control.    Network-wide

disturbances  are  extremely  unusual  in  the  ARPANET (none has

occurred in several years), and as a  result,  many  people  have

expressed  interest  in  learning more about the etiology of this

particular incident.  The purpose of this note is to explain what

the symptoms of the problem  were,  what  the  underlying  causes

were,  and  what  lessons  can  be  drawn.   As we shall see, the

immediate cause of the problem was  a  rather  freakish  hardware

malfunction  (which is not likely to recur) which caused a faulty

sequence of network control packets to be generated.  This faulty

sequence of control packets in turn affected the apportionment of

software resources in the IMPs, causing one of the IMP  processes

to  use  an  excessive  amount  of resources, to the detriment of

other  IMP  processes.   Restoring  the  network  to  operational


                              - 1 -

RFC 789                              Bolt Beranek and Newman Inc.
                                                    Eric C. Rosen

condition  was  a  relatively straightforward task.  There was no

damage other than the outage itself,  and  no  residual  problems

once  the  network  was  restored.   Nevertheless,  it  is  quite

interesting to see the way  in  which  unusual  (indeed,  unique)

circumstances  can  bring  out vulnerabilities in network control

protocols, and that shall be the focus of this paper.


     The problem began suddenly when  we  discovered  that,  with

very few exceptions, no IMP was able to communicate reliably with

any other IMP.  Attempts to go from a TIP to a host on some other

IMP   only   brought  forth  the  "net  trouble"  error  message,

indicating that no physical path  existed  between  the  pair  of

IMPs.   Connections  which already existed were summarily broken.

A flood of phone calls to the Network Control Center  (NCC)  from

all  around  the  country  indicated  that  the  problem  was not

localized, but rather seemed to be affecting virtually every IMP.


     As a first step towards trying to find out what the state of

the network actually was, we dialed up a number  of  TIPs  around

the  country.  What we generally found was that the TIPs were up,

but  that  their  lines  were  down.   That  is,  the  TIPs  were

communicating  properly  with the user over the dial-up line, but

no connections to other IMPs were possible.


     We tried manually restarting a number of IMPs which  are  in

our own building (after taking dumps, of course).  This procedure

initializes  all  of  the IMPs' dynamic data structures, and will


                              - 2 -

RFC 789                              Bolt Beranek and Newman Inc.
                                                    Eric C. Rosen

often clear up problems which arise when, as sometimes happens in

most complex software systems, the IMPs'  software  gets  into  a

"funny"  state.   The IMPs which were restarted worked well until

they were connected to the rest of  the  net,  after  which  they

exhibited  the same complex of symptoms as the IMPs which had not

been restarted.


     From the facts so far presented, we  were  able  to  draw  a

number  of  conclusions.   Any  problem  which  affects  all IMPs

throughout the network is usually a routing problem.   Restarting

an  IMP  re-initializes  the routing data structures, so the fact

that restarting an IMP did not alleviate the problem in that  IMP

suggested  that  the problem was due to one or more "bad" routing

updates circulating in the network.  IMPs  which  were  restarted

would  just receive the bad updates from those of their neighbors

which were not restarted.  The fact that IMPs  seemed  unable  to

keep  their lines up was also a significant clue as to the nature

of the problem.  Each  pair  of  neighboring  IMPs  runs  a  line

up/down protocol to determine whether the line connecting them is

of  sufficient  quality  to be put into operation.  This protocol

involves the sending of HELLO and I-HEARD-YOU messages.  We  have

noted  in  the  past that under conditions of extremely heavy CPU

utilization, so many buffers can pile up waiting to be served  by

the  bottleneck  CPU process, that the IMPs are unable to acquire

the  buffers  needed  for  receiving  the  HELLO  or  I-HEARD-YOU

messages.  If a condition like this lasts for any length of time,


                              - 3 -

RFC 789                              Bolt Beranek and Newman Inc.
                                                    Eric C. Rosen

the  IMPs  may  not be able to run the line up/down protocol, and

lines will be declared down by the IMPs' software.  On the  basis

of  all  these  facts,  our  tentative  conclusion  was that some

malformed update was causing the routing process in the  IMPs  to

use  an excessive amount of CPU time, possibly even to be running

in an infinite loop.  (This would be  quite  a  surprise  though,

since  we  tried very hard to protect ourselves against malformed

updates when we designed the routing process.)  As we shall  see,

this  tentative  conclusion, although on the right track, was not

quite correct, and the actual situation turned  out  to  be  much

more complex.


     When we examined core dumps from several IMPs, we noted that

most,  in  some cases all, of the IMPs' buffers contained routing

updates  waiting  to  be  processed.   Before   describing   this

situation further, it is necessary to explain some of the details

of  the  routing  algorithm's  updating  scheme.   (The following

explanation will of course be very brief and incomplete.  Readers

with a greater  level  of  interest  are  urged  to  consult  the

references.)  Every so often, each IMP generates a routing update

indicating  which  other  IMPs  are  its immediate neighbors over

operational  lines,  and  the  average   per-packet   delay   (in

milliseconds)  over that line.  Every IMP is required to generate

such an update at least once per minute, and no IMP is  permitted

to  generate  more than a dozen such updates over the course of a

minute.  Each  update  has  a  6-bit  sequence  number  which  is


                              - 4 -

RFC 789                              Bolt Beranek and Newman Inc.
                                                    Eric C. Rosen

advanced by 1 (modulo 64) for each successive update generated by

a  particular IMP.  If two updates generated by the same IMP have

sequence numbers n and m, update n  is  considered  to  be  LATER

(i.e.,  more recently generated) than update m if and only if one

of the following two conditions hold:



         (a) n > m, and n - m <= 32

         (b) n < m, and m - n > 32


(where the comparisons and subtractions treat n and m as unsigned

6-bit numbers, with  no  modulus).   When  an  IMP  generates  an

update,  it sends a copy of the update to each neighbor.  When an

IMP A receives an update u1 which was generated  by  a  different

IMP  B,  it  first  compares  the  sequence number of u1 with the

sequence number of the last update, u2, that it accepted from  B.

If  this  comparison  indicates  that  u2 is LATER than u1, u1 is

simply discarded.  If, on the other hand, u1 appears  to  be  the

LATER  update, IMP A will send u1 to all its neighbors (including

the one from which it was received).  The sequence number  of  u1

will be retained in A's tables as the LATEST received update from

B.   Of  course,  u1 is always accepted if A has seen no previous

update from B.  Note that this procedure is  designed  to  ensure

that  an  update  generated  by  a  particular  IMP  is received,

unchanged, by all other  IMPs  in  the  network,  IN  THE  PROPER

SEQUENCE.    Each routing update is broadcast (or flooded) to all

IMPs, not just to immediate neighbors of the IMP which  generated


                              - 5 -

RFC 789                              Bolt Beranek and Newman Inc.
                                                    Eric C. Rosen

the update (as in some other routing algorithms).  The purpose of

the  sequence numbers is to ensure that all IMPs will agree as to

which update from a given IMP  is  the  most  recently  generated

update from that IMP.


     For  reliability,  there  is  a  protocol for retransmitting

updates over individual links.  Let X and Y be neighboring  IMPs,

and let A be a third IMP.  Suppose X receives an update which was

generated by A, and transmits it to Y.  Now if in the next 100 ms

or  so, X does not receive from Y an update which originated at A

and whose sequence number is at least as recent as  that  of  the

update  X  sent  to  Y,  X concludes that its transmission of the

update did not get through to Y, and  that  a  retransmission  is

required.   (This  conclusion is warranted, since an update which

is  received  and  adjudged  to  be  the  most  recent  from  its

originating  IMP is sent to all neighbors, including the one from

which it was received.)  The IMPs do not keep the original update

packets  buffered  pending  retransmission.   Rather,   all   the

information  in  the  update  packet  is  kept in tables, and the

packet  is  re-created  from  the  tables  if  necessary  for   a

retransmission.


     This  transmission  protocol  ("flooding")  distributes  the

routing updates  in a  very  rapid  and  reliable  manner.   Once

generated by an IMP, an update will almost always reach all other

IMPs  in  a time period on the order of 100 ms.  Since an IMP can

generate no more than a dozen updates per minute, and  there  are

                              - 6 -

RFC 789                              Bolt Beranek and Newman Inc.
                                                    Eric C. Rosen

64  possible sequence numbers, sequence number wrap-around is not

a problem.  There is only one exception  to  this.   Suppose  two

IMPs  A  and  B  are  out  of  communication for a period of time

because there is no physical path between them.  (This may be due

either to a network partition, or to a more  mundane  occurrence,

such  as  one  of  the  IMPs  being down.)  When communication is

re-established, A and B have no way of knowing how long they have

been out of communication, or how many times the other's sequence

numbers may have wrapped around.  Comparing the  sequence  number

of  a newly received update with the sequence number of an update

received before the outage may give an incorrect result.  To deal

with this problem, the following scheme is adopted.   Let  t0  be

the time at which IMP A receives update number n generated by IMP

B.   Let  t1 be t0 plus 1 minute.  If by t1, A receives no update

generated by B with a LATER sequence number than n, A will accept

any update from B as being more recent than n.  So  if  two  IMPs

are  out  of  communication  for  a  period of time which is long

enough for the sequence numbers  to  have  wrapped  around,  this

procedure  ensures  that  proper  resynchronization  of  sequence

numbers is effected when communication is re-established.


     There is just one more facet of the updating  process  which

needs  to  be  discussed.   Because  of  the way the line up/down

protocol works, a line cannot be  brought  up  until  60  seconds

after  its performance becomes good enough to warrant operational

use.  (Roughly speaking, this is the time it takes  to  determine


                              - 7 -

RFC 789                              Bolt Beranek and Newman Inc.
                                                    Eric C. Rosen

that  the  line's  performance  is  good  enough.)   During  this

60-second period, no data is sent  over  the  line,  but  routing

updates are transmitted.  Remember that every node is required to

generate  a  routing update at least once per minute.  Therefore,

this procedure ensures that if two IMPs are out of  communication

because  of  the  failure  of some line, each has the most recent

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?