📄 rfc789.txt
字号:
RFC 789 Vulnerabilities of Network Control Protocols: An Example Eric C. Rosen Bolt Beranek and Newman Inc.RFC 789 Bolt Beranek and Newman Inc. Eric C. Rosen This paper has appeared in the January 1981 edition of theSIGSOFT Software Engineering Notes, and will soon appear in theSIGCOMM Computer Communications Review. It is being circulatedas an RFC because it is thought that it may be of interest to awider audience, particularly to the internet community. It is acase study of a particular kind of problem that can arise inlarge distributed systems, and of the approach used in theARPANET to deal with one such problem. On October 27, 1980, there was an unusual occurrence on theARPANET. For a period of several hours, the network appeared tobe unusable, due to what was later diagnosed as a high prioritysoftware process running out of control. Network-widedisturbances are extremely unusual in the ARPANET (none hasoccurred in several years), and as a result, many people haveexpressed interest in learning more about the etiology of thisparticular incident. The purpose of this note is to explain whatthe symptoms of the problem were, what the underlying causeswere, and what lessons can be drawn. As we shall see, theimmediate cause of the problem was a rather freakish hardwaremalfunction (which is not likely to recur) which caused a faultysequence of network control packets to be generated. This faultysequence of control packets in turn affected the apportionment ofsoftware resources in the IMPs, causing one of the IMP processesto use an excessive amount of resources, to the detriment ofother IMP processes. Restoring the network to operational - 1 -RFC 789 Bolt Beranek and Newman Inc. Eric C. Rosencondition was a relatively straightforward task. There was nodamage other than the outage itself, and no residual problemsonce the network was restored. Nevertheless, it is quiteinteresting to see the way in which unusual (indeed, unique)circumstances can bring out vulnerabilities in network controlprotocols, and that shall be the focus of this paper. The problem began suddenly when we discovered that, withvery few exceptions, no IMP was able to communicate reliably withany other IMP. Attempts to go from a TIP to a host on some otherIMP only brought forth the "net trouble" error message,indicating that no physical path existed between the pair ofIMPs. Connections which already existed were summarily broken.A flood of phone calls to the Network Control Center (NCC) fromall around the country indicated that the problem was notlocalized, but rather seemed to be affecting virtually every IMP. As a first step towards trying to find out what the state ofthe network actually was, we dialed up a number of TIPs aroundthe country. What we generally found was that the TIPs were up,but that their lines were down. That is, the TIPs werecommunicating properly with the user over the dial-up line, butno connections to other IMPs were possible. We tried manually restarting a number of IMPs which are inour own building (after taking dumps, of course). This procedureinitializes all of the IMPs' dynamic data structures, and will - 2 -RFC 789 Bolt Beranek and Newman Inc. Eric C. Rosenoften clear up problems which arise when, as sometimes happens inmost complex software systems, the IMPs' software gets into a"funny" state. The IMPs which were restarted worked well untilthey were connected to the rest of the net, after which theyexhibited the same complex of symptoms as the IMPs which had notbeen restarted. From the facts so far presented, we were able to draw anumber of conclusions. Any problem which affects all IMPsthroughout the network is usually a routing problem. Restartingan IMP re-initializes the routing data structures, so the factthat restarting an IMP did not alleviate the problem in that IMPsuggested that the problem was due to one or more "bad" routingupdates circulating in the network. IMPs which were restartedwould just receive the bad updates from those of their neighborswhich were not restarted. The fact that IMPs seemed unable tokeep their lines up was also a significant clue as to the natureof the problem. Each pair of neighboring IMPs runs a lineup/down protocol to determine whether the line connecting them isof sufficient quality to be put into operation. This protocolinvolves the sending of HELLO and I-HEARD-YOU messages. We havenoted in the past that under conditions of extremely heavy CPUutilization, so many buffers can pile up waiting to be served bythe bottleneck CPU process, that the IMPs are unable to acquirethe buffers needed for receiving the HELLO or I-HEARD-YOUmessages. If a condition like this lasts for any length of time, - 3 -RFC 789 Bolt Beranek and Newman Inc. Eric C. Rosenthe IMPs may not be able to run the line up/down protocol, andlines will be declared down by the IMPs' software. On the basisof all these facts, our tentative conclusion was that somemalformed update was causing the routing process in the IMPs touse an excessive amount of CPU time, possibly even to be runningin an infinite loop. (This would be quite a surprise though,since we tried very hard to protect ourselves against malformedupdates when we designed the routing process.) As we shall see,this tentative conclusion, although on the right track, was notquite correct, and the actual situation turned out to be muchmore complex. When we examined core dumps from several IMPs, we noted thatmost, in some cases all, of the IMPs' buffers contained routingupdates waiting to be processed. Before describing thissituation further, it is necessary to explain some of the detailsof the routing algorithm's updating scheme. (The followingexplanation will of course be very brief and incomplete. Readerswith a greater level of interest are urged to consult thereferences.) Every so often, each IMP generates a routing updateindicating which other IMPs are its immediate neighbors overoperational lines, and the average per-packet delay (inmilliseconds) over that line. Every IMP is required to generatesuch an update at least once per minute, and no IMP is permittedto generate more than a dozen such updates over the course of aminute. Each update has a 6-bit sequence number which is - 4 -RFC 789 Bolt Beranek and Newman Inc. Eric C. Rosenadvanced by 1 (modulo 64) for each successive update generated bya particular IMP. If two updates generated by the same IMP havesequence numbers n and m, update n is considered to be LATER(i.e., more recently generated) than update m if and only if oneof the following two conditions hold: (a) n > m, and n - m <= 32 (b) n < m, and m - n > 32(where the comparisons and subtractions treat n and m as unsigned6-bit numbers, with no modulus). When an IMP generates anupdate, it sends a copy of the update to each neighbor. When anIMP A receives an update u1 which was generated by a differentIMP B, it first compares the sequence number of u1 with thesequence number of the last update, u2, that it accepted from B.If this comparison indicates that u2 is LATER than u1, u1 issimply discarded. If, on the other hand, u1 appears to be theLATER update, IMP A will send u1 to all its neighbors (includingthe one from which it was received). The sequence number of u1will be retained in A's tables as the LATEST received update fromB. Of course, u1 is always accepted if A has seen no previousupdate from B. Note that this procedure is designed to ensurethat an update generated by a particular IMP is received,unchanged, by all other IMPs in the network, IN THE PROPERSEQUENCE. Each routing update is broadcast (or flooded) to allIMPs, not just to immediate neighbors of the IMP which generated - 5 -RFC 789 Bolt Beranek and Newman Inc. Eric C. Rosenthe update (as in some other routing algorithms). The purpose ofthe sequence numbers is to ensure that all IMPs will agree as towhich update from a given IMP is the most recently generatedupdate from that IMP. For reliability, there is a protocol for retransmittingupdates over individual links. Let X and Y be neighboring IMPs,and let A be a third IMP. Suppose X receives an update which wasgenerated by A, and transmits it to Y. Now if in the next 100 msor so, X does not receive from Y an update which originated at Aand whose sequence number is at least as recent as that of theupdate X sent to Y, X concludes that its transmission of theupdate did not get through to Y, and that a retransmission isrequired. (This conclusion is warranted, since an update whichis received and adjudged to be the most recent from itsoriginating IMP is sent to all neighbors, including the one fromwhich it was received.) The IMPs do not keep the original updatepackets buffered pending retransmission. Rather, all theinformation in the update packet is kept in tables, and thepacket is re-created from the tables if necessary for aretransmission. This transmission protocol ("flooding") distributes therouting updates in a very rapid and reliable manner. Oncegenerated by an IMP, an update will almost always reach all otherIMPs in a time period on the order of 100 ms. Since an IMP cangenerate no more than a dozen updates per minute, and there are - 6 -RFC 789 Bolt Beranek and Newman Inc. Eric C. Rosen64 possible sequence numbers, sequence number wrap-around is nota problem. There is only one exception to this. Suppose twoIMPs A and B are out of communication for a period of timebecause there is no physical path between them. (This may be dueeither to a network partition, or to a more mundane occurrence,such as one of the IMPs being down.) When communication isre-established, A and B have no way of knowing how long they havebeen out of communication, or how many times the other's sequencenumbers may have wrapped around. Comparing the sequence numberof a newly received update with the sequence number of an updatereceived before the outage may give an incorrect result. To dealwith this problem, the following scheme is adopted. Let t0 bethe time at which IMP A receives update number n generated by IMPB. Let t1 be t0 plus 1 minute. If by t1, A receives no updategenerated by B with a LATER sequence number than n, A will acceptany update from B as being more recent than n. So if two IMPsare out of communication for a period of time which is longenough for the sequence numbers to have wrapped around, thisprocedure ensures that proper resynchronization of sequencenumbers is effected when communication is re-established. There is just one more facet of the updating process whichneeds to be discussed. Because of the way the line up/downprotocol works, a line cannot be brought up until 60 secondsafter its performance becomes good enough to warrant operationaluse. (Roughly speaking, this is the time it takes to determine - 7 -RFC 789 Bolt Beranek and Newman Inc. Eric C. Rosenthat the line's performance is good enough.) During this60-second period, no data is sent over the line, but routingupdates are transmitted. Remember that every node is required togenerate a routing update at least once per minute. Therefore,this procedure ensures that if two IMPs are out of communicationbecause of the failure of some line, each has the most recent
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -