📄 rfc789.txt
字号:
update from the other by the time communication isre-established. This very short introduction to the routing algorithm'supdating protocol should provide enough background to enable thereader to understand the particular problem under discussion;further justification and detail can be found in the references. Let us return now to the discussion of the network outage.I have already mentioned that the core dumps showed almost allbuffers holding routing updates which were waiting to beprocessed. Close inspection showed that all the updates werefrom a single IMP, IMP 50. By a strange "coincidence," IMP 50had been malfunctioning just before the network-wide outageoccurred, and was off the net during the period of the outage.Hence it was not generating any updates during the period of theoutage. In addition, IMP 29, an immediate neighbor of IMP 50,was also suffering hardware malfunctions (in particular, droppingbits), but was up (though somewhat flakey) while the network wasin bad shape. Furthermore, the malfunction in IMP 50 had to dowith its ability to communicate properly with the neighboring IMP - 8 -RFC 789 Bolt Beranek and Newman Inc. Eric C. Rosen29. Although we did not yet understand how it was possible forso many updates from one IMP to be extant simultaneously, we didunderstand enough to be able to get the network to recover. Allthat was necessary was to patch the IMPs to disregard any updatesfrom IMP 50, which after all was down anyway. When the networkis operating normally, broadcasting a patch to all IMPs can bedone in a matter of minutes. With the network operating as itwas during the period of the outage, this can take as much as 3or 4 hours. (Remember that the IMPs are generally unmanned, andthat the only way of controlling them from the NCC is via thenetwork itself. This is perfectly satisfactory when an outageaffects only a small group of IMPs, but is an obvious problemwhen the outage has network-wide effects.) This procedure wasfully successful in bringing the network back up. When we looked closely at the dumps, we saw that not onlywere all the updates on the queue from IMP 50, but they all hadone of three sequence numbers (either 8, 40, or 44), and wereordered in the queue as follows:8, 40, 44, 8, 40, 44, 8, 40, 44, ... Note that by the definitionof LATER, 44 is LATER than 40 (44 > 40 and 44 - 40 <= 32), 40 isLATER than 8 (40 > 8 and 40 - 8 <= 32), and 8 is LATER than 44(8 < 44 and 44 - 8 > 32). Given the presence of three updatesfrom the same IMP with these three sequence numbers, this is whatwould be expected. Since each update is LATER than one of theothers, a cycle is formed which keeps the three updates floating - 9 -RFC 789 Bolt Beranek and Newman Inc. Eric C. Rosenaround the network indefinitely. Thus the IMPs spend most oftheir CPU time and buffer space in processing these updates. Theproblem was to figure out how these three updates could possiblyhave existed at the same time. After all, getting from update 8to update 40 should require 2 or 3 full minutes, plus 31intervening sequence numbers. So how could 8 still be aroundwhen 40 was generated, especially since no updates withintervening sequence numbers were present? Our first thought was that maybe the real-time clock in IMP50 was running one or two orders of magnitude faster than normal,invalidating our assumptions about the maximum number of updateswhich could be generated in a given time. An alternativehypothesis suggested itself however when we looked at the binaryrepresentations of the three sequence numbers: 8 - 001000 40 - 101000 44 - 101100Note that 44 has only one more bit than 40, which has only onemore bit than 8. Furthermore, the three different updates werecompletely identical, except for their sequence numbers. Thissuggests that there was really only one update, 44, whosesequence number was twice corrupted by dropped bits. (Of course,it's also possible that the "real" update was 8, and wascorrupted by added bits. However, bit-dropping has proven itself - 10 -RFC 789 Bolt Beranek and Newman Inc. Eric C. Rosento be a much more common sort of hardware malfunction thanbit-adding, although spontaneously dropped bits may sometimescome back on spontaneously.) Surely, the reader will object, there must be protectionagainst dropped bits. Yes there is protection, but apparentlynot enough. The update packets themselves are checksummed, so adropped bit in an update packet is readily detected. Rememberthough that if an update needs to be retransmitted, it isrecreated from tabled information. For maximal reliability, thetables must be checksummed also, and the checksum must berecomputed every time the table is accessed. However, this wouldrequire either a large number of CPU cycles (for frequentchecksumming of a large area of memory) or a large amount ofmemory (to store the checksums for a lot of small areas). SinceCPU cycles and memory are both potentially scarce resources, thisdid not seem to us to be a cost-effective way to deal withproblems that arise, say, once per year (this is the first suchproblem encountered in a year and a half of running this routingalgorithm). Time and space can be saved by recomputing thechecksum at a somewhat slower frequency, but this is lessreliable, in that it allows a certain number of dropped bits to"fall between the cracks." It seems likely then that one of themalfunctioning IMPs had to retransmit update 44 at least twice,(recreating it each time from tabled information), retransmittingit at least once with the corrupted sequence number 40, and at - 11 -RFC 789 Bolt Beranek and Newman Inc. Eric C. Rosenleast once with the corrupted sequence number 8. This wouldcause those three sequence numbers to be extant in the networksimultaneously, even though protocol is supposed to ensure thatthis is impossible. Actually, the detection of dropped bits is most properly ahardware function. The next generation of IMP hardware (the "C30IMP") will be able to detect and correct all single-bit errors,and will detect all other bit errors. Uncorrectable bit errorswill cause the IMP to go into its "loader/dumper." (An IMP inits loader/dumper is not usable for transferring data, and isofficially in the "down" state. However, an IMP in itsloader/dumper is easily controllable from the NCC, and can berestarted or reloaded without on-site intervention.) Currenthardware does have parity checking (which should detect singledropped bits), but this feature has had to be turned off since(a) there are too many spurious parity "errors," i.e., most ofthe time when the machines complain of parity errors there don'treally seem to be any, and (b) parity errors cause the machinesto simply halt, rather than go into their loader/dumpers, whichmeans that on-site intervention is required to restart them. Pending the introduction of improved hardware, what can bedone to prevent problems like this from recurring in the future?It is easy to think of many ways of avoiding this particularproblem, especially if one does not consider the problems thatmay arise from the "fixes." For example, we might be able to - 12 -RFC 789 Bolt Beranek and Newman Inc. Eric C. Rosenavoid this sort of problem by spending a lot more CPU cycles onchecksumming, but this may be too expensive because of the sideeffects it would introduce. (Also, it is not clear that anymemory checksumming strategy can be totally free of "cracks.") Avery simple and conservative fix to prevent this particularproblem from recurring is to modify clause (a) of the definitionof LATER so that the "<=" is replaced by "<" (strictly lessthan). We will implement this fix, but it cannot be guaranteedthat no related problems will ever arise. What is really needed is not some particular fix to therouting algorithm, but a more general fix. In some sense, theproblem we saw was not really a routing problem. The routingcode was working correctly, and the routes that were generatedwere correct and consistent. The real problem is that a freakishhardware malfunction caused a high priority process to run wild,devouring resources needed by other processes, thereby making thenetwork unusable. The fact that the wild process was the routingprocess is incidental. In designing the routing process, wecarefully considered the amount of resource utilization it wouldrequire. By strictly controlling and limiting the rate at whichupdates can be generated, we tried to prevent any situation inwhich the routing process would make excessive demands on thesystem. As we have seen though, even our carefully designedmechanisms were unable to protect against every possible sort ofhardware failure. We need a better means of detecting that some - 13 -RFC 789 Bolt Beranek and Newman Inc. Eric C. Rosenhigh priority process in the IMP, despite all the safeguards wehave built in, is still consuming too many resources. Once thisis detected, the IMP can be automatically placed in itsloader/dumper. In the case under discussion, we would have likedto have all the IMPs go into their loader/dumpers when theproblem arose. This would have enabled us to re-initialize andrestart all the IMPs much more quickly. (Although restartingindividual IMPs did little good, restarting all the IMPssimultaneously would have cleared up the problem instantly, sinceall routing tables in all IMPs would have been initializedsimultaneously.) It took us no more than an hour to figure outhow to restore the network; several additional hours wererequired because it took so long for us to gain control of themisbehaving IMPs and get them back to normal. A built-insoftware alarm system (assuming, of course, that it was notsubject to false alarms) might have enabled us to restore thenetwork more quickly, significantly reducing the duration of theoutage. This is not to say that a better alarm and controlsystem could ever be a replacement for careful study and designwhich attempts to properly distribute the utilization ofimportant resources, but only that it is a necessary adjunct, tohandle the cases that will inevitably fall between the cracks ofeven the most careful design. - 14 -RFC 789 Bolt Beranek and Newman Inc. Eric C. Rosen REFERENCES"The New Routing Algorithm for the ARPANET," IEEE TRANSACTIONS ONCOMMUNICATIONS, May 1980, J.M. McQuillan, I. Richer, E.C. Rosen."The Updating Protocol of ARPANET's New Routing Algorithm,"COMPUTER NETWORKS, February 1980, E.C. Rosen. - 15 -
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -