📄 rfc789.txt

📁 RFC 相关的技术文档
💻 TXT
📖 第 1 页 / 共 2 页
字号:
上一页 12
update  from   the   other   by   the   time   communication   isre-established.     This  very  short  introduction  to  the routing algorithm'supdating protocol should provide enough background to enable  thereader  to  understand  the  particular problem under discussion;further justification and detail can be found in the  references.     Let  us  return now to the discussion of the network outage.I have already mentioned that the core dumps  showed  almost  allbuffers   holding  routing  updates  which  were  waiting  to  beprocessed.  Close inspection showed that  all  the  updates  werefrom  a  single  IMP, IMP 50.  By a strange "coincidence," IMP 50had been  malfunctioning  just  before  the  network-wide  outageoccurred,  and  was  off the net during the period of the outage.Hence it was not generating any updates during the period of  theoutage.   In  addition,  IMP 29, an immediate neighbor of IMP 50,was also suffering hardware malfunctions (in particular, droppingbits), but was up (though somewhat flakey) while the network  wasin  bad  shape.  Furthermore, the malfunction in IMP 50 had to dowith its ability to communicate properly with the neighboring IMP                              - 8 -RFC 789                              Bolt Beranek and Newman Inc.                                                    Eric C. Rosen29.  Although we did not yet understand how it was  possible  forso  many updates from one IMP to be extant simultaneously, we didunderstand enough to be able to get the network to recover.   Allthat was necessary was to patch the IMPs to disregard any updatesfrom  IMP  50, which after all was down anyway.  When the networkis operating normally, broadcasting a patch to all  IMPs  can  bedone  in  a  matter of minutes.  With the network operating as itwas during the period of the outage, this can take as much  as  3or  4 hours.  (Remember that the IMPs are generally unmanned, andthat the only way of controlling them from the  NCC  is  via  thenetwork  itself.   This  is perfectly satisfactory when an outageaffects only a small group of IMPs, but  is  an  obvious  problemwhen  the  outage  has network-wide effects.)  This procedure wasfully successful in bringing the network back up.     When we looked closely at the dumps, we saw  that  not  onlywere  all  the updates on the queue from IMP 50, but they all hadone of three sequence numbers (either 8, 40,  or  44),  and  wereordered        in        the        queue       as       follows:8, 40, 44, 8, 40, 44, 8, 40, 44, ...  Note that by the definitionof LATER, 44 is LATER than 40 (44 > 40 and 44 - 40 <= 32), 40  isLATER  than  8  (40 > 8 and 40 - 8 <= 32), and 8 is LATER than 44(8 < 44 and 44 - 8 > 32).  Given the presence  of  three  updatesfrom the same IMP with these three sequence numbers, this is whatwould  be  expected.   Since each update is LATER than one of theothers, a cycle is formed which keeps the three updates  floating                              - 9 -RFC 789                              Bolt Beranek and Newman Inc.                                                    Eric C. Rosenaround  the  network  indefinitely.   Thus the IMPs spend most oftheir CPU time and buffer space in processing these updates.  Theproblem was to figure out how these three updates could  possiblyhave  existed at the same time.  After all, getting from update 8to update 40  should  require  2  or  3  full  minutes,  plus  31intervening  sequence  numbers.   So  how could 8 still be aroundwhen  40  was  generated,  especially  since  no   updates   withintervening sequence numbers were present?     Our  first thought was that maybe the real-time clock in IMP50 was running one or two orders of magnitude faster than normal,invalidating our assumptions about the maximum number of  updateswhich  could  be  generated  in  a  given  time.   An alternativehypothesis suggested itself however when we looked at the  binaryrepresentations of the three sequence numbers:          8 - 001000         40 - 101000         44 - 101100Note  that  44  has only one more bit than 40, which has only onemore bit than 8.  Furthermore, the three different  updates  werecompletely  identical,  except  for their sequence numbers.  Thissuggests that  there  was  really  only  one  update,  44,  whosesequence number was twice corrupted by dropped bits.  (Of course,it's  also  possible  that  the  "real"  update  was  8,  and wascorrupted by added bits.  However, bit-dropping has proven itself                             - 10 -RFC 789                              Bolt Beranek and Newman Inc.                                                    Eric C. Rosento be a much  more  common  sort  of  hardware  malfunction  thanbit-adding,  although  spontaneously  dropped  bits may sometimescome back on spontaneously.)     Surely, the reader will object,  there  must  be  protectionagainst  dropped  bits.   Yes there is protection, but apparentlynot enough.  The update packets themselves are checksummed, so  adropped  bit  in  an update packet is readily detected.  Rememberthough that if  an  update  needs  to  be  retransmitted,  it  isrecreated  from tabled information.  For maximal reliability, thetables must  be  checksummed  also,  and  the  checksum  must  berecomputed every time the table is accessed.  However, this wouldrequire  either  a  large  number  of  CPU  cycles  (for frequentchecksumming of a large area of memory)  or  a  large  amount  ofmemory  (to store the checksums for a lot of small areas).  SinceCPU cycles and memory are both potentially scarce resources, thisdid not seem to us to  be  a  cost-effective  way  to  deal  withproblems  that  arise, say, once per year (this is the first suchproblem encountered in a year and a half of running this  routingalgorithm).   Time  and  space  can  be  saved by recomputing thechecksum at  a  somewhat  slower  frequency,  but  this  is  lessreliable,  in  that it allows a certain number of dropped bits to"fall between the cracks."  It seems likely then that one of  themalfunctioning  IMPs  had to retransmit update 44 at least twice,(recreating it each time from tabled information), retransmittingit at least once with the corrupted sequence number  40,  and  at                             - 11 -RFC 789                              Bolt Beranek and Newman Inc.                                                    Eric C. Rosenleast  once  with  the  corrupted  sequence number 8.  This wouldcause those three sequence numbers to be extant  in  the  networksimultaneously,  even  though protocol is supposed to ensure thatthis is impossible.     Actually, the detection of dropped bits is most  properly  ahardware function.  The next generation of IMP hardware (the "C30IMP")  will  be able to detect and correct all single-bit errors,and will detect all other bit errors.  Uncorrectable  bit  errorswill  cause  the  IMP to go into its "loader/dumper."  (An IMP inits loader/dumper is not usable for  transferring  data,  and  isofficially   in  the  "down"  state.   However,  an  IMP  in  itsloader/dumper is easily controllable from the  NCC,  and  can  berestarted  or  reloaded  without  on-site intervention.)  Currenthardware does have parity checking (which  should  detect  singledropped  bits),  but  this feature has had to be turned off since(a) there are too many spurious parity "errors,"  i.e.,  most  ofthe  time when the machines complain of parity errors there don'treally seem to be any, and (b) parity errors cause  the  machinesto  simply  halt, rather than go into their loader/dumpers, whichmeans that on-site intervention is required to restart them.     Pending the introduction of improved hardware, what  can  bedone  to prevent problems like this from recurring in the future?It is easy to think of many  ways  of  avoiding  this  particularproblem,  especially  if  one does not consider the problems thatmay arise from the "fixes."  For example, we  might  be  able  to                             - 12 -RFC 789                              Bolt Beranek and Newman Inc.                                                    Eric C. Rosenavoid  this  sort of problem by spending a lot more CPU cycles onchecksumming, but this may be too expensive because of  the  sideeffects  it  would  introduce.   (Also,  it is not clear that anymemory checksumming strategy can be totally free of "cracks.")  Avery  simple  and  conservative  fix  to  prevent this particularproblem from recurring is to modify clause (a) of the  definitionof  LATER  so  that  the  "<="  is replaced by "<" (strictly lessthan).  We will implement this fix, but it cannot  be  guaranteedthat no related problems will ever arise.     What  is  really  needed  is  not some particular fix to therouting algorithm, but a more general fix.  In  some  sense,  theproblem  we  saw  was  not really a routing problem.  The routingcode was working correctly, and the routes  that  were  generatedwere correct and consistent.  The real problem is that a freakishhardware  malfunction caused a high priority process to run wild,devouring resources needed by other processes, thereby making thenetwork unusable.  The fact that the wild process was the routingprocess is incidental.  In  designing  the  routing  process,  wecarefully  considered the amount of resource utilization it wouldrequire.  By strictly controlling and limiting the rate at  whichupdates  can  be  generated, we tried to prevent any situation inwhich the routing process would make  excessive  demands  on  thesystem.   As  we  have  seen  though, even our carefully designedmechanisms were unable to protect against every possible sort  ofhardware  failure.  We need a better means of detecting that some                             - 13 -RFC 789                              Bolt Beranek and Newman Inc.                                                    Eric C. Rosenhigh priority process in the IMP, despite all the  safeguards  wehave  built in, is still consuming too many resources.  Once thisis  detected,  the  IMP  can  be  automatically  placed  in   itsloader/dumper.  In the case under discussion, we would have likedto  have  all  the  IMPs  go  into  their loader/dumpers when theproblem arose.  This would have enabled us to  re-initialize  andrestart  all  the  IMPs  much more quickly.  (Although restartingindividual  IMPs  did  little  good,  restarting  all  the   IMPssimultaneously would have cleared up the problem instantly, sinceall  routing  tables  in  all  IMPs  would  have been initializedsimultaneously.)  It took us no more than an hour to  figure  outhow  to  restore  the  network;  several  additional  hours  wererequired because it took so long for us to gain  control  of  themisbehaving  IMPs  and  get  them  back  to  normal.   A built-insoftware alarm system (assuming,  of  course,  that  it  was  notsubject  to  false  alarms)  might have enabled us to restore thenetwork more quickly, significantly reducing the duration of  theoutage.   This  is  not  to  say  that a better alarm and controlsystem could ever be a replacement for careful study  and  designwhich   attempts   to  properly  distribute  the  utilization  ofimportant resources, but only that it is a necessary adjunct,  tohandle  the cases that will inevitably fall between the cracks ofeven the most careful design.                             - 14 -RFC 789                              Bolt Beranek and Newman Inc.                                                    Eric C. Rosen                           REFERENCES"The New Routing Algorithm for the ARPANET," IEEE TRANSACTIONS ONCOMMUNICATIONS, May 1980, J.M. McQuillan, I. Richer, E.C.  Rosen."The  Updating  Protocol  of  ARPANET's  New  Routing Algorithm,"COMPUTER NETWORKS, February 1980, E.C. Rosen.                             - 15 -
上一页 12
💿 文件大小 3544 K
👤 上传用户 kzdai22
📂 所属分类文章/文档
🏷️ 相关标签

#RFC #文档
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -