⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 rfc528.txt

📁 著名的RFC文档,其中有一些文档是已经翻译成中文的的.
💻 TXT
📖 第 1 页 / 共 2 页
字号:
Network Working Group                                      J.  McQuillanRequest for Comments: 528                                        BBN-NETNIC: 17164                                                  20 June 1973        SOFTWARE CHECKSUMMING IN THE IMP AND NETWORK RELIABILITY   As the ARPA Network has developed over the last few years, and our   experience with operating the IMP subnetwork has grown, the issue of   reliability has assumed greater importance and greater complexity.   This note describes some modifications that have recently been made   to the IMP and TIP programs in this regard.  These changes are   mechanically minor and do not affect Host operation at all, but they   are logically noteworthy, and for this reason we have explained the   workings of the new IMP and TIP programs in some detail.  Host   personnel are advised to note particularly the modifications   described in sections 4 and 5, as they may wish to change their own   programs or operating procedures.1. A Changing View of Network Reliability   Our idea of the Network has evolved as the Network itself has grown.   Initially, it was thought that the only components in the network   design that were prone to errors were the communications circuits,   and the modem interfaces in the IMPs are equipped with a CRC checksum   to detect "almost all" such errors.  The rest of the system,   including Host interfaces, IMP processors, memories, and interfaces,   were all considered to be error-free.  We have had to re-evaluate   this position in the light of our experience.  In operating the   network we are faced with the problem of having to perform remote   diagnosis on failures which cannot easily be classified or   understood.  Some examples of such problems include reports from Host   personnel of lost RFNMs and lost Host-Host protocol allocate   messages, inexplicable behavior in the IMP of a transient nature,   and, finally, the problem of crashes -- the total failure of an IMP,   perhaps affecting adjacent IMPs.  These circumstances are infrequent   and are therefore difficult to correlate with other failures or with   particular attempted remedies.  Indeed, it is often impossible to   distinguish a software failure from a hardware failure.   In attempting to post-mortem crashes, we have sometimes found the IMP   program has had instructions incorrect--sometimes just one or two   bits picked or dropped.  Clearly, memory errors can account for   almost any failure, not only program crashes but also data errors   which can lead to many other syndromes.  For instance, if the address   of a message is changed in transit, then one Host thinks the message   was lost, and another Host may receive an extra message.  Errors of   this kind fall into two general classes: errors in Host messages,McQuillan                                                       [Page 1]RFC 528             SOFTWARE CHECKSUMMING IN THE IMP        20 June 1973   whether in the control information or the data, and errors in inter-   IMP messages, primarily routing update messages.  In the course of   the last few years, it has become increasingly clear that such errors   were occurring, though it was difficult to speculate as to where,   why, and how often.   One of the earliest problems of this kind was discovered in 1971.   The Harvard IMP was sometimes crashing in an unknown manner so that   all the other IMPs were affected.  It was finally determined that its   memory was faulty and sometimes the routing messages read out from   memory by the modem output interfaces were all zeroes.  The adjacent   IMPs interpreted such an erroneous message as stating that the   Harvard IMP had zero delay to all destinations -- that it was the   best route to everywhere! Once this information propagated to the   other IMPs, the whole network was in a shambles.  The solution to   this problem was to generate a software checksum for each routing   message before it was sent from one IMP, and to check it after it was   received at the other IMP.  This software checksum, in addition to   the hardware checksum of the circuit, checks the modem interfaces and   memories at each IMP, and protects the IMPs from erroneous routing   information.  The overhead in computing these checksums is not great   since the messages are only exchanged every 2/3 of a second.   In the first few months of 1973, we began to have a great deal of   trouble with the reliability of some IMPs, especially these in the   Washington area.  The normal procedures of calling in and working   with Honeywell field engineers had not cleared up several of these   persistent failures, and it was felt that an escalation of BBN   involvement was needed to identify the exact causes of the problems.   Therefore, during much of February and March there were one or more   members of the staff at various sites in the network where hardware   problems were suspected.  The first thing we found out was that the   operational IMP program did not give enough diagnostic information   about failures when they occurred, and that the available test   programs did not detect errors frequently enough to justify their   use.  That is, the errors were appearing at rather low frequency,   from once every few hours to once every few days, compared to message   rates of once a second or faster.  Therefore, we decided to try to   make the operational IMP program run when it could, and report more   information about detected hardware errors, rather than keep the   failing IMPs off the network for days at a time.   Modifications to the IMP program had two independent goals: we wanted   to make the software less vulnerable to hardware failures, and we   wanted the software to isolate the failures and report them to the   NCC.  The technique we chose to use was generating a software   checksum on all packets as they are sent out over a line.  We   suspected that the hardware failures in the Washington area wereMcQuillan                                                       [Page 2]RFC 528             SOFTWARE CHECKSUMMING IN THE IMP        20 June 1973   happening between IMPs, that is, the packets were correct before they   were sent.  Thus, a memory-to-memory software checksum, similar to   the technique installed two years before for routing messages only,   should be able to detect these errors.  On March 13, a new version of   the IMP program was released with software checksum code.  In this   program, when a packet is found to have an incorrect checksum it is   discarded, and a copy of the data is sent to the NCC.  The previous   IMP retransmits the packet, since an acknowledgment is not returned.   A partial list of the hardware problems that were uncovered by   software checksums, and subsequently fixed, includes:      *  One modem interface at the Aberdeen IMP dropped several bits         from several successive words in transferring data into memory.      *  One modem interface at the Belvoir IMP picked one or two bits         in a single word in transferring data into memory.      *  One modem interface at the ETAC TIP dropped the first word in         transferring data out of memory.      *  A region in memory at the Utah IMP changed the low order two         bits in some words on an irregular basis.   Each of these problems resulted in two or three detected errors per   day.  There were other problems that were not detected by the   software checksum, such as dropped interrupts.  This set of problems   may be explained by the electronics of the high-speed DMC on 316   IMPs.  The first three machines cited above are 316 IMPs with 3 modem   interfaces, and they are the only such machines in the network.  The   third interface is in a separate drawer and the total bus length   seems to be too long for the driving electronics in the original   design.  We are presently investigating various ways to fix these   problems, and have had some success already.2. An End-to-End Software Checksum on Packets   This last experience, and the earlier checksum on routing messages,   proved the value of a software checksum on all inter-IMP   transmissions.  We have decided to extend the checksum to detect   intra-IMP failures as well, and make software checksums on all   network transmissions a permanent feature of the IMP system.  We can   obtain an end-to-end software checksum on packets, without any time   gaps, as follows:McQuillan                                                       [Page 3]RFC 528             SOFTWARE CHECKSUMMING IN THE IMP        20 June 1973          +--------+        +--------+        +---------+          |  IMP  2|--------|3 IMP  4|--------|5  IMP   |          |   1    |        |        |        |    6    |          +---|----+        +--------+        +----|----+              |                                    |          +---|----+                          +----|----+          |        |                          |         |          |  Host  |                          |  Host   |          +--------+                          +---------+      *  A checksum is computed at the source IMP for each packet as it         is received from the source Host. (interface 1)      *  The checksum is verified at each intermediate IMP as it is         received over the circuit from the previous IMP. (interfaces 3         and 5)      *  If the checksum is in error, the packet is discarded, and the         previous IMP retransmits the packet when it does not receive an         acknowledgment. (interface 2 and 4)      *  The previous IMP does not verify the checksum before the         original transmission, to cut the number of checks in half.         But when it must retransmit a packet it does verify the         checksum.  If it finds an error, it has detected an intra-IMP         failure, and the packet is lost.  If not, then the first         transmission was lost due to an inter-IMP failure, a circuit         error, or was simply refused by the adjacent IMP.  The previous         IMP holds a good copy of the packet, which it then retransmits.         (interface 2 and 4)      *  After the packet has successfully traversed several         intermediate IMPs, it arrives at the destination IMP.  The         checksum is verified just before the packet is sent to the         Host. (interface 6)   This technique provides a checksum from the source IMP to the   destination IMP on each packet, with no gaps in time when the packet   is unchecked.  Any errors are reported to the NCC in full, with a   copy of the packet in question.  This method answers both   requirements stated above: it makes the IMPs more reliable and   fault-tolerant, and it provides a maximum of diagnostic information   for use in fault isolation.  This expanded checksum logic was   installed in the network on June 19.   On of the major questions about such approaches is their efficiency.   We have been able to include the software checksum on all packets   without greatly increasing the processing overhead in the IMP.  TheMcQuillan                                                       [Page 4]RFC 528             SOFTWARE CHECKSUMMING IN THE IMP        20 June 1973   method described above involves one checksum calculation at each IMP   through which a packet travels.  We developed a very fast checksum   technique, which takes only 2 msec per word.  The program computes   the number of words in a packet and then jumps to the appropriate   entry in a chain of add instructions.  This produces a simple sum of   the words in the packet, to which the number of words in the packet   is added to detect missing or extra words of zero.  With the   inclusion of this code, the effective processor bandwidth of a 516   IMP is reduced by one-eighth for full-length store-and-forward   packets, from a megabit per second to 875 kilobits per second.  That   is, the IMP now has the processing capability to connect to 17 full   duplex 50 kilobit per second lines, as compared to 20 such lines   without the checksum program.  We are aware that this add checksum is   not a very good one in terms of its error-detecting capabilities, but   it is as much as the IMP can afford to do in software.  Furthermore,   we emphasize that the primary goal of this modification is to assist   in the remote diagnosis of intermittent hardware failures.3. Checksumming to Improve the Reliability of Routing   We mentioned earlier the catastrophic effects that follow for the   Network as a whole when a single IMP begins to propagate incorrect   routing information.  The experience described above involved a   specific memory failure which has not recurred in the last two years,

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -