rfc528.txt
来自「RFC 的详细文档!」· 文本 代码 · 共 508 行 · 第 1/2 页
TXT
508 行
but the problem is easily understood to be of a general nature. In
fact, we recently had another network-wide failure that was traced to
a hardware error that resulted in erroneous routing messages, after
we had installed a software checksum on all inter-IMP transmissions.
The problem we had were due to a single broken instruction in the
part of the IMP program that builds the routing message. As a
result, the routing messages from that IMP were random data, and the
neighboring IMPs interpreted these messages as routing update
information. When this happened, traffic flow through the Network
was completely disrupted and no useful work could be done until the
failed IMP was halted.
This kind of problem, the introduction of incorrect routing
information into the Network, can happen in three ways:
* The routing message is changed in transmission. The inter-IMP
checksum should catch this. The bad routing messages we saw in
the Network had good checksums.
* The routing message is changed as it is constructed, say by a
memory or processor failure, or before it is transmitted. This
is what we termed above an intra-IMP failure.
McQuillan [Page 5]
RFC 528 SOFTWARE CHECKSUMMING IN THE IMP 20 June 1973
* The routing program is incorrect for hardware or software
reasons.
We have attempted to solve the last two kinds of problems by
extending the concept of software checksums. The routing program has
been modified to build a software checksum for the routing message as
it builds the message, just as if it came from a Host. It is
important that this checksum refer to the intended contents of the
routing message, not the actual contents. That is, the program which
generates the routing message builds its own software checksum as it
proceeds, not by reading what has been stored in the routing message
area, but by adding up the intended contents for each entry as it
computes them. The process which sends out routing messages then
always verifies the checksum before transmitting them. This scheme
should detect all intra-IMP failures.
Finally, the routing program itself can be checksummed to detect any
changes in the code. The programs which copy in received routing
messages, compute new routing tables, and send out routing messages
each calculate the checksum of the code before executing it. If the
program finds a discrepancy in the checksum of the program it is
about to run, it immediately requests a program reload from an
adjacent IMP. These checksums include the checksum computation
itself, the routing program and any constants referenced. This
modification should prevent a hardware failure at one IMP from
affecting the Network at large by stopping the IMP before it does any
damage in terms of spreading bad routing. A version of the IMP
program with this added protection for routing was released on May
22.
In the first few months of 1973, there have been several other
efforts aimed at improving the reliability of the Network, in
addition to software checksumming in the IMPs. At the same time that
we were discovering inter-IMP failures with the software checksum
packets, we began to notice a different kind of problem with intra-
IMP failures. In these cases we were primarily faced with memory
problems, and they often affected the IMP program itself, rather than
the packets flowing through the IMP. Our first attack on this
problem was to build a PDP-1 program to verify the running IMP and
TIP programs at a site against the correct core images held at the
PDP-1. The program interrogates the IMP with DDT messages, and
prints out a list of discrepancies. Using this program, we have
already found memory failures at one site.
McQuillan [Page 6]
RFC 528 SOFTWARE CHECKSUMMING IN THE IMP 20 June 1973
4. TIP Modifications
The hardware difficulties which we began to experience during the
first few months of 1973 had two effects on Host-to-Host
communication. First, the intermittent modem interface failures, of
the type seen at Belvoir, Aberdeen, and ETAC, meant that messages
were occasionally lost by the network. This loss is reported to the
transmitting Host by the "Incomplete Transmission" message generated
by the source IMP; the Host must then decide whether to retransmit or
to take some other action. Second, the higher than normal incidence
of machine failures meant that the network sometimes "partitioned" so
that there was no path between the two communicating Hosts. (It
should be noted that, contrary to the original design, two sites are
currently connected to the network by only a single path; other
similar connections are planned. For any such sites, any failure
along the single path will be seen as a partition.) Since a TIP acts
as a Host for its users, its resilience when these types of failures
occur has a major effect on user satisfaction.
Prior to this time the TIP program "aborted" the user's connection if
it received an Incomplete Transmission indication from the IMP
program. In March the TIP program (and the programs of several other
Hosts) was changed to retransmit messages for which the Incomplete
Transmission indication was returned; some Hosts (e.g. MULTICs) have
done this from the start. This modification has turned out to be
relatively simple, and we urge other Hosts to consider implementing
some sort of error recovery software. On the other hand, it has not
seemed reasonable to continue attempting to transmit when the program
receives a "Destination Unreachable" indication, since this could
arise either from a network partition or from a failure at the
destination site. The interactive user is, of course, free to try
again manually.
A different situation pertains to tape transfers involving TIPs with
the magnetic tape option. In these cases, the user would like to
start the process and then ignore it until the transfer is finished.
Network partitions, even if infrequent, may occur when tape transfers
many hours in length are in progress. Therefore, we made a
significant modification to the TIP magnetic tape option to include a
sequencing mechanism in the tape transfer protocol which permits
automatic recovery and transmission continuation after most kinds of
network transients. With this mechanism in effect, and assuming a
tape is mounted at the "other end", the complete transfer of a tape
is possible with a single command given at either end. If the
connection goes dead in mid-transfer, the TIP magnetic tape software
will attempt to reopen the connection until successful and then
continue the transfer from where it was left off. In addition to
modifying the TIP magnetic tape option as specified above, we also
McQuillan [Page 7]
RFC 528 SOFTWARE CHECKSUMMING IN THE IMP 20 June 1973
modified the TENEX program which is able to communicate with the TIP
magnetic tape option so that it remained compatible. These changes
were installed in April.
5. Future Plans
We have been considering some of the issues of network reliability
discussed above in connection with the development of the new High
Speed Modular IMP. This design effort and the experiences with the
current IMP system are, of course, linked together, and we have
already decided on several approaches to be taken in the new line of
IMPs:
* The IMP will have a hardware CRC checksum generator which
returns the checksum on a specified range of memory.
* The IMP will use this facility to generate and check an end-
to-end checksum on messages. This checksum will therefore be
more comprehensive and better for error detection than the
current software checksum. It will insure a high degree of
reliability for Host transmissions.
* In addition, the IMP will perform a verification of a packet
checksum at each hop to provide diagnostic information. This
check will be on an optional basis, whenever the system has
available resources for the check.
* The code for the new IMP system will be read-only (this is
impractical for the present 516 and 316 IMPs), and the program
will periodically checksum itself using the hardware CRC
generator. We hope to design the program so that it can be
reloaded in segments in the event of a detected error in the
code, with no service interruption.
* Finally, we are looking into the structure of an optional IMP-
Host/Host-IMP checksum to complete Host/Host end-to-end
checksum. Under such an arrangement, the IMP and Host could
agree to verify the checksums on the messages transferred over
the interface between them, and the appropriate signalling
mechanisms would be provided to handled errors. With this
technique in effect, two Hosts could be certain that their
messages were delivered error-free or else they would be
notified of an error, and could then retransmit their message
if desired.
McQuillan [Page 8]
RFC 528 SOFTWARE CHECKSUMMING IN THE IMP 20 June 1973
More details on any such modifications to the IMP and to the
IMP-Host interface will be published when appropriate.
[This RFC was put into machine readable form for entry]
[into the online RFC archives by Via Genie 12/1999]
McQuillan [Page 9]
⌨️ 快捷键说明
复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?