📄 rfc816.txt
字号:
RFC: 816 FAULT ISOLATION AND RECOVERY David D. Clark MIT Laboratory for Computer Science Computer Systems and Communications Group July, 1982 1. Introduction Occasionally, a network or a gateway will go down, and the sequenceof hops which the packet takes from source to destination must change.Fault isolation is that action which hosts and gateways collectivelytake to determine that something is wrong; fault recovery is theidentification and selection of an alternative route which will serve toreconnect the source to the destination. In fact, the gateways performmost of the functions of fault isolation and recovery. There are,however, a few actions which hosts must take if they wish to provide areasonable level of service. This document describes the portion offault isolation and recovery which is the responsibility of the host. 2. What Gateways Do Gateways collectively implement an algorithm which identifies thebest route between all pairs of networks. They do this by exchangingpackets which contain each gateway's latest opinion about theoperational status of its neighbor networks and gateways. Assuming thatthis algorithm is operating properly, one can expect the gateways to gothrough a period of confusion immediately after some network or gateway 2has failed, but one can assume that once a period of negotiation haspassed, the gateways are equipped with a consistent and correct model ofthe connectivity of the internet. At present this period of negotiationmay actually take several minutes, and many TCP implementations time outwithin that period, but it is a design goal of the eventual algorithmthat the gateway should be able to reconstruct the topology quicklyenough that a TCP connection should be able to survive a failure of theroute. 3. Host Algorithm for Fault Recovery Since the gateways always attempt to have a consistent and correctmodel of the internetwork topology, the host strategy for fault recoveryis very simple. Whenever the host feels that something is wrong, itasks the gateway for advice, and, assuming the advice is forthcoming, itbelieves the advice completely. The advice will be wrong only duringthe transient period of negotiation, which immediately follows anoutage, but will otherwise be reliably correct. In fact, it is never necessary for a host to explicitly ask agateway for advice, because the gateway will provide it as appropriate.When a host sends a datagram to some distant net, the host should beprepared to receive back either of two advisory messages which thegateway may send. The ICMP "redirect" message indicates that thegateway to which the host sent the datagram is not longer the bestgateway to reach the net in question. The gateway will have forwardedthe datagram, but the host should revise its routing table to have adifferent immediate address for this net. The ICMP "destination 3unreachable" message indicates that as a result of an outage, it iscurrently impossible to reach the addressed net or host in any manner.On receipt of this message, a host can either abandon the connectionimmediately without any further retransmission, or resend slowly to seeif the fault is corrected in reasonable time. If a host could assume that these two ICMP messages would alwaysarrive when something was amiss in the network, then no other action onthe part of the host would be required in order maintain its tables inan optimal condition. Unfortunately, there are two circumstances underwhich the messages will not arrive properly. First, during thetransient following a failure, error messages may arrive that do notcorrectly represent the state of the world. Thus, hosts must take anisolated error message with some scepticism. (This transient period isdiscussed more fully below.) Second, if the host has been sendingdatagrams to a particular gateway, and that gateway itself crashes, thenall the other gateways in the internet will reconstruct the topology,but the gateway in question will still be down, and therefore cannotprovide any advice back to the host. As long as the host continues todirect datagrams at this dead gateway, the datagrams will simply vanishoff the face of the earth, and nothing will come back in return. Hostsmust detect this failure. If some gateway many hops away fails, this is not of concern to thehost, for then the discovery of the failure is the responsibility of theimmediate neighbor gateways, which will perform this action in a mannerinvisible to the host. The problem only arises if the very first 4gateway, the one to which the host is immediately sending the datagrams,fails. We thus identify one single task which the host must perform asits part of fault isolation in the internet: the host must use somestrategy to detect that a gateway to which it is sending datagrams isdead. Let us assume for the moment that the host implements somealgorithm to detect failed gateways; we will return later to discusswhat this algorithm might be. First, let us consider what the hostshould do when it has determined that a gateway is down. In fact, withthe exception of one small problem, the action the host should take isextremely simple. The host should select some other gateway, and trysending the datagram to it. Assuming that gateway is up, this willeither produce correct results, or some ICMP advice. Since we assumethat, ignoring temporary periods immediately following an outage, anygateway is capable of giving correct advice, once the host has receivedadvice from any gateway, that host is in as good a condition as it canhope to be. There is always the unpleasant possibility that when the host triesa different gateway, that gateway too will be down. Therefore, whateveralgorithm the host uses to detect a dead gateway must continuously beapplied, as the host tries every gateway in turn that it knows about. The only difficult part of this algorithm is to specify the meansby which the host maintains the table of all of the gateways to which ithas immediate access. Currently, the specification of the internetprotocol does not architect any message by which a host can ask to be 5supplied with such a table. The reason is that different networks mayprovide very different mechanisms by which this table can be filled in.For example, if the net is a broadcast net, such as an ethernet or aringnet, every gateway may simply broadcast such a table from time totime, and the host need do nothing but listen to obtain the requiredinformation. Alternatively, the network may provide the mechanism oflogical addressing, by which a whole set of machines can be providedwith a single group address, to which a request can be sent forassistance. Failing those two schemes, the host can build up its tableof neighbor gateways by remembering all the gateways from which it hasever received a message. Finally, in certain cases, it may be necessaryfor this table, or at least the initial entries in the table, to beconstructed manually by a manager or operator at the site. In caseswhere the network in question provides absolutely no support for thiskind of host query, at least some manual intervention will be requiredto get started, so that the host can find out about at least onegateway. 4. Host Algorithms for Fault Isolation We now return to the question raised above. What strategy shouldthe host use to detect that it is talking to a dead gateway, so that itcan know to switch to some other gateway in the list. In fact, there areseveral algorithms which can be used. All are reasonably simple toimplement, but they have very different implications for the overhead onthe host, the gateway, and the network. Thus, to a certain extent, thealgorithm picked must depend on the details of the network and of thehost. 61. NETWORK LEVEL DETECTION Many networks, particularly the Arpanet, perform precisely therequired function internal to the network. If a host sends a datagramto a dead gateway on the Arpanet, the network will return a "host dead"message, which is precisely the information the host needs to know inorder to switch to another gateway. Some early implementations ofInternet on the Arpanet threw these messages away. That is anexceedingly poor idea.2. CONTINUOUS POLLING The ICMP protocol provides an echo mechanism by which a host maysolicit a response from a gateway. A host could simply send thismessage at a reasonable rate, to assure itself continuously that thegateway was still up. This works, but, since the message must be sentfairly often to detect a fault in a reasonable time, it can imply anunbearable overhead on the host itself, the network, and the gateway.This strategy is prohibited except where a specific analysis hasindicated that the overhead is tolerable.3. TRIGGERED POLLING If the use of polling could be restricted to only those times whensomething seemed to be wrong, then the overhead would be bearable.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -