📄 rfc816.txt
字号:
RFC: 816
FAULT ISOLATION AND RECOVERY
David D. Clark
MIT Laboratory for Computer Science
Computer Systems and Communications Group
July, 1982
1. Introduction
Occasionally, a network or a gateway will go down, and the sequence
of hops which the packet takes from source to destination must change.
Fault isolation is that action which hosts and gateways collectively
take to determine that something is wrong; fault recovery is the
identification and selection of an alternative route which will serve to
reconnect the source to the destination. In fact, the gateways perform
most of the functions of fault isolation and recovery. There are,
however, a few actions which hosts must take if they wish to provide a
reasonable level of service. This document describes the portion of
fault isolation and recovery which is the responsibility of the host.
2. What Gateways Do
Gateways collectively implement an algorithm which identifies the
best route between all pairs of networks. They do this by exchanging
packets which contain each gateway's latest opinion about the
operational status of its neighbor networks and gateways. Assuming that
this algorithm is operating properly, one can expect the gateways to go
through a period of confusion immediately after some network or gateway
2
has failed, but one can assume that once a period of negotiation has
passed, the gateways are equipped with a consistent and correct model of
the connectivity of the internet. At present this period of negotiation
may actually take several minutes, and many TCP implementations time out
within that period, but it is a design goal of the eventual algorithm
that the gateway should be able to reconstruct the topology quickly
enough that a TCP connection should be able to survive a failure of the
route.
3. Host Algorithm for Fault Recovery
Since the gateways always attempt to have a consistent and correct
model of the internetwork topology, the host strategy for fault recovery
is very simple. Whenever the host feels that something is wrong, it
asks the gateway for advice, and, assuming the advice is forthcoming, it
believes the advice completely. The advice will be wrong only during
the transient period of negotiation, which immediately follows an
outage, but will otherwise be reliably correct.
In fact, it is never necessary for a host to explicitly ask a
gateway for advice, because the gateway will provide it as appropriate.
When a host sends a datagram to some distant net, the host should be
prepared to receive back either of two advisory messages which the
gateway may send. The ICMP "redirect" message indicates that the
gateway to which the host sent the datagram is not longer the best
gateway to reach the net in question. The gateway will have forwarded
the datagram, but the host should revise its routing table to have a
different immediate address for this net. The ICMP "destination
3
unreachable" message indicates that as a result of an outage, it is
currently impossible to reach the addressed net or host in any manner.
On receipt of this message, a host can either abandon the connection
immediately without any further retransmission, or resend slowly to see
if the fault is corrected in reasonable time.
If a host could assume that these two ICMP messages would always
arrive when something was amiss in the network, then no other action on
the part of the host would be required in order maintain its tables in
an optimal condition. Unfortunately, there are two circumstances under
which the messages will not arrive properly. First, during the
transient following a failure, error messages may arrive that do not
correctly represent the state of the world. Thus, hosts must take an
isolated error message with some scepticism. (This transient period is
discussed more fully below.) Second, if the host has been sending
datagrams to a particular gateway, and that gateway itself crashes, then
all the other gateways in the internet will reconstruct the topology,
but the gateway in question will still be down, and therefore cannot
provide any advice back to the host. As long as the host continues to
direct datagrams at this dead gateway, the datagrams will simply vanish
off the face of the earth, and nothing will come back in return. Hosts
must detect this failure.
If some gateway many hops away fails, this is not of concern to the
host, for then the discovery of the failure is the responsibility of the
immediate neighbor gateways, which will perform this action in a manner
invisible to the host. The problem only arises if the very first
4
gateway, the one to which the host is immediately sending the datagrams,
fails. We thus identify one single task which the host must perform as
its part of fault isolation in the internet: the host must use some
strategy to detect that a gateway to which it is sending datagrams is
dead.
Let us assume for the moment that the host implements some
algorithm to detect failed gateways; we will return later to discuss
what this algorithm might be. First, let us consider what the host
should do when it has determined that a gateway is down. In fact, with
the exception of one small problem, the action the host should take is
extremely simple. The host should select some other gateway, and try
sending the datagram to it. Assuming that gateway is up, this will
either produce correct results, or some ICMP advice. Since we assume
that, ignoring temporary periods immediately following an outage, any
gateway is capable of giving correct advice, once the host has received
advice from any gateway, that host is in as good a condition as it can
hope to be.
There is always the unpleasant possibility that when the host tries
a different gateway, that gateway too will be down. Therefore, whatever
algorithm the host uses to detect a dead gateway must continuously be
applied, as the host tries every gateway in turn that it knows about.
The only difficult part of this algorithm is to specify the means
by which the host maintains the table of all of the gateways to which it
has immediate access. Currently, the specification of the internet
protocol does not architect any message by which a host can ask to be
5
supplied with such a table. The reason is that different networks may
provide very different mechanisms by which this table can be filled in.
For example, if the net is a broadcast net, such as an ethernet or a
ringnet, every gateway may simply broadcast such a table from time to
time, and the host need do nothing but listen to obtain the required
information. Alternatively, the network may provide the mechanism of
logical addressing, by which a whole set of machines can be provided
with a single group address, to which a request can be sent for
assistance. Failing those two schemes, the host can build up its table
of neighbor gateways by remembering all the gateways from which it has
ever received a message. Finally, in certain cases, it may be necessary
for this table, or at least the initial entries in the table, to be
constructed manually by a manager or operator at the site. In cases
where the network in question provides absolutely no support for this
kind of host query, at least some manual intervention will be required
to get started, so that the host can find out about at least one
gateway.
4. Host Algorithms for Fault Isolation
We now return to the question raised above. What strategy should
the host use to detect that it is talking to a dead gateway, so that it
can know to switch to some other gateway in the list. In fact, there are
several algorithms which can be used. All are reasonably simple to
implement, but they have very different implications for the overhead on
the host, the gateway, and the network. Thus, to a certain extent, the
algorithm picked must depend on the details of the network and of the
host.
6
1. NETWORK LEVEL DETECTION
Many networks, particularly the Arpanet, perform precisely the
required function internal to the network. If a host sends a datagram
to a dead gateway on the Arpanet, the network will return a "host dead"
message, which is precisely the information the host needs to know in
order to switch to another gateway. Some early implementations of
Internet on the Arpanet threw these messages away. That is an
exceedingly poor idea.
2. CONTINUOUS POLLING
The ICMP protocol provides an echo mechanism by which a host may
solicit a response from a gateway. A host could simply send this
message at a reasonable rate, to assure itself continuously that the
gateway was still up. This works, but, since the message must be sent
fairly often to detect a fault in a reasonable time, it can imply an
unbearable overhead on the host itself, the network, and the gateway.
This strategy is prohibited except where a specific analysis has
indicated that the overhead is tolerable.
3. TRIGGERED POLLING
If the use of polling could be restricted to only those times when
something seemed to be wrong, then the overhead would be bearable.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -