📄 rfc992.txt
字号:
If one rules out "unreliable" communication mechanisms, there are
basically two fault-tolerant alternatives that can be pursued.
The first approach is to provide mechanisms for transactional
interactions between processes that communicate using remote pro-
cedure calls [Birrell]. This has lead to work on nested transactions
(due to nested RPC's) [Moss], support for transactions at the
language level [Liskov], transactions within an operating systems
kernel [Spector] [Allchin] [Popek] [Lazowska], and transactional
access to higher-level replicated services, such as resilient objects
in ISIS or relations in database systems. The primitives in a tran-
sactional system provide mechanisms for distributing the request that
initiates the transaction, accessing data (which may be replicated),
performing concurrency control, and implementing commit or abort.
Additional mechanisms are normally needed for orphan termination,
deadlock detection, etc. The issue then arises of how these mechan-
isms should themselves be implemented.
Our work in ISIS leads us to believe that whereas transactions are
easily implemented on top of fault-tolerant process groups -- we have
done so -- the converse is much harder. Moreover, transactions
Birman & Joseph [Page 5]
RFC 992 November 1986
represent a relatively heavy-weight solution to the problems surveyed
in the previous section, and might impose an unacceptable overhead on
subsystems that need to run non-transactionally, for example because
a pair of concurrent processes needs to interact on a frequent basis.
(We are not saying that "transactional" mechanisms such as cobegins
and toplevel actions can't solve this problem, but just that they
yield a solution that is awkward and costly). This sort of reasoning
has lead us to focus on non-transactional interaction mechanisms, and
to treat transactions as a special class of mechanisms used when
processes that have been designed to employ a transactional protocol
interact.
The second approach involves the provision of a communication primi-
tive, such as atomic broadcast, which can be used as the framework on
which higher level algorithms are designed. Such a primitive seeks
to deliver messages reliably to some set of destinations, despite the
possibility that failures might occur during the execution of the
protocol. Above, we termed this the fault tolerant process group
approach, since it lends itself to the organization of cooperating
processes into groups, as described in the introduction. Process
groups are an extremely flexible abstraction, and have been employed
in the V Kernel [Cheriton] and in UNIX, and more recently in the ISIS
system. A proposal to provide Internet support for host groups was
raised in RFC's 966 and 988. However, the idea of adapting the pro-
cess group approach to work reliably in an environment subject to the
sorts of exception events and concurrency cited in the previous sec-
tion seems to be new.
As noted earlier, existing reliable communication protocols do not
address the requirements of fault-tolerant process groups. For exam-
ple, in [Schneider], an implementation of a reliable multicast primi-
tive is described. Such a primitive ensures that a designated mes-
sage will be transmitted from one site to all other operational sites
in a system; if a failure occurs but any site has received the mes-
sage, all will eventually do so. [Chang] and [Cristian] describe
implementations for atomic broadcast, which is a reliable broadcast
(sent to all sites in a system) with the additional property that
messages are delivered in the same order at all overlapping destina-
tions, and this order preserves the transmission order if messages
originate in a single site.
Atomic broadcast is a powerful abstraction, and essentially the same
behavior is provided by one of the multicast primitives we discuss in
the next section. However, it has several drawbacks which made us
hesitant to adopt it as the only primitive in the system. Most seri-
ous is the latency that is incurred in order to satisfy the delivery
ordering property. Without delving deeply into the implementations,
which are based on a token scheme in [Chang] and an acknowledgement
protocol in [Schneider], we observe that the delaying of certain mes-
sages is fundamental to the establishment of a unique global delivery
ordering; indeed, it is easy to prove on knowledge theoretic grounds
Birman & Joseph [Page 6]
RFC 992 November 1986
that this must always be the case. In [Chang] a primary goal is to
minimize the number of messages sent, and the protocol given performs
extremely well in this regard. However, a delay occurs while waiting
for tokens to arrive and the delivery latency that results may be
high. [Cristian] assumes that clocks are closely synchronized and
that message transit times are bounded by well-known constants, and
uses this to derive atomic broadcast protocols tolerant of increas-
ingly severe classes of failures. The protocols explicitly delay
delivery to achieve the desired global ordering on multicasts. For
reasons discussed below, this tends to result in high latency in typ-
ical local networking environments. An additional drawback of the
atomic broadcast protocols is that no mechanism is provided for
ensuring that all processes observe the same sequence of failures and
recoveries, or for ensuring that failures and recoveries are ordered
relative to ongoing multicasts. Since this problem arises in any
setting where one process monitors another, we felt it should be
addressed at the same level as the communication protocol. Finally,
one wants a group oriented multicast protocol, not a site oriented
broadcast, and this issue must be resolved too.
6. Our multicast primitives
We now describe three multicast protocols - GBCAST, ABCAST, and
CBCAST - for transmitting a message reliably from a sender process to
some set of destination processes. Details of the protocols and
their correctness proofs can be found in [Birman-b]. The protocols
ensure "all or nothing" behavior: if any destination receives a mes-
sage, then unless it fails, all destinations will receive it. Group
addressing is discussed in Sec. 6.5.
The failure model that one adopts has a considerable impact on the
structure of the resulting system. We adopted the model of fail-stop
processors [Schneider]: when failures occur, a processor simply stops
(crashes), as do all the processes executing on it. We also assume
that individual processes can crash, and that this is detected when
it occurs by a monitoring mechanism present at each site. Further
assumptions are sometimes made about the availability of synchronized
realtime clocks. Here, we adopt the position that although reason-
ably accurate elapsed-time clocks may be available, closely synchron-
ized clocks probably will not be. For example, the 60Hz "line"
clocks commonly used on current workstations are only accurate to
16ms. On the other hand, 4-8ms inter-site message transit times are
common and 1-2ms are reported increasingly often. Thus, it is impos-
sible to synchronize clocks to better than 32-48ms, enough time for a
pair of sites to exchange between 4 and 50 messages. Even with
advancing technology, it seems safe to assume that clock skew will
remain "large" when compared to inter-site message transmission
speed. In particular, this argues against time-based protocols such
as the one used in [Cristian]
Birman & Joseph [Page 7]
RFC 992 November 1986
6.1 The GBCAST primitive
GBCAST (group multicast) is the most constrained, and costly, of
the three primitives. It is used to transmit information about
failures and recoveries to members of a process group. A recov-
ering member uses GBCAST to inform the operational ones that it
has become available. Additionally, when a member fails, the
system arranges for a GBCAST to be issued to group members on its
behalf, informing them of its failure. Arguments to GBCAST are a
message and a process group identifier, which is translated into
a set of destinations as described below (Sec. 6.5).
Our GBCAST protocol ensures that if any process receives a multi-
cast B before receiving a GBCAST G, then all overlapping destina-
tions will receive B before G <1> This is true regardless of the
type of multicast involved. Moreover, when a failure occurs, the
corresponding GBCAST message is delivered after any other multi-
casts from the failed process. Each member can therefore main-
tain a VIEW listing the membership of the process group, updating
it when a GBCAST is received. Although VIEW's are not updated
simultaneously in real time, all members observe the same
sequence of VIEW changes. Since, GBCAST's are ordered relative
to all other multicasts, all members receiving a given multicast
will have the same value of VIEW when they receive it.
Notice that GBCAST also provides a convenient way to change other
global properties of a group "atomically". In our work, we have
used GBCAST to dynamically change a ranking on the members of a
group, to request that group members establish checkpoints for
use if recovery is needed after all failure, and to implement
process migration. In each case, the ordering of GBCAST relative
to other events that makes it possible to perform the desired
action without running any additional protocol. Other uses for
GBCAST will no doubt emerge as our research continues.
Members of a process group can also use the value of VIEW to pick
a strategy for processing an incoming request, or to react to
failure or recovery without having to run any special protocol
first. Since the GBCAST ordering is the same everywhere, their
actions will all be consistent. Notice that when all the members
of a process group may have failed, GBCAST also provides an inex-
pensive way to determine the last site that failed: process group
members simply log each value of VIEW that becomes defined on
stable storage before using it; a simplified version of the algo-
rithm in [Skeen-a] can then be executed when recovering from
failure.
Birman & Joseph [Page 8]
RFC 992 November 1986
6.2 The ABCAST primitive
The GBCAST primitive is too costly to be used for general commun-
ication between process group members. This motivates the intro-
duction of weaker (less ordered) primitives, which might be used
in situations where a total order on multicast messages is not
necessary. Our second primitive, ABCAST (atomic multicast),
satisfies such a weaker constraint. Specifically, it is often
desired that if two multicasts are received in some order at a
common destination site, they be received in that order at all
other common destinations, even if this order was not predeter-
mined. For example, if a process group is being used to maintain
a replicated queue and ABCAST is used to transmit queue opera-
tions to all copies, the operations will be done in the same
order everywhere, hence the copies of the queue will remain mutu-
ally consistent. The primitive ABCAST(msg, label, dests) pro-
vides this behavior. Two ABCAST's having the same label are
delivered in the same order at all common destinations.
6.3 The CBCAST primitive
Our third primitive, CBCAST (causal multicast), is weakest in the
sense that it involves less distributed synchronization then
GBCAST or ABCAST. CBCAST(msg, dests) atomically delivers msg to
each operational dest. The CBCAST protocol ensures that if two
multicasts are potentially causally dependent on another, then
the former is delivered after the latter at all overlapping des-
tinations. A multicast B' is potentially causally dependent on a
multicast B if both multicasts originate from the same process,
and B' is sent after B, or if there exists a chain of message
transmissions and receptions or local events by which knowledge
could have been transferred from the process that issued B to the
process that issued B' [Lamport]. For causally independent mul-
ticasts, the delivery ordering is not constrained.
CBCAST is valuable in systems like ISIS, where concurrency con-
trol algorithms are used to synchronize concurrent computations.
In these systems, if two processes communicate concurrently with
the same process the messages are almost always independent ones
that can be processed in any order: otherwise, concurrency con-
trol would have caused one to pause until the other was finished.
On the other hand, order is clearly important within a causally
linked series of multicasts, and it is precisely this sort of
order that CBCAST respects.
6.4 Other multicast primitives
A weaker multicast primitive is reliable multicast, which pro-
vides all-or-nothing delivery, but no ordering properties. The
formulation of CBCAST in [Birman-b] actually includes a mechanism
for performing multicasts of this sort, hence no special
Birman & Joseph [Page 9]
RFC 992 November 1986
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -