📄 rfc992.txt
字号:
primitive is needed for the purpose. Additionally, there may be
situations in which ABCAST protocols that also satisfy a CBCAST
ordering property would be valuable. Our ABCAST primitive could
be changed to respect such a rule, and we made use of a multicast
primitive that is simultaneously causal and atomic in our work on
consistent shared bulletin boards ([Birman-c]). For simplicity,
the presentation here assumes that ABCAST is completely orthogo-
nal to CBCAST, but a simple way to build an efficient "causal
atomic" multicast is described in our full-length paper. The
cost of this protocol is only slightly higher than that of
ABCAST.
6.5 Group addressing protocol
Since group membership can change dynamically, it may be diffi-
cult for a process to compute a list of destinations to which a
message should be sent, for example, as is needed to perform a
GBCAST. In [Birman-b] we report on a protocol for ensuring that
a given multicast will be delivered to all members of a process
group in the same view. This view is either the view that was
operative when the message transmission was initiated, or a view
that was defined subsequently. The algorithm is a simple itera-
tive one that costs nothing unless the group membership changes,
and permits the caching of possibly inaccurate membership infor-
mation near processes that might want to communicate with a
group. Using the protocol, a flexible message addressing scheme
can readily be supported.
Iterative addressing is only required when the process transmit-
ting a message has an inaccurate copy of the process group view.
In the implementation we are now building, this would rarely be
the case, and iteration is never needed if the view is known to
be accurate. Thus, iterated delivery should be very infrequent.
6.6 Synchronous versus asynchronous multicast abstractions
Many systems employ RPC internally, as a lowest level primitive
for interaction between processes. It should be evident that all
of our multicast primitives can be used to implement replicated
remote procedure calls [Cooper]: the caller would simply pause
until replies have been received from all the participants
(observation of a failure constitutes a reply in this case). We
term such a use of the primitives synchronous, to distinguish it
from from an asynchronous multicast in which no replies, or just
one reply, suffices.
In our work on ISIS, GBCAST and ABCAST are normally invoked syn-
chronously, to implement a remote procedure call by one member of
an object on all the members of its process group. However,
CBCAST, which is the most frequently used overall, is almost
never invoked synchronously. Asynchronous CBCAST's are the
Birman & Joseph [Page 10]
RFC 992 November 1986
primary source of concurrency in ISIS: although the delivery ord-
ering is assured, transmission can be delayed to enable a message
to be piggybacked on another, or to schedule IO within the system
as a whole. While the system cannot defer an asynchronous multi-
cast indefinitely, the ability to defer it a little, without
delaying some computation by doing so, permits load to be
smoothed. Since CBCAST respects the delivery orderings on which
a computation might depend, and is ordered with respect to
failures, the concurrency introduced does not complicate higher
level algorithms. Moreover, the protocol itself is extremely
cheap.
A problem is introduced by our decision to allow asynchronous
multicasts: the atomic reception property must now be extended to
address causally related sequences of asynchronous messages. If
a failure were to result in some multicasts being delivered to
all their destinations but others that precede them not being
delivered anywhere, inconsistency might result even if the desti-
nations do not overlap. We therefore extend the atomicity pro-
perty as follows. If process t receives a message m from process
s, and s subsequently fails, then unless t fails as well, all
messages m' that s received prior to its failure must be
delivered to their remaining operational destinations. This is
because the state of t may now depend on the contents of any such
m', hence the system state could become inconsistent if the
delivery of m' were not completed. The costs of the protocols
are not affected by this change.
A second problem arises when the user-level implications of this
atomicity rule are considered. In the event of a failure, any
suffix of a sequence of aysnchronous multicasts could be lost and
the system state would still be internally consistent. A process
that is about to take some action that may leave an externally
visible side-effect will need a way to pause until it is
guaranteed that such multicasts have actually been delivered.
For this purpose, a flush primitive is provided. Occasional
calls to flush do not eliminate the benefit of using CBCAST asyn-
chronously. Unless the system has built up a considerable back-
log of undelivered multicast messages, which should be rare,
flush will only pause while transmission of the last few multi-
casts complete.
7. Using the primitives
The reliable communication primitives described above lead to simple
solutions for the problems cited in Sec. 4:
[1] Synchronization. Many synchronization problems are subsumed
into the primitives themselves. For example, consider the use of
GBCAST to implement recovery. A recovering process would issue a
GBCAST to the process group members, requesting that state
Birman & Joseph [Page 11]
RFC 992 November 1986
information be transferred to it. In addition to sending the
current state of the group to the recovering process, group
members update the process group view at this time. Subsequent
messages to the group will be delivered to the recovered process,
with all necessary synchronization being provided by the ordering
properties of GBCAST. In situations where other forms of syn-
chronization are needed, ABCAST provides a simple way to ensure
that several processes take actions in the same order, and this
form of low-level synchronization simplifies a number of higher-
level synchronization problems. For example, if ABCAST is used
to do P() and V() operations on a distributed semaphore, the
order of operations on the semaphore is set by the ABCAST, hence
all the managers of the semaphore see these operations in a fixed
order.
[2] Failure detection. Consistent failure (and recovery) detec-
tion are trivial using our primitives: a process simply waits for
the appropriate process group view to change. This facilitates
the implementation of algorithms in which one processes monitors
the status of another process. A process that acts on the basis
of a process group view change does so with the assurance that
other group members will (eventually) observe the same event and
will take consistent actions.
[3] Consistency. We believe that consistency is generally
expressible as a set of atomicity and ordering constraints on
message delivery, particularly causal ones of the sort provided
by CBCAST. Our primitives permit a process to specify the com-
munication properties needed to achieve a desired form of con-
sistency. Continued research will be needed to understand pre-
cisely how to pick the weakest primitive in a designated situa-
tion.
[4] Serializability. To achieve serializability, one implements
a concurrency control algorithm and then forces computations to
respect the serialization order that this algorithm choses. The
ABCAST primitive, as observed above, is a powerful tool for
establishing an order between concurrent events, e.g. by lock
acquisition. Having established such an order, CBCAST can be
used to distribute information about the computation and also its
termination (commit or abort). Any process that observes the
commit or abort of a computation will only be able to interact
with data managers that have received messages preceding the com-
mit or abort, hence a highly asynchronous transactional execution
results. If a process running a computation fails, this is
detected when a failure GBCAST is received instead of the commit.
Thus, executions are simple and quite deterministic.
If commit is conditional, CBCAST would be used to first interro-
gate participants to learn if they are prepared to commit, and
then to transmit the commit or abort decision (the usual two-
Birman & Joseph [Page 12]
RFC 992 November 1986
phase commit). On the other hand, conditional commits can often
be avoided using our approach. A method for building transac-
tions that will roll-forward after failure after failure is dis-
cussed in more detail in [Birman-a] [Joseph-a] [Joseph-b]. Other
forms of concurrency control, such as timestamp generation, can
similarly be implemented using ABCAST and CBCAST. We view tran-
sactional data storage as an application-level concern, which can
be handled using a version stack approach or a multi-version
store, or any other appropriate mechanism.
8. Implementation
The communication primitives can be built in layers, starting with a
bare network providing unreliable Internet datagrams. The software
structure is, however, less mature and more complex than the one sug-
gested in RFC's 966 and 988. For example, at this stage of our
research we do not understand how to optimize our protocols to the
same extent as for the unreliable host multicast approach described
in those RFC's. Thus, the implementation we describe here should be
understood to be a prototype. A particularly intriguing question,
which we are investigating actively, concerns the use of a "best
effort" ethernet or Internet multicast as a tool to optimize the
implementation of our protocols.
Our basic approach is to view large area networks as a set of clus-
ters of sites interconnected by high speed LAN devices and intercon-
nected by slower long-haul links. We first provide protocols for use
within clusters, and then extend them to run between clusters too.
Network partitioning can be tolerated at all levels of the hierarchy
in the sense that no incorrect actions can result after network par-
titioning, although our approach will sometimes block until the par-
tition is repaired. Our protocols also tend to block within a clus-
ter while the list of operational sites for that cluster is being
changed. In normal LAN's, this happens infrequently (during site
failure or recovery), and would not pose a problem. (In failure
intensive applications, alternative protocols might be needed to
address this issue).
The lowest level of our software uses a site-to-site acknowledgement
protocol to convert the unreliable packet transport this into a
sequenced, error-free message abstraction, using timeouts to detect
apparent failures. TCP can also be used for this purpose, provided
that a "filter" is placed on the incoming message stream and certain
types of messages are handled specially. An agreement protocol is
then used to order the site-failures and recoveries consistently. If
timeouts cause a failure to be detected erroneously, the protocol
forces the affected site to undergo recovery.
Built on this is a layer that supports the primitives themselves.
CBCAST has a very light-weight implementation, based on the idea of
flooding the system with copies of a message: Each process buffers
Birman & Joseph [Page 13]
RFC 992 November 1986
copies of any messages needed to ensure the consistency of its view
of the system. If message m is delivered to process p, and m is
potentially causally dependent on a message m prime, then a copy of m
prime is sent to p as well (duplicates are discarded). A garbage
collector deletes superfluous copies after a message has reached all
its destinations. By using extensive piggybacking and a simple
scheduling algorithm to control message transmission, the cost of a
CBCAST is kept low -- often, less than one packet per destination.
ABCAST employs a two-phase protocol based on one suggested to us by
Skeen [Skeen-b]. This protocol has higher latency than CBCAST
because delivery can only occur during the second phase; ABCAST is
thus inherently synchronous. In ISIS, however, ABCAST is used
rarely; we believe that this would be the case in other systems as
well. GBCAST is implemented using a two-phase protocol similar to
the one for ABCAST, but with an additional mechanism that flushes
messages from a failed process before delivering the GBCAST announc-
ing the failure. Although GBCAST is slower than ABCAST or CBCAST, it
is used rarely enough so that performance is probably less of an
issue here -- and in any case, even GBCAST could be tuned to give
very high throughput. Preliminary performance figures appear in
[Birman-b].
Although satisfactory performance should be possible using an imple-
mentation that sits on top of a conventional Internet mechanism, it
should be noted that to achieve really high rates of communication
the layers of software described above must reside in the kernel,
because they run on behalf of large numbers of clients, run fre-
quently, and tend to execute for very brief periods before doing I/O
and pausing. A non-kernel implementation will thus incur high
scheduling and context switching overhead. Additionally, it is not
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -