📄 rfc992.txt

📁 RFC 的详细文档！
💻 TXT
📖 第 1 页 / 共 4 页
字号:
       primitive is needed for the purpose.  Additionally, there may be
       situations in which ABCAST protocols that also satisfy a CBCAST
       ordering property would be valuable.  Our ABCAST primitive could
       be changed to respect such a rule, and we made use of a multicast
       primitive that is simultaneously causal and atomic in our work on
       consistent shared bulletin boards ([Birman-c]).  For simplicity,
       the presentation here assumes that ABCAST is completely orthogo-
       nal to CBCAST, but a simple way to build an efficient "causal
       atomic" multicast is described in our full-length paper.  The
       cost of this protocol is only slightly higher than that of
       ABCAST.

   6.5 Group addressing protocol

       Since group membership can change dynamically, it may be diffi-
       cult for a process to compute a list of destinations to which a
       message should be sent, for example, as is needed to perform a
       GBCAST.  In [Birman-b] we report on a protocol for ensuring that
       a given multicast will be delivered to all members of a process
       group in the same view.  This view is either the view that was
       operative when the message transmission was initiated, or a view
       that was defined subsequently.  The algorithm is a simple itera-
       tive one that costs nothing unless the group membership changes,
       and permits the caching of possibly inaccurate membership infor-
       mation near processes that might want to communicate with a
       group.  Using the protocol, a flexible message addressing scheme
       can readily be supported.

       Iterative addressing is only required when the process transmit-
       ting a message has an inaccurate copy of the process group view.
       In the implementation we are now building, this would rarely be
       the case, and iteration is never needed if the view is known to
       be accurate.  Thus, iterated delivery should be very infrequent.

   6.6 Synchronous versus asynchronous multicast abstractions

       Many systems employ RPC internally, as a lowest level primitive
       for interaction between processes.  It should be evident that all
       of our multicast primitives can be used to implement replicated
       remote procedure calls [Cooper]: the caller would simply pause
       until replies have been received from all the participants
       (observation of a failure constitutes a reply in this case).  We
       term such a use of the primitives synchronous, to distinguish it
       from from an asynchronous multicast in which no replies, or just
       one reply, suffices.

       In our work on ISIS, GBCAST and ABCAST are normally invoked syn-
       chronously, to implement a remote procedure call by one member of
       an object on all the members of its process group.  However,
       CBCAST, which is the most frequently used overall, is almost
       never invoked synchronously.  Asynchronous CBCAST's are the



Birman & Joseph                                                [Page 10]

RFC 992                                                    November 1986


       primary source of concurrency in ISIS: although the delivery ord-
       ering is assured, transmission can be delayed to enable a message
       to be piggybacked on another, or to schedule IO within the system
       as a whole.  While the system cannot defer an asynchronous multi-
       cast indefinitely, the ability to defer it a little, without
       delaying some computation by doing so, permits load to be
       smoothed.  Since CBCAST respects the delivery orderings on which
       a computation might depend, and is ordered with respect to
       failures, the concurrency introduced does not complicate higher
       level algorithms.  Moreover, the protocol itself is extremely
       cheap.

       A problem is introduced by our decision to allow asynchronous
       multicasts: the atomic reception property must now be extended to
       address causally related sequences of asynchronous messages.  If
       a failure were to result in some multicasts being delivered to
       all their destinations but others that precede them not being
       delivered anywhere, inconsistency might result even if the desti-
       nations do not overlap.  We therefore extend the atomicity pro-
       perty as follows.  If process t receives a message m from process
       s, and s subsequently fails, then unless t fails as well, all
       messages m' that s received prior to its failure must be
       delivered to their remaining operational destinations.  This is
       because the state of t may now depend on the contents of any such
       m', hence the system state could become inconsistent if the
       delivery of m' were not completed.  The costs of the protocols
       are not affected by this change.

       A second problem arises when the user-level implications of this
       atomicity rule are considered.  In the event of a failure, any
       suffix of a sequence of aysnchronous multicasts could be lost and
       the system state would still be internally consistent.  A process
       that is about to take some action that may leave an externally
       visible side-effect will need a way to pause until it is
       guaranteed that such multicasts have actually been delivered.
       For this purpose, a flush primitive is provided.  Occasional
       calls to flush do not eliminate the benefit of using CBCAST asyn-
       chronously.  Unless the system has built up a considerable back-
       log of undelivered multicast messages, which should be rare,
       flush will only pause while transmission of the last few multi-
       casts complete.

7. Using the primitives

   The reliable communication primitives described above lead to simple
   solutions for the problems cited in Sec. 4:

       [1]  Synchronization.  Many synchronization problems are subsumed
       into the primitives themselves.  For example, consider the use of
       GBCAST to implement recovery.  A recovering process would issue a
       GBCAST to the process group members, requesting that state



Birman & Joseph                                                [Page 11]

RFC 992                                                    November 1986


       information be transferred to it.  In addition to sending the
       current state of the group to the recovering process, group
       members update the process group view at this time.  Subsequent
       messages to the group will be delivered to the recovered process,
       with all necessary synchronization being provided by the ordering
       properties of GBCAST.  In situations where other forms of syn-
       chronization are needed, ABCAST provides a simple way to ensure
       that several processes take actions in the same order, and this
       form of low-level synchronization simplifies a number of higher-
       level synchronization problems.  For example, if ABCAST is used
       to do P() and V() operations on a distributed semaphore, the
       order of operations on the semaphore is set by the ABCAST, hence
       all the managers of the semaphore see these operations in a fixed
       order.

       [2]  Failure detection.  Consistent failure (and recovery) detec-
       tion are trivial using our primitives: a process simply waits for
       the appropriate process group view to change.  This facilitates
       the implementation of algorithms in which one processes monitors
       the status of another process.  A process that acts on the basis
       of a process group view change does so with the assurance that
       other group members will (eventually) observe the same event and
       will take consistent actions.

       [3]  Consistency.  We believe that consistency is generally
       expressible as a set of atomicity and ordering constraints on
       message delivery, particularly causal ones of the sort provided
       by CBCAST.  Our primitives permit a process to specify the com-
       munication properties needed to achieve a desired form of con-
       sistency.  Continued research will be needed to understand pre-
       cisely how to pick the weakest primitive in a designated situa-
       tion.

       [4]  Serializability.  To achieve serializability, one implements
       a concurrency control algorithm and then forces computations to
       respect the serialization order that this algorithm choses.  The
       ABCAST primitive, as observed above, is a powerful tool for
       establishing an order between concurrent events, e.g. by lock
       acquisition.  Having established such an order, CBCAST can be
       used to distribute information about the computation and also its
       termination (commit or abort).  Any process that observes the
       commit or abort of a computation will only be able to interact
       with data managers that have received messages preceding the com-
       mit or abort, hence a highly asynchronous transactional execution
       results.  If a process running a computation fails, this is
       detected when a failure GBCAST is received instead of the commit.
       Thus, executions are simple and quite deterministic.

       If commit is conditional, CBCAST would be used to first interro-
       gate participants to learn if they are prepared to commit, and
       then to transmit the commit or abort decision (the usual two-



Birman & Joseph                                                [Page 12]

RFC 992                                                    November 1986


       phase commit).  On the other hand, conditional commits can often
       be avoided using our approach.  A method for building transac-
       tions that will roll-forward after failure after failure is dis-
       cussed in more detail in [Birman-a] [Joseph-a] [Joseph-b].  Other
       forms of concurrency control, such as timestamp generation, can
       similarly be implemented using ABCAST and CBCAST.  We view tran-
       sactional data storage as an application-level concern, which can
       be handled using a version stack approach or a multi-version
       store, or any other appropriate mechanism.

8. Implementation

   The communication primitives can be built in layers, starting with a
   bare network providing unreliable Internet datagrams.  The software
   structure is, however, less mature and more complex than the one sug-
   gested in RFC's 966 and 988.  For example, at this stage of our
   research we do not understand how to optimize our protocols to the
   same extent as for the unreliable host multicast approach described
   in those RFC's.  Thus, the implementation we describe here should be
   understood to be a prototype.  A particularly intriguing question,
   which we are investigating actively, concerns the use of a "best
   effort" ethernet or Internet multicast as a tool to optimize the
   implementation of our protocols.

   Our basic approach is to view large area networks as a set of clus-
   ters of sites interconnected by high speed LAN devices and intercon-
   nected by slower long-haul links.  We first provide protocols for use
   within clusters, and then extend them to run between clusters too.
   Network partitioning can be tolerated at all levels of the hierarchy
   in the sense that no incorrect actions can result after network par-
   titioning, although our approach will sometimes block until the par-
   tition is repaired.  Our protocols also tend to block within a clus-
   ter while the list of operational sites for that cluster is being
   changed.  In normal LAN's, this happens infrequently (during site
   failure or recovery), and would not pose a problem.  (In failure
   intensive applications, alternative protocols might be needed to
   address this issue).

   The lowest level of our software uses a site-to-site acknowledgement
   protocol to convert the unreliable packet transport this into a
   sequenced, error-free message abstraction, using timeouts to detect
   apparent failures.  TCP can also be used for this purpose, provided
   that a "filter" is placed on the incoming message stream and certain
   types of messages are handled specially.  An agreement protocol is
   then used to order the site-failures and recoveries consistently.  If
   timeouts cause a failure to be detected erroneously, the protocol
   forces the affected site to undergo recovery.

   Built on this is a layer that supports the primitives themselves.
   CBCAST has a very light-weight implementation, based on the idea of
   flooding the system with copies of a message: Each process buffers



Birman & Joseph                                                [Page 13]

RFC 992                                                    November 1986


   copies of any messages needed to ensure the consistency of its view
   of the system.  If message m is delivered to process p, and m is
   potentially causally dependent on a message m prime, then a copy of m
   prime is sent to p as well (duplicates are discarded).  A garbage
   collector deletes superfluous copies after a message has reached all
   its destinations.  By using extensive piggybacking and a simple
   scheduling algorithm to control message transmission, the cost of a
   CBCAST is kept low -- often, less than one packet per destination.
   ABCAST employs a two-phase protocol based on one suggested to us by
   Skeen [Skeen-b].  This protocol has higher latency than CBCAST
   because delivery can only occur during the second phase; ABCAST is
   thus inherently synchronous.  In ISIS, however, ABCAST is used
   rarely; we believe that this would be the case in other systems as
   well.  GBCAST is implemented using a two-phase protocol similar to
   the one for ABCAST, but with an additional mechanism that flushes
   messages from a failed process before delivering the GBCAST announc-
   ing the failure.  Although GBCAST is slower than ABCAST or CBCAST, it
   is used rarely enough so that performance is probably less of an
   issue here -- and in any case, even GBCAST could be tuned to give
   very high throughput.  Preliminary performance figures appear in
   [Birman-b].

   Although satisfactory performance should be possible using an imple-
   mentation that sits on top of a conventional Internet mechanism, it
   should be noted that to achieve really high rates of communication
   the layers of software described above must reside in the kernel,
   because they run on behalf of large numbers of clients, run fre-
   quently, and tend to execute for very brief periods before doing I/O
   and pausing.  A non-kernel implementation will thus incur high
   scheduling and context switching overhead.  Additionally, it is not
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -