📄 rfc992.txt

📁 RFC 的详细文档！
💻 TXT
📖 第 1 页 / 共 4 页
字号:

   If one rules out "unreliable" communication mechanisms, there are
   basically two fault-tolerant alternatives that can be pursued.

   The first approach is to provide mechanisms for transactional
   interactions between processes that communicate using remote pro-
   cedure calls [Birrell].  This has lead to work on nested transactions
   (due to nested RPC's) [Moss], support for transactions at the
   language level [Liskov], transactions within an operating systems
   kernel [Spector] [Allchin] [Popek] [Lazowska], and transactional
   access to higher-level replicated services, such as resilient objects
   in ISIS or relations in database systems.  The primitives in a tran-
   sactional system provide mechanisms for distributing the request that
   initiates the transaction, accessing data (which may be replicated),
   performing concurrency control, and implementing commit or abort.
   Additional mechanisms are normally needed for orphan termination,
   deadlock detection, etc.  The issue then arises of how these mechan-
   isms should themselves be implemented.

   Our work in ISIS leads us to believe that whereas transactions are
   easily implemented on top of fault-tolerant process groups -- we have
   done so -- the converse is much harder.  Moreover, transactions



Birman & Joseph                                                 [Page 5]

RFC 992                                                    November 1986


   represent a relatively heavy-weight solution to the problems surveyed
   in the previous section, and might impose an unacceptable overhead on
   subsystems that need to run non-transactionally, for example because
   a pair of concurrent processes needs to interact on a frequent basis.
   (We are not saying that "transactional" mechanisms such as cobegins
   and toplevel actions can't solve this problem, but just that they
   yield a solution that is awkward and costly).  This sort of reasoning
   has lead us to focus on non-transactional interaction mechanisms, and
   to treat transactions as a special class of mechanisms used when
   processes that have been designed to employ a transactional protocol
   interact.

   The second approach involves the provision of a communication primi-
   tive, such as atomic broadcast, which can be used as the framework on
   which higher level algorithms are designed.  Such a primitive seeks
   to deliver messages reliably to some set of destinations, despite the
   possibility that failures might occur during the execution of the
   protocol.  Above, we termed this the fault tolerant process group
   approach, since it lends itself to the organization of cooperating
   processes into groups, as described in the introduction.  Process
   groups are an extremely flexible abstraction, and have been employed
   in the V Kernel [Cheriton] and in UNIX, and more recently in the ISIS
   system.  A proposal to provide Internet support for host groups was
   raised in RFC's 966 and 988.  However, the idea of adapting the pro-
   cess group approach to work reliably in an environment subject to the
   sorts of exception events and concurrency cited in the previous sec-
   tion seems to be new.

   As noted earlier, existing reliable communication protocols do not
   address the requirements of fault-tolerant process groups.  For exam-
   ple, in [Schneider], an implementation of a reliable multicast primi-
   tive is described.  Such a primitive ensures that a designated mes-
   sage will be transmitted from one site to all other operational sites
   in a system; if a failure occurs but any site has received the mes-
   sage, all will eventually do so.  [Chang] and [Cristian] describe
   implementations for atomic broadcast, which is a reliable broadcast
   (sent to all sites in a system) with the additional property that
   messages are delivered in the same order at all overlapping destina-
   tions, and this order preserves the transmission order if messages
   originate in a single site.

   Atomic broadcast is a powerful abstraction, and essentially the same
   behavior is provided by one of the multicast primitives we discuss in
   the next section.  However, it has several drawbacks which made us
   hesitant to adopt it as the only primitive in the system.  Most seri-
   ous is the latency that is incurred in order to satisfy the delivery
   ordering property.  Without delving deeply into the implementations,
   which are based on a token scheme in [Chang] and an acknowledgement
   protocol in [Schneider], we observe that the delaying of certain mes-
   sages is fundamental to the establishment of a unique global delivery
   ordering; indeed, it is easy to prove on knowledge theoretic grounds



Birman & Joseph                                                 [Page 6]

RFC 992                                                    November 1986


   that this must always be the case.  In [Chang] a primary goal is to
   minimize the number of messages sent, and the protocol given performs
   extremely well in this regard.  However, a delay occurs while waiting
   for tokens to arrive and the delivery latency that results may be
   high.  [Cristian] assumes that clocks are closely synchronized and
   that message transit times are bounded by well-known constants, and
   uses this to derive atomic broadcast protocols tolerant of increas-
   ingly severe classes of failures.  The protocols explicitly delay
   delivery to achieve the desired global ordering on multicasts.  For
   reasons discussed below, this tends to result in high latency in typ-
   ical local networking environments.  An additional drawback of the
   atomic broadcast protocols is that no mechanism is provided for
   ensuring that all processes observe the same sequence of failures and
   recoveries, or for ensuring that failures and recoveries are ordered
   relative to ongoing multicasts.  Since this problem arises in any
   setting where one process monitors another, we felt it should be
   addressed at the same level as the communication protocol.  Finally,
   one wants a group oriented multicast protocol, not a site oriented
   broadcast, and this issue must be resolved too.

6. Our multicast primitives

   We now describe three multicast protocols - GBCAST, ABCAST, and
   CBCAST - for transmitting a message reliably from a sender process to
   some set of destination processes.  Details of the protocols and
   their correctness proofs can be found in [Birman-b].  The protocols
   ensure "all or nothing" behavior: if any destination receives a mes-
   sage, then unless it fails, all destinations will receive it.  Group
   addressing is discussed in Sec. 6.5.

   The failure model that one adopts has a considerable impact on the
   structure of the resulting system.  We adopted the model of fail-stop
   processors [Schneider]: when failures occur, a processor simply stops
   (crashes), as do all the processes executing on it.  We also assume
   that individual processes can crash, and that this is detected when
   it occurs by a monitoring mechanism present at each site.  Further
   assumptions are sometimes made about the availability of synchronized
   realtime clocks.  Here, we adopt the position that although reason-
   ably accurate elapsed-time clocks may be available, closely synchron-
   ized clocks probably will not be.  For example, the 60Hz "line"
   clocks commonly used on current workstations are only accurate to
   16ms.  On the other hand, 4-8ms inter-site message transit times are
   common and 1-2ms are reported increasingly often.  Thus, it is impos-
   sible to synchronize clocks to better than 32-48ms, enough time for a
   pair of sites to exchange between 4 and 50 messages.  Even with
   advancing technology, it seems safe to assume that clock skew will
   remain "large" when compared to inter-site message transmission
   speed.  In particular, this argues against time-based protocols such
   as the one used in [Cristian]





Birman & Joseph                                                 [Page 7]

RFC 992                                                    November 1986


   6.1 The GBCAST primitive

       GBCAST (group multicast) is the most constrained, and costly, of
       the three primitives.  It is used to transmit information about
       failures and recoveries to members of a process group.  A recov-
       ering member uses GBCAST to inform the operational ones that it
       has become available.  Additionally, when a member fails, the
       system arranges for a GBCAST to be issued to group members on its
       behalf, informing them of its failure.  Arguments to GBCAST are a
       message and a process group identifier, which is translated into
       a set of destinations as described below (Sec. 6.5).

       Our GBCAST protocol ensures that if any process receives a multi-
       cast B before receiving a GBCAST G, then all overlapping destina-
       tions will receive B before G <1> This is true regardless of the
       type of multicast involved.  Moreover, when a failure occurs, the
       corresponding GBCAST message is delivered after any other multi-
       casts from the failed process.  Each member can therefore main-
       tain a VIEW listing the membership of the process group, updating
       it when a GBCAST is received.  Although VIEW's are not updated
       simultaneously in real time, all members observe the same
       sequence of VIEW changes.  Since, GBCAST's are ordered relative
       to all other multicasts, all members receiving a given multicast
       will have the same value of VIEW when they receive it.

       Notice that GBCAST also provides a convenient way to change other
       global properties of a group "atomically".  In our work, we have
       used GBCAST to dynamically change a ranking on the members of a
       group, to request that group members establish checkpoints for
       use if recovery is needed after all failure, and to implement
       process migration.  In each case, the ordering of GBCAST relative
       to other events that makes it possible to perform the desired
       action without running any additional protocol.  Other uses for
       GBCAST will no doubt emerge as our research continues.

       Members of a process group can also use the value of VIEW to pick
       a strategy for processing an incoming request, or to react to
       failure or recovery without having to run any special protocol
       first.  Since the GBCAST ordering is the same everywhere, their
       actions will all be consistent.  Notice that when all the members
       of a process group may have failed, GBCAST also provides an inex-
       pensive way to determine the last site that failed: process group
       members simply log each value of VIEW that becomes defined on
       stable storage before using it; a simplified version of the algo-
       rithm in [Skeen-a] can then be executed when recovering from
       failure.








Birman & Joseph                                                 [Page 8]

RFC 992                                                    November 1986


   6.2 The ABCAST primitive

       The GBCAST primitive is too costly to be used for general commun-
       ication between process group members.  This motivates the intro-
       duction of weaker (less ordered) primitives, which might be used
       in situations where a total order on multicast messages is not
       necessary.  Our second primitive, ABCAST (atomic multicast),
       satisfies such a weaker constraint.  Specifically, it is often
       desired that if two multicasts are received in some order at a
       common destination site, they be received in that order at all
       other common destinations, even if this order was not predeter-
       mined.  For example, if a process group is being used to maintain
       a replicated queue and ABCAST is used to transmit queue opera-
       tions to all copies, the operations will be done in the same
       order everywhere, hence the copies of the queue will remain mutu-
       ally consistent.  The primitive ABCAST(msg, label, dests) pro-
       vides this behavior.  Two ABCAST's having the same label are
       delivered in the same order at all common destinations.

   6.3 The CBCAST primitive

       Our third primitive, CBCAST (causal multicast), is weakest in the
       sense that it involves less distributed synchronization then
       GBCAST or ABCAST.  CBCAST(msg, dests) atomically delivers msg to
       each operational dest.  The CBCAST protocol ensures that if two
       multicasts are potentially causally dependent on another, then
       the former is delivered after the latter at all overlapping des-
       tinations.  A multicast B' is potentially causally dependent on a
       multicast B if both multicasts originate from the same process,
       and B' is sent after B, or if there exists a chain of message
       transmissions and receptions or local events by which knowledge
       could have been transferred from the process that issued B to the
       process that issued B' [Lamport].  For causally independent mul-
       ticasts, the delivery ordering is not constrained.

       CBCAST is valuable in systems like ISIS, where concurrency con-
       trol algorithms are used to synchronize concurrent computations.
       In these systems, if two processes communicate concurrently with
       the same process the messages are almost always independent ones
       that can be processed in any order: otherwise, concurrency con-
       trol would have caused one to pause until the other was finished.
       On the other hand, order is clearly important within a causally
       linked series of multicasts, and it is precisely this sort of
       order that CBCAST respects.

   6.4 Other multicast primitives

       A weaker multicast primitive is reliable multicast, which pro-
       vides all-or-nothing delivery, but no ordering properties.  The
       formulation of CBCAST in [Birman-b] actually includes a mechanism
       for performing multicasts of this sort, hence no special



Birman & Joseph                                                 [Page 9]

RFC 992                                                    November 1986
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -