📄 rfc992.txt

📁 RFC 相关的技术文档
💻 TXT
📖 第 1 页 / 共 4 页
字号:
   If one rules out "unreliable" communication mechanisms, there are   basically two fault-tolerant alternatives that can be pursued.   The first approach is to provide mechanisms for transactional   interactions between processes that communicate using remote pro-   cedure calls [Birrell].  This has lead to work on nested transactions   (due to nested RPC's) [Moss], support for transactions at the   language level [Liskov], transactions within an operating systems   kernel [Spector] [Allchin] [Popek] [Lazowska], and transactional   access to higher-level replicated services, such as resilient objects   in ISIS or relations in database systems.  The primitives in a tran-   sactional system provide mechanisms for distributing the request that   initiates the transaction, accessing data (which may be replicated),   performing concurrency control, and implementing commit or abort.   Additional mechanisms are normally needed for orphan termination,   deadlock detection, etc.  The issue then arises of how these mechan-   isms should themselves be implemented.   Our work in ISIS leads us to believe that whereas transactions are   easily implemented on top of fault-tolerant process groups -- we have   done so -- the converse is much harder.  Moreover, transactionsBirman & Joseph                                                 [Page 5]RFC 992                                                    November 1986   represent a relatively heavy-weight solution to the problems surveyed   in the previous section, and might impose an unacceptable overhead on   subsystems that need to run non-transactionally, for example because   a pair of concurrent processes needs to interact on a frequent basis.   (We are not saying that "transactional" mechanisms such as cobegins   and toplevel actions can't solve this problem, but just that they   yield a solution that is awkward and costly).  This sort of reasoning   has lead us to focus on non-transactional interaction mechanisms, and   to treat transactions as a special class of mechanisms used when   processes that have been designed to employ a transactional protocol   interact.   The second approach involves the provision of a communication primi-   tive, such as atomic broadcast, which can be used as the framework on   which higher level algorithms are designed.  Such a primitive seeks   to deliver messages reliably to some set of destinations, despite the   possibility that failures might occur during the execution of the   protocol.  Above, we termed this the fault tolerant process group   approach, since it lends itself to the organization of cooperating   processes into groups, as described in the introduction.  Process   groups are an extremely flexible abstraction, and have been employed   in the V Kernel [Cheriton] and in UNIX, and more recently in the ISIS   system.  A proposal to provide Internet support for host groups was   raised in RFC's 966 and 988.  However, the idea of adapting the pro-   cess group approach to work reliably in an environment subject to the   sorts of exception events and concurrency cited in the previous sec-   tion seems to be new.   As noted earlier, existing reliable communication protocols do not   address the requirements of fault-tolerant process groups.  For exam-   ple, in [Schneider], an implementation of a reliable multicast primi-   tive is described.  Such a primitive ensures that a designated mes-   sage will be transmitted from one site to all other operational sites   in a system; if a failure occurs but any site has received the mes-   sage, all will eventually do so.  [Chang] and [Cristian] describe   implementations for atomic broadcast, which is a reliable broadcast   (sent to all sites in a system) with the additional property that   messages are delivered in the same order at all overlapping destina-   tions, and this order preserves the transmission order if messages   originate in a single site.   Atomic broadcast is a powerful abstraction, and essentially the same   behavior is provided by one of the multicast primitives we discuss in   the next section.  However, it has several drawbacks which made us   hesitant to adopt it as the only primitive in the system.  Most seri-   ous is the latency that is incurred in order to satisfy the delivery   ordering property.  Without delving deeply into the implementations,   which are based on a token scheme in [Chang] and an acknowledgement   protocol in [Schneider], we observe that the delaying of certain mes-   sages is fundamental to the establishment of a unique global delivery   ordering; indeed, it is easy to prove on knowledge theoretic groundsBirman & Joseph                                                 [Page 6]RFC 992                                                    November 1986   that this must always be the case.  In [Chang] a primary goal is to   minimize the number of messages sent, and the protocol given performs   extremely well in this regard.  However, a delay occurs while waiting   for tokens to arrive and the delivery latency that results may be   high.  [Cristian] assumes that clocks are closely synchronized and   that message transit times are bounded by well-known constants, and   uses this to derive atomic broadcast protocols tolerant of increas-   ingly severe classes of failures.  The protocols explicitly delay   delivery to achieve the desired global ordering on multicasts.  For   reasons discussed below, this tends to result in high latency in typ-   ical local networking environments.  An additional drawback of the   atomic broadcast protocols is that no mechanism is provided for   ensuring that all processes observe the same sequence of failures and   recoveries, or for ensuring that failures and recoveries are ordered   relative to ongoing multicasts.  Since this problem arises in any   setting where one process monitors another, we felt it should be   addressed at the same level as the communication protocol.  Finally,   one wants a group oriented multicast protocol, not a site oriented   broadcast, and this issue must be resolved too.6. Our multicast primitives   We now describe three multicast protocols - GBCAST, ABCAST, and   CBCAST - for transmitting a message reliably from a sender process to   some set of destination processes.  Details of the protocols and   their correctness proofs can be found in [Birman-b].  The protocols   ensure "all or nothing" behavior: if any destination receives a mes-   sage, then unless it fails, all destinations will receive it.  Group   addressing is discussed in Sec. 6.5.   The failure model that one adopts has a considerable impact on the   structure of the resulting system.  We adopted the model of fail-stop   processors [Schneider]: when failures occur, a processor simply stops   (crashes), as do all the processes executing on it.  We also assume   that individual processes can crash, and that this is detected when   it occurs by a monitoring mechanism present at each site.  Further   assumptions are sometimes made about the availability of synchronized   realtime clocks.  Here, we adopt the position that although reason-   ably accurate elapsed-time clocks may be available, closely synchron-   ized clocks probably will not be.  For example, the 60Hz "line"   clocks commonly used on current workstations are only accurate to   16ms.  On the other hand, 4-8ms inter-site message transit times are   common and 1-2ms are reported increasingly often.  Thus, it is impos-   sible to synchronize clocks to better than 32-48ms, enough time for a   pair of sites to exchange between 4 and 50 messages.  Even with   advancing technology, it seems safe to assume that clock skew will   remain "large" when compared to inter-site message transmission   speed.  In particular, this argues against time-based protocols such   as the one used in [Cristian]Birman & Joseph                                                 [Page 7]RFC 992                                                    November 1986   6.1 The GBCAST primitive       GBCAST (group multicast) is the most constrained, and costly, of       the three primitives.  It is used to transmit information about       failures and recoveries to members of a process group.  A recov-       ering member uses GBCAST to inform the operational ones that it       has become available.  Additionally, when a member fails, the       system arranges for a GBCAST to be issued to group members on its       behalf, informing them of its failure.  Arguments to GBCAST are a       message and a process group identifier, which is translated into       a set of destinations as described below (Sec. 6.5).       Our GBCAST protocol ensures that if any process receives a multi-       cast B before receiving a GBCAST G, then all overlapping destina-       tions will receive B before G <1> This is true regardless of the       type of multicast involved.  Moreover, when a failure occurs, the       corresponding GBCAST message is delivered after any other multi-       casts from the failed process.  Each member can therefore main-       tain a VIEW listing the membership of the process group, updating       it when a GBCAST is received.  Although VIEW's are not updated       simultaneously in real time, all members observe the same       sequence of VIEW changes.  Since, GBCAST's are ordered relative       to all other multicasts, all members receiving a given multicast       will have the same value of VIEW when they receive it.       Notice that GBCAST also provides a convenient way to change other       global properties of a group "atomically".  In our work, we have       used GBCAST to dynamically change a ranking on the members of a       group, to request that group members establish checkpoints for       use if recovery is needed after all failure, and to implement       process migration.  In each case, the ordering of GBCAST relative       to other events that makes it possible to perform the desired       action without running any additional protocol.  Other uses for       GBCAST will no doubt emerge as our research continues.       Members of a process group can also use the value of VIEW to pick       a strategy for processing an incoming request, or to react to       failure or recovery without having to run any special protocol       first.  Since the GBCAST ordering is the same everywhere, their       actions will all be consistent.  Notice that when all the members       of a process group may have failed, GBCAST also provides an inex-       pensive way to determine the last site that failed: process group       members simply log each value of VIEW that becomes defined on       stable storage before using it; a simplified version of the algo-       rithm in [Skeen-a] can then be executed when recovering from       failure.Birman & Joseph                                                 [Page 8]RFC 992                                                    November 1986   6.2 The ABCAST primitive       The GBCAST primitive is too costly to be used for general commun-       ication between process group members.  This motivates the intro-       duction of weaker (less ordered) primitives, which might be used       in situations where a total order on multicast messages is not       necessary.  Our second primitive, ABCAST (atomic multicast),       satisfies such a weaker constraint.  Specifically, it is often       desired that if two multicasts are received in some order at a       common destination site, they be received in that order at all       other common destinations, even if this order was not predeter-       mined.  For example, if a process group is being used to maintain       a replicated queue and ABCAST is used to transmit queue opera-       tions to all copies, the operations will be done in the same       order everywhere, hence the copies of the queue will remain mutu-       ally consistent.  The primitive ABCAST(msg, label, dests) pro-       vides this behavior.  Two ABCAST's having the same label are       delivered in the same order at all common destinations.   6.3 The CBCAST primitive       Our third primitive, CBCAST (causal multicast), is weakest in the       sense that it involves less distributed synchronization then       GBCAST or ABCAST.  CBCAST(msg, dests) atomically delivers msg to       each operational dest.  The CBCAST protocol ensures that if two       multicasts are potentially causally dependent on another, then       the former is delivered after the latter at all overlapping des-       tinations.  A multicast B' is potentially causally dependent on a       multicast B if both multicasts originate from the same process,       and B' is sent after B, or if there exists a chain of message       transmissions and receptions or local events by which knowledge       could have been transferred from the process that issued B to the       process that issued B' [Lamport].  For causally independent mul-       ticasts, the delivery ordering is not constrained.       CBCAST is valuable in systems like ISIS, where concurrency con-       trol algorithms are used to synchronize concurrent computations.       In these systems, if two processes communicate concurrently with       the same process the messages are almost always independent ones       that can be processed in any order: otherwise, concurrency con-       trol would have caused one to pause until the other was finished.       On the other hand, order is clearly important within a causally       linked series of multicasts, and it is precisely this sort of       order that CBCAST respects.   6.4 Other multicast primitives       A weaker multicast primitive is reliable multicast, which pro-       vides all-or-nothing delivery, but no ordering properties.  The       formulation of CBCAST in [Birman-b] actually includes a mechanism       for performing multicasts of this sort, hence no specialBirman & Joseph                                                 [Page 9]RFC 992                                                    November 1986
💿 文件大小 3544 K
👤 上传用户 kzdai22
📂 所属分类文章/文档
🏷️ 相关标签

#RFC #文档
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -