📄 rfc992.txt
字号:
If one rules out "unreliable" communication mechanisms, there are basically two fault-tolerant alternatives that can be pursued. The first approach is to provide mechanisms for transactional interactions between processes that communicate using remote pro- cedure calls [Birrell]. This has lead to work on nested transactions (due to nested RPC's) [Moss], support for transactions at the language level [Liskov], transactions within an operating systems kernel [Spector] [Allchin] [Popek] [Lazowska], and transactional access to higher-level replicated services, such as resilient objects in ISIS or relations in database systems. The primitives in a tran- sactional system provide mechanisms for distributing the request that initiates the transaction, accessing data (which may be replicated), performing concurrency control, and implementing commit or abort. Additional mechanisms are normally needed for orphan termination, deadlock detection, etc. The issue then arises of how these mechan- isms should themselves be implemented. Our work in ISIS leads us to believe that whereas transactions are easily implemented on top of fault-tolerant process groups -- we have done so -- the converse is much harder. Moreover, transactionsBirman & Joseph [Page 5]RFC 992 November 1986 represent a relatively heavy-weight solution to the problems surveyed in the previous section, and might impose an unacceptable overhead on subsystems that need to run non-transactionally, for example because a pair of concurrent processes needs to interact on a frequent basis. (We are not saying that "transactional" mechanisms such as cobegins and toplevel actions can't solve this problem, but just that they yield a solution that is awkward and costly). This sort of reasoning has lead us to focus on non-transactional interaction mechanisms, and to treat transactions as a special class of mechanisms used when processes that have been designed to employ a transactional protocol interact. The second approach involves the provision of a communication primi- tive, such as atomic broadcast, which can be used as the framework on which higher level algorithms are designed. Such a primitive seeks to deliver messages reliably to some set of destinations, despite the possibility that failures might occur during the execution of the protocol. Above, we termed this the fault tolerant process group approach, since it lends itself to the organization of cooperating processes into groups, as described in the introduction. Process groups are an extremely flexible abstraction, and have been employed in the V Kernel [Cheriton] and in UNIX, and more recently in the ISIS system. A proposal to provide Internet support for host groups was raised in RFC's 966 and 988. However, the idea of adapting the pro- cess group approach to work reliably in an environment subject to the sorts of exception events and concurrency cited in the previous sec- tion seems to be new. As noted earlier, existing reliable communication protocols do not address the requirements of fault-tolerant process groups. For exam- ple, in [Schneider], an implementation of a reliable multicast primi- tive is described. Such a primitive ensures that a designated mes- sage will be transmitted from one site to all other operational sites in a system; if a failure occurs but any site has received the mes- sage, all will eventually do so. [Chang] and [Cristian] describe implementations for atomic broadcast, which is a reliable broadcast (sent to all sites in a system) with the additional property that messages are delivered in the same order at all overlapping destina- tions, and this order preserves the transmission order if messages originate in a single site. Atomic broadcast is a powerful abstraction, and essentially the same behavior is provided by one of the multicast primitives we discuss in the next section. However, it has several drawbacks which made us hesitant to adopt it as the only primitive in the system. Most seri- ous is the latency that is incurred in order to satisfy the delivery ordering property. Without delving deeply into the implementations, which are based on a token scheme in [Chang] and an acknowledgement protocol in [Schneider], we observe that the delaying of certain mes- sages is fundamental to the establishment of a unique global delivery ordering; indeed, it is easy to prove on knowledge theoretic groundsBirman & Joseph [Page 6]RFC 992 November 1986 that this must always be the case. In [Chang] a primary goal is to minimize the number of messages sent, and the protocol given performs extremely well in this regard. However, a delay occurs while waiting for tokens to arrive and the delivery latency that results may be high. [Cristian] assumes that clocks are closely synchronized and that message transit times are bounded by well-known constants, and uses this to derive atomic broadcast protocols tolerant of increas- ingly severe classes of failures. The protocols explicitly delay delivery to achieve the desired global ordering on multicasts. For reasons discussed below, this tends to result in high latency in typ- ical local networking environments. An additional drawback of the atomic broadcast protocols is that no mechanism is provided for ensuring that all processes observe the same sequence of failures and recoveries, or for ensuring that failures and recoveries are ordered relative to ongoing multicasts. Since this problem arises in any setting where one process monitors another, we felt it should be addressed at the same level as the communication protocol. Finally, one wants a group oriented multicast protocol, not a site oriented broadcast, and this issue must be resolved too.6. Our multicast primitives We now describe three multicast protocols - GBCAST, ABCAST, and CBCAST - for transmitting a message reliably from a sender process to some set of destination processes. Details of the protocols and their correctness proofs can be found in [Birman-b]. The protocols ensure "all or nothing" behavior: if any destination receives a mes- sage, then unless it fails, all destinations will receive it. Group addressing is discussed in Sec. 6.5. The failure model that one adopts has a considerable impact on the structure of the resulting system. We adopted the model of fail-stop processors [Schneider]: when failures occur, a processor simply stops (crashes), as do all the processes executing on it. We also assume that individual processes can crash, and that this is detected when it occurs by a monitoring mechanism present at each site. Further assumptions are sometimes made about the availability of synchronized realtime clocks. Here, we adopt the position that although reason- ably accurate elapsed-time clocks may be available, closely synchron- ized clocks probably will not be. For example, the 60Hz "line" clocks commonly used on current workstations are only accurate to 16ms. On the other hand, 4-8ms inter-site message transit times are common and 1-2ms are reported increasingly often. Thus, it is impos- sible to synchronize clocks to better than 32-48ms, enough time for a pair of sites to exchange between 4 and 50 messages. Even with advancing technology, it seems safe to assume that clock skew will remain "large" when compared to inter-site message transmission speed. In particular, this argues against time-based protocols such as the one used in [Cristian]Birman & Joseph [Page 7]RFC 992 November 1986 6.1 The GBCAST primitive GBCAST (group multicast) is the most constrained, and costly, of the three primitives. It is used to transmit information about failures and recoveries to members of a process group. A recov- ering member uses GBCAST to inform the operational ones that it has become available. Additionally, when a member fails, the system arranges for a GBCAST to be issued to group members on its behalf, informing them of its failure. Arguments to GBCAST are a message and a process group identifier, which is translated into a set of destinations as described below (Sec. 6.5). Our GBCAST protocol ensures that if any process receives a multi- cast B before receiving a GBCAST G, then all overlapping destina- tions will receive B before G <1> This is true regardless of the type of multicast involved. Moreover, when a failure occurs, the corresponding GBCAST message is delivered after any other multi- casts from the failed process. Each member can therefore main- tain a VIEW listing the membership of the process group, updating it when a GBCAST is received. Although VIEW's are not updated simultaneously in real time, all members observe the same sequence of VIEW changes. Since, GBCAST's are ordered relative to all other multicasts, all members receiving a given multicast will have the same value of VIEW when they receive it. Notice that GBCAST also provides a convenient way to change other global properties of a group "atomically". In our work, we have used GBCAST to dynamically change a ranking on the members of a group, to request that group members establish checkpoints for use if recovery is needed after all failure, and to implement process migration. In each case, the ordering of GBCAST relative to other events that makes it possible to perform the desired action without running any additional protocol. Other uses for GBCAST will no doubt emerge as our research continues. Members of a process group can also use the value of VIEW to pick a strategy for processing an incoming request, or to react to failure or recovery without having to run any special protocol first. Since the GBCAST ordering is the same everywhere, their actions will all be consistent. Notice that when all the members of a process group may have failed, GBCAST also provides an inex- pensive way to determine the last site that failed: process group members simply log each value of VIEW that becomes defined on stable storage before using it; a simplified version of the algo- rithm in [Skeen-a] can then be executed when recovering from failure.Birman & Joseph [Page 8]RFC 992 November 1986 6.2 The ABCAST primitive The GBCAST primitive is too costly to be used for general commun- ication between process group members. This motivates the intro- duction of weaker (less ordered) primitives, which might be used in situations where a total order on multicast messages is not necessary. Our second primitive, ABCAST (atomic multicast), satisfies such a weaker constraint. Specifically, it is often desired that if two multicasts are received in some order at a common destination site, they be received in that order at all other common destinations, even if this order was not predeter- mined. For example, if a process group is being used to maintain a replicated queue and ABCAST is used to transmit queue opera- tions to all copies, the operations will be done in the same order everywhere, hence the copies of the queue will remain mutu- ally consistent. The primitive ABCAST(msg, label, dests) pro- vides this behavior. Two ABCAST's having the same label are delivered in the same order at all common destinations. 6.3 The CBCAST primitive Our third primitive, CBCAST (causal multicast), is weakest in the sense that it involves less distributed synchronization then GBCAST or ABCAST. CBCAST(msg, dests) atomically delivers msg to each operational dest. The CBCAST protocol ensures that if two multicasts are potentially causally dependent on another, then the former is delivered after the latter at all overlapping des- tinations. A multicast B' is potentially causally dependent on a multicast B if both multicasts originate from the same process, and B' is sent after B, or if there exists a chain of message transmissions and receptions or local events by which knowledge could have been transferred from the process that issued B to the process that issued B' [Lamport]. For causally independent mul- ticasts, the delivery ordering is not constrained. CBCAST is valuable in systems like ISIS, where concurrency con- trol algorithms are used to synchronize concurrent computations. In these systems, if two processes communicate concurrently with the same process the messages are almost always independent ones that can be processed in any order: otherwise, concurrency con- trol would have caused one to pause until the other was finished. On the other hand, order is clearly important within a causally linked series of multicasts, and it is precisely this sort of order that CBCAST respects. 6.4 Other multicast primitives A weaker multicast primitive is reliable multicast, which pro- vides all-or-nothing delivery, but no ordering properties. The formulation of CBCAST in [Birman-b] actually includes a mechanism for performing multicasts of this sort, hence no specialBirman & Joseph [Page 9]RFC 992 November 1986
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -