📄 rfc992.txt
字号:
primitive is needed for the purpose. Additionally, there may be situations in which ABCAST protocols that also satisfy a CBCAST ordering property would be valuable. Our ABCAST primitive could be changed to respect such a rule, and we made use of a multicast primitive that is simultaneously causal and atomic in our work on consistent shared bulletin boards ([Birman-c]). For simplicity, the presentation here assumes that ABCAST is completely orthogo- nal to CBCAST, but a simple way to build an efficient "causal atomic" multicast is described in our full-length paper. The cost of this protocol is only slightly higher than that of ABCAST. 6.5 Group addressing protocol Since group membership can change dynamically, it may be diffi- cult for a process to compute a list of destinations to which a message should be sent, for example, as is needed to perform a GBCAST. In [Birman-b] we report on a protocol for ensuring that a given multicast will be delivered to all members of a process group in the same view. This view is either the view that was operative when the message transmission was initiated, or a view that was defined subsequently. The algorithm is a simple itera- tive one that costs nothing unless the group membership changes, and permits the caching of possibly inaccurate membership infor- mation near processes that might want to communicate with a group. Using the protocol, a flexible message addressing scheme can readily be supported. Iterative addressing is only required when the process transmit- ting a message has an inaccurate copy of the process group view. In the implementation we are now building, this would rarely be the case, and iteration is never needed if the view is known to be accurate. Thus, iterated delivery should be very infrequent. 6.6 Synchronous versus asynchronous multicast abstractions Many systems employ RPC internally, as a lowest level primitive for interaction between processes. It should be evident that all of our multicast primitives can be used to implement replicated remote procedure calls [Cooper]: the caller would simply pause until replies have been received from all the participants (observation of a failure constitutes a reply in this case). We term such a use of the primitives synchronous, to distinguish it from from an asynchronous multicast in which no replies, or just one reply, suffices. In our work on ISIS, GBCAST and ABCAST are normally invoked syn- chronously, to implement a remote procedure call by one member of an object on all the members of its process group. However, CBCAST, which is the most frequently used overall, is almost never invoked synchronously. Asynchronous CBCAST's are theBirman & Joseph [Page 10]RFC 992 November 1986 primary source of concurrency in ISIS: although the delivery ord- ering is assured, transmission can be delayed to enable a message to be piggybacked on another, or to schedule IO within the system as a whole. While the system cannot defer an asynchronous multi- cast indefinitely, the ability to defer it a little, without delaying some computation by doing so, permits load to be smoothed. Since CBCAST respects the delivery orderings on which a computation might depend, and is ordered with respect to failures, the concurrency introduced does not complicate higher level algorithms. Moreover, the protocol itself is extremely cheap. A problem is introduced by our decision to allow asynchronous multicasts: the atomic reception property must now be extended to address causally related sequences of asynchronous messages. If a failure were to result in some multicasts being delivered to all their destinations but others that precede them not being delivered anywhere, inconsistency might result even if the desti- nations do not overlap. We therefore extend the atomicity pro- perty as follows. If process t receives a message m from process s, and s subsequently fails, then unless t fails as well, all messages m' that s received prior to its failure must be delivered to their remaining operational destinations. This is because the state of t may now depend on the contents of any such m', hence the system state could become inconsistent if the delivery of m' were not completed. The costs of the protocols are not affected by this change. A second problem arises when the user-level implications of this atomicity rule are considered. In the event of a failure, any suffix of a sequence of aysnchronous multicasts could be lost and the system state would still be internally consistent. A process that is about to take some action that may leave an externally visible side-effect will need a way to pause until it is guaranteed that such multicasts have actually been delivered. For this purpose, a flush primitive is provided. Occasional calls to flush do not eliminate the benefit of using CBCAST asyn- chronously. Unless the system has built up a considerable back- log of undelivered multicast messages, which should be rare, flush will only pause while transmission of the last few multi- casts complete.7. Using the primitives The reliable communication primitives described above lead to simple solutions for the problems cited in Sec. 4: [1] Synchronization. Many synchronization problems are subsumed into the primitives themselves. For example, consider the use of GBCAST to implement recovery. A recovering process would issue a GBCAST to the process group members, requesting that stateBirman & Joseph [Page 11]RFC 992 November 1986 information be transferred to it. In addition to sending the current state of the group to the recovering process, group members update the process group view at this time. Subsequent messages to the group will be delivered to the recovered process, with all necessary synchronization being provided by the ordering properties of GBCAST. In situations where other forms of syn- chronization are needed, ABCAST provides a simple way to ensure that several processes take actions in the same order, and this form of low-level synchronization simplifies a number of higher- level synchronization problems. For example, if ABCAST is used to do P() and V() operations on a distributed semaphore, the order of operations on the semaphore is set by the ABCAST, hence all the managers of the semaphore see these operations in a fixed order. [2] Failure detection. Consistent failure (and recovery) detec- tion are trivial using our primitives: a process simply waits for the appropriate process group view to change. This facilitates the implementation of algorithms in which one processes monitors the status of another process. A process that acts on the basis of a process group view change does so with the assurance that other group members will (eventually) observe the same event and will take consistent actions. [3] Consistency. We believe that consistency is generally expressible as a set of atomicity and ordering constraints on message delivery, particularly causal ones of the sort provided by CBCAST. Our primitives permit a process to specify the com- munication properties needed to achieve a desired form of con- sistency. Continued research will be needed to understand pre- cisely how to pick the weakest primitive in a designated situa- tion. [4] Serializability. To achieve serializability, one implements a concurrency control algorithm and then forces computations to respect the serialization order that this algorithm choses. The ABCAST primitive, as observed above, is a powerful tool for establishing an order between concurrent events, e.g. by lock acquisition. Having established such an order, CBCAST can be used to distribute information about the computation and also its termination (commit or abort). Any process that observes the commit or abort of a computation will only be able to interact with data managers that have received messages preceding the com- mit or abort, hence a highly asynchronous transactional execution results. If a process running a computation fails, this is detected when a failure GBCAST is received instead of the commit. Thus, executions are simple and quite deterministic. If commit is conditional, CBCAST would be used to first interro- gate participants to learn if they are prepared to commit, and then to transmit the commit or abort decision (the usual two-Birman & Joseph [Page 12]RFC 992 November 1986 phase commit). On the other hand, conditional commits can often be avoided using our approach. A method for building transac- tions that will roll-forward after failure after failure is dis- cussed in more detail in [Birman-a] [Joseph-a] [Joseph-b]. Other forms of concurrency control, such as timestamp generation, can similarly be implemented using ABCAST and CBCAST. We view tran- sactional data storage as an application-level concern, which can be handled using a version stack approach or a multi-version store, or any other appropriate mechanism.8. Implementation The communication primitives can be built in layers, starting with a bare network providing unreliable Internet datagrams. The software structure is, however, less mature and more complex than the one sug- gested in RFC's 966 and 988. For example, at this stage of our research we do not understand how to optimize our protocols to the same extent as for the unreliable host multicast approach described in those RFC's. Thus, the implementation we describe here should be understood to be a prototype. A particularly intriguing question, which we are investigating actively, concerns the use of a "best effort" ethernet or Internet multicast as a tool to optimize the implementation of our protocols. Our basic approach is to view large area networks as a set of clus- ters of sites interconnected by high speed LAN devices and intercon- nected by slower long-haul links. We first provide protocols for use within clusters, and then extend them to run between clusters too. Network partitioning can be tolerated at all levels of the hierarchy in the sense that no incorrect actions can result after network par- titioning, although our approach will sometimes block until the par- tition is repaired. Our protocols also tend to block within a clus- ter while the list of operational sites for that cluster is being changed. In normal LAN's, this happens infrequently (during site failure or recovery), and would not pose a problem. (In failure intensive applications, alternative protocols might be needed to address this issue). The lowest level of our software uses a site-to-site acknowledgement protocol to convert the unreliable packet transport this into a sequenced, error-free message abstraction, using timeouts to detect apparent failures. TCP can also be used for this purpose, provided that a "filter" is placed on the incoming message stream and certain types of messages are handled specially. An agreement protocol is then used to order the site-failures and recoveries consistently. If timeouts cause a failure to be detected erroneously, the protocol forces the affected site to undergo recovery. Built on this is a layer that supports the primitives themselves. CBCAST has a very light-weight implementation, based on the idea of flooding the system with copies of a message: Each process buffersBirman & Joseph [Page 13]RFC 992 November 1986 copies of any messages needed to ensure the consistency of its view of the system. If message m is delivered to process p, and m is potentially causally dependent on a message m prime, then a copy of m prime is sent to p as well (duplicates are discarded). A garbage collector deletes superfluous copies after a message has reached all its destinations. By using extensive piggybacking and a simple scheduling algorithm to control message transmission, the cost of a CBCAST is kept low -- often, less than one packet per destination. ABCAST employs a two-phase protocol based on one suggested to us by Skeen [Skeen-b]. This protocol has higher latency than CBCAST because delivery can only occur during the second phase; ABCAST is thus inherently synchronous. In ISIS, however, ABCAST is used rarely; we believe that this would be the case in other systems as well. GBCAST is implemented using a two-phase protocol similar to the one for ABCAST, but with an additional mechanism that flushes messages from a failed process before delivering the GBCAST announc- ing the failure. Although GBCAST is slower than ABCAST or CBCAST, it is used rarely enough so that performance is probably less of an issue here -- and in any case, even GBCAST could be tuned to give very high throughput. Preliminary performance figures appear in [Birman-b]. Although satisfactory performance should be possible using an imple- mentation that sits on top of a conventional Internet mechanism, it should be noted that to achieve really high rates of communication the layers of software described above must reside in the kernel, because they run on behalf of large numbers of clients, run fre- quently, and tend to execute for very brief periods before doing I/O and pausing. A non-kernel implementation will thus incur high scheduling and context switching overhead. Additionally, it is not
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -