📄 rfc992.txt

📁 RFC 相关的技术文档
💻 TXT
📖 第 1 页 / 共 4 页
字号:
       primitive is needed for the purpose.  Additionally, there may be       situations in which ABCAST protocols that also satisfy a CBCAST       ordering property would be valuable.  Our ABCAST primitive could       be changed to respect such a rule, and we made use of a multicast       primitive that is simultaneously causal and atomic in our work on       consistent shared bulletin boards ([Birman-c]).  For simplicity,       the presentation here assumes that ABCAST is completely orthogo-       nal to CBCAST, but a simple way to build an efficient "causal       atomic" multicast is described in our full-length paper.  The       cost of this protocol is only slightly higher than that of       ABCAST.   6.5 Group addressing protocol       Since group membership can change dynamically, it may be diffi-       cult for a process to compute a list of destinations to which a       message should be sent, for example, as is needed to perform a       GBCAST.  In [Birman-b] we report on a protocol for ensuring that       a given multicast will be delivered to all members of a process       group in the same view.  This view is either the view that was       operative when the message transmission was initiated, or a view       that was defined subsequently.  The algorithm is a simple itera-       tive one that costs nothing unless the group membership changes,       and permits the caching of possibly inaccurate membership infor-       mation near processes that might want to communicate with a       group.  Using the protocol, a flexible message addressing scheme       can readily be supported.       Iterative addressing is only required when the process transmit-       ting a message has an inaccurate copy of the process group view.       In the implementation we are now building, this would rarely be       the case, and iteration is never needed if the view is known to       be accurate.  Thus, iterated delivery should be very infrequent.   6.6 Synchronous versus asynchronous multicast abstractions       Many systems employ RPC internally, as a lowest level primitive       for interaction between processes.  It should be evident that all       of our multicast primitives can be used to implement replicated       remote procedure calls [Cooper]: the caller would simply pause       until replies have been received from all the participants       (observation of a failure constitutes a reply in this case).  We       term such a use of the primitives synchronous, to distinguish it       from from an asynchronous multicast in which no replies, or just       one reply, suffices.       In our work on ISIS, GBCAST and ABCAST are normally invoked syn-       chronously, to implement a remote procedure call by one member of       an object on all the members of its process group.  However,       CBCAST, which is the most frequently used overall, is almost       never invoked synchronously.  Asynchronous CBCAST's are theBirman & Joseph                                                [Page 10]RFC 992                                                    November 1986       primary source of concurrency in ISIS: although the delivery ord-       ering is assured, transmission can be delayed to enable a message       to be piggybacked on another, or to schedule IO within the system       as a whole.  While the system cannot defer an asynchronous multi-       cast indefinitely, the ability to defer it a little, without       delaying some computation by doing so, permits load to be       smoothed.  Since CBCAST respects the delivery orderings on which       a computation might depend, and is ordered with respect to       failures, the concurrency introduced does not complicate higher       level algorithms.  Moreover, the protocol itself is extremely       cheap.       A problem is introduced by our decision to allow asynchronous       multicasts: the atomic reception property must now be extended to       address causally related sequences of asynchronous messages.  If       a failure were to result in some multicasts being delivered to       all their destinations but others that precede them not being       delivered anywhere, inconsistency might result even if the desti-       nations do not overlap.  We therefore extend the atomicity pro-       perty as follows.  If process t receives a message m from process       s, and s subsequently fails, then unless t fails as well, all       messages m' that s received prior to its failure must be       delivered to their remaining operational destinations.  This is       because the state of t may now depend on the contents of any such       m', hence the system state could become inconsistent if the       delivery of m' were not completed.  The costs of the protocols       are not affected by this change.       A second problem arises when the user-level implications of this       atomicity rule are considered.  In the event of a failure, any       suffix of a sequence of aysnchronous multicasts could be lost and       the system state would still be internally consistent.  A process       that is about to take some action that may leave an externally       visible side-effect will need a way to pause until it is       guaranteed that such multicasts have actually been delivered.       For this purpose, a flush primitive is provided.  Occasional       calls to flush do not eliminate the benefit of using CBCAST asyn-       chronously.  Unless the system has built up a considerable back-       log of undelivered multicast messages, which should be rare,       flush will only pause while transmission of the last few multi-       casts complete.7. Using the primitives   The reliable communication primitives described above lead to simple   solutions for the problems cited in Sec. 4:       [1]  Synchronization.  Many synchronization problems are subsumed       into the primitives themselves.  For example, consider the use of       GBCAST to implement recovery.  A recovering process would issue a       GBCAST to the process group members, requesting that stateBirman & Joseph                                                [Page 11]RFC 992                                                    November 1986       information be transferred to it.  In addition to sending the       current state of the group to the recovering process, group       members update the process group view at this time.  Subsequent       messages to the group will be delivered to the recovered process,       with all necessary synchronization being provided by the ordering       properties of GBCAST.  In situations where other forms of syn-       chronization are needed, ABCAST provides a simple way to ensure       that several processes take actions in the same order, and this       form of low-level synchronization simplifies a number of higher-       level synchronization problems.  For example, if ABCAST is used       to do P() and V() operations on a distributed semaphore, the       order of operations on the semaphore is set by the ABCAST, hence       all the managers of the semaphore see these operations in a fixed       order.       [2]  Failure detection.  Consistent failure (and recovery) detec-       tion are trivial using our primitives: a process simply waits for       the appropriate process group view to change.  This facilitates       the implementation of algorithms in which one processes monitors       the status of another process.  A process that acts on the basis       of a process group view change does so with the assurance that       other group members will (eventually) observe the same event and       will take consistent actions.       [3]  Consistency.  We believe that consistency is generally       expressible as a set of atomicity and ordering constraints on       message delivery, particularly causal ones of the sort provided       by CBCAST.  Our primitives permit a process to specify the com-       munication properties needed to achieve a desired form of con-       sistency.  Continued research will be needed to understand pre-       cisely how to pick the weakest primitive in a designated situa-       tion.       [4]  Serializability.  To achieve serializability, one implements       a concurrency control algorithm and then forces computations to       respect the serialization order that this algorithm choses.  The       ABCAST primitive, as observed above, is a powerful tool for       establishing an order between concurrent events, e.g. by lock       acquisition.  Having established such an order, CBCAST can be       used to distribute information about the computation and also its       termination (commit or abort).  Any process that observes the       commit or abort of a computation will only be able to interact       with data managers that have received messages preceding the com-       mit or abort, hence a highly asynchronous transactional execution       results.  If a process running a computation fails, this is       detected when a failure GBCAST is received instead of the commit.       Thus, executions are simple and quite deterministic.       If commit is conditional, CBCAST would be used to first interro-       gate participants to learn if they are prepared to commit, and       then to transmit the commit or abort decision (the usual two-Birman & Joseph                                                [Page 12]RFC 992                                                    November 1986       phase commit).  On the other hand, conditional commits can often       be avoided using our approach.  A method for building transac-       tions that will roll-forward after failure after failure is dis-       cussed in more detail in [Birman-a] [Joseph-a] [Joseph-b].  Other       forms of concurrency control, such as timestamp generation, can       similarly be implemented using ABCAST and CBCAST.  We view tran-       sactional data storage as an application-level concern, which can       be handled using a version stack approach or a multi-version       store, or any other appropriate mechanism.8. Implementation   The communication primitives can be built in layers, starting with a   bare network providing unreliable Internet datagrams.  The software   structure is, however, less mature and more complex than the one sug-   gested in RFC's 966 and 988.  For example, at this stage of our   research we do not understand how to optimize our protocols to the   same extent as for the unreliable host multicast approach described   in those RFC's.  Thus, the implementation we describe here should be   understood to be a prototype.  A particularly intriguing question,   which we are investigating actively, concerns the use of a "best   effort" ethernet or Internet multicast as a tool to optimize the   implementation of our protocols.   Our basic approach is to view large area networks as a set of clus-   ters of sites interconnected by high speed LAN devices and intercon-   nected by slower long-haul links.  We first provide protocols for use   within clusters, and then extend them to run between clusters too.   Network partitioning can be tolerated at all levels of the hierarchy   in the sense that no incorrect actions can result after network par-   titioning, although our approach will sometimes block until the par-   tition is repaired.  Our protocols also tend to block within a clus-   ter while the list of operational sites for that cluster is being   changed.  In normal LAN's, this happens infrequently (during site   failure or recovery), and would not pose a problem.  (In failure   intensive applications, alternative protocols might be needed to   address this issue).   The lowest level of our software uses a site-to-site acknowledgement   protocol to convert the unreliable packet transport this into a   sequenced, error-free message abstraction, using timeouts to detect   apparent failures.  TCP can also be used for this purpose, provided   that a "filter" is placed on the incoming message stream and certain   types of messages are handled specially.  An agreement protocol is   then used to order the site-failures and recoveries consistently.  If   timeouts cause a failure to be detected erroneously, the protocol   forces the affected site to undergo recovery.   Built on this is a layer that supports the primitives themselves.   CBCAST has a very light-weight implementation, based on the idea of   flooding the system with copies of a message: Each process buffersBirman & Joseph                                                [Page 13]RFC 992                                                    November 1986   copies of any messages needed to ensure the consistency of its view   of the system.  If message m is delivered to process p, and m is   potentially causally dependent on a message m prime, then a copy of m   prime is sent to p as well (duplicates are discarded).  A garbage   collector deletes superfluous copies after a message has reached all   its destinations.  By using extensive piggybacking and a simple   scheduling algorithm to control message transmission, the cost of a   CBCAST is kept low -- often, less than one packet per destination.   ABCAST employs a two-phase protocol based on one suggested to us by   Skeen [Skeen-b].  This protocol has higher latency than CBCAST   because delivery can only occur during the second phase; ABCAST is   thus inherently synchronous.  In ISIS, however, ABCAST is used   rarely; we believe that this would be the case in other systems as   well.  GBCAST is implemented using a two-phase protocol similar to   the one for ABCAST, but with an additional mechanism that flushes   messages from a failed process before delivering the GBCAST announc-   ing the failure.  Although GBCAST is slower than ABCAST or CBCAST, it   is used rarely enough so that performance is probably less of an   issue here -- and in any case, even GBCAST could be tuned to give   very high throughput.  Preliminary performance figures appear in   [Birman-b].   Although satisfactory performance should be possible using an imple-   mentation that sits on top of a conventional Internet mechanism, it   should be noted that to achieve really high rates of communication   the layers of software described above must reside in the kernel,   because they run on behalf of large numbers of clients, run fre-   quently, and tend to execute for very brief periods before doing I/O   and pausing.  A non-kernel implementation will thus incur high   scheduling and context switching overhead.  Additionally, it is not
💿 文件大小 3544 K
👤 上传用户 kzdai22
📂 所属分类文章/文档
🏷️ 相关标签

#RFC #文档
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -