📄 http:^^www.cs.cornell.edu^info^projects^cam^hoti-94.html

📁 This data set contains WWW-pages collected from computer science departments of various universities
💻 HTML
📖 第 1 页 / 共 5 页
字号:
上一页 1 2 3 45
The ATM network communication layer most directly comparable to SSAM is the remote memory access model proposed by Thekkath et. al. <A HREF="hoti-94.html#REF15859">[10</A>,<A HREF="hoti-94.html#REF28211">11]</A>. The implementation is very similar to SSAM in that it uses traps for reserved opcodes in the MIPS instruction set to implement remote read and write instructions.<A HREF="hoti-94.html#FN10">(10)</A><p>The major difference between the two models is that the remote memory operations separate data and control transfer while Active Messages unifies them. With remote memory accesses data can be transferred to user memory by the kernel without the corresponding process having to run. But the model used does not allow remote reads and writes to the full address space of a process. Rather, each communicating process must allocate special communication memory segments which are pinned by the operating system just as the buffers used by SSAM are. The communication segments are more flexible than SSAM's buffers in that they can directly hold data structures (limited by the fact that the segments are pinned).<p>The advantage of SSAM over the remote memory accesses is the coupling of data and control: each message causes a small amount of user code to be executed, which allows data to be scattered into complex data structures and the scheduling of computation to be directly influenced by the arrival of data. In the remote memory access model a limited control transfer is offered through per-segment notification flags in order to to cause a file descriptor to become ready.<p>Finally, SSAM provides a reliable transport mechanism while the remote memory access primitives are unreliable and do not provide flow-control.<p><A HREF="hoti-94.html#REF17269">Table 4</A> compares the performance of the two approaches: Thekkath's implementation uses two DECstation 5000 interconnected by a Turbochannel version of the same Fore-100 ATM interface used for SSAM and performs a little worse than SSAM for data transfer and significantly worse for control transfer. The remote reads and writes are directly comparable in that they transfer the same payload per cell.<p><pre><A NAME="REF17269">Table 4: Comparison of SSAM to Remote Memory Accesses between 2 DECstation 		5000sover </A>ATM <A HREF="hoti-94.html#REF28211">[11]</A>. ------------------------------------<B>Operation</B>        <B>SSAM</B>     <B>Remote </B>                               <B>mem access</B>  ------------------------------------read latency     32us     45us        write latency    22us     30us        addt'l control   none     260us       transfer ovhd                         block write      5.5MB/s  4.4MB/s                                                                                 ------------------------------------</pre>The performance of more traditional communication layers over an ATM network has been evaluated by Lin et. al. <A HREF="hoti-94.html#REF34727">[7]</A> and shows over two orders of magnitude higher communication latencies than SSAM offers. <A HREF="hoti-94.html#REF22605">Table 5</A> summarizes the best round-trip latencies and one-way bandwidths attained on Sun 4/690's and SPARCstation 2's connected by Fore SBA-100 interfaces without switch. The millisecond scale reflects the costs of the traditional networking architecture used by these layers, although it is not clear why Fore's AAL/5 API is slower than the read/write system call interface described in <A HREF="hoti-94.html#REF21197">3.4.2</A>. Note that a TCP/IP implementation with a well-optimized fast-path should yield sub-millisecond latencies.<p><pre><A NAME="REF22605">Table 5: Performance of traditional communication layers on Sun4/690s and 		SPARCstation2s </A>over ATM <A HREF="hoti-94.html#REF34727">[7]</A>. -------------------------------------------Communication layer  <B>Round-trip<BR></B>  <B>Peak </B>                           <B>latency</B>      <B>bandwidth</B>  -------------------------------------------Fore AAL/5 API       1.7ms        4MB/s      BSD TCP/IP Sockets   3.9ms        2MB/s      PVM over TCP/IP      5.4ms        1.5MB/s    Sun RPC              3.9ms        1.6MB/s                                                                                              -------------------------------------------</pre><H2><A NAME="HDR22">5  Conclusions</A></H2>The emergence of high-bandwidth low-latency networks is making the use of clusters of workstations attractive for parallel computing style applications. From a technical point of view a continuous spectrum of systems can be conceived, ranging from collections of Ethernet-based workstations to tightly integrated custom multiprocessors. However, this paper argues that clusters will be characterized by the use of off-the-shelf components, which will handicap them with respect to multiprocessors in which hardware and software are customized to allow a tighter integration of the network into the overall architecture.<p>The use of standard components, and in particular, of ATM networking technology, results in three major disadvantages of clusters with respect to multiprocessors: (i) ATM networks do not offer reliable delivery or flow control, (ii) the current network interfaces are not well integrated into the workstation architecture, and (iii) the operating systems on the nodes of a cluster do not coordinate process scheduling or address translations.<p>The prototype implementation of the Active Messages communication model described in this paper achieves two orders of magnitude better performance than traditional networking layers. <A HREF="hoti-94.html#REF22813">Table 6</A> shows that the resulting communication latencies and bandwidths are in the same ball-park as on state-of-the-art multiprocessors. Key to the success are the use of large memory buffers and the careful design of a lean user-kernel interface. The major obstacle towards closing the remaining performance gap is the slow access to the network interface across the I/O bus, and reducing the buffer memory usage requires coordination of process scheduling across nodes. While taking care of flow control in software does not dominate performance in this study, the behavior of ATM networks under parallel computing communication loads remains an open question.<p><pre><A NAME="REF22813">Table 6: Comparison of SSAM's performance with that of recent parallel 		machines.</A>----------------------------------------Machine           <B>Peak<BR></B>      <B>Round-trip<BR></B>                    <B>bandwidth</B>  <B>latency</B>      ----------------------------------------SP-1 + MPL/p <A HREF="hoti-94.html#REF34793">[9]</A>  8.3MB/     56us                           s                       Paragon + NX <A HREF="hoti-94.html#REF25308">[8]</A>  73MB/s     44us         CM-5 + Active     10MB/s     12us         Mesg <A HREF="hoti-94.html#REF35711">[4]</A>                                  SS-20 cluster +   5.6MB/     32us         SSAM              s                       ----------------------------------------</pre><H2><A NAME="HDR23">6  Bibliography</A></H2><UNKNOWN><A NAME="REF62152">[1]  CCITT. Recommendation I.150: B-ISDN ATM functional characteristics. (Revised version), Geneva: ITU 1992.</A></UNKNOWN><BR><UNKNOWN><A NAME="REF66827">[2]  D. E. Culler, A. Dusseau, S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. Yelick. Introduction to Split-C. In Proc. of Supercomputing '93<I>.</I></A></UNKNOWN><BR><UNKNOWN><A NAME="REF55218">[3]  D. E. Culler, A. Dusseau, R. Martin, K. E. Schauser. Fast Parallel Sorting: from LogP to Split-C. In Proc. of WPPP '93, July 93.</A></UNKNOWN><BR><UNKNOWN><A NAME="REF35711">[4]  T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser. Active Messages: A Mechanism for Integrated Communication and Computation. In Proc. of the 19th ISCA, pages 256-266, May 1992.</A></UNKNOWN><BR><UNKNOWN><A NAME="REF32249">[5]  A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam. PVM 3.0 User's Guide and Reference Manual. Oak Ridge National Laboratory, Technical Report ORNL/TM-12187, February 1993.</A></UNKNOWN><BR><UNKNOWN><A NAME="REF48403">[6]  K. Li and P. Hudak. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321-359, November 1989.</A></UNKNOWN><BR><UNKNOWN><A NAME="REF34727">[7]  M. Lin, J. Hsieh, D. H. C. Du, J. P. Thomas, and J. A. MacDonald. Distributed Network Computing over Local ATM Networks. IEEE Journal on Selected Areas in Communications, Special Issue on ATM LANs, to appear, 1995.</A></UNKNOWN><BR><UNKNOWN><A NAME="REF25308">[8]  P. Pierce and G. Regnier. The Paragon Implementation of the NX Message Passing Interface. In Proc. of SHPCC `94, May 1994.</A></UNKNOWN><BR><UNKNOWN><A NAME="REF34793">[9]  C. B. Stunkel, D. G. Shea, D. G. Grice, P. H. Hochschild, and M. Tsao. The SP1 High-Performance Switch. In Proc. of SHPCC `94, May 1994.</A></UNKNOWN><BR><UNKNOWN><A NAME="REF15859">[10]  C. A. Thekkath, H. M. Levy, and E. D. Lazowska. Efficient Support for Multicomputing on ATM Networks. University of Washington, Technical Report 93-04-03, April 1993.</A></UNKNOWN><BR><UNKNOWN><A NAME="REF28211">[11]  C. A. Thekkath, H. M. Levy, and E. D. Lazowska. Separating Data and Control Transfer in Distributed Operating Systems. In Proc. of the 6th Int'l Conf. on ASPLOS, To appear, October 1994.</A></UNKNOWN><BR><UNKNOWN><A NAME="REF52517">[12]  Thinking Machines Corporation, Cambridge, Massachusetts. <I>Connection Machine CM-5, Technical Summary, </I>November 1992.</A></UNKNOWN><BR><HR><h3>Footnotes</h3><DL COMPACT><DT><A NAME=FN1>(1)</A><DD>The term cluster is used here to refer to collections of workstation-class machines interconnected by a low-latency high-bandwidth network.<DT><A NAME=FN2>(2)</A><DD>This paper focuses exclusively on scalable multiprocessor architectures and specifically excludes bus-based shared-memory multiprocessors.<DT><A NAME=FN3>(3)</A><DD>Current ATM switches have latencies about an order of magnitude higher than comparable multiprocessor networks, however, this difference does not seem to be inherent in ATM networks, at least not for local area switches.<DT><A NAME=FN4>(4)</A><DD>A discussion of differences in fault isolation characteristics is beyond the scope of this paper.<DT><A NAME=FN5>(5)</A><DD>Although some transmission media may cause burst errors which are beyond the correction capabilities of most CRC codes.<DT><A NAME=FN6>(6)</A><DD>Cache coherent shared memory stretch this characterization given that the cache in the receiving node essentially performs another address translation which may miss and require additional communication with other nodes to complete the request.<DT><A NAME=FN7>(7)</A><DD>All bandwidths are measured in megabytes per second.<DT><A NAME=FN8>(8)</A><DD>The kernel write-protects the trap vectors after boot-up. The SSAM prototype uses a permanently loaded trap which performs an indirect jump via a kernel variable to allow simple dynamic driver loading.<DT><A NAME=FN9>(9)</A><DD>Note that in a more realistic setting a Fore ASX-100 switch will add roughly 10us of latency to the write time and 20us to the round-trip read time [7].<DT><A NAME=FN10>(10)</A><DD>One could easily describe the traps employed by SSAM as additional emulated communication instructions.</DL><A NAME="ENDFILE"><PRE> </PRE></A>
上一页 1 2 3 45
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -