⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 http:^^www.cs.cornell.edu^info^projects^cam^hoti-94.html

📁 This data set contains WWW-pages collected from computer science departments of various universities
💻 HTML
📖 第 1 页 / 共 5 页
字号:
Beyond the problems caused by contention and the resulting retransmissions, the lack of reliable delivery guarantee in ATM networks imposes a certain overhead on the communication primitives. Specifically, the sender must keep a copy of each cell sent until a corresponding acknowledgment is received, in case the cell must be retransmitted. This means that messages cannot be transferred directly between processor registers and the network interface (as is possible on the CM-5 <A HREF="hoti-94.html#REF52517">[12]</A>), rather, a memory copy must be made as well.<p><H3><A NAME="HDR4">2.2  User-level access to the network interface</A></H3>Recently, multiprocessor communication architectures have achieved a significant reduction of the communication overhead by eliminating the operating system from the critical path. In order not to compromise security, the network interface must offer some form of protection mechanism. In shared memory models, the memory management unit is extended to map remote memory into the local virtual user address space such that the operating system can enforce security by managing the address translation tables. Message-based network interfaces contain a node address translation table which maps the user's virtual node numbers onto the physical node address space. Again, the operating system enforces security by controlling the address translation, thereby preventing a process from sending a message to an arbitrary node. The current generation of message based network interfaces only control the destination node address and therefore require that all processes of a parallel program run at the same time. The next generation adds the sending process id to each message allowing the receiving network interface to discriminate between messages destined for the currently running process, that can retrieve these message directly, and messages for dormant processes, which must be queued (typically by the operating system) for later retrieval.<p>In contrast, the network interfaces available for workstations do not yet incorporate any form of protection mechanism. Instead, the operating system must be involved in the sending and reception of every message. The connection based nature of ATM networks would principally allow the design of a protection mechanism to limit the virtual circuits a user process has access to (the operating system would still control virtual circuit set-up). But because the architecture of the networking layers in current operating systems does not seem to be set-up to allow user-level network interface access, it appears unlikely that network interfaces with these features will become commonplace soon. The challenge in any high-performance communication layer for clusters is, thus, to minimize the path through the kernel by judiciously coordinating the user-kernel interactions.<p><H3><A NAME="HDR5">2.3  Coordination of system software across all communicating nodes</A></H3>In almost all communication architectures the message reception logic is the critical performance bottleneck. In order to be able to handle incoming messages at full network bandwidth, the processing required for each arriving message must be minimized carefully. The trick used in multiprocessor systems to ensure rapid message handling is to constrain the sender to only send messages which are easy to handle.<p>In shared memory systems this is done by coordinating the address translation tables among all processing nodes such that the originating node can translate the virtual memory address of a remote access and directly place the corresponding physical memory address into the message. The set of communication primitives is small and fixed (e.g., read and write) and by forcing the sender to perform the complicated part of a remote memory access (namely the protection checks and the address translation) the handling of a request is relatively simple to implement<A HREF="hoti-94.html#FN6">(6)</A>. If the virtual address were sent, the receiving node could discover that the requested virtual memory location had been paged out to disk with the result that the handling of the message would become rather involved.<p>In Active Messages on multiprocessors the scheduling of processes is assumed to be coordinated among all nodes such that communicating processes execute simultaneously on their respective nodes. This guarantees that messages can be handled immediately on arrival by the destination process itself. In order to accomplish this, the sender of an Active Message specifies a user-level handler at the destination whose role it is to extract the message from the network and integrate it into the ongoing computation. The handler can also implement a simple remote service and send a reply Active Message back. However, in order to prevent deadlock the communication patterns are limited to requests and replies, e.g., a handler of a reply message is not allowed to send any further messages. An implementation of Active Messages typically reserves the first word of each message for the handler address, and the handler at the receiving end is dispatched immediately on message arrival to dispose of the message. The fact that the message layer can call upon the handlers to deal with messages in FIFO order simplifies the buffering considerably over that required by more traditional message passing models such as PVM, MPI, or NX. These models allow processes to consume messages in arbitrary order and at arbitrary times forcing the communication architecture to implement very general buffer and message matching mechanisms at high cost.<p>In clusters the fact that the operating systems of the individual nodes are not nearly as coordinated contradicts the assumption that messages can always be consumed quickly upon arrival. In the case of Active Messages the destination process might have been suspended and cannot run the handler, and in a shared memory model the memory location requested might not be mapped. Although exact coordination is not possible without major changes to the operating system core, an implementation of either communication model is likely to be able to perform some coordination among nodes on its own and to influence the local operating system accordingly. This may allow the communication layer to assume that in the common case everything works out fine, but it must be able to handle the difficult cases as well.<p><H3><A NAME="HDR6">2.4  Summary</A></H3>Even though superficially a cluster of workstations appears the be technically comparable to a multiprocessor, the reality is that key characteristics are different and cause significant implementation difficulties: the very comparable raw hardware link bandwidths, bisection bandwidths, and routing latencies conceal the lack in clusters of flow control, reliability, user-level network access, and operating system coordination.<p>These shortcomings will inevitably result in lower communication performance; their quantitative effect on performance is evaluated in the next section which presents a prototype implementation of Active Messages on a cluster of Sun workstations. However, the lack of flow-control in ATM networks poses a fundamental problem: can catastrophic performance degradation occur due to significant cell loss in particular communication patterns?<p><H2><A NAME="HDR7">3  SSAM: a SPARCstation Active Messages Prototype</A></H2>The SSAM prototype implements the critical parts of an Active Messages communication architecture on a cluster of SPARCstations connected by an ATM network. The primary goal is to evaluate whether it is possible to provide a parallel programming environment on the cluster that is comparable to those found on multiprocessors. The prototype is primarily concerned with providing performance at par with parallel machines, while addressing the handicaps of ATM networks that have been identified in the previous section. In particular:<p><UL><LI>the prototype provides reliable communication to evaluate the cost of performing the necessary flow-control and error checking in software,<BR><LI>it minimizes the kernel intervention to determine the cost of providing protection in software, and<BR><LI>the buffering is designed to tolerate arbitrary context switching on the nodes.<BR></UL>At this time only a limited experimental set-up (described below) is available such that the prototype cannot provide information neither on how cell losses due to contention within the network affect performance, nor on how the scheduling of processes can be coordinated to improve the overall performance of parallel applications.<p><H3><A NAME="HDR8">3.1  Active Messages Communication Architecture</A></H3>The Active Messages communication architecture <A HREF="hoti-94.html#REF35711">[4]</A> offers simple, general purpose communication primitives as a thin veneer over the raw hardware. It is intended to serve as a substrate for building libraries that provide higher-level communication abstractions and for generating communication code directly from a parallel-language compiler. Unlike most communication layers, it is not intended for direct use by application programmers and really provides lower-level services from which communication libraries and run-time systems can be built.<p>The basic communication primitive is a message with an associated small amount of computation (in the form of a handler) at the receiving end. Typically the first word of an Active Message points to the handler for that message. On message arrival, the computation on the node is interrupted and the handler is executed. The role of the handler is to get the message out of the network, by integrating it into the ongoing computation and/or by sending a reply message back. The buffering and scheduling provided by Active Messages are extremely primitive and thereby fast: the only buffering is that involved in actual transport and the only scheduling is that required to activate the handler. This is sufficient to support many higher-level abstractions and more general buffering and scheduling can be easily constructed in layers above Active Messages when needed. This minimalist approach avoids paying a performance penalty for unneeded functionality.<p>In order to prevent deadlock and livelock, Active Message restricts communication patterns to requests and replies, i.e., the handler of a request message is only allowed to send a reply message and a reply handler is not allowed to send further replies.<p><H4><A NAME="HDR9">3.1.1  SSAM functionality</A></H4>The current implementation is geared towards the sending of small messages which fit into the payload of a single ATM cell. Eight of the 48 available bytes of payload in an ATM cell are used by SSAM to hold flow-control information (16 bits), the handler address (32 bits), and an AAL3/4 compatible checksum (16 bits). The remaining 40 bytes hold the Active Message data.<p>The C header file for the interface to SSAM is shown in <A HREF="hoti-94.html#REF17844">Figure 1</A>. To send a request Active Message, the user places the message data into a per-connection buffer provided by SSAM and calls SSAM_10 with a connection identifier and the remote handler address. SSAM_10 adds the flow-control information and traps to the kernel to have the message injected into the network. It also polls the receiver and processes incoming messages. At the receiving end, the network is polled by SSAM_10 or SSAM_poll (the latter only polls the network) and all messages accumulated in the receive FIFO are moved into a buffer. SSAM then calls the appropriate handler for each message, passing as arguments the originating connection identifier, the address of the buffer holding the message, and the address of a buffer for a reply message. The handler processes the message and may send a reply message back by placing the data in the buffer provided and returning the address of the reply handler (or NULL if no reply is to be sent).<p><A HREF="hoti-94.fig1.gif"><img src="hoti-94.fig1.gif"><B><A NAME="REF17844">Figure 1:  C interface for SPARCstation Active Messages</A></B><p></A><p>The current prototype does not use interrupts, instead, the network is polled every time a message is sent. This means that as long as a process is sending messages it will also handle incoming ones. An explicit polling function is provided for program parts which do not communicate. Using interrupts is planned but not implemented yet.<p><H4><A NAME="HDR10">3.1.2  Example: implementing a remote read with SSAM</A></H4>The sample implementation of a split-phase remote double-word read is shown in <A HREF="hoti-94.html#REF17083">Figure 2</A>. The readDouble function increments a counter of outstanding reads, formats a request Active Message with the address to be read as well as information for the reply, and sends the message. The readDouble_h handler fetches the remote location and sends a reply back to the readDouble_rh reply handler which stores the data into memory and decrements the counter. The originating processor waits for the completion of the read by busy-waiting on the counter at the end of readDouble. A split-phase read could be constructed easily by exposing the counter to the caller, who could proceed with computation after initiating the read and only wait on the counter when the data is required.<p><A HREF="hoti-94.fig4.gif"><img src="hoti-94.fig4.gif"><B><A NAME="REF17083">Figure 2:  Sample remote read implementation using SSAM</A></B><p></A><p><H3><A NAME="HDR11">3.2  Experimental set-up</A></H3>The experimental set-up used to evaluate the performance of the prototype SSAM implementation consists of a 60Mhz SPARCstation-20 and a 25Mhz SPARCstation-1+ running SunOS 4.1. The two machines are connected via Fore Systems SBA-100 ATM interfaces using a 140Mb/s TAXI fiber. The interfaces are located on the Sbus (a 32-bit I/O bus running at 20 or 25Mhz) and provide a 36-cell deep output FIFO as well as a 292-cell input FIFO. To send a cell the processor stores 56 bytes into the memory-mapped output FIFO and to receive a cell it reads 56 bytes from the input FIFO. A register in the interface indicates the number of cells available in the input FIFO.<p>Note that the network interface used is much simpler and closer to multiprocessor NIs than most second-generation ATM interfaces available today. The only function performed in hardware, beyond simply moving cells onto/off the fiber, is checksum generation and checking for the ATM header and an AAL3/4 compatible payload. In particular, no DMA or segmentation and reassembly of multi-cell packets is provided.<p><H3><A NAME="HDR12">3.3  SSAM implementation</A></H3>The implementation of the SPARCstation ATM Active Messages layer consists of two parts: a device driver which is dynamically loaded into the kernel and a user-level library to be linked with applications using SSAM. The driver implements standard functionality to open and close the ATM device and it provides two paths to send and receive cells. The fast path described here consists of three trap instructions which lead directly to code for sending and receiving individual ATM cells. The traps are relatively generic and all functionality specific to Active Messages is in the user-level library which also performs the flow-control and buffer management. A conventional read/write system call interface is provided for comparison purposes and allows to send and receive cells using a &quot;pure&quot; device driver approach.<p>The traps to send and receive cells consist of carefully crafted assembly language routines. Each routine is small (28 and 43 instructions for the send and receive traps, respectively) and uses available registers carefully. The register usage is simplified by the Sparc architecture's use of a circular register file, which provides a clean 8-register window for the trap. By interfacing from the program to the traps via a function call, arguments can be passed and another 8 registers become available to the trap.<p>The following paragraphs describe the critical parts of the SSAM implementation in more detail.<p><H4><A NAME="HDR13">3.3.1  Flow-control</A></H4>A simple sliding window flow control scheme is used to prevent overrun of the receive buffers and to detect cell losses. The window size is dimensioned to allow close to full bandwidth communication among pairs of processors. <p>In order to implement the flow control for a window of size <I>w</I>, each process pre-allocates memory to hold 4<I>w</I> cells per every other process with which it communicates. The algorithm to send a request message polls the receiver until a free window slot is available and then injects the cell into the network, saving it in one of the buffers as well in case it has to be retransmitted. Upon receipt of a request message, the message layer moves the cell into a buffer and, as soon as the corresponding process is running, calls the Active Message handler. If the handler issues a reply, it is sent and a copy is held in a buffer. If the handler does not generate a reply, an explicit acknowledgment is sent. Upon receipt of the reply or acknowledgment, the buffer holding the original request message can be reused. Note how the distinction between requests and replies made in Active Messages allows acknowledgments to be piggy-backed onto replies.<p>The recovery scheme used in case of lost or duplicate cells is standard, except that the reception of duplicate request messages may indicate lost replies which have to be retransmitted. It is important to realize that this flow control mechanism does not really attempt to minimize message losses due to congestion within the network: the lack of flow-control in ATM networks effectively precludes any simple congestion avoidance scheme. Until larger test-beds become available and the ATM community agrees on how routers should handle buffer overflows it seems futile to invest in more sophisticated flow-control mechanisms. Nevertheless, the bursty nature of parallel computing communication patterns are likely to require some solution before the performance characteristics of an ATM network become as robust as those of as multiprocessor networks.<p><H4><A NAME="HDR14">3.3.2  User-kernel interface and buffer management</A></H4>The streamlining of the user-kernel interface is the most important factor contributing to the performance of SSAM. In the prototype, the kernel preallocates all buffers for a process when the device is opened. The pages are then pinned to prevent page-outs and are mapped (using mmap) into the processes' address space. After every message send, the user-level library chooses a buffer for the next message and places a pointer in an exported variable. The application program moves the message data into that buffer and passes the connection id and the handler address to SSAM which finishes formatting the cell (adding the flow control and handler) and traps to the kernel. The trap passes the message offset within the buffer area and the connection id in registers to the kernel. Protection is ensured with simple masks to limit the connection id and offset ranges. A lookup maps the current process and connection ids to a virtual circuit. The kernel finally moves the cell into the output FIFO.<p>At the receiving end, the read-ATM kernel trap delivers a batch of incoming cells into a pre-determined shared memory buffer. The number of cells received is returned in a register. For each cell the kernel performs four tasks: it loads the first half of the cell into registers, uses the VCI to index into a table to obtain the address of the appropriate processes' input buffer, moves the full cell into that buffer, and checks the integrity of the cell using three flag bits set by the NI in the last byte. Upon return from the trap, the SSAM library loops through all received cells checking the flow-control information, calling the appropriate handlers for request and reply messages, and sending explicit acknowledgments when needed.<p><H3><A NAME="HDR15">3.4  SSAM performance</A></H3>The following paragraphs describe performance measurements of SSAM made with a number of synthetic benchmarks. The following terminology is used:<I> overhead</I> consists of the processor cycles spent preparing to send or receive a message, <I>latency</I> is the time from which a message send routine is called to the time the message is handled at the remote end, and bandwidth is the rate at which user data is transferred. The performance goal for SSAM is the fiber rate of 140Mbit/s which transmits a cell every 3.14us (53+2 bytes) for an ATM payload bandwidth of 15.2MB/s<A HREF="hoti-94.html#FN7">(7)</A>.<p><H4><A NAME="HDR16">3.4.1  ATM traps</A></H4>A detailed cost breakdown for the operations occurring in each of the traps to send and receive cells is shown in <A HREF="hoti-94.html#REF33440">Table 1</A>. The two timing columns refer to measurements taken on the SPARCstation 1+ and on the SPARCstation 20, respectively. The times have been obtained by measuring repeated executions of each trap with gettimeofday which uses a microsecond-accurate clock and takes 9.5us on the SS-20. The time break-down for each trap was measured by commenting appropriate instructions out and is somewhat approximate due to the pipeline overlap occurring between successive instructions.<p><pre><A NAME="REF33440">Table 1: Cost breakdown for traps to send and receive cells.</A> ---------------------------------------------------------<B>Operation</B>                                 <B>SS-20</B>   <B>SS-1+</B>    ---------------------------------------------------------write trap                                                                   trap+rett               0.44us  2.03us                     check pid and con       0.05us  0.49us                     nection id                                                 addt'l kernel ovhd      0.05us  0.50us                     load cell to push       0.13us  3.87us                     push cell to NI         2.05us  3.17us                     total                   2.72us  10.11us  read trap                                                                    trap+rett               0.44us  2.03us                     check cell count        0.81us  1.08us                     addt'l kernel ovhd      0.18us  0.80us                     per cell pull from NI   4.27us  3.68us                     per cell demux          0.09us  0.23us                     per cell store away     0.17us  3.50us                     total for 1 cell        5.87us  11.32us  

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -