📄 http:^^www.cs.cornell.edu^info^projects^cam^hoti-94.html
字号:
per cell total for 16 4.61us 8.08us cells write_read trap total, 0 cells read 3.7us 11.2us total, 1 cell read 8.2us 21.4us null system call 6.9us 40us ---------------------------------------------------------</pre>The write trap cost is broken down into 5 parts: the cost of the trap and return, the protection checks, overhead for fetching addresses, loading the cell into registers, and pushing the cell into the network interface. The SS-20 numbers show clearly that the fiber can be saturated by sending a cell at a time from user level. It also indicates that the majority of the cost (75%) lies in the access to the network interface across the Sbus. The cost of the trap itself is surprisingly low, even though it is the second largest item. In fact, it could be reduced slightly as the current implementation adds a level of indirection in the trap dispatch to simplify the dynamic loading of the device driver.<A HREF="hoti-94.html#FN8">(8)</A><p>The read trap is itemized similarly: the cost to trap and return, fetching the device register with the count of available cells, additional overhead for setting-up addresses, loading the cell from the network interface, demultiplexing among processes, and storing the cell away. The total cost shows a trap which receives a single cell, as well as the per-cell cost for a trap which receives 16 cells. Here again the access to the device dominates due to the fact that each double-word load incurs the full latency of an Sbus access. The total time of 4.61us on the SS-20 falls short of the fiber's cell time and will limit the achievable bandwidth to at most 68% of the fiber.<p>The write-read trap first sends a cell and then receives a chunk of cells. This amortizes the cost of the trap across both functions and overlaps checking the cell count slightly with sending. The last item in the table shows the cost of a null system call for comparison purposes (a write to file descriptor -1 was used). It is clear that a system call approach would yield performance far inferior to the traps and would achieve only a fraction of the fiber bandwidth.<p><H4><A NAME="REF21197">3.4.2 ATM read/write system calls</A></H4>In addition to the direct traps, the device driver allows cells to be sent and received using traditional read and write system calls on the device file descriptor. At this time this conventional path is provided for comparison purposes only and the read and write entry points into the device driver are limited to sending and receiving single cells. Multi-cell reads and writes could be supported easily. The read and write entry points perform the following operations:<p><UL><LI>check for the appropriateness of the file descriptor,<BR><LI>transfer data between user space and an internal buffer using uiomove, and<BR><LI>transfer data between the internal buffer and the FIFOs of the network interface.<BR></UL>The internal buffer is used because the data cannot be transferred directly between user space and the device using uiomove due to the fact that the device FIFOs are only word addressable. The use of an internal buffer also allows double-word accesses to the device FIFOs, which improves the access times considerably.<p><A HREF="hoti-94.html#REF16785">Table 2</A> shows the costs for the various parts of the read and write system calls. The "syscall overhead" entries reflect the time taken for a read (respectively write) system call with an empty read (write) device driver routine. This measures the kernel overhead associated with these system calls. The "check fd, do uiomove" entry reflects the time spent in checking the validity of the file descriptor and performing the uiomove. In the case of a read, it also includes the time to check the device register holding the number of cells available in the input FIFO. The "push/pull cell" entries reflect the time spent to transfer the contents of one cell between the internal buffer and the device FIFOs. The "write" and "read 1 cell" totals reflect the cost of the full system call, while the "read 0 cells" entry is the time taken for an unsuccessful poll which includes the system call overhead, the file descriptor checks, and the reading of the receive-ready register.<p><pre><A NAME="REF16785">Table 2: Cost of sending and receiving cells using read and write system calls.</A>-----------------------------------------------------------<B>Operation</B> <B>SS-20</B> <B>SS-1+</B> -----------------------------------------------------------write system call syscall overhead 22.6us 100us check fd, do uiomove 3.4us 16us push cell into NI 2.2us 8us write total 28.2us 124us read system call syscall overhead 22.1us 99us pull cell from NI 5.0us 13us check fd and recv ready, 7.0us 25us do uiomove read total for 1 cell 34.1us 137us read total for 0 cells 28.8us 113us -----------------------------------------------------------</pre>The timings show clearly that the overhead of the read/write system call interface is prohibitive for small messages. For larger messages, however, it may well be a viable choice and it is more portable than the traps.<p><H4><A NAME="HDR17">3.4.3 SSAM</A></H4>Measurements of the Active Messages layer built on the cell send and receive traps are shown in <A HREF="hoti-94.html#REF21433">Table 3</A>. In all cases one word of the Active Message payload carries data and the handlers simply return. The send request uses a write-read-trap and adds a little over 1us of overhead (on the SS-20) for cell formatting and flow-control. The handling times are all roughly the cost of a read-trap (reading 16 cells per trap) plus again a little over 1us for the flow control and handler dispatch. If a reply is sent that adds the time of a write-trap.<p><pre><A NAME="REF21433">Table 3: Cost breakdown for SPARCstation Active Messages.</A> ---------------------------------------<B>Operation</B> <B>SS-20</B> <B>SS-1+</B> ---------------------------------------send request 5.0us 15us handle request, no reply 5.6us 15us sent handle request and send 7.7us 25us reply handle ack 5.0us 11us handle reply 5.2us 12us ---------------------------------------</pre>The measurements show that supporting only single-cell Active Messages is not optimal. Longer messages are required to achieve peak bulk transfer rates: the one-cell-at-a-time prototype can yield up to 5.6MB/s. A simpler interface for shorter messages (e.g., with only 16 bytes of payload) might well be useful as well to accelerate the small requests and acknowledgments that are often found in higher-level protocols. Unfortunately, given that the trap cost is dominated by the network interface access time and that the SBA-100 requires all 56 bytes of a cell to be transferred by the processor, it is unlikely that a significant benefit can be realized.<p><H4><A NAME="HDR18">3.4.4 Split-C</A></H4>While a full implementation of Split-C <A HREF="hoti-94.html#REF66827">[2]</A> is still in progress, timings of the remote memory access primitives show that the round-trip time for a remote read of 32 double-word aligned bytes takes 32us on the SS-20 and a one-way remote store takes 22us for the same payload.<A HREF="hoti-94.html#FN9">(9)</A> Remote accesses with smaller payloads are not noticeably cheaper. A bulk write implemented with the current SSAM layer transfers 5.5Mbytes/s, but experiments show that, using long messages, this could be improved to 9Mbytes/s by using the full ATM payload and simplifying the handling slightly.<p><H3><A NAME="HDR19">3.5 Unresolved issues</A></H3>The current SSAM prototype has no influence on the kernel's process scheduling. Given the current buffering scheme the SSAM layer operation is not influenced by which process is running. The performance of applications, however, is likely to be highly influenced by the scheduling. How to best influence the scheduler in a semi-portable fashion requires further investigation. The most promising approach appears to be to use real-time thread scheduling priorities, such as are available in Solaris 2.<p>The amount of memory allocated by the SSAM prototype is somewhat excessive and, in fact, for simplicity, the current prototype uses twice as many buffers as strictly necessary. For example, assuming that a flow-control window of 32 cells is used, the kernel allocates and pins 8Kbytes of memory per process per connection. On a 64-node cluster with 10 parallel applications running, this represents 5Mb of memory per processor.<p>The number of preallocated buffers could be reduced without affecting peak bulk transfer rates by adjusting the flow control window size dynamically. The idea is that the first cell of a long message contain a flag which requests a larger window size from the receiver; a few extra buffers would be allocated for this purpose. The receiver grants the larger window to one sender at a time using the first acknowledgment cell of the bulk transfer. The larger window size remains in effect until the end of the long message. This scheme has two benefits: the request for a larger window is overlapped with the first few cells of the long message, and the receiver can prevent too many senders from transferring large data blocks simultaneously, which would be sub-optimal for the cache. However, fundamentally, it appears that memory (or, alternatively, low performance) is the price to pay for having neither flow-control in the network nor coordinated process scheduling.<p>A more subtle problem having to do with the ATM payload alignment used by the SBA-100 interface will surface in the future: the 53 bytes of an ATM cell are padded by the SBA-100 to 56 bytes and the 48-byte payload starts with the 6th byte, i.e., it is only half-word aligned. The effect is that bulk transfer payload formats designed with the SBA-100 in mind (and supporting double-word moves of data between memory and the SBA-100) will clash with other network interfaces which double-word align the ATM payload.<p><H3><A NAME="HDR20">3.6 Summary</A></H3>The prototype Active Messages implementation on a SPARCstation ATM cluster provides a preliminary demonstration that this communication architecture developed for multiprocessors can be adapted to the peculiarities of the workstation cluster. The performance achieved is roughly comparable to that of a multiprocessor such as the CM-5 (where the one-way latency is roughly 6us), but it is clear that without a network interface closer to the processor the performance gap cannot be closed.<p>The time taken by the flow-control and protection in software is surprisingly low (at least in comparison with the network interface access times). The cost, in effect, has been shifted to large pre-allocated and pinned buffers. While the prototype's memory usage is somewhat excessive, other schemes with comparable performance will also require large buffers.<p>Overall, SSAM's speed comes from a careful integration of all layers, from the language level to the kernel traps. The key issues are avoiding copies by having the application place the data directly where the kernel picks it up to move it into the device and by passing only easy to check information to the kernel (in particular not pass an arbitrary virtual address).<p><H2><A NAME="HDR21">4 Comparison to other approaches</A></H2>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -