首页 › 资源下载 › 其他 › This data set contai › 源码查看
http:^^www.cs.cornell.edu^info^projects^cam^hoti-94.html

来自「This data set contains WWW-pages collect」· HTML 代码 · 共 387 行 · 第 1/5 页
HTML
387 行
                  per cell total for 16   4.61us  8.08us                     cells                                    write_read trap                                                              total, 0 cells read     3.7us   11.2us                     total, 1 cell read      8.2us   21.4us   null system call                          6.9us   40us                                                                ---------------------------------------------------------</pre>The write trap cost is broken down into 5 parts: the cost of the trap and return, the protection checks, overhead for fetching addresses, loading the cell into registers, and pushing the cell into the network interface. The SS-20 numbers show clearly that the fiber can be saturated by sending a cell at a time from user level. It also indicates that the majority of the cost (75%) lies in the access to the network interface across the Sbus. The cost of the trap itself is surprisingly low, even though it is the second largest item. In fact, it could be reduced slightly as the current implementation adds a level of indirection in the trap dispatch to simplify the dynamic loading of the device driver.<A HREF="hoti-94.html#FN8">(8)</A><p>The read trap is itemized similarly: the cost to trap and return, fetching the device register with the count of available cells, additional overhead for setting-up addresses, loading the cell from the network interface, demultiplexing among processes, and storing the cell away. The total cost shows a trap which receives a single cell, as well as the per-cell cost for a trap which receives 16 cells. Here again the access to the device dominates due to the fact that each double-word load incurs the full latency of an Sbus access. The total time of 4.61us on the SS-20 falls short of the fiber's cell time and will limit the achievable bandwidth to at most 68% of the fiber.<p>The write-read trap first sends a cell and then receives a chunk of cells. This amortizes the cost of the trap across both functions and overlaps checking the cell count slightly with sending. The last item in the table shows the cost of a null system call for comparison purposes (a write to file descriptor -1 was used). It is clear that a system call approach would yield performance far inferior to the traps and would achieve only a fraction of the fiber bandwidth.<p><H4><A NAME="REF21197">3.4.2  ATM read/write system calls</A></H4>In addition to the direct traps, the device driver allows cells to be sent and received using traditional read and write system calls on the device file descriptor. At this time this conventional path is provided for comparison purposes only and the read and write entry points into the device driver are limited to sending and receiving single cells. Multi-cell reads and writes could be supported easily. The read and write entry points perform the following operations:<p><UL><LI>check for the appropriateness of the file descriptor,<BR><LI>transfer data between user space and an internal buffer using uiomove, and<BR><LI>transfer data between the internal buffer and the FIFOs of the network interface.<BR></UL>The internal buffer is used because the data cannot be transferred directly between user space and the device using uiomove due to the fact that the device FIFOs are only word addressable. The use of an internal buffer also allows double-word accesses to the device FIFOs, which improves the access times considerably.<p><A HREF="hoti-94.html#REF16785">Table 2</A> shows the costs for the various parts of the read and write system calls. The &quot;syscall overhead&quot; entries reflect the time taken for a read (respectively write) system call with an empty read (write) device driver routine. This measures the kernel overhead associated with these system calls. The &quot;check fd, do uiomove&quot; entry reflects the time spent in checking the validity of the file descriptor and performing the uiomove. In the case of a read, it also includes the time to check the device register holding the number of cells available in the input FIFO. The &quot;push/pull cell&quot; entries reflect the time spent to transfer the contents of one cell between the internal buffer and the device FIFOs. The &quot;write&quot; and &quot;read 1 cell&quot; totals reflect the cost of the full system call, while the &quot;read 0 cells&quot; entry is the time taken for an unsuccessful poll which includes the system call overhead, the file descriptor checks, and the reading of the receive-ready register.<p><pre><A NAME="REF16785">Table 2: Cost of sending and receiving cells using read and write system 		calls.</A>-----------------------------------------------------------<B>Operation</B>                                     <B>SS-20</B>   <B>SS-1+</B>  -----------------------------------------------------------write system call                                                               syscall overhead           22.6us  100us                     check fd, do uiomove       3.4us   16us                      push cell into NI          2.2us   8us                       write total                28.2us  124us  read system call                                                                syscall overhead           22.1us  99us                      pull cell from NI          5.0us   13us                      check fd and recv ready,   7.0us   25us                      do uiomove                                                   read total for 1 cell      34.1us  137us                     read total for 0 cells     28.8us  113us                                                               -----------------------------------------------------------</pre>The timings show clearly that the overhead of the read/write system call interface is prohibitive for small messages. For larger messages, however, it may well be a viable choice and it is more portable than the traps.<p><H4><A NAME="HDR17">3.4.3  SSAM</A></H4>Measurements of the Active Messages layer built on the cell send and receive traps are shown in <A HREF="hoti-94.html#REF21433">Table 3</A>. In all cases one word of the Active Message payload carries data and the handlers simply return. The send request uses a write-read-trap and adds a little over 1us of overhead (on the SS-20) for cell formatting and flow-control. The handling times are all roughly the cost of a read-trap (reading 16 cells per trap) plus again a little over 1us for the flow control and handler dispatch. If a reply is sent that adds the time of a write-trap.<p><pre><A NAME="REF21433">Table 3: Cost breakdown for SPARCstation Active Messages.</A> ---------------------------------------<B>Operation</B>                  <B>SS-20</B>  <B>SS-1+</B>  ---------------------------------------send request               5.0us  15us   handle request, no reply   5.6us  15us   sent                                     handle request and send    7.7us  25us   reply                                    handle ack                 5.0us  11us   handle reply               5.2us  12us                                            ---------------------------------------</pre>The measurements show that supporting only single-cell Active Messages is not optimal. Longer messages are required to achieve peak bulk transfer rates: the one-cell-at-a-time prototype can yield up to 5.6MB/s. A simpler interface for shorter messages (e.g., with only 16 bytes of payload) might well be useful as well to accelerate the small requests and acknowledgments that are often found in higher-level protocols. Unfortunately, given that the trap cost is dominated by the network interface access time and that the SBA-100 requires all 56 bytes of a cell to be transferred by the processor, it is unlikely that a significant benefit can be realized.<p><H4><A NAME="HDR18">3.4.4  Split-C</A></H4>While a full implementation of Split-C <A HREF="hoti-94.html#REF66827">[2]</A> is still in progress, timings of the remote memory access primitives show that the round-trip time for a remote read of 32 double-word aligned bytes takes 32us on the SS-20 and a one-way remote store takes 22us for the same payload.<A HREF="hoti-94.html#FN9">(9)</A> Remote accesses with smaller payloads are not noticeably cheaper. A bulk write implemented with the current SSAM layer transfers 5.5Mbytes/s, but experiments show that, using long messages, this could be improved to 9Mbytes/s by using the full ATM payload and simplifying the handling slightly.<p><H3><A NAME="HDR19">3.5  Unresolved issues</A></H3>The current SSAM prototype has no influence on the kernel's process scheduling. Given the current buffering scheme the SSAM layer operation is not influenced by which process is running. The performance of applications, however, is likely to be highly influenced by the scheduling. How to best influence the scheduler in a semi-portable fashion requires further investigation. The most promising approach appears to be to use real-time thread scheduling priorities, such as are available in Solaris 2.<p>The amount of memory allocated by the SSAM prototype is somewhat excessive and, in fact, for simplicity, the current prototype uses twice as many buffers as strictly necessary. For example, assuming that a flow-control window of 32 cells is used, the kernel allocates and pins 8Kbytes of memory per process per connection. On a 64-node cluster with 10 parallel applications running, this represents 5Mb of memory per processor.<p>The number of preallocated buffers could be reduced without affecting peak bulk transfer rates by adjusting the flow control window size dynamically. The idea is that the first cell of a long message contain a flag which requests a larger window size from the receiver; a few extra buffers would be allocated for this purpose. The receiver grants the larger window to one sender at a time using the first acknowledgment cell of the bulk transfer. The larger window size remains in effect until the end of the long message. This scheme has two benefits: the request for a larger window is overlapped with the first few cells of the long message, and the receiver can prevent too many senders from transferring large data blocks simultaneously, which would be sub-optimal for the cache. However, fundamentally, it appears that memory (or, alternatively, low performance) is the price to pay for having neither flow-control in the network nor coordinated process scheduling.<p>A more subtle problem having to do with the ATM payload alignment used by the SBA-100 interface will surface in the future: the 53 bytes of an ATM cell are padded by the SBA-100 to 56 bytes and the 48-byte payload starts with the 6th byte, i.e., it is only half-word aligned. The effect is that bulk transfer payload formats designed with the SBA-100 in mind (and supporting double-word moves of data between memory and the SBA-100) will clash with other network interfaces which double-word align the ATM payload.<p><H3><A NAME="HDR20">3.6  Summary</A></H3>The prototype Active Messages implementation on a SPARCstation ATM cluster provides a preliminary demonstration that this communication architecture developed for multiprocessors can be adapted to the peculiarities of the workstation cluster. The performance achieved is roughly comparable to that of a multiprocessor such as the CM-5 (where the one-way latency is roughly 6us), but it is clear that without a network interface closer to the processor the performance gap cannot be closed.<p>The time taken by the flow-control and protection in software is surprisingly low (at least in comparison with the network interface access times). The cost, in effect, has been shifted to large pre-allocated and pinned buffers. While the prototype's memory usage is somewhat excessive, other schemes with comparable performance will also require large buffers.<p>Overall, SSAM's speed comes from a careful integration of all layers, from the language level to the kernel traps. The key issues are avoiding copies by having the application place the data directly where the kernel picks it up to move it into the device and by passing only easy to check information to the kernel (in particular not pass an arbitrary virtual address).<p><H2><A NAME="HDR21">4  Comparison to other approaches</A></H2>
http:^^www.cs.cornell.edu^info^projects^cam^hoti-94.html - 源码说明

本页面展示了「This data set contains WWW-pages collected from computer science departments of various universities」中的 http:^^www.cs.cornell.edu^info^projects^cam^hoti-94.html 源码文件，采用 HTML 编程语言编写，共 387 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与universities相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?