⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 http:^^www.cs.cornell.edu^info^projects^cam^hoti-94.html

📁 This data set contains WWW-pages collected from computer science departments of various universities
💻 HTML
📖 第 1 页 / 共 5 页
字号:
Differences remain: in particular, the design and construction ofmultiprocessors allows better integration of all the components becausethey can be designed to fit together. In addition, the sharing ofphysical components such as power supplies, cooling and cabinets hasthe potential to reduce cost and to allow denser packaging. While thedebate over the significance of these technological differences isstill open, it is becoming clear that the two approaches will yieldqualitatively similar hardware systems. Indeed, it is possible to takea cluster of workstations and load system software making it lookalmost identical to a multiprocessor. This means that a continuousspectrum of platforms spanning the entire range from workstations on anEthernet to state-of-the-art multiprocessors can become available, andthaFrom a pragmatic point of view, however, significant differences arelikely to remain. The most important attraction in using a cluster ofworkstations instead of a multiprocessor lies in the off-the-shelfavailability of all its major hardware and software components. Thismeans that all the components are readily available, they are familiar,and their cost is lower because of economies of scale leveraged acrossthe entire workstation user community. Thus, even if from a technicalpoint of view there is a continuous spectrum between clusters andmultiprocessors, the use of off-the-shelf components in clusters willmaintain differences.<p>In fact, the use of standard components in clusters raises the questionwhether these can be reasonably used for parallel processing. Recentadvances in multiprocessor communication performance are principallydue to a tighter integration of programming models, compilers,operating system functions, and hardware primitives. It is not clearwhether these advances can be carried over to clusters or whether theuse of standard components is squarely at odds with achieving the levelof integration required to enable modern parallel programming models.Specifically, new communication architectures such as distributedshared memory, explicit remote memory access, and Active Messagesreduced the costs from hundreds to thousands of microseconds to just afew dozen precisely through the integration of all system components.These new communication architectures are designed such that networkinterfaces can implement common primitives directly in hardware, theyallow the operating system to be moved out of the critica This paperexamines whether the techniques developed to improve communicationperformance in multiprocessors, in particular, Active Messages, can becarried over to clusters of workstations with standard networks andmostly standard system software. This paper assumes the current stateof the art technology in which clusters using ATM networks differ frommultiprocessors in three major aspects<AHREF="hoti-94.html#FN4">(4)</A>:<p><UL><LI>clusters use standard operating system software which implies less coordination among individual nodes, in particular with respect to process scheduling and address translation,<BR><LI>ATM networks do not provide the reliable delivery and flow control that are taken for granted in multiprocessor networks, and<BR><LI>network interfaces for workstations optimize stream communication (e.g., TCP/IP) and are less well integrated into the overall architecture (e.g., connect to the I/O bus instead of the memory bus).<BR></UL>In comparing communication on clusters and multiprocessors this paper makes two major contributions:<p><UL><LI>first, it analyzes, in Section 2, the implications that the differences between clusters and multiprocessors have on the design of communication layers similar to those used in multiprocessors, and<BR><LI>second, it describes, in Section 3, the design of an Active Messages prototype implementation on a collection of Sun workstations interconnected by an ATM network which yields application-to-application latencies on the order of 20us.<BR></UL>The use of Active Messages in workstation clusters is briefly contrasted to other approaches in Section 4 and Section 5 concludes the paper.<p><H2><A NAME="HDR2">2  Technical Issues</A></H2>Collections of workstations have been used in many different forms to run large applications. In order to establish a basis for comparison to multiprocessors, this paper limits itself to consider only collections of workstations (called clusters) which consist of a homogeneous set of machines, dedicated to run parallel applications, located in close proximity (such as in the same machine room), and interconnected by an ATM network. Such a cluster can be employed in a large variety of settings. The cluster could simply provide high-performance compute service for a user community to run large parallel applications.<p>A more typical setting would be as computational resource in a distributed application. One such example, the Stormcast weather monitoring system in Norway, runs on a very large collection of machines spread across a large portion of the country, but uses a cluster of a few dozen workstations in a machine room (without high speed network in this case) to run compute-intensive weather prediction models and to emit storm warnings. The availability of low-latency communication among these workstations would enable the use of parallel programming languages and of more powerful parallel algorithms, both of which require a closer coupling among processors than is possible today.<p>Concentrating on the compute cluster offers the largest potential for improvement because the latency over the long-haul links is dominated by speed-of-light and network congestion issues and because the wide area communication is comparatively better served by today's distributed computing software. Note that this paper does not argue that running concurrent applications in a heterogeneous environment, across large distances, and on workstations that happen to be sitting idle is not an interesting design point (it in fact has been used successfully), but that the set of communication issues occurring in such a context cannot be compared to those in a multiprocessor.<p>Given that the applications for clusters considered here exhibit characteristics similar to those on multiprocessors, the programming models used would be comparable, if not identical, to those popular for parallel computing. This includes various forms of message passing (e.g., send/receive, PVM), of shared memory (e.g., cache coherent shared memory, remote reads and writes, explicit global memory), and of parallel object oriented languages (e.g., numerous C++ extensions).<p>On parallel machines several proposed communication architectures have achieved the low overheads, low latencies, and high bandwidths that are required for high performance implementations of the above programming models. In particular, cache coherent shared memory, remote reads and writes, and Active Messages offer round-trip communication within a few hundred instruction times, so that frequent communication on a fine granularity (such as on an object by object or cache line basis) remains compatible with high performance. In these settings, the overhead of communication, that is, the time spent by the processor initiating communication, is essentially the cost of pushing message data into the network interface at the sending end and pulling it out at the receiving end. Virtually no cycles are spent in any protocol handling as all reliability and flow control are handled in hardware. The operating system need not be involved in every communication operation because the network interface hardware can enforce protection boundaries across the network.<p>The above communication architectures cannot be moved in a straightforward manner from multiprocessors to clusters of workstations with ATM networks because of three major differences between the two: ATM networks offer neither reliable delivery nor flow control, ATM network interfaces provide no support for protected user-level access to the network, and the workstation operating systems do not coordinate process scheduling or address translation globally. Coping with these differences poses major technical challenges and may eventually require the integration of some multiprocessor-specific features into the clusters. The following three subsections present the nature of these differences in more detail and discuss the resulting issues.<p><H3><A NAME="HDR3">2.1  Reliability and flow control in the network</A></H3>In multiprocessor networks, flow control is implemented in hardware on a link-by-link basis. Whenever the input buffer of a router fills up, the output of the up-stream router is disabled to prevent buffer overflow. The flow control thus has the effect of blocking messages in the network and eventually, as the back-pressure propagates, the sending nodes are prevented from injecting further messages. This mechanism guarantees that messages are never dropped due to buffer space limitations within the network or at the receiving end. In addition, the electrical characteristics of the network are designed to ensure very low error rates, such that the use of a simple error detection and correction mechanism (implemented in hardware) can offer the same reliability within the network as is typical of the processing nodes themselves.<p>In contrast, an ATM network does not provide any form of flow control and does not offer reliable delivery. Instead, higher protocol layers must detect cell loss or corruption and cause their retransmission. While this partitioning of responsibilities may be acceptable in the case of stream-based communication (e.g., TCP/IP, video, audio) it is questionable in a parallel computing setting.<p>The flow control and the error detection and correction in multiprocessor networks serve to cover four causes of message loss: buffer overflow in the receiving software, buffer overflow in the receiving network interface, buffer overflow within the network, and message corruption due to hardware errors. In an ATM network, simple window based end-to-end flow control schemes and a per-message CRC (as used in AAL-5) can cover the first and last cases<A HREF="hoti-94.html#FN5">(5)</A> of cell loss. In addition, preventing buffer overflow in the receiving network interface can be achieved by ensuring that the rate at which cells can be moved from the interface into main memory is at least as large as the maximal cell arrival rate. Preventing buffer overflow within the network, however, is not realistically possible using end-to-end flow control. This is particularly a problem in a parallel computing setting in which all nodes tend to communicate with all other nodes in both highly regular and irregular patterns at unpredictable intervals. The degree of contention within the network therefore cannot be measured or predicted with any accuracy by either the sender or the receiver and communication patterns which result in high contention will result in high cell loss rates causing extensive retransmissions.<p>Traditional flow control schemes used in stream-based communication avoid fruitless retransmission storms by dynamically reducing the transmission rate on connections which experience high cell loss rates. This works in these settings because, following the law of large numbers, contention in a wide area network does not tend to vary instantaneously and therefore the degree of contention observed in the recent past is a good predictor for contention in the near future.<p>As an illustration of the difficulties in a parallel computing setting, consider the implementation of a parallel sort. The most efficient parallel sort algorithms <A HREF="hoti-94.html#REF55218">[3]</A> are based on an alternation of local sorts on the nodes and permutation phases in which all nodes exchange data with all other nodes. These permutation phases serve to move the elements to be sorted &quot;towards&quot; their correct position, The communication patterns observed are highly dynamic and their characteristics depend to a large degree on the input data. If at any point the attempted data rate into a given node exceeds the link rate, then the output buffers at up-stream switches will start filling up. Because the communication patterns change very rapidly (essentially with every cell), it is futile to attempt to predict contention, and given the all-to-all communication pattern, the probability of internal contention among seemingly unrelated connections is high.<p>

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -