📄 portals3.lyx
字号:
This document describes an application programming interface for message passing between nodes in a system area network. The goal of this interface is to improve the scalability and performance of network communication by defining the functions and semantics of message passing required for scaling a parallel computing system to ten thousand nodes. This goal is achieved by providing an interface that will allow a quality implementation to take advantage of the inherently scalable design of Portals.\layout StandardThis document is divided into several sections: \layout DescriptionSection\SpecialChar ~\begin_inset LatexCommand \ref{sec:intro}\end_inset ---Introduction This section describes the purpose and scope of the Portals API. \layout DescriptionSection\SpecialChar ~\begin_inset LatexCommand \ref{sec:apiover}\end_inset ---An\SpecialChar ~Overview\SpecialChar ~of\SpecialChar ~the\SpecialChar ~Portals\SpecialChar ~3.1\SpecialChar ~API This section gives a brief overview of the Portals API. The goal is to introduce the key concepts and terminology used in the description of the API. \layout DescriptionSection\SpecialChar ~\begin_inset LatexCommand \ref{sec:api}\end_inset ---The\SpecialChar ~Portals\SpecialChar ~3.2\SpecialChar ~API This section describes the functions and semantics of the Portals application programming interface. \layout DescriptionSection\SpecialChar ~\begin_inset LatexCommand \ref{sec:semantics}\end_inset --The\SpecialChar ~Semantics\SpecialChar ~of\SpecialChar ~Message\SpecialChar ~Transmission This section describes the semantics of message transmission. In particular, the information transmitted in each type of message and the processing of incoming messages. \layout DescriptionSection\SpecialChar ~\begin_inset LatexCommand \ref{sec:examples}\end_inset ---Examples This section presents several examples intended to illustrates the use of the Portals API. \layout SectionPurpose\layout StandardExisting message passing technologies available for commodity cluster networking hardware do not meet the scalability goals required by the Cplant\SpecialChar ~\begin_inset LatexCommand \cite{Cplant}\end_inset project at Sandia National Laboratories. The goal of the Cplant project is to construct a commodity cluster that can scale to the order of ten thousand nodes. This number greatly exceeds the capacity for which existing message passing technologies have been designed and implemented.\layout StandardIn addition to the scalability requirements of the network, these technologies must also be able to support a scalable implementation of the Message Passing Interface (MPI)\SpecialChar ~\begin_inset LatexCommand \cite{MPIstandard}\end_inset standard, which has become the \shape italic de facto\shape default standard for parallel scientific computing. While MPI does not impose any scalability limitations, existing message passing technologies do not provide the functionality needed to allow implementations of MPI to meet the scalability requirements of Cplant.\layout StandardThe following are properties of a network architecture that do not impose any inherent scalability limitations: \layout ItemizeConnectionless - Many connection-oriented architectures, such as VIA\SpecialChar ~\begin_inset LatexCommand \cite{VIA}\end_inset and TCP/IP sockets, have limitations on the number of peer connections that can be established. \layout ItemizeNetwork independence - Many communication systems depend on the host processor to perform operations in order for messages in the network to be consumed. Message consumption from the network should not be dependent on host processor activity, such as the operating system scheduler or user-level thread scheduler. \layout ItemizeUser-level flow control - Many communication systems manage flow control internally to avoid depleting resources, which can significantly impact performance as the number of communicating processes increases. \layout ItemizeOS Bypass - High performance network communication should not involve memory copies into or out of a kernel-managed protocol stack. \layout StandardThe following are properties of a network architecture that do not impose scalability limitations for an implementation of MPI:\layout ItemizeReceiver-managed - Sender-managed message passing implementations require a persistent block of memory to be available for every process, requiring memory resources to increase with job size and requiring user-level flow control mechanisms to manage these resources. \layout ItemizeUser-level Bypass - While OS Bypass is necessary for high-performance, it alone is not sufficient to support the Progress Rule of MPI asynchronous operations. \layout ItemizeUnexpected messages - Few communication systems have support for receiving messages for which there is no prior notification. Support for these types of messages is necessary to avoid flow control and protocol overhead. \layout SectionBackground\layout StandardPortals was originally designed for and implemented on the nCube machine as part of the SUNMOS (Sandia/UNM OS)\SpecialChar ~\begin_inset LatexCommand \cite{SUNMOS}\end_inset and Puma\SpecialChar ~\begin_inset LatexCommand \cite{PumaOS}\end_inset lightweight kernel development projects. Portals went through two design phases, the latter of which is used on the 4500-node Intel TeraFLOPS machine\SpecialChar ~\begin_inset LatexCommand \cite{TFLOPS}\end_inset . Portals have been very successful in meeting the needs of such a large machine, not only as a layer for a high-performance MPI implementation\SpecialChar ~\begin_inset LatexCommand \cite{PumaMPI}\end_inset , but also for implementing the scalable run-time environment and parallel I/O capabilities of the machine.\layout StandardThe second generation Portals implementation was designed to take full advantage of the hardware architecture of large MPP machines. However, efforts to implement this same design on commodity cluster technology identified several limitations, due to the differences in network hardware as well as to shortcomings in the design of Portals.\layout SectionScalability\layout StandardThe primary goal in the design of Portals is scalability. Portals are designed specifically for an implementation capable of supporting a parallel job running on tens of thousands of nodes. Performance is critical only in terms of scalability. That is, the level of message passing performance is characterized by how far it allows an application to scale and not by how it performs in micro-benchmarks (e.g., a two node bandwidth or latency test).\layout StandardThe Portals API is designed to allow for scalability, not to guarantee it. Portals cannot overcome the shortcomings of a poorly designed application program. Applications that have inherent scalability limitations, either through design or implementation, will not be transformed by Portals into scalable applications. Scalability must be addressed at all levels. Portals do not inhibit scalability, but do not guarantee it either.\layout StandardTo support scalability, the Portals interface maintains a minimal amount of state. Portals provide reliable, ordered delivery of messages between pairs of processes. They are connectionless: a process is not required to explicitly establish a point-to-point connection with another process in order to communicate. Moreover, all buffers used in the transmission of messages are maintained in user space. The target process determines how to respond to incoming messages, and messages for which there are no buffers are discarded.\layout SectionCommunication Model\layout StandardPortals combine the characteristics of both one-side and two-sided communication. They define a \begin_inset Quotes eld\end_inset matching put\begin_inset Quotes erd\end_inset operation and a \begin_inset Quotes eld\end_inset matching get\begin_inset Quotes erd\end_inset operation. The destination of a put (or send) is not an explicit address; instead, each message contains a set of match bits that allow the receiver to determine where incoming messages should be placed. This flexibility allows Portals to support both traditional one-sided operations and two-sided send/receive operations.\layout StandardPortals allows the target to determine whether incoming messages are acceptable. A target process can choose to accept message operations from any specific process or can choose to ignore message operations from any specific process.\layout SectionZero Copy, OS Bypass and Application Bypass\layout StandardIn traditional system architectures, network packets arrive at the network interface card (NIC), are passed through one or more protocol layers in the operating system, and eventually copied into the address space of the application. As network bandwidth began to approach memory copy rates, reduction of memory copies became a critical concern. This concern lead to the development of zero-copy message passing protocols in which message copies are eliminated or pipelined to avoid the loss of bandwidth.\layout StandardA typical zero-copy protocol has the NIC generate an interrupt for the CPU when a message arrives from the network. The interrupt handler then controls the transfer of the incoming message into the address space of the appropriate application. The interrupt latency, the time from the initiation of an interrupt until the interrupt handler is running, is fairly significant. To avoid this cost, some modern NICs have processors that can be programmed to implement part of a message passing protocol. Given a properly designed protocol, it is possible to program the NIC to control the transfer of incoming messages, without needing to interrupt the CPU. Because this strategy does not need to involve the OS on every message transfer, it is frequently called \begin_inset Quotes eld\end_inset OS Bypass.\begin_inset Quotes erd\end_inset ST\SpecialChar ~\begin_inset LatexCommand \cite{ST}\end_inset , VIA\SpecialChar ~\begin_inset LatexCommand \cite{VIA}\end_inset , FM\SpecialChar ~\begin_inset LatexCommand \cite{FM2}\end_inset , GM\SpecialChar ~\begin_inset LatexCommand \cite{GM}\end_inset , and Portals are examples of OS Bypass protocols.\layout StandardMany protocols that support OS Bypass still require that the application actively participate in the protocol to ensure progress. As an example, the long message protocol of PM requires that the application receive and reply to a request to put or get a long message. This complicates the runtime environment, requiring a thread to process incoming requests, and significantly increases the latency required to initiate a long message protocol. The Portals message passing protocol does not require activity on the part of the application to ensure progress. We use the term \begin_inset Quotes eld\end_inset Application Bypass\begin_inset Quotes erd\end_inset to refer to this aspect of the Portals protocol.\layout SectionFaults \layout StandardGiven the number of components that we are dealing with and the fact that we are interested in supporting applications that run for very long times, failures are inevitable. The Portals API recognizes that the underlying transport may not be able to successfully complete an operation once it has been initiated. This is reflected in the fact that the Portals API reports three types of events: events indicating the initiation of an operation, events indicating the successful completion of an operation, and events indicating the unsuccessful completion of an operation. Every initiation event is eventually followed by a successful completion event or an unsuccessful completion event.\layout StandardBetween the time an operation is started and the time that the operation completes (successfully or unsuccessfully), any memory associated with the operation should be considered volatile. That is, the memory may be changed in unpredictable ways while the operation is progressing. Once the operation completes, the memory associated with the operation will not be subject to further modification (from this operation). Notice that unsuccessful operations may alter memory in an essentially unpredictable fashion.\layout ChapterAn Overview of the Portals API\begin_inset LatexCommand \label{sec:apiover}\end_inset \layout StandardIn this section, we give a conceptual overview of the Portals API. The goal is to provide a context for understanding the detailed description of the API presented in the next section.\layout SectionData Movement\begin_inset LatexCommand \label{sec:dmsemantics}\end_inset \layout StandardA Portal represents an opening in the address space of a process. Other processes can use a Portal to read (get) or write (put) the memory associated with the portal. Every data movement operation involves two processes, the \series bold initiator\series default and the \series bold target\series default . The initiator is the process that initiates the data movement operation. The target is the process that responds to the operation by either accepting the data for a put operation, or replying with the data for a get operation.\layout StandardIn this discussion, activities attributed to a process may refer to activities that are actually performed by the process or \emph on on behalf of the process\emph default . The inclusiveness of our terminology is important in the context of \emph on application bypass\emph default . In particular, when we note that the target sends a reply in the case of a get operation, it is possible that reply will be generated by another component in the system, bypassing the application.\layout StandardFigures\SpecialChar ~\begin_inset LatexCommand \ref{fig:put}\end_inset and \begin_inset LatexCommand \ref{fig:get}\end_inset present graphical interpretations of the Portal data movement operations: put and get. In the case of a put operation, the initiator sends a put request message containing the data to the target. The target translates the Portal addressing information in the request using its local Portal structures. When the request has been processed, the target optionally sends an acknowledgement message.\layout Standard\begin_inset Float figureplacement htbpwide falsecollapsed false\layout Standard\align center \begin_inset Graphics FormatVersion 1 filename put.eps display color size_type 0 rotateOrigin center lyxsize_type 1 lyxwidth 218pt lyxheight 119pt\end_inset \layout CaptionPortal Put (Send)\begin_inset LatexCommand \label{fig:put}\end_inset \end_inset \layout StandardIn the case of a get operation, the initiator sends a get request to the target. As with the put operation, the target translates the Portal addressing information in the request using its local Portal structures. Once it has translated the Portal addressing information, the target sends a reply that includes the requested data.\layout Standard\begin_inset Float figureplacement htbpwide falsecollapsed false\layout Standard\align center \begin_inset Graphics FormatVersion 1 filename get.eps display color size_type 0 rotateOrigin center lyxsize_type 1 lyxwidth 218pt lyxheight 119pt\end_inset \layout CaptionPortal Get\begin_inset LatexCommand \label{fig:get}\end_inset \end_inset \layout StandardWe should note that Portal address translations are only performed on nodes that respond to operations initiated by other nodes. Acknowledgements and replies to get operations bypass the portals address translation structures.\layout SectionPortal Addressing\begin_inset LatexCommand \label{subsec:paddress}\end_inset \layout StandardOne-sided data movement models (e.g., shmem\SpecialChar ~\begin_inset LatexCommand \cite{CraySHMEM}\end_inset , ST\SpecialChar ~\begin_inset LatexCommand \cite{ST}\end_inset , MPI-2\SpecialChar ~\begin_inset LatexCommand \cite{MPI2}\end_inset ) typically use a triple to address memory on a remote node. This triple consists of a process id, memory buffer id, and offset. The process id identifies the target process, the memory buffer id specifies the region of memory to be used for the operation, and the offset specifies an offset within the memory buffer.\layout StandardIn addition to the standard address components (process id, memory buffer id, and offset), a Portal address includes a set of match bits. This addressing model is appropriate for supporting one-sided operations as well as traditional two-sided message passing operations. Specifically, the Portals API provides the flexibility needed for an efficient implementation of MPI-1, which defines two-sided operations with one-sided completion semantics.\layout StandardFigure\SpecialChar ~\begin_inset LatexCommand \ref{fig:portals}\end_inset presents a graphical representation of the structures used by a target in the interpretation of a Portal address. The process id is used to route the message to the appropriate node and is not reflected in this diagram. The memory buffer id, called the \series bold portal id\series default , is used as an index into the Portal table. Each element of the Portal table identifies a match list. Each element of the match list specifies two bit patterns: a set of \begin_inset Quotes eld\end_inset don't care\begin_inset Quotes erd
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -