📄 http:^^www.cs.cornell.edu^info^people^nikosc^projects^prema^amr^load-balancing^report.html
字号:
MIME-Version: 1.0
Server: CERN/3.0
Date: Monday, 16-Dec-96 21:40:18 GMT
Content-Type: text/html
Content-Length: 10987
Last-Modified: Thursday, 24-Oct-96 23:53:42 GMT
<BODY><TITLE> Mobile Object Layer for Dynamic and Irregular Computations </TITLE><center><H2> Mobile Object Layer for Dynamic and Irregular Computations </H2><H2> Nikos Chrisochoides and Chris Hawblitzel </H2></center><A HREF="../description/index.html"> The motivation for this work is the implementation of runtime support for dynamic and irregular computations. </A> The work consisted of two parts:</P><UL><LI>Implementing a runtime library ("Mobile Objects") to handle migration ofobjects from oneprocessor to another, and to handle the communication between these objects<LI>Parallelizing an adaptive mesh refinement (AMR) program on top of MobileObjects and PORTS</UL>The AMR algorithm starts with a uniform mesh (on which a pde can be solved)and recursively refines areas of the mesh that need a finer mesh to achievethe desired level of accuracy in the pde solution. We cannot tell untilruntime which areas will need refining, so some sort of dynamic loadbalancing is necessary in a parallel implementation of AMR. Our approachwas to break up the mesh into small pieces called "Grid Components", andthen balance the load by transferring grid components from heavily loadedprocessors to lightly loaded processors. In contrast to approachesinvolving centralized decision making, our system is completely decentralized.No collective communication is required, and processors do not have towait for centralized decisions to be made before proceeding on withtheir work.</P>The AMR algorithm on grid componentsas follows: The mesh starts out as a single root grid component. If anyareas of this grid component need refining, the root grid component spawnsmany smaller child grid components which have finer meshes. The childrencan then spawn their own children recursively, forming a tree ofgrid components.</P>In order to balance the load, grid components can move from oneprocessor to another. When this happens, pointers between gridcomponents must remain valid. With conventional global pointersconsisting of <processor, address> pairs, if the processor or theaddress of an object change, all global pointers to that object becomeinvalid. To deal with this, we use "mobile pointers" which remainvalid even when objects move. To keep track of mobile pointers, eachprocessor has a directory which it uses to hold the location of mobileobjects. The entries in the directory may not be current, so messagescan be sent out to the "best guess" at where the object resides, andthe messages will be forwarded to the true location of the object.</P>The current interface to the Mobile Objects layer is contained in mobile.h. Mobile pointers are implemented as a structure containing the number of the processor that created theobject, and an index number which is unique on that processor (inaddition, an epoch number is used to guard against stale data). Themembers of this structure form a unique name for every mobile objectin the system. A directory consists of a set of tables, where eachtable holds information about objects originating at one processor.To send a message to an object specified by a mobile pointer, aprocessor checks the table corresponding to the originating processorof the mobile pointer. It uses the index field of the mobile pointerto look up a specific entry in this table. Each entry holds theprocessor at which the object can be located (if the object is localto a processor, the entry also holds the local memory address of theobject). This entry may not be the true current location of theobject, but is the "best guess" that the processor has as to where theobject resides (if there is no entry in the table, then theoriginating processor field of the mobile pointer serves as the bestguess location of the object). Once an entry has been looked up inthe directory, a message can be sent to the mobile object on a remoteprocessor. If it turns out that the object moved and is no longerlocated at the processor specified in the directory entry, the messageis automatically forwarded (possibly multiple times) to the correctdestination. The directory entry can later be updated with morecurrent information, so that subsequent messages sent to the mobileobject go directly to the correct destination.</P>The two functions <code>mob_ObjReq</code> and <code>mob_MsgReq</code>form the core of the Mobile Objects communication interface. Anapplication can call <code>mob_ObjReq</code> to send out a "requestfor object" from one processor to another. This request invokes auser handler at the remote processor which selects an object and sendsthe object back to the requesting processor. The handler uninstallsthe object from the remote processor by calling<code>mob_uninstallObj</code>. This function takes the new locationof the object as an argument, so that the remote processor knows whereto forward incoming messages. When the object arrives at therequesting processor, the application installs the object with<code>mob_installObj</code>. To send a message to an object, theapplication calls <code>mob_MsgReq</code>. This sends a small messageto the processor that holds the object specified by a mobile pointer.When the message arrives, a user handler is invoked to perform anaction using the object (such as sending a large reply message back).</P>The current implementation of <code>mob_ObjReq</code> and<code>mob_MsgReq</code> uses the PORTS functions<code>ports1_put</code> and <code>ports1_rsr_handler</code>. Eachprocessor has one queue for each other processor in the system to holdincoming messages. Messages data is into one of the queues at thedestination processor using the PORTS function<code>ports1_put</code>. The put is followed by a<code>ports1_rsr_handler</code>, which invokes a handler at thedestination to process the message. As the destination processes theincoming messages, it sends replies back to free up space in thequeue. The communication interface is a compromise in that it onlyhandles buffering and forwarding for the small, fixed sized messagessent out by <code>mob_MsgReq</code>. An interface that also handledbuffering and forwarding for large messages would be easier to use butmore difficult to implement (I think arbitrary sized message bufferingand forwarding would be worth implementing but it is beyond the scopeof this project).The current implementation of Mobile Objects is mobile.c (there are still a few features that are unimplemented in this). As a simple example of how Mobile Objects is used, a small test file is provided.</P>The parallelized AMR code is contained in the directory (the files <A HREF="main.c">main.c</A>, <A HREF="amr0.c">amr0.c</A>,<A HREF="cluster.c>cluster.c</A>, and<A HREF="grid_level.h">grid_level.h</A> are good places to look). TheAMR program creates one mobile object per grid component to handle thegrid component data, and uses one thread for each grid component tohandle control. The code does only the mesh refinement and treeconstruction - no equations are solved on the mesh as it is constructed.</P>One of the most attractive aspects of the threads/mobile objects approachis that it is easy to experiment with different load balancing strategiesin an application without drastically altering the application code. Thecurrent policy is fairly simple. The processors are organized in a grid(actually a 2-d ring). When the number of threads on a processordrops below a certain value (currently 8), the processor sends out requeststo all of its neighbors asking for more grid components. If a processorreceives a request for grid components, it checks its list of availablegrid components to see if it has any work to send out. If it does, itsends out the grid component whose position is the furthest towards theposition of the requesting processor (for instance, if the the request comesfrom the processor on the left, the leftmost grid component is sent out).<H3>Timing Results</H3>Timing measurements of the AMR code were made on 4 processors on theSP-2. The following plots show how much time was spent by eachprocessor doing useful work, and how much time went into communicationand thread management. These measurements are shown as functions oftime, so that it is apparent how the balance of computation andcommunication change as the mesh refinement progresses. Thecommunication times shown include all overhead associated with loadbalancing, including handler execution and packing and unpacking ofobjects. </P>(Note: you may want to make your web browser wide enough to displaytwo plots side by side)</P><IMG SRC="times0.gif"><IMG SRC="athreads0.gif"><IMG SRC="times1.gif"><IMG SRC="athreads1.gif"><IMG SRC="times2.gif"><IMG SRC="athreads2.gif"><IMG SRC="times3.gif"><IMG SRC="athreads3.gif"></P>The best performance is obtained early in the computation, becausesmall objects near the root of the grid component tree quickly spawnlarge amounts of work for the processors. A processor spendsrelatively little time fetching these components, and a lot of timedoing useful work on the components and the components' decendants.However, as the computation progresses, the processors spend more andmore time fetching components, because the components are near thebottom of the grid component tree and therefore lead to relativelylittle work. Processors 0 and 3 struggle to find enough gridcomponents to keep them busy near the end of the refinement. Althoughthe processors do have at least some work to do during most of thecomputation (they request more work before running completely runningout of threads), the resultant communication overhead leads to lowperformance during this period of time on these two processors. Thiscommunication also has an impact on processors 1 and 2, which mustservice the requests for work. </P>As the above plots show, AMR is difficult to load balance, because ofthe explosive growth of the grid component tree in unpredictableplaces. However, the above tests are somewhat of a worst-casesituation, because they only test one part of one time step of a realAMR application. In an application using AMR, the grid component treeis similar in structure from one time step to the next (in fact, itmay be held fixed for several time steps), so refinementsare no longer completely unpredictable and load balancing can occurmore gradually over many time steps. In contrast to centralizedalgorithms that must completely redistribute the load when the meshchanges significantly, an AMR implementation based on threads andmoving objects should be able to incrementally balance the load,preserving data locality and holding down communication costs. Whilewe have not implemented this yet, the tools built here should providean easy platform on which to construct a full AMR application.</BODY>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -