📄 node19.html

📁 Design and building parallel program
💻 HTML
📖 第 1 页 / 共 2 页
字号:
12 下一页
<html><!DOCTYPE HTML PUBLIC "-//W3O//DTD W3 HTML 2.0//EN">
<!Converted with LaTeX2HTML 95.1 (Fri Jan 20 1995) by Nikos Drakos (nikos@cbl.leeds.ac.uk), CBLU, University of Leeds >
<HEAD>
<TITLE>2.5 Mapping</TITLE>
</HEAD>
<BODY>
<meta name="description" value="2.5 Mapping">
<meta name="keywords" value="book">
<meta name="resource-type" value="document">
<meta name="distribution" value="global">
<P>
 <BR> <HR><a href="msgs0.htm#2" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/tppmsgs/msgs0.htm#2"><img ALIGN=MIDDLE src="asm_color_tiny.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/asm_color_tiny.gif" alt="[DBPP]"></a>    <A NAME=tex2html2071 HREF="node18.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node18.html"><IMG ALIGN=MIDDLE ALT="previous" SRC="previous_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/previous_motif.gif"></A> <A NAME=tex2html2079 HREF="node20.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node20.html"><IMG ALIGN=MIDDLE ALT="next" SRC="next_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/next_motif.gif"></A> <A NAME=tex2html2077 HREF="node14.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node14.html"><IMG ALIGN=MIDDLE ALT="up" SRC="up_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/up_motif.gif"></A> <A NAME=tex2html2081 HREF="node1.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node1.html"><IMG ALIGN=MIDDLE ALT="contents" SRC="contents_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/contents_motif.gif"></A> <A NAME=tex2html2082 HREF="node133.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node133.html"><IMG ALIGN=MIDDLE ALT="index" SRC="index_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/index_motif.gif"></A> <a href="msgs0.htm#3" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/tppmsgs/msgs0.htm#3"><img ALIGN=MIDDLE src="search_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/search_motif.gif" alt="[Search]"></a>   <BR>
<B> Next:</B> <A NAME=tex2html2080 HREF="node20.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node20.html">2.6 Case Study: Atmosphere Model</A>
<B>Up:</B> <A NAME=tex2html2078 HREF="node14.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node14.html">2 Designing Parallel Algorithms</A>
<B> Previous:</B> <A NAME=tex2html2072 HREF="node18.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node18.html">2.4 Agglomeration</A>
<BR><HR><P>
<H1><A NAME=SECTION02350000000000000000>2.5 Mapping</A></H1>
<P>
<A NAME=sectmap>&#160;</A>
<P>
<A NAME=1544>&#160;</A>
In the fourth and final stage of the parallel algorithm design
process, we specify where each task is to execute.  This mapping
problem does not arise on uniprocessors or on shared-memory computers
that provide automatic task scheduling.  In these computers, a set of
tasks and associated communication requirements is a sufficient
specification for a parallel algorithm; operating system or hardware
mechanisms can be relied upon to schedule executable tasks to
available processors.  Unfortunately, general-purpose mapping
mechanisms have yet to be developed for scalable parallel computers.
In general, mapping remains a difficult problem that must be
explicitly addressed when designing parallel algorithms.
<P>
Our goal in developing mapping algorithms is normally to minimize
total execution time.  We use two strategies to achieve this goal:
<OL><LI>
We place tasks that are able to execute concurrently on <em>
different
 </em> processors, so as to enhance concurrency.
<LI>
We place tasks that communicate frequently on the <em> same
 </em>
processor, so as to increase locality.
</OL>
Clearly, these two strategies will sometimes conflict, in which case
our design will involve tradeoffs.  In addition, resource limitations
may restrict the number of tasks that can be placed on a single
processor.
<P>
The mapping problem is known to be <em> NP
 -complete</em>, meaning that no
computationally tractable (polynomial-time) algorithm can exist for
evaluating these tradeoffs in the general case.  However, considerable
knowledge has been gained on specialized strategies and heuristics and
the classes of problem for which they are effective.  In this
section, we provide a rough classification of problems and present
some representative techniques.
<P>
<P><A NAME=2667>&#160;</A><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img240.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img240.gif">
<BR><STRONG>Figure 2.16:</STRONG> <em> Mapping in a grid problem in which each task performs the
same amount of computation and communicates only with its four
neighbors.  The heavy dashed lines delineate processor boundaries.
The grid and associated computation is partitioned to give each
processor the same amount of computation and to minimize off-processor
communication.</em><A NAME=figmap1>&#160;</A><BR>
<P>
<P>
Many algorithms developed using domain decomposition techniques
feature a fixed number of equal-sized tasks and structured local and
global communication.  In such cases, an efficient mapping is
straightforward.  We map tasks in a way that minimizes interprocessor
communication (Figure <A HREF="node19.html#figmap1" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node19.html#figmap1">2.16</A>); we may also choose to agglomerate
tasks mapped to the same processor, if this has not already been done,
to yield a total of <em> P</em>
 coarse-grained tasks, one per processor.
<P>
In more complex domain decomposition-based algorithms with variable
amounts of work per task and/or unstructured communication patterns,
efficient agglomeration and mapping strategies may not be obvious to
the programmer.  Hence, we may employ
<em> load balancing
 </em> algorithms that seek to identify efficient
agglomeration and mapping strategies, typically by using heuristic
techniques.  The time required to execute these algorithms must be
weighed against the benefits of reduced execution time.  
<em> Probabilistic load-balancing
 </em> methods tend to have lower overhead
<A NAME=1558>&#160;</A>
than do methods that exploit structure in an application.
<P>
The most complex problems are those in which either the number of
tasks or the amount of computation or communication per task changes
dynamically during program execution.  In the case of problems
developed using domain decomposition techniques, we may use a <em>
<A NAME=1559>&#160;</A>
dynamic load-balancing
 </em> strategy in which a load-balancing
algorithm is executed periodically to determine a new agglomeration
and mapping.  Because load balancing must be performed many times
during program execution, 
<A NAME=1560>&#160;</A>
<em> local
 </em> algorithms may be preferred that do not require
global knowledge of computation state.
<P>
Algorithms based on functional decomposition often yield computations
consisting of many short-lived tasks that coordinate with other tasks
only at the start and end of execution.  In this case, we can
<A NAME=1562>&#160;</A>
use <em> task-scheduling
 </em> algorithms, which allocate tasks to
processors that are idle or that are likely to become idle.
<P>
<H2><A NAME=SECTION02351000000000000000>2.5.1 Load-Balancing Algorithms</A></H2>
<P>
<A NAME=seclbalgs>&#160;</A>
<P>
A wide variety of both general-purpose and application-specific
load-balancing techniques have been proposed for use in parallel
algorithms based on domain decomposition techniques.  We review
several representative approaches here (the chapter notes provide
references to other methods), namely recursive bisection methods,
local algorithms, probabilistic methods, and cyclic mappings.  These
techniques are all intended to agglomerate fine-grained tasks defined
in an initial partition to yield one coarse-grained task per
processor.  Alternatively, we can think of them as partitioning our
computational domain to yield one subdomain
<A NAME=1566>&#160;</A>
for each processor. For this reason, they are often referred to as
<em> partitioning
 </em> algorithms.
<P>
<H4><A NAME=SECTION02351010000000000000> Recursive Bisection.</A></H4>
<P>
<A NAME=1569>&#160;</A>
Recursive bisection techniques are used to partition a domain (e.g., a
<A NAME=1570>&#160;</A>
finite element grid) into subdomains of approximately equal
computational cost while attempting to minimize communication costs,
that is, the number of channels crossing task boundaries.  A
divide-and-conquer approach is taken.  The domain is first cut in one
dimension to yield two subdomains.  Cuts are then made recursively in
the new subdomains until we have as many subdomains as we require
tasks.  Notice that this recursive strategy allows the
partitioning algorithm itself to be executed in parallel.
<P>
The most straightforward of the recursive bisection techniques is <em>
<A NAME=1571>&#160;</A>
recursive coordinate bisection</em>, which is normally applied to
irregular grids that have a mostly local communication structure.
This technique makes cuts based on the physical coordinates of grid
points in the domain, at each step subdividing along the longer
dimension so that if (for example) the cut is made along the
<em> x</em>
 dimension, grid points in one subdomain will all have an
<em> x</em>
-coordinate greater than grid points in the other.  This
approach has the advantages of being simple and inexpensive.  It also
does a good job of partitioning computation.  A disadvantage is that
it does not optimize communication performance.  In particular, it can
generate long, skinny subdomains, which if an algorithm has
significant local communication will result in more messages than will
a decomposition that generates square subdomains.
<P>
A variant of recursive bisection called <em> unbalanced recursive
<A NAME=1574>&#160;</A>
bisection
 </em> attempts to reduce communication costs by forming
<A NAME=1575>&#160;</A>
subgrids that have better aspect ratios.  Instead of automatically
dividing a grid in half, it considers the <em> P-1</em>
 partitions
obtained by forming unbalanced subgrids with <em> 1/P</em>
 and
<em> (P-1)/P</em>
 of the load, with <em> 2/P</em>
 and <em> (P-2)/P</em>
 of the
load, and so on, and chooses the partition that minimizes partition
aspect ratio.  This method increases the cost of computing the
partition but can reduce communication costs.
<A HREF="#plassman">Plate 1</A>

<P>
shows a mapping onto 64 processors constructed by using unbalanced
recursive bisection.  In this instance, the grid in question is an
irregular finite element mesh generated for a superconductivity
simulation.
<P>
<P><HR>
<A NAME=plassman HREF="msgs0.htm#25" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/tppmsgs/msgs0.htm#25"> <img
ALIGN=MIDDLE src="part_small.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/part_small.gif"></A>
<P>
(GIF <A HREF="msgs0.htm#25" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/tppmsgs/msgs0.htm#25">235573</A> bytes; RGB <A
HREF="msgs0.htm#26" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/tppmsgs/msgs0.htm#26">1142568</A> bytes.)
Plate 1: The unbalanced recursive bisection algorithm, applied here to a
superconductivity simulation in which increased computational load
corresponds to an increased number of triangular elements in certain
areas of the grid.  The recursive partitioning yields sixty four
subdomains, with for example the first partition descending vertically
between subdomains 28 and 5.  Image courtesy of P. Plassmann.
<P><HR>

<P>
<A NAME=1587>&#160;</A>
Another technique, called <em> recursive graph bisection
 </em>, can be
useful in the case of more complex unstructured grids, for example,
finite element meshes.  This technique uses connectivity information
to reduce the number of grid edges crossing subdomain boundaries, and
hence to reduce communication requirements.  A grid is treated as a
graph with <em> N</em>
 vertices (grid points) <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img241.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img241.gif">.  The algorithm first
identifies the two extremities of the graph, that is, the two vertices
that are the most separated in terms of graph distance.  (The graph
distance between two vertices is the smallest number of edges that
must be traversed to go between them.)  Each vertex is then assigned
to the subdomain corresponding to the closer extremity.  Another
<A NAME=1590>&#160;</A>
algorithm called <em> recursive spectral bisection
 </em> is even
better in many circumstances (see the chapter notes for references).
<A HREF="#johan">Plate 2</A>

<P>
<A NAME=1596>&#160;</A>
shows a partition computed using the latter algorithm for the grid of
Figure <A HREF="node17.html#figfem" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node17.html#figfem">2.9</A>.
<P>
<P><HR>
<A NAME=johan HREF="asm_color.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/asm_color.gif"> <img
ALIGN=MIDDLE src="asm_color_small.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/asm_color_small.gif"></A>
<P>
(GIF <A HREF="asm_color.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/asm_color.gif">34643</A> bytes; RGB <A
HREF="asm_color.rgb" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/asm_color.rgb">131183</A> bytes.)
Plate 2: The spectral
bisection partitioning algorithm applied to a finite element mesh
generated for an assembly part.  Image courtesy of Z. Johan.
<P><HR>

<P>
<A NAME=1600>&#160;</A>
<P>
<P><A NAME=2691>&#160;</A><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img242.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img242.gif">
<BR><STRONG>Figure 2.17:</STRONG> <em> Load balancing in a grid problem.  Variable numbers of
grid points are placed on each processor so as to compensate for load
imbalances.  This sort of load distribution may arise if a local
load-balancing scheme is used in which tasks exchange load information
with neighbors and transfer grid points when load imbalances are
detected.</em><A NAME=figload2>&#160;</A><BR>
<P><H4><A NAME=SECTION02351020000000000000> Local Algorithms.</A></H4>
<P>
<A NAME=1606>&#160;</A>
The techniques just described are relatively expensive because they
require global knowledge of computation state.  In contrast, local
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -