📄 node18.html

📁 Design and building parallel program
💻 HTML
📖 第 1 页 / 共 2 页
字号:
12 下一页
<html><!DOCTYPE HTML PUBLIC "-//W3O//DTD W3 HTML 2.0//EN">
<!Converted with LaTeX2HTML 95.1 (Fri Jan 20 1995) by Nikos Drakos (nikos@cbl.leeds.ac.uk), CBLU, University of Leeds >
<HEAD>
<TITLE>2.4 Agglomeration</TITLE>
</HEAD>
<BODY>
<meta name="description" value="2.4 Agglomeration">
<meta name="keywords" value="book">
<meta name="resource-type" value="document">
<meta name="distribution" value="global">
<P>
 <BR> <HR><a href="msgs0.htm#2" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/tppmsgs/msgs0.htm#2"><img ALIGN=MIDDLE src="asm_color_tiny.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/asm_color_tiny.gif" alt="[DBPP]"></a>    <A NAME=tex2html2059 HREF="node17.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node17.html"><IMG ALIGN=MIDDLE ALT="previous" SRC="previous_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/previous_motif.gif"></A> <A NAME=tex2html2067 HREF="node19.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node19.html"><IMG ALIGN=MIDDLE ALT="next" SRC="next_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/next_motif.gif"></A> <A NAME=tex2html2065 HREF="node14.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node14.html"><IMG ALIGN=MIDDLE ALT="up" SRC="up_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/up_motif.gif"></A> <A NAME=tex2html2069 HREF="node1.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node1.html"><IMG ALIGN=MIDDLE ALT="contents" SRC="contents_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/contents_motif.gif"></A> <A NAME=tex2html2070 HREF="node133.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node133.html"><IMG ALIGN=MIDDLE ALT="index" SRC="index_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/index_motif.gif"></A> <a href="msgs0.htm#3" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/tppmsgs/msgs0.htm#3"><img ALIGN=MIDDLE src="search_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/search_motif.gif" alt="[Search]"></a>   <BR>
<B> Next:</B> <A NAME=tex2html2068 HREF="node19.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node19.html">2.5 Mapping</A>
<B>Up:</B> <A NAME=tex2html2066 HREF="node14.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node14.html">2 Designing Parallel Algorithms</A>
<B> Previous:</B> <A NAME=tex2html2060 HREF="node17.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node17.html">2.3 Communication</A>
<BR><HR><P>
<H1><A NAME=SECTION02340000000000000000>2.4 Agglomeration</A></H1>
<P>
<A NAME=1453>&#160;</A>
In the first two stages of the design process, we partitioned the
computation to be performed into a set of tasks and introduced
communication to provide data required by these tasks.  The resulting
algorithm is still abstract in the sense that it is not specialized
for efficient execution on any particular parallel computer.  In fact,
it may be highly inefficient if, for example, it creates many more
tasks than there are processors on the target computer and this
computer is not designed for efficient execution of small tasks.
<P>
In the third stage, <em> agglomeration,</em> we move from the abstract
toward the concrete.  We revisit decisions made in the partitioning
and communication phases with a view to obtaining an algorithm that
will execute efficiently on some class of parallel computer.  In
particular, we consider whether it is useful to combine, or <em>
agglomerate</em>, tasks identified by the partitioning phase, so as to
provide a smaller number of tasks, each of greater size
(Figure <A HREF="node18.html#figagglom" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node18.html#figagglom">2.11</A>).  We also determine whether it is worthwhile
to <em> replicate
 </em> data and/or computation.
<P>
<P><A NAME=2573>&#160;</A><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img208.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img208.gif">
<BR><STRONG>Figure 2.11:</STRONG> <em> Examples of agglomeration.  In (a), the size of tasks is
increased by reducing the dimension of the decomposition from three to
two.  In (b), adjacent tasks are combined to yield a three-dimensional
decomposition of higher granularity.  In (c), subtrees in a
divide-and-conquer structure are coalesced.  In (d), nodes in a tree
algorithm are combined.</em><A NAME=figagglom>&#160;</A><BR>
<P>
<P>
The number of tasks yielded by the agglomeration phase, although
reduced, may still be greater than the number of processors.  In this
case, our design remains somewhat abstract, since issues relating to
the mapping of tasks to processors remain unresolved.  Alternatively,
we may choose during the agglomeration phase to reduce the number of
tasks to exactly one per processor.  We might do this, for example,
because our target parallel computer or program
<A NAME=1462>&#160;</A>
development environment demands an SPMD program.  In this case, our
design is already largely complete, since in defining <em> P</em>
 tasks
that will execute on <em> P</em>
 processors, we have also addressed the
mapping problem.  In this section, we focus on general issues that
arise when
<A NAME=1465>&#160;</A>
increasing task granularity. Specific issues relating to the
generation of SPMD programs are discussed in Section <A HREF="node19.html#sectmap" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node19.html#sectmap">2.5</A>.
<P>
Three sometimes-conflicting goals guide decisions concerning
agglomeration and replication: reducing communication costs by
increasing computation and communication <em> granularity</em>, retaining
<em> flexibility
 </em> with respect to scalability and mapping
decisions, and reducing <em> software engineering
 </em> costs.  These
goals are discussed in the next three subsections.
<P>
<H2><A NAME=SECTION02341000000000000000>2.4.1 Increasing Granularity</A></H2>
<P>
<A NAME=secbfly>&#160;</A>
<P>
In the partitioning phase of the design process, our efforts are
focused on defining as many tasks as possible.  This is a useful
discipline because it forces us to consider a wide range of
opportunities for parallel execution.  We note, however, that defining
a large number of fine-grained tasks does not necessarily produce an
efficient parallel algorithm.
<P>
One critical issue influencing parallel performance is communication
costs.  On most parallel computers, we have to stop computing in order
to send and receive messages.  Because we typically would rather be
computing, we can improve performance by reducing the amount of time
spent communicating.  Clearly, this performance improvement can be
achieved by sending less data.  Perhaps less obviously, it can also be
achieved by using fewer messages, even if we send the same amount of
data.  This is because each communication incurs not only a cost
proportional to the amount of data transferred but also a fixed
startup cost.  These issues are discussed in detail in
Chapter <A HREF="node26.html#chapperf" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node26.html#chapperf">3</A>, where we use analytic models to quantify
communication costs.
<P>
In addition to communication costs, we may need to be concerned with
task creation costs.  For example, the performance of the fine-grained
search algorithm illustrated in Figure <A HREF="node10.html#figtree" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node10.html#figtree">1.13</A>, which creates
one task for each search tree node, is sensitive to task creation
costs.
<P>
<A NAME=1474>&#160;</A>
<P>
<P><A NAME=2590>&#160;</A><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img221.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img221.gif">
<BR><STRONG>Figure 2.12:</STRONG> <em> Effect of increased granularity on communication costs in
a two-dimensional finite difference problem with a five-point stencil.
The figure shows fine- and coarse-grained two-dimensional partitions
of this problem.  In each case, a single task is exploded to show its
outgoing messages (dark shading) and incoming messages (light
shading).  In (a), a computation on an <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img215.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img215.gif"> grid is partitioned
into <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img216.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img216.gif"> tasks, each responsible for a single point, while
in (b) the same computation is partioned into <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img217.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img217.gif"> tasks, each
responsible for 16 points.  In (a), <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img218.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img218.gif"> communications
are required, 4 per task; these transfer a total of 256 data values.
In (b), only <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img219.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img219.gif"> communications are required, and only
<IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img220.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img220.gif"> data values are transferred.</em><A NAME=figgran>&#160;</A><BR>
<P>
<P>
<H4><A NAME=SECTION02341010000000000000> Surface-to-Volume Effects.</A></H4>
<P>
If the number of communication partners per task is small, we can
often reduce both the number of communication operations and the total
communication volume by increasing the granularity of our
<A NAME=1481>&#160;</A>
partition, that is, by agglomerating several tasks into one.  This
effect is illustrated in Figure <A HREF="node18.html#figgran" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node18.html#figgran">2.12</A>.  In this figure, the
<A NAME=1483>&#160;</A>
reduction in communication costs is due to a <em> surface-to-volume
effect
 </em>. In other words, the communication requirements of a task
are proportional to the surface of the subdomain on which it operates,
while the computation requirements are proportional to the subdomain's
volume.  In a two-dimensional problem, the ``surface'' scales with the
problem size while the ``volume'' scales as the problem size squared.
Hence, the amount of communication performed for a unit of computation
(the
<A NAME=1485>&#160;</A>
<em> communication/computation ratio
 </em>) decreases as task size
increases.  This effect is often visible when a partition is obtained
by using domain decomposition techniques.
<P>
<A NAME=1487>&#160;</A>
A consequence of surface-to-volume effects is that higher-dimensional
decompositions are typically the most efficient, other things being
equal, because they reduce the surface area (communication) required for a
given volume (computation).  Hence, from the viewpoint of efficiency
it is usually best to increase granularity by agglomerating tasks in
all dimensions rather than reducing the dimension of the
decomposition.  This issue is explored quantitatively in
Example <A HREF="node30.html#experfgrid" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node30.html#experfgrid">3.4</A> in Chapter <A HREF="node26.html#chapperf" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node26.html#chapperf">3</A>.
<P>
The design of an efficient agglomeration strategy can be difficult in
problems with unstructured communications, such as the finite element
problem of Figure <A HREF="node17.html#figfem" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node17.html#figfem">2.9</A>.  Specialized techniques that can be
used in such cases are discussed in Section <A HREF="node19.html#seclbalgs" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node19.html#seclbalgs">2.5.1</A>.
<P>
<H4><A NAME=SECTION02341020000000000000> Replicating Computation.</A></H4>
<P>
<A NAME=1493>&#160;</A>
We can sometimes trade off replicated computation for reduced
communication requirements and/or execution time.  For an example, we
consider a variant of the summation problem presented in
Section <A HREF="node17.html#seccommglobal" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node17.html#seccommglobal">2.3.2</A>, in which the sum must be replicated in
each of the <em> N</em>
 tasks that contribute to the sum.
<P>
<P><A NAME=2609>&#160;</A><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img226.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img226.gif">
<BR><STRONG>Figure 2.13:</STRONG> <em> Using an array (above) and a tree (below) to perform a
summation and a broadcast.  On the left are the communications
performed for the summation (s); on the right, the communications
performed for the broadcast (b).  After <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img224.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img224.gif"> or <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img225.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img225.gif"> steps,
respectively, the sum of the <b>N</b> values is replicated in each of the
<b>N</b> tasks.</em><A NAME=figsb>&#160;</A><BR>
<P>
<P>
A simple approach to distributing the sum is first to use either a
ring- or tree-based algorithm to compute the sum in a single task, and
then to <em> broadcast
 </em> the sum to each of the
<em> N</em>
 tasks.  The broadcast can be performed using the same
communication structure as the summation; hence, the complete
operation can be performed in either <em> 2(N-1)</em>
 or <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img227.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img227.gif"> steps,
depending on which communication structure is used
(Figure <A HREF="node18.html#figsb" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node18.html#figsb">2.13</A>).
<P>
These algorithms are optimal in the sense that they do not perform any
unnecessary computation or communication.  However, there also exist
alternative algorithms that execute in less elapsed time, although at
the expense of unnecessary (replicated) computation and communication.
The basic idea is to perform multiple summations concurrently, with
each concurrent summation producing a value in a different task.
<P>
<A NAME=1504>&#160;</A>
We first consider a variant of the array summation algorithm based on
this idea.  In this variant, tasks are connected in a <em> ring
 </em>
rather than an array, and all <em> N</em>
 tasks execute the same algorithm
so that <em> N</em>
 partial sums are in motion simultaneously.  After
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -