📄 node33.html

📁 Design and building parallel program
💻 HTML
📖 第 1 页 / 共 3 页
字号:
上一页 1 23
<P>
<P><A NAME=5065>&#160;</A><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img564.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img564.gif">
<BR><STRONG>Figure 3.18:</STRONG> <em> Communications in a bidirectional MIN.  The communication
indicated at (a) involves processors connected to the same crossbar;
it takes just two hops and passes through a single switch.  The
communication at (b) takes three hops and passes through two
switches.</em><A NAME=figperfx>&#160;</A><BR>
<P>
<P>
In a unidirectional MIN, all messages
<A NAME=3879>&#160;</A>
must traverse the same number of wires, and so the cost of sending a
message is independent of processor location.  In effect, all
processors are equidistant.  In a bidirectional MIN, the number of
wires traversed depends to some extent on processor location, although
to a lesser extent than in a mesh or hypercube
(Figure <A HREF="node33.html#figperfx" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node33.html#figperfx">3.18</A>).
<P>
The fact that messages destined for different destinations may need to
pass over the same wire means that MINs are not immune to competition
for bandwidth.  Nevertheless, a MIN connecting <em> P</em>
 processors
typically provides <em> P</em>
 wires at each stage, so in principle we
should be able to organize communications so that little competition
occurs.
<P>
<BR><HR>
<b> Example <IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img573.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img573.gif">.<IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img565.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img565.gif">    Competition for Bandwidth in Finite Difference</b>:<A NAME=experffd>&#160;</A>
<P>
<A NAME=3885>&#160;</A>
In the first of two examples, we consider the impact of competition
for bandwidth in an algorithm with a high degree of locality: the
one-dimensional finite difference algorithm examined in preceding
sections.  Recall from Equation <A HREF="node29.html#eq76" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node29.html#eq76">3.3</A> that according to the
idealized model of Equation <A HREF="node29.html#eq88" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node29.html#eq88">3.1</A>, the per-processor
communication costs for this algorithm are
<P><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img566.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img566.gif"><P>
<P>
<P><A NAME=5094>&#160;</A><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img571.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img571.gif">
<BR><STRONG>Figure 3.19:</STRONG> <em> Speedup of finite difference code with <em> N=512</em>
 and
<em> Z=5</em>
 as measured on Ethernet-connected IBM RS6000 workstations
and as predicted both by a simple performance model that does not take
into account competition for bandwidth and by a more sophisticated
model that does.  Both models assume that <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img569.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img569.gif">sec and
<IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img570.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img570.gif">sec.</em><A NAME=figfdetherspeedup2>&#160;</A><BR>
<P>
<P>
Competition for bandwidth is not an issue on a mesh or hypercube because
the one-dimensional ring-based communication structure of the finite
difference problem can be embedded in these networks using only
nearest-neighbor connections.  On a bus-based network, only one of the
<em> P</em>
 processors can communicate at one time; if we assume that in
the communication phase of the algorithm, half the processors need to
send at once (the other half are receiving), then <em> S=P/2</em>
 and the
communication volume term must be scaled by a factor of <em> P/2</em>
,
giving
<P><A NAME=eqfdb>&#160;</A><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img572.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img572.gif"><P>
<P>
Figure <A HREF="node33.html#figfdetherspeedup2" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node33.html#figfdetherspeedup2">3.19</A> illustrates both the large impact
<A NAME=3904>&#160;</A>
that bandwidth limitations can have on the performance of even a
simple algorithm such as finite difference and the improved accuracy
of the refined performance model.  The figure shows performance
measured on Ethernet-connected workstations and as predicted by
Equations <A HREF="node29.html#eq76" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node29.html#eq76">3.3</A> and <A HREF="node33.html#eqfdb" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node33.html#eqfdb">3.11</A>.  We see that the more
sophisticated model is reasonably accurate.
<P>
<BR><HR>
<P>
<A NAME=3907>&#160;</A>
<P>
<BR><HR>
<b> Example <IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img586.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img586.gif">.<IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img574.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img574.gif">    Competition for Bandwidth in Butterfly</b>:<A NAME=experfbfly>&#160;</A>
<P>
<A NAME=3910>&#160;</A>
As a second example, we consider an algorithm in which <em> P</em>
 tasks
<A NAME=3912>&#160;</A>
use the butterfly (or hypercube) communication structure
illustrated in Figure <A HREF="node18.html#figbflyN" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node18.html#figbflyN">2.14</A> to perform <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img575.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img575.gif"> exchanges of
<em> N/P</em>
 data.  The summation algorithm described in Section <A HREF="node18.html#secbfly" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node18.html#secbfly">2.4.1</A>
has this form. Other algorithms with similar characteristics are
described in Chapter <A HREF="node123.html#chapcube" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node123.html#chapcube">11</A>.
<P>
Per-processor communication costs associated with this algorithm are,
in the absence of competition for bandwidth,
<P><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img576.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img576.gif"><P>
<P>
The algorithm can, of course, execute without competition for
bandwidth on a crossbar switch.  Somewhat less obviously, it can also
execute without competition for bandwidth on a <em> P</em>
-processor
hypercube: Computation and communication can be organized so that each
of the <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img577.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img577.gif"> processors with which a processor must communicate is
a neighbor on one of the hypercube links.  On a bus-based network,
only one processor can communicate at a time; hence, as in the finite
difference algorithm considered in Example <A HREF="node33.html#experffd" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node33.html#experffd">3.7</A>, we assume
<em> S=P/2</em>
 and from Equation <A HREF="node33.html#eqcomm1" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node33.html#eqcomm1">3.10</A> we have
<P><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img578.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img578.gif"><P>
<P>
<P><A NAME=5133>&#160;</A><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img579.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img579.gif">
<BR><STRONG>Figure 3.20:</STRONG> <em> Execution of the butterfly summation algorithm on an
eight-processor, one-dimensional mesh.  Shading is used to highlight a
single task and its communication partners, which are one, two, and
four hops distant.</em><A NAME=figbutt3>&#160;</A><BR>
<P>
<P>
On a mesh, the limited number of wires becomes an issue.  For example,
on a one-dimensional mesh of <em> P</em>
 processors, each processor
generates messages that must traverse 1, 2, ..., <IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img580.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img580.gif"> hops in the
<em> p</em>
 steps of the algorithm (Figure <A HREF="node33.html#figbutt3" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node33.html#figbutt3">3.20</A>).  These
messages travel a total of <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img581.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img581.gif"> hops.
This represents the number of wires to which each processor requires
exclusive access during execution of the summation.  As a
one-dimensional bidirectional mesh provides only <em> 2(P-1)</em>
 wires,
we see that the parallel algorithm cannot possibly proceed in less
than <em> P/2</em>
 steps rather than <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img582.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img582.gif"> steps as supposed
previously.  In fact, it can proceed in <em> P/2</em>
 steps only if we can
define a communication schedule that keeps all wires busy all the
time.  Hence, the following model represents a lower bound on
communication costs:
<P><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img583.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img583.gif"><P>
<P>
<P><A NAME=5156>&#160;</A><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img584.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img584.gif">
<BR><STRONG>Figure 3.21:</STRONG> <em> Performance of parallel FFT in a spectral transform code
<A NAME=5151>&#160;</A>
on a one-dimensional mesh in Intel DELTA and on Ethernet-connected
RS/6000 processors.  The simple models do not take into account
competition for bandwidth; the refined models do, and give a better
fit to observed performance.</em><A NAME=figbflyperf1>&#160;</A><BR>
<P>
<P>
Figure <A HREF="node33.html#figbflyperf1" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node33.html#figbflyperf1">3.21</A> compares observed speedups with those
<A NAME=3954>&#160;</A>
predicted by the simple and bandwidth-limited performance models on a
one-dimensional mesh and on an Ethernet.  These results are from an
<A NAME=3955>&#160;</A>
atmosphere modeling code that uses a parallel fast Fourier transform
<A NAME=3956>&#160;</A>
(FFT) to parallelize a numerical method called the spectral transform.
<A NAME=3957>&#160;</A>
The details of the numerical method are not important here; what is
relevant is that at each step, the code must perform two butterfly
communication operations (specifically, FFT) on a large array.
Details of the two experiments are given in Table <A HREF="node33.html#tabexp" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node33.html#tabexp">3.6</A>.  (The
<IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img585.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img585.gif"> term used on the DELTA is significantly smaller than in the
finite difference code of Example <A HREF="node32.html#exfdi" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node32.html#exfdi">3.6</A>; this reflects the
fact that the communication code in the FFT implementation on the
DELTA had been carefully optimized.)
<P>
<BR><HR>
<P>
<P><A NAME=4475>&#160;</A><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img589.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img589.gif">
<BR><STRONG>Table 3.6:</STRONG>  Parameters for butterfly performance study (<em> N</em>
 in words,
times in <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img588.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img588.gif">sec).
<A NAME=tabexp>&#160;</A><BR>
<P>
<P>

<BR> <HR><a href="msgs0.htm#2" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/tppmsgs/msgs0.htm#2"><img ALIGN=MIDDLE src="asm_color_tiny.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/asm_color_tiny.gif" alt="[DBPP]"></a>    <A NAME=tex2html2249 HREF="node32.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node32.html"><IMG ALIGN=MIDDLE ALT="previous" SRC="previous_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/previous_motif.gif"></A> <A NAME=tex2html2257 HREF="node34.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node34.html"><IMG ALIGN=MIDDLE ALT="next" SRC="next_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/next_motif.gif"></A> <A NAME=tex2html2255 HREF="node26.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node26.html"><IMG ALIGN=MIDDLE ALT="up" SRC="up_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/up_motif.gif"></A> <A NAME=tex2html2259 HREF="node1.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node1.html"><IMG ALIGN=MIDDLE ALT="contents" SRC="contents_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/contents_motif.gif"></A> <A NAME=tex2html2260 HREF="node133.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node133.html"><IMG ALIGN=MIDDLE ALT="index" SRC="index_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/index_motif.gif"></A> <a href="msgs0.htm#3" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/tppmsgs/msgs0.htm#3"><img ALIGN=MIDDLE src="search_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/search_motif.gif" alt="[Search]"></a>   <BR>
<B> Next:</B> <A NAME=tex2html2258 HREF="node34.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node34.html">3.8 Input/Output</A>
<B>Up:</B> <A NAME=tex2html2256 HREF="node26.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node26.html">3 A Quantitative Basis for Design</A>
<B> Previous:</B> <A NAME=tex2html2250 HREF="node32.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node32.html">3.6 Evaluating Implementations</A>
<BR><HR><P>
<P><ADDRESS>
<I>&#169 Copyright 1995 by <A href="msgs0.htm#6" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/tppmsgs/msgs0.htm#6">Ian Foster</a></I>
</ADDRESS>
</BODY>
上一页 1 23
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -