📄 node33.html
字号:
<P>
<P><A NAME=5065> </A><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img564.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img564.gif">
<BR><STRONG>Figure 3.18:</STRONG> <em> Communications in a bidirectional MIN. The communication
indicated at (a) involves processors connected to the same crossbar;
it takes just two hops and passes through a single switch. The
communication at (b) takes three hops and passes through two
switches.</em><A NAME=figperfx> </A><BR>
<P>
<P>
In a unidirectional MIN, all messages
<A NAME=3879> </A>
must traverse the same number of wires, and so the cost of sending a
message is independent of processor location. In effect, all
processors are equidistant. In a bidirectional MIN, the number of
wires traversed depends to some extent on processor location, although
to a lesser extent than in a mesh or hypercube
(Figure <A HREF="node33.html#figperfx" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node33.html#figperfx">3.18</A>).
<P>
The fact that messages destined for different destinations may need to
pass over the same wire means that MINs are not immune to competition
for bandwidth. Nevertheless, a MIN connecting <em> P</em>
processors
typically provides <em> P</em>
wires at each stage, so in principle we
should be able to organize communications so that little competition
occurs.
<P>
<BR><HR>
<b> Example <IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img573.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img573.gif">.<IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img565.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img565.gif"> Competition for Bandwidth in Finite Difference</b>:<A NAME=experffd> </A>
<P>
<A NAME=3885> </A>
In the first of two examples, we consider the impact of competition
for bandwidth in an algorithm with a high degree of locality: the
one-dimensional finite difference algorithm examined in preceding
sections. Recall from Equation <A HREF="node29.html#eq76" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node29.html#eq76">3.3</A> that according to the
idealized model of Equation <A HREF="node29.html#eq88" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node29.html#eq88">3.1</A>, the per-processor
communication costs for this algorithm are
<P><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img566.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img566.gif"><P>
<P>
<P><A NAME=5094> </A><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img571.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img571.gif">
<BR><STRONG>Figure 3.19:</STRONG> <em> Speedup of finite difference code with <em> N=512</em>
and
<em> Z=5</em>
as measured on Ethernet-connected IBM RS6000 workstations
and as predicted both by a simple performance model that does not take
into account competition for bandwidth and by a more sophisticated
model that does. Both models assume that <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img569.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img569.gif">sec and
<IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img570.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img570.gif">sec.</em><A NAME=figfdetherspeedup2> </A><BR>
<P>
<P>
Competition for bandwidth is not an issue on a mesh or hypercube because
the one-dimensional ring-based communication structure of the finite
difference problem can be embedded in these networks using only
nearest-neighbor connections. On a bus-based network, only one of the
<em> P</em>
processors can communicate at one time; if we assume that in
the communication phase of the algorithm, half the processors need to
send at once (the other half are receiving), then <em> S=P/2</em>
and the
communication volume term must be scaled by a factor of <em> P/2</em>
,
giving
<P><A NAME=eqfdb> </A><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img572.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img572.gif"><P>
<P>
Figure <A HREF="node33.html#figfdetherspeedup2" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node33.html#figfdetherspeedup2">3.19</A> illustrates both the large impact
<A NAME=3904> </A>
that bandwidth limitations can have on the performance of even a
simple algorithm such as finite difference and the improved accuracy
of the refined performance model. The figure shows performance
measured on Ethernet-connected workstations and as predicted by
Equations <A HREF="node29.html#eq76" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node29.html#eq76">3.3</A> and <A HREF="node33.html#eqfdb" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node33.html#eqfdb">3.11</A>. We see that the more
sophisticated model is reasonably accurate.
<P>
<BR><HR>
<P>
<A NAME=3907> </A>
<P>
<BR><HR>
<b> Example <IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img586.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img586.gif">.<IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img574.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img574.gif"> Competition for Bandwidth in Butterfly</b>:<A NAME=experfbfly> </A>
<P>
<A NAME=3910> </A>
As a second example, we consider an algorithm in which <em> P</em>
tasks
<A NAME=3912> </A>
use the butterfly (or hypercube) communication structure
illustrated in Figure <A HREF="node18.html#figbflyN" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node18.html#figbflyN">2.14</A> to perform <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img575.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img575.gif"> exchanges of
<em> N/P</em>
data. The summation algorithm described in Section <A HREF="node18.html#secbfly" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node18.html#secbfly">2.4.1</A>
has this form. Other algorithms with similar characteristics are
described in Chapter <A HREF="node123.html#chapcube" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node123.html#chapcube">11</A>.
<P>
Per-processor communication costs associated with this algorithm are,
in the absence of competition for bandwidth,
<P><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img576.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img576.gif"><P>
<P>
The algorithm can, of course, execute without competition for
bandwidth on a crossbar switch. Somewhat less obviously, it can also
execute without competition for bandwidth on a <em> P</em>
-processor
hypercube: Computation and communication can be organized so that each
of the <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img577.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img577.gif"> processors with which a processor must communicate is
a neighbor on one of the hypercube links. On a bus-based network,
only one processor can communicate at a time; hence, as in the finite
difference algorithm considered in Example <A HREF="node33.html#experffd" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node33.html#experffd">3.7</A>, we assume
<em> S=P/2</em>
and from Equation <A HREF="node33.html#eqcomm1" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node33.html#eqcomm1">3.10</A> we have
<P><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img578.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img578.gif"><P>
<P>
<P><A NAME=5133> </A><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img579.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img579.gif">
<BR><STRONG>Figure 3.20:</STRONG> <em> Execution of the butterfly summation algorithm on an
eight-processor, one-dimensional mesh. Shading is used to highlight a
single task and its communication partners, which are one, two, and
four hops distant.</em><A NAME=figbutt3> </A><BR>
<P>
<P>
On a mesh, the limited number of wires becomes an issue. For example,
on a one-dimensional mesh of <em> P</em>
processors, each processor
generates messages that must traverse 1, 2, ..., <IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img580.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img580.gif"> hops in the
<em> p</em>
steps of the algorithm (Figure <A HREF="node33.html#figbutt3" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node33.html#figbutt3">3.20</A>). These
messages travel a total of <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img581.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img581.gif"> hops.
This represents the number of wires to which each processor requires
exclusive access during execution of the summation. As a
one-dimensional bidirectional mesh provides only <em> 2(P-1)</em>
wires,
we see that the parallel algorithm cannot possibly proceed in less
than <em> P/2</em>
steps rather than <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img582.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img582.gif"> steps as supposed
previously. In fact, it can proceed in <em> P/2</em>
steps only if we can
define a communication schedule that keeps all wires busy all the
time. Hence, the following model represents a lower bound on
communication costs:
<P><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img583.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img583.gif"><P>
<P>
<P><A NAME=5156> </A><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img584.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img584.gif">
<BR><STRONG>Figure 3.21:</STRONG> <em> Performance of parallel FFT in a spectral transform code
<A NAME=5151> </A>
on a one-dimensional mesh in Intel DELTA and on Ethernet-connected
RS/6000 processors. The simple models do not take into account
competition for bandwidth; the refined models do, and give a better
fit to observed performance.</em><A NAME=figbflyperf1> </A><BR>
<P>
<P>
Figure <A HREF="node33.html#figbflyperf1" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node33.html#figbflyperf1">3.21</A> compares observed speedups with those
<A NAME=3954> </A>
predicted by the simple and bandwidth-limited performance models on a
one-dimensional mesh and on an Ethernet. These results are from an
<A NAME=3955> </A>
atmosphere modeling code that uses a parallel fast Fourier transform
<A NAME=3956> </A>
(FFT) to parallelize a numerical method called the spectral transform.
<A NAME=3957> </A>
The details of the numerical method are not important here; what is
relevant is that at each step, the code must perform two butterfly
communication operations (specifically, FFT) on a large array.
Details of the two experiments are given in Table <A HREF="node33.html#tabexp" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node33.html#tabexp">3.6</A>. (The
<IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img585.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img585.gif"> term used on the DELTA is significantly smaller than in the
finite difference code of Example <A HREF="node32.html#exfdi" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node32.html#exfdi">3.6</A>; this reflects the
fact that the communication code in the FFT implementation on the
DELTA had been carefully optimized.)
<P>
<BR><HR>
<P>
<P><A NAME=4475> </A><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img589.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img589.gif">
<BR><STRONG>Table 3.6:</STRONG> Parameters for butterfly performance study (<em> N</em>
in words,
times in <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img588.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img588.gif">sec).
<A NAME=tabexp> </A><BR>
<P>
<P>
<BR> <HR><a href="msgs0.htm#2" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/tppmsgs/msgs0.htm#2"><img ALIGN=MIDDLE src="asm_color_tiny.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/asm_color_tiny.gif" alt="[DBPP]"></a> <A NAME=tex2html2249 HREF="node32.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node32.html"><IMG ALIGN=MIDDLE ALT="previous" SRC="previous_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/previous_motif.gif"></A> <A NAME=tex2html2257 HREF="node34.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node34.html"><IMG ALIGN=MIDDLE ALT="next" SRC="next_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/next_motif.gif"></A> <A NAME=tex2html2255 HREF="node26.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node26.html"><IMG ALIGN=MIDDLE ALT="up" SRC="up_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/up_motif.gif"></A> <A NAME=tex2html2259 HREF="node1.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node1.html"><IMG ALIGN=MIDDLE ALT="contents" SRC="contents_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/contents_motif.gif"></A> <A NAME=tex2html2260 HREF="node133.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node133.html"><IMG ALIGN=MIDDLE ALT="index" SRC="index_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/index_motif.gif"></A> <a href="msgs0.htm#3" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/tppmsgs/msgs0.htm#3"><img ALIGN=MIDDLE src="search_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/search_motif.gif" alt="[Search]"></a> <BR>
<B> Next:</B> <A NAME=tex2html2258 HREF="node34.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node34.html">3.8 Input/Output</A>
<B>Up:</B> <A NAME=tex2html2256 HREF="node26.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node26.html">3 A Quantitative Basis for Design</A>
<B> Previous:</B> <A NAME=tex2html2250 HREF="node32.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node32.html">3.6 Evaluating Implementations</A>
<BR><HR><P>
<P><ADDRESS>
<I>© Copyright 1995 by <A href="msgs0.htm#6" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/tppmsgs/msgs0.htm#6">Ian Foster</a></I>
</ADDRESS>
</BODY>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -