📄 node22.html

📁 Design and building parallel program
💻 HTML
📖 第 1 页 / 共 2 页
字号:
上一页 12
requirements are illustrated in Figure <A HREF="node22.html#figfockagg" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node22.html#figfockagg">2.32</A>.
<P>
<P><A NAME=3018>&#160;</A><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img347.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img347.gif">
<BR><STRONG>Figure 2.32:</STRONG> <em> Agglomeration strategies for Fock matrix construction with
<em> N=P=5</em>
, for (a) the total replication algorithm and (b) the partial
replication algorithm.  In each case, the five tasks are shown with
shading used to represent the portion of the symmetric <tt> D</tt> and
<tt> F</tt> matrices allocated to each task.  In (a), each matrix is
replicated in each task.  In (b), each task is given a single row and
column; this corresponds to a factor of two
replication.</em><A NAME=figfockagg>&#160;</A><BR>
<P>
<P>
1.  <em> Total replication.</em> Communication costs can be cut dramatically
by replicating the <tt> F</tt> and <tt> D</tt> matrices in each of
<em> P</em>
 tasks, one per processor of a parallel computer.  Each task is given
responsibility for <em> 1/P</em>
 of the integrals.  Computation can then
proceed in each task without any communication.  The only coordination
required is a final summation to accumulate partial <tt> F</tt> matrices.
This can be achieved using a parallel vector reduction
algorithm described in Section <A HREF="node125.html#secvecred" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node125.html#secvecred">11.2</A>.
<P>
The technique of replicating data structures on each processor of a
parallel computer is commonly used in parallel computing to reduce
software engineering costs.  It allows an existing sequential code to
be adapted quickly for parallel execution, since there is no need to
modify data structures.  The principal disadvantage of the technique
is that it is nonscalable.  Because total memory requirements
scale with the number of tasks created, the largest
<A NAME=2114>&#160;</A>
problem that can be solved is determined not by the total amount of
memory in a parallel computer, but by the amount available in a single
processor.  For example, on a 512-processor computer with 16 MB of
memory per processor, an implementation of the quantum chemistry code
DISCO that uses this strategy cannot solve problems with <em> N&gt;400</em>
.
<A NAME=2116>&#160;</A>
In principle, it would be interesting to solve problems where
<em> N</em>
 is 10 times larger.
<P>
<P><A NAME=3040>&#160;</A><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img348.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img348.gif">
<BR><STRONG>Figure 2.33:</STRONG> <em> Data requirements for integral clusters.  Each task accesses
three rows (and sometimes columns) of the <tt> D</tt> and <tt> F</tt>
matrices.</em><A NAME=figfock99>&#160;</A><BR>
<P>
<P>
2.  <em> Partial replication.</em> An alternative approach is as follows.
<A NAME=2124>&#160;</A>
First, we agglomerate computation in what seems an obvious way, namely,
by making the inner loop of the procedure <tt> fock_build</tt> in
Algorithm <A HREF="node22.html#algfock" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node22.html#algfock">2.3</A> into a task.  This yields <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img349.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img349.gif"> computation
tasks, each responsible for <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img350.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img350.gif"> integrals.  Next, we examine the
communication requirements of each such task.  We find that there is
considerable locality in the data required by these clusters of
integrals: each cluster accesses the <em> i</em>
th, <em> j</em>
th, and
<em> k</em>
th row (and sometimes column) of <tt> D</tt> and <tt> F</tt>
(Figure <A HREF="node22.html#figfock99" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node22.html#figfock99">2.33</A>).  To exploit this locality, we
agglomerate data to create <em> N</em>
 data tasks, each containing a
row/column pair of the two-dimensional arrays <tt> D</tt> and <tt> F</tt> and
all of the one-dimensional array <tt> A</tt>.  In this scheme, each
element of <tt> D</tt> and <tt> F</tt> is replicated once, and <tt> A</tt> is
replicated <em> N</em>
 times, so total storage requirements are increased
from an average of <em> N</em>
 to <em> 3N</em>
 per task.  Because of this
replication, each computation task now requires data from just three
data tasks.  Hence, the number of messages is reduced from <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img351.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img351.gif">
to <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img352.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img352.gif">.  The total volume communicated remains <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img353.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img353.gif">.  Because
the cost of communicating a word is typically much less than the cost
of computing an integral, this is an efficient parallel algorithm.
<P>
<H4><A NAME=SECTION02382030000000000000> Mapping.</A></H4>
<P>
<A NAME=2144>&#160;</A>
The ``partial replication'' Fock matrix construction algorithm creates
<em> N</em>
 data tasks and <IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img354.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img354.gif"> computation tasks.  We use the notation
<em> (i j k)</em>
 to identify the computation task responsible for
computing the <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img355.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img355.gif"> integrals <em> I</em>
<IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img356.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img356.gif">; this task requires
data from data tasks <em> i</em>
, <em> j</em>
, and <em> k</em>
.  To complete the
parallel algorithm, we must define a mapping of data and computation
tasks to processors.
<P>
We assume <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img357.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img357.gif"> processors.  Since each data task will receive
roughly the same number of requests, we allocate one data task to each
processor.  This leaves the problem of mapping computation tasks.  We
can imagine a variety of approaches:
<OL><LI>
A simple mapping, in which task <em> (i j k)</em>
 is mapped to the same
processor as data task <em> i</em>
; since each task communicates with data
tasks <em> i</em>
, <em> j</em>
, and <em> k</em>
, off-processor communication
requirements are reduced by one third.  A disadvantage of this
strategy is that since both the number of integrals in a task and the
amount of computation per integral can vary, different processors may
be allocated different amounts of computation.
<P>
<LI>
<A NAME=2158>&#160;</A>
A probabilistic mapping, in which computation tasks are mapped to
processors at random or using a cyclic strategy.
<P>
<LI>
A task-scheduling algorithm to allocate tasks to idle processors.
Since a problem can be represented by three integers (<em> i</em>
,
<em> j</em>
, <em> k</em>
) and multiple problems can easily be agglomerated
into a single message, a simple centralized scheduler can be used.
(Empirical studies suggest that a centralized scheduler performs well
on up to a few hundred processors.)
<P>
<LI>
Hybrid schemes in which, for example, tasks are allocated randomly to
sets of processors, within which a manager/worker scheduler is used.
</OL>
The best scheme will depend on performance requirements and on problem
and machine characteristics.
<P>
<H2><A NAME=SECTION02383000000000000000>2.8.3 Chemistry Summary</A></H2>
<P>
We have developed two alternative parallel algorithms for the Fock
matrix construction problem.
<P>
<OL><LI>
The <tt> F</tt> and <tt> D</tt> matrices are replicated in each of
<em> N</em>
 tasks.  Integral computations are distributed among the tasks, and a
summation algorithm is used to sum <tt> F</tt> matrix contributions
accumulated in the different tasks.  This algorithm is simple but
nonscalable.
<P>
<LI>
The <tt> F</tt>, <tt> D</tt>, and <tt> A</tt> matrices are partitioned among
<em> N</em>
 tasks, with a small amount of replication.  Integral
computations are agglomerated into <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img358.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img358.gif"> tasks, each containing
<IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img359.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img359.gif"> integrals. 
These tasks are mapped to processors either
statically or using a task-scheduling scheme.
<P>
</OL>
<P>
This case study illustrates some of the tradeoffs that can arise in
the design process.  The first algorithm slashes communication and
software engineering costs; however, it is not scalable.  In
contrast, the second algorithm has higher communication costs but is
highly scalable: its memory requirements increase only with problem
size, not the number of processors.  To choose between the
two algorithms, we need to quantify their parallel performance
and then to determine the importance of scalability, by assessing
application requirements and the characteristics of the target
parallel computer.
<P>
<A NAME=2174>&#160;</A>
<P>

<BR> <HR><a href="msgs0.htm#2" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/tppmsgs/msgs0.htm#2"><img ALIGN=MIDDLE src="asm_color_tiny.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/asm_color_tiny.gif" alt="[DBPP]"></a>    <A NAME=tex2html2107 HREF="node21.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node21.html"><IMG ALIGN=MIDDLE ALT="previous" SRC="previous_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/previous_motif.gif"></A> <A NAME=tex2html2115 HREF="node23.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node23.html"><IMG ALIGN=MIDDLE ALT="next" SRC="next_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/next_motif.gif"></A> <A NAME=tex2html2113 HREF="node14.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node14.html"><IMG ALIGN=MIDDLE ALT="up" SRC="up_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/up_motif.gif"></A> <A NAME=tex2html2117 HREF="node1.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node1.html"><IMG ALIGN=MIDDLE ALT="contents" SRC="contents_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/contents_motif.gif"></A> <A NAME=tex2html2118 HREF="node133.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node133.html"><IMG ALIGN=MIDDLE ALT="index" SRC="index_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/index_motif.gif"></A> <a href="msgs0.htm#3" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/tppmsgs/msgs0.htm#3"><img ALIGN=MIDDLE src="search_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/search_motif.gif" alt="[Search]"></a>   <BR>
<B> Next:</B> <A NAME=tex2html2116 HREF="node23.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node23.html">2.9 Summary</A>
<B>Up:</B> <A NAME=tex2html2114 HREF="node14.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node14.html">2 Designing Parallel Algorithms</A>
<B> Previous:</B> <A NAME=tex2html2108 HREF="node21.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node21.html">2.7 Case Study: Floorplan Optimization</A>
<BR><HR><P>
<P><ADDRESS>
<I>&#169 Copyright 1995 by <A href="msgs0.htm#6" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/tppmsgs/msgs0.htm#6">Ian Foster</a></I>
</ADDRESS>
</BODY>
上一页 12
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -