📄 node32.html

📁 Design and building parallel program
💻 HTML
📖 第 1 页 / 共 2 页
字号:
12 下一页
<html><!DOCTYPE HTML PUBLIC "-//W3O//DTD W3 HTML 2.0//EN">
<!Converted with LaTeX2HTML 95.1 (Fri Jan 20 1995) by Nikos Drakos (nikos@cbl.leeds.ac.uk), CBLU, University of Leeds >
<HEAD>
<TITLE>3.6 Evaluating Implementations</TITLE>
</HEAD>
<BODY>
<meta name="description" value="3.6 Evaluating Implementations">
<meta name="keywords" value="book">
<meta name="resource-type" value="document">
<meta name="distribution" value="global">
<P>
 <BR> <HR><a href="msgs0.htm#2" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/tppmsgs/msgs0.htm#2"><img ALIGN=MIDDLE src="asm_color_tiny.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/asm_color_tiny.gif" alt="[DBPP]"></a>    <A NAME=tex2html2237 HREF="node31.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node31.html"><IMG ALIGN=MIDDLE ALT="previous" SRC="previous_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/previous_motif.gif"></A> <A NAME=tex2html2245 HREF="node33.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node33.html"><IMG ALIGN=MIDDLE ALT="next" SRC="next_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/next_motif.gif"></A> <A NAME=tex2html2243 HREF="node26.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node26.html"><IMG ALIGN=MIDDLE ALT="up" SRC="up_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/up_motif.gif"></A> <A NAME=tex2html2247 HREF="node1.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node1.html"><IMG ALIGN=MIDDLE ALT="contents" SRC="contents_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/contents_motif.gif"></A> <A NAME=tex2html2248 HREF="node133.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node133.html"><IMG ALIGN=MIDDLE ALT="index" SRC="index_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/index_motif.gif"></A> <a href="msgs0.htm#3" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/tppmsgs/msgs0.htm#3"><img ALIGN=MIDDLE src="search_motif.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/search_motif.gif" alt="[Search]"></a>   <BR>
<B> Next:</B> <A NAME=tex2html2246 HREF="node33.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node33.html">3.7 A Refined Communication Cost Model</A>
<B>Up:</B> <A NAME=tex2html2244 HREF="node26.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node26.html">3 A Quantitative Basis for Design</A>
<B> Previous:</B> <A NAME=tex2html2238 HREF="node31.html" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node31.html">3.5 Experimental Studies</A>
<BR><HR><P>
<H1><A NAME=SECTION02460000000000000000>3.6 Evaluating Implementations</A></H1>
<P>
<A NAME=secperfpt>&#160;</A>
<P>
Performance models also play an important role after a design
<A NAME=3605>&#160;</A>
is complete, when we start to write programs.  Comparisons of observed
and predicted execution times can provide valuable information about
both an algorithm and its implementation.
<P>
Even if considerable care is taken when designing and carrying out
experiments, the idealized nature of our models means that observed
and predicted execution times will seldom completely agree.  Major
discrepancies signify either that the model is incorrect or that the
implementation is inadequate.  In the first case, we can use the
empirical results to determine where the model is deficient; this
information in turn enables us to reassess the quantitative tradeoffs
that we have used to justify design decisions.  In the second case, we
can use the model to identify areas in which the implementation can be
improved.
<P>
When faced with a substantial difference between modeled and observed
execution times, our first step should be to check both the
performance model and our experimental design to verify not only that
the model and experiments are correct but that they are measuring the
same thing.
<P>
Our
<A NAME=3606>&#160;</A>
next step should be to obtain an execution profile of the parallel
code.  (In contrast to the execution profiles discussed in
Section <A HREF="node30.html#sec3prof" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node30.html#sec3prof">3.4.3</A>, this profile will be based on measured
values.)  The goal here is to obtain a more detailed view of program
behavior, by measuring, for example, time spent in initialization,
time spent in different phases of a computation, total idle time, and
the total number and volume of messages communicated.  Ideally, data
will be obtained for a range of problem sizes and processor counts.
Tables <A HREF="node32.html#tabperfx" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node32.html#tabperfx">3.4</A> and <A HREF="node32.html#tabperfy" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node32.html#tabperfy">3.5</A> show typical examples of
execution profile data, in this case from a parallel computational
chemistry code that incorporates the Fock matrix construction
algorithm of Section <A HREF="node22.html#secchem" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node22.html#secchem">2.8</A> as a kernel.  These data were
obtained by using instrumentation inserted manually into the program.
<P>
<P><A NAME=4473>&#160;</A><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img510.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img510.gif">
<BR><STRONG>Table 3.4:</STRONG>  A simple execution profile for a single step of
a parallel computational chemistry code, here applied to a relatively
small problem on an IBM SP computer.  This code combines the Fock
matrix construction algorithm of Chapter 2 (``fock'') with additional
components.  The profile shows both the time spent in different parts of
the program for varying processor counts and the total execution time.
Scalability is reasonably good, although it is evident that the
routine <tt> diag</tt> has not been parallelized.  The <tt> init</tt> routine
does not scale well, but this cost is less important because the code is
normally run for many steps.
<A NAME=tabperfx>&#160;</A><BR>
<P>
<P>
<P><A NAME=4474>&#160;</A><IMG BORDER=0 ALIGN=BOTTOM ALT="" SRC="img513.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img513.gif">
<BR><STRONG>Table 3.5:</STRONG>  A more detailed execution profile for the parallel
code of Table 3.4.  This gives call frequencies and execution times
for various communication routines on each processor.  For brevity,
only the first 6 of 16 processors are shown.  Instrumentation overhead
increases total time from the 230 seconds of Table 3.4 to around 241
seconds.  The <tt> get</tt>, <tt> accum</tt>, and <tt> put</tt> routines read and
write data in distributed global arrays.  A <tt> get</tt> operation, which
blocks waiting for a response to a remote request, takes around 1.7
milliseconds on average.  Since each data transfer is relatively
small, and the IBM SP's <IMG BORDER=0 ALIGN=MIDDLE ALT="" SRC="img512.gif" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/img512.gif"> is low, this time must include
substantial idle time that could perhaps be overlapped with local
computation.  The second major source of communication cost is the
<tt> barrier</tt> operation, which is used to ensure that all updates to a
global array have completed before reading begins.  We may wish to
examine the program to determine whether we really need 85 such
operations per step.
<A NAME=tabperfy>&#160;</A><BR>
<P>
<P>
Once we have obtained an execution profile, we can compare it with the
performance model to identify deficiencies in either the model or the
implementation.  In the following sections, we list several potential problems
that may be revealed in an execution profile.
<P>
<H2><A NAME=SECTION02461000000000000000>3.6.1 Unaccounted-for Overhead</A></H2>
<P>
<A NAME=3649>&#160;</A>
We first consider issues that may lead to observed execution times
greater than predicted by a model.  Most often, such a situation
occurs because the performance model is incomplete: some aspect of an
algorithm or its implementation was neglected as insignificant but
proves in practice to contribute significantly to execution time.
<P>
<UL><LI>
<em> Load imbalances.</em> An algorithm may suffer from computation or
communication imbalances that were not considered in the performance
model.  These problems may be revealed by a study of how costs vary
across processors.
<P>
<LI>
<em> Replicated computation.</em> Disparities between observed and
predicted times can also signal deficiencies in the implementation.
For example, we may have failed to parallelize some component,
assuming that the cost of replicating its computation on every
processor would be insignificant.  On large numbers of processors,
this assumption may not be valid.  These sorts of problem can be
detected by studying costs in different parts of a program and for
varying numbers of processors.
<P>
<LI>
<em> Tool/algorithm mismatch.</em> The tools used to implement the
algorithms may introduce inefficiencies.  For example, we may have
designed an algorithm that creates multiple tasks per processor so as
to overlap computation and communication.  Yet the programming
language or library used to implement the algorithm may not implement
tasks efficiently.
<P>
<LI>
<em> Competition for bandwidth.</em> Equation <A HREF="node29.html#eq88" tppabs="http://www.dit.hcmut.edu.vn/books/system/par_anl/node29.html#eq88">3.1</A> may not be an
adequate characterization of the communication network on which the
program is executing.  In particular, concurrent communications may
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -