📄 68.html

📁 国外MPI教材
💻 HTML
字号:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head>	<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />	<style type="text/css">	body { font-family: Verdana, Arial, Helvetica, sans-serif;}	a.at-term {	font-style: italic; }	</style>	<title>MPI Laplace Solver Performance Characteristics</title>	<meta name="Generator" content="ATutor">	<meta name="Keywords" content=""></head><body> <p>The following graphs show the performance of the MPI Laplace solver on several systems, including an SGI Origin 2000, a Cray T3E-600, an IBM SP with 8-processor nodes, and a cluster of quad-Xeon systems connected with Myrinet.  As with the OpenMP performance data, the results are presented in terms of speedup and parallel efficiency.</p>

<h3>Speedup</h3>


  <img src="mpi-speedup.JPG" alt="mpi speedup chart" align="center">

<h3>Parallel Efficiency</h3>


  <img src="mpi-pareff.JPG" alt="mpi parallel efficiency chart" align="center">


<h3>Discussion</h3>

<p>Several systems see superlinear speedups on this code at low task counts.  This is a cache effect; as the entire working set for an individual MPI process moves into cache, the computational loops complete much more quickly as the cost of going out to main memory goes away.  This is most apparent on the Origin 2000, where the large (8MB) L2 cache causes this effect to occur almost immediately.</p>

<p>A second effect which is seen in the above graphs is the effect of running multiple MPI processes on an SMP node of an SP or Linux cluster.  Because each MPI process is a miniature version of the serial Laplace solver, its memory bandwidth requirements are about the same as that of an instance of the serial Laplace solver.  When running multiple MPI processes on the same node, the memory subsystem must have enough bandwidth capacity to keep all the processes running at full speed or the performance of each will be degraded.  The extent of this degradation is directly related to the oversubscription of the memory subsystem; in the case of a 8-processor SP node, the per-processor memory bandwidth available appears to degrade slightly going from two to four MPI processes per node.  In the case of the quad-Xeon cluster, the per-processor memory bandwidth available begins to degrade immediately, and so runs with four MPI processes per node are substantially slower than a comparable total task count using one or two MPI processes per node.</p></body></html>
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -