📄 98.html
字号:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> <style type="text/css"> body { font-family: Verdana, Arial, Helvetica, sans-serif;} a.at-term { font-style: italic; } </style> <title>Scaling Performance: IBM SP</title> <meta name="Generator" content="ATutor"> <meta name="Keywords" content=""></head><body> <p>The scaling behavior of the pure MPI version of the Laplace code is shown here as a function of the total number of processors: <br/>
<br/>
<img src="SP-MPI.gif" align=center> </p>
<p>In general the performance is not terribly sensitive to whether the PEs are within the same or different nodes. Note the eventual performance peak for roughly 40 PEs; at this point communication overheads have increased to the point where they dominate the (decreasing) amount of work each PE has to do. The result is that performance actually begins to decline past some critical point.</p>
<p>We next compare the pure MPI code to the MLP implementation. We consider the cases for which the total number of PEs used per node is the same. For example, the following shows the performance for MPI with two processes per node compared to the MLP performance when we have one MPI process per node and allow it to create two OpenMP threads in parallel regions: <br>
<br>
<img src="SP-2PEs.gif" align=center> </p>
<p>We see that the pure MPI code is slightly faster over most of the range considered; but the MLP implementation becomes more efficient for large numbers of processors.
This is due to the relatively lower number of MPI processes and the corresponding reduction in overheads for, e.g., the collective reduction operation and other communications. Of course, this must be balanced against the OpenMP overheads for branching into parallel regions. The result of these competing effects is that the MLP implementation is superior past about 24 PEs. </p>
<p>The following two plots show MPI versus MLP performance for larger numbers of total processors per node. Various interesting behaviors can be seen. In particular, for eight PEs per node the best performance is achieved with a "balance" of MPI processes and OpenMP threads; that is, the best performance occurs neither when the number of MPI processes nor the number of OpenMP threads is maximized. The best overall performance is obtained for this case, with a speedup factor of about 30 for 64 processors. The parallel efficiency in this case is about 46%. <br/>
<br/>
<img src="SP-4PEs.gif" align=center> <br/>
<br/>
<img src="SP-8PEs.gif" align=center> </p></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -