📄 67.html
字号:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> <style type="text/css"> body { font-family: Verdana, Arial, Helvetica, sans-serif;} a.at-term { font-style: italic; } </style> <title>Applying MPI to the Laplace Solver</title> <meta name="Generator" content="ATutor"> <meta name="Keywords" content=""></head><body> <p>As opposed to the case of OpenMP, where the compiler does most of the hard work of parallelization, MPI requires much more work on the part of the developer to parallelize an application. Since no data is shared between MPI processes except through the MPI communication routines, it is up to the developer to determine the patterns of communication. The upside to this is that the developer also has much more control over how an application is parallelized, which can
lead to much better performance and scalability.</p>
<h3>Domain Decomposition</h3>
<p>Fundamental to parallelizing many applications with MPI is the concept of <em><a href="../glossary.html#domain+decomposition" target="body" class="at-term">domain decomposition</a></em>, in which the overall domain of the solution is broken into pieces and the calculations for each piece are assigned to an MPI process. Consider again the diagram of the domain shown in Chapter 2:</p>
<img src="domain.jpg">
<p>This domain can be decomposed in three ways. The first is a 1D partitioning along the <i>y</i>-axis (a.k.a. the <code>j</code> index):</p>
<img src="domain-1d-horz.jpg">
<p>Another way of decomposing the domain is a 1D partitioning along the <i>x</i>-axis (a.k.a. the <code>i</code> index):
</p>
<img src="domain-1d-vert.jpg">
<p>The third way of decomposing the domain is a 2D decomposition where the domain is partitioned along both axes:
</p>
<img src="domain-2d.jpg">
<p>In all three diagrams, the number in each stripe or block represents the rank of the MPI process responsible for that area. MPI's virtual topology functionality can also be used to assign areas to MPI processes and determine nearest neighbor connectivity.
</p>
<p>A second concept that often goes hand in hand with domain decomposition is <em>ghost cells</em>. In most numerical solutions of differential equations, the solution contains a dependency on neighboring points. With a decomposed domain, the solution on the edge of a particular subdomain will usually depend on values in a neighboring subdomain owned by a different MPI process. To handle this, each MPI process maintains ghost cells on the edges of its subdomain which hold the solution values from neighboring MPI processes. This requires pairs of neighboring MPI processes to exchange solution values along the the boundary between them.
</p>
<h3>Algorithmic Changes and Implementation Details</h3>
<p>Domain decomposition requires a number of algorithmic changes to the approach used in the serial implementation. First, each MPI process must determine for what ranges in the <code>i</code> and <code>j</code> indices it is responsible, as well as what MPI processes are responsible for neighboring areas. The latter can be achieved using MPI's virtual topology functionality. Second, the main computation loop only loops over the ranges in <code>i</code> and <code>j</code> for which the MPI process is responsible. Third, determining the maximum change from one iteration to the next becomes a two step process; each MPI process first determines its own local maximum change, and then a global maximum among all the MPI processes is computed using the MPI_REDUCE call. Finally, to maintain current values in the ghost cells, the solution values along boundaries shared with other MPI processes must be sent to the neighboring processes, and values received from those neighboring processes must be placed in the appropriate ghost cells. There are a number of ways these data exchanges can be implemented; the code below uses nonblocking sends (MPI_ISEND) followed by blocking receives (MPI_REDUCE) and a wait for the completion of the send (MPI_WAIT). However, these communications can also be implemented using a blocking exchange (MPI_SENDRECV), or by preposting nonblocking receives (MPI_IREC) followed by blocking sends (MPI_SEND).
</p>
<p>The particular domain decomposition used for the implementation below is that of a 1D decomposition along the <i>y</i> axis (<code>j</code> index). This is advantageous due to how Fortran 90 structures arrays in memory; Fortran is column-major, so array elements in successive values of <code>i</code> are contiguous in memory. By keeping the <code>i</code> index unbroken, the compiler is more readily able to generate prefetch instructions and avoid stalling on memory references.
</p>
<h3>MPI Implementation</h3>
<pre><code>
program lpmpi
include 'mpif.h'
integer imax,jmax,im1,im2,jm1,jm2,it,itmax
parameter (imax=2001,jmax=2001)
parameter (im1=imax-1,im2=imax-2,jm1=jmax-1,jm2=jmax-2)
parameter (itmax=100)
real*8 u(imax,jmax),du(imax,jmax),umax,dumax,tol,pi
parameter (umax=10.0,tol=1.0e-6,pi=3.14159)
! Additional MPI parameters
integer istart,iend,jstart,jend
integer size,rank,ierr,istat(MPI_STATUS_SIZE),mpigrid,length
integer grdrnk,dims(1),gloc(1),up,down,isize,jsize
integer ureq,dreq
integer ustat(MPI_STATUS_SIZE),dstat(MPI_STATUS_SIZE)
real*8 tstart,tend,gdumax
logical cyclic(1)
real*8 uibuf(imax),uobuf(imax),dibuf(imax),dobuf(imax)
! Initialize
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierr)
! 1D linear topology
dims(1)=size
cyclic(1)=.FALSE.
call MPI_CART_CREATE(MPI_COMM_WORLD,1,dims,cyclic,.true.,mpigrid &
,ierr)
call MPI_COMM_RANK(mpigrid,grdrnk,ierr)
call MPI_CART_COORDS(mpigrid,grdrnk,1,gloc,ierr)
call MPI_CART_SHIFT(mpigrid,0,1,down,up,ierr)
istart=2
iend=imax-1
jsize=jmax/size
jstart=gloc(1)*jsize+1
if (jstart.LE.1) jstart=2
jend=(gloc(1)+1)*jsize
if (jend.GE.jmax) jend=jmax-1
do j=jstart-1,jend+1
do i=istart-1,iend+1
u(i,j)=0.0
du(i,j)=0.0
enddo
u(imax,j)=umax*sin(pi*float(j-1)/float(jmax-1))
enddo
! Main computation loop
call MPI_BARRIER(MPI_COMM_WORLD,ierr)
tstart=MPI_WTIME()
do it=1,itmax
dumax=0.0
do j=jstart,jend
do i=istart,iend
du(i,j)=0.25*(u(i-1,j)+u(i+1,j)+u(i,j-1)+u(i,j+1))-u(i,j)
dumax=max(dumax,abs(du(i,j)))
enddo
enddo
do j=jstart,jend
do i=istart,iend
u(i,j)=u(i,j)+du(i,j)
enddo
enddo
! Compute the overall residual
call MPI_REDUCE(dumax,gdumax,1,MPI_REAL8,MPI_MAX,0 &
,MPI_COMM_WORLD,ierr)
! Send phase
if (down.NE.MPI_PROC_NULL) then
j=1
do i=istart,iend
dobuf(j)=u(i,jstart)
j=j+1
enddo
length=j-1
call MPI_ISEND(dobuf,length,MPI_REAL8,down,it,mpigrid, &
dreq,ierr)
endif
if (up.NE.MPI_PROC_NULL) then
j=1
do i=istart,iend
uobuf(j)=u(i,jend)
j=j+1
enddo
length=j-1
call MPI_ISEND(uobuf,length,MPI_REAL8,up,it,mpigrid, &
ureq,ierr)
endif
! Receive phase
if (down.NE.MPI_PROC_NULL) then
length=iend-istart+1
call MPI_RECV(dibuf,length,MPI_REAL8,down,it, &
mpigrid,istat,ierr)
call MPI_WAIT(dreq,dstat,ierr)
j=1
do i=istart,iend
u(i,jstart-1)=dibuf(j)
j=j+1
enddo
endif
if (up.NE.MPI_PROC_NULL) then
length=iend-istart+1
call MPI_RECV(uibuf,length,MPI_REAL8,up,it, &
mpigrid,istat,ierr)
call MPI_WAIT(ureq,ustat,ierr)
j=1
do i=istart,iend
u(i,jend+1)=uibuf(j)
j=j+1
enddo
endif
write (rank+10,*) rank,it,dumax,gdumax
if (rank.eq.0) write (1,*) it,gdumax
enddo
call MPI_BARRIER(MPI_COMM_WORLD,ierr)
tend=MPI_WTIME()
if (rank.EQ.0) then
write(*,*) 'Calculation took ',tend-tstart,'s. on ',size, &
' MPI processes'
endif
call MPI_FINALIZE(ierr)
stop
end</code></pre></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -